Home » Conferences Journals and Workshops

Category Archives: Conferences Journals and Workshops

April 2024
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Where has the time gone?

August 5, 2020

It’s been more than a year since I last posted; it’s not that I haven’t been busy, but rather that I’ve been trying to do too many things and have been (more slowly than I’d like) cutting back on some of my activities.

Still, I miss using this as a (one way) discussion about my own work. In the past year I’ve managed to publish one new (short) paper, though the amount of work that I put into it was substantial (it was just published in Computer Architecture Letters). This short article (letter) journal normally provides at most one revise and resubmit opportunity, but they gave me two such opportunities, then accepted the paper, albeit begrudgingly over the objections of Reviewer # 2 (who agreed to accept it, but didn’t change their comments).

Despite the lack of clear publications to demonstrate forward progress, I’ve been working on a couple of projects to push them along. Both were presented, in some form, at Eurosys as posters.

Since I got back from a three month stint at Microsoft Research (in the UK) I’ve been working on one of those, evolving the idea of kernel bypasses and really analyzing why we keep doing these things; this time through the lens of building user mode file systems. I really should write more about it, since that’s on the drawing board for submission this fall.

The second idea is one that stemmed from my attendance at SOSP 2019. There were three papers that spoke directly to file systems:

Each of these had important insights into the crossover between file systems and persistent memory. One of the struggles I had with that short paper was explaining to people “why file systems are necessary for using persistent memory”. I was still able to capture some of what I’d learned, but a fair bit of it was sacrificed to adding background information.

One key observation was around the size of memory pages and their impact on performance; it convinced me that we’d benefit from using ever larger page sizes for PMEM. Some of this is because persistent memory is, well, persistent and thus we don’t need to “load the contents from storage”. Instead, it is storage. So, we’re off testing out some ideas in this area to see if we can contribute some additional insight.

The other area – the one that I have been ignoring too long – is the thesis of this PhD work in the first place. Part of the challenge is to reduce the problem down to something that is tractable and can be finished in a reasonable amount of time.

One of the questions (and the one I wanted to explore when I started writing this) is a rather famous article from 1945 entitled As We May Think. Vannevar Bush described something quite understandable, yet we have not achieve this, though we have been trying – one could argue that hypertext stems from these ideas, but I would argue that hypertext links are a pale imitation of the rich assistive model Bush lays out when he describes the Memex.

Thus, to the question, which I will reserve for another day: why have we not achieved this yet? What prevents us from having this, or something better, and how can I move us towards this goal?

I suspect, but am not certain, that one culprit may be the fact we decided to stick with an existing and well-understood model of organization:

Maybe the model is wrong when the data doesn’t fit?

Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery

June 25, 2019

Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery, Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kahthryn Mohror, Bradley Settlemyer, and Matthew Wolf, United States: N. p., 2019. Web. doi:10.2172/1491994.

Google brought this to my attention recently because I’ve set triggers for people citing prior works that I’ve found interesting or useful. There are a number of things that caught my eye with this technical report that apply to my own current research focus. In addition, I received a suggestion that I pick a specific field or discipline and look at how to solve problems in that specific community as an aid to providing focus and motivation for my own work.

Data center Image (supercomputer). — Supercomputer Cluster

This report is a wealth of interesting information and suggestions for areas in which additional work will be useful. The report itself is quite long – 134 pages – so I am really only going to discuss the sections I found useful for framing my own research interests.

The report itself is essentially capturing the discussion at a US DOE workshop. Based upon my reading of the report, it appears that the workshop was structured, with a presentation of information and seeded with topics expected to elicit discussion. Much of the interesting information for me was in Section 4.2 “Metadata, Name Spaces, and Provenance”.

The report starts by defining what they mean by metadata:

Metadata, in this context, refers generally to the information about data. It may include traditional user-visible file system metadata (e.g., file names, permissions, and access times), internal storage system constructs (e.g., data layout information), and extended metadata in support of features such as provenance or user-defined attributes. Metadata access is often characterized by small, latency-bound operations that present a significant challenge for SSIO systems that are optimized for large, bandwidth-intensive transfers. Other challenging aspects of metadata management are the interdependencies among metadata items, consistency requirements of the information about the data, and volume and diversity of metadata workloads.
4.2.1 Metadata (p. 39)

I was particularly interested in seeing their observation that relationships were one of the things that they found missing from existing systems. While somewhat self-serving, this is the observation that has led me into my current research direction. The observation about how meta-data I/O behavior is a mismatch for the bandwidth-intensive nature of accessing the data was a useful insight: there is a real mismatch between the needs of accessing meta-data versus accessing the data itself, particularly for HPC style workloads.

Not explicitly mentioned here is the challenge of constructing a system in which meta-data is an inherent part of the file itself, despite these very different characteristics. The authors do point out other challenges, such as the difficulty in constructing efficient indexing, which suffers from scalability issues:

While most HPC file systems support some notion of extended attributes for files [Braam2002, Welch2008, Weil2006], this type of support is insufficient to capture the desired requirements to establish relationships between distributed datasets, files, and databases; attribute additional complex metadata based on provenance information; and support the mining and analysis of data. Some research systems provide explicit support for searching the file system name space based on attributes [Aviles-Gonzalez2014, Leung2009], but most of these systems rely on effective indexing, which has its own scalability and data-consistency challenges [Chou2011].
4.2.1 Metadata – “State of the Art” (p. 39)

In other words: the performance requirements of accessing meta-data versus data are quite different; the obvious solution is to provide separate storage tiers or services for satisfying these needs. The disadvantage to this is that when we start separating meta-data from data, we create consistency problems and classic solutions to these consistency problems in turn create challenges in scalability. In other words, we are faced with classic distributed systems problems, in which we must trade off consistency versus performance. That is the CAP Theorem in a nutshell.

Another important point in looking at this technical report is that it emphasizes the vast size of datasets in the scientific and HPC communities. These issues of scale are exacerbated in these environments because their needs are extraordinary, with vast data sets, large compute clusters, geographical diversity, and performance demands.

The need for solutions is an important one in this space, as these are definitely pain points. Further, the needs of reproducibility in science are important – the report expressly mentions the DOE policy now requires a data management plan. The emphasis is clearly on reproducibility and data sharing. It seems fairly clear (to me, at least) that having better data sharing can only benefit scientific work.

The seeding discussion for the Metadata section of the report raises some excellent points that, again, help buttress my own arguments and hopefully will be useful in shaping the forward direction:

A number of nontraditional use cases for the metadata management system have emerged as key to DOE missions. These include multiple views of the metadata to support, for example, different views at different levels of the name space hierarchy and different views for different users’ purposes; user-defined metadata; provenance of the metadata; and the ability to define relationships between metadata from different experiments (e.g., to support the provenance use case).
As the collection of metadata expands, it is important to ensure that all metadata associated with a dataset remains with the data. Metadata storage at different storage tiers, storage and recovery of metadata from archive, and the transfer of datasets to different storage systems are all important use cases to consider.
4.2.1 Metadata – “Seeding Workshop Discussion” (p. 40)

The idea of multiple views is important. It is something we’ve been exploring recently as we consider how to look at data just using current information available to us, something I should describe in another post.

So what do I pick out here as being important considerations? Different views, user-defined metadata, provenance, and defining relationships. Their requirement that metadata be associated with the underlying dataset. As I noted previously, this becomes more challenging when you consider that metadata operations have very different interface and performance characteristics than data access.

I have not really been looking at the issues related to metadata management in tiered storage systems, but clearly I have to do so if I want to address the concerns of the HPC community.

The attendees agreed that the added complexity in storage hierarchies presents challenges for locating users’ data. A primary reason is that the community does not yet have efficient mechanisms for representing and querying the metadata of users’ data in a storage hierarchy. Given the bottlenecks that already exist in metadata operations for simple parallel file systems, there is a strong research need to explore how to efficiently support metadata in hierarchical storage systems. A promising direction of research could be to allow users to tag and name their data to facilitate locating the data in the future. The appropriate tagging and naming schemes need investigation and could include information about the data contents to facilitate locating particular datasets, as well as to communicate I/O requirements for the data (e.g., data lifetime or resilience).
Section 4.4.5 Hierarchy and Data Management

How do we efficiently support medatadata in hierarchical storage systems? Implicit in this question is the assumption that there are peculiar challenges for doing this. The report does delve into some of the challenges:

Hashing can be useful in sharding data for distribution and load balancing, but this does not capture locality – the fact that various files are actually related to one another. We have been considering file clustering; I suspect that cluster sharding might be a useful mechanisms for providing load balancing.
Metadata generation consists of both automatically collected information (e.g., timestamps, sizes, and name) as well as manually generated information (e.g., tags). The report argues that manual generation is not a particularly effective approach and suggests automatically capturing workflow and provenance information is important. As I was reading this, I was wondering if we might be able to apply inheritance to metadata, in a way that is similar to taint tracking systems.

The report has a short, but useful discussion of namespaces. This includes the traditional POSIX hierarchical name space as well as object oriented name spaces. They point to views as a well-understood approach to the problem from the database community. I would point out the hierarchical approach is one possible view. The report is arguing that their needs would be best met by having multiple views.

The existing work generally is hierarchical and focused on file systems. A number of researchers, however, have argued that such hierarchical namespaces impose inherent limitations on concurrency and usability. Eliminating these limitations with object storage systems or higher-level systems could be the fundamental breakthrough needed to scale namespaces to million-way concurrency and to enable new and more productive interaction modalities.
4.2.2 Namespaces “Seeding Workshop Discussion”

There is a dynamic tension here, between search and navigation. I find myself returning to this issue repeatedly lately and this section reminds me that this is, in fact, an important challenge. Navigation becomes less useful when the namespace becomes large and poorly organized; humans then turn to search. Views become alternative representations of the namespace that humans can use to navigate. They can filter out data that is not useful, which simplifies the task of finding relevant data. We apply views already: we hide “hidden files” or directories beginning with a special character (e.g., “.” in UNIX derived systems). The source code control system git will ignore files (filter them from its view) via a .gitignore file. Thus, we are already applying a primitive, limited form of filtering to create the actual view we show.

This report goes on further. It considers some really interesting issues within this area:

Storage aware systems for maintaining provenance data.
The scaling issues inherent in collecting more provenance data; or what do we do when managing the metadata becomes a huge issue itself?
Cross-system considerations. This doesn’t require HPC data – I have commented more than once that when humans are looking for something, they don’t want to restrict it to the current storage device. Data flows across devices and storage systems; we need to be able to capture these relationships. “[T]here is no formal way to construct, capture, and manage this type of data in an interoperable manner.”
External meta-data. We need to remember that the context in which the data is collected or consumed is an important aspect of the data itself. Thus, the tools used, the systems, etc. might be factors. I would argue that a storage system can’t reasonably be expected to capture these, but it certainly should be able to store this metadata.

The discussion for this section is equally interesting, because it reflects the thoughts of practitioners and thus their own struggles with the current system:

Attendees mentioned that tracking provenance is a well-explored aspect of many other fields (art history, digital library science, etc.) and that effort should be made to apply the best practices from those fields to our challenges, rather than reinventing them. Attendees extensively discussed the high value of provenance in science reproducibility, error detection and correction in stored data, software fault detection, and I/O performance improvement of current and future systems.
Attendees also discussed the need for research into how much provenance information to store, for how long, in what level of detail, and how to ensure that the provenance information was immutable and trustworthy. The value of using provenance beyond strictly validating science data itself was brought up; attendees pointed out that provenance information can be used to train new staff as well as help to retain and propagate institutional knowledge of data gathering processes and procedures.
4.2.4 Discussion Themes “Provenance”

A generally useful observation: look to how other fields have approached common problems, to see if there are insights from those fields that we can use to address them here. I found the vast reach of the discussion here interesting – the idea that such a system can be used to “… retain and propagate institutional knowledge…”

Finally, I’m going to capture the areas in which the report indicates participants at the workshop reached consensus. I’ll paraphrase, rather than quote:

Scalable metadata storage – a key point for me here was decoupling meta-data from the data itself. That, despite the seeding suggestion that we keep meta-data associated with the file.
Improve namespace query and display capabilities – make them dynamic, programmable, and extensible.
Better provenance information – the emphasis was on reproducibility of results, but they wanted to ensure that this could embed domain specific features, so that such systems can be useful beyond just reproducibility.

I’ve really only touched on a small part of this report’s total content; there is quite a bit of other, useful, insight within the report. I will be mulling over these issues

To FUSE or Not to FUSE: Performance of User-Space File Systems

June 20, 2019

To FUSE or Not to FUSE: Performance of User-Space File Systems
Bharath Kumar Reddy Vangoor, Vasily Tarasov, and Erez Zadok,
in The 15th USENIX Conference on File and Storage Technologies (FAST ’17),
February 27 – March 2, 2017, Santa Clara, CA, USA.

Previously, I discussed some of the rationale behind FUSE and a basic introduction to why we use it. This paper is actually a detailed analysis of the performance of FUSE. I have spent a fair bit of time reading – and re-reading – this paper, in order to understand what some of the bottlenecks are within FUSE itself. One area I have previously explored are possible ways of improving its performance; indeed, this is one of my current projects.

Figure 1 in the paper is actually a basic block diagram providing an interface model for how FUSE fits into the file systems layer. FUSE consists of:

The kernel mode FUSE driver; on Linux this is part of the kernel; and
The user mode FUSE library. This handles interactions between the FUSE file system driver and the user mode file system process (referred to as a “daemon” or “background process”) in the diagram.
The user mode FUSE file system itself – this is the code that implements the FUSE library interface (one of them, since there are two different interfaces).

Applications can then access this FUSE file system without knowing any details of the implementation.

In Figure 2 from the paper, the authors show the internal structure of how the FUSE kernel driver manages various internal data structures that handle events between the user mode file system and the kernel mode file system support. This includes the messages between the kernel and user mode application (requests and their corresponding replies) as well as synchronous and asynchronous file system operations and the cache invalidation mechanisms (“forgets”).

Caching is an essential part of the system because it allows the system to quickly respond to repetitive events. The downside to caches are that they can become invalid because the underlying state changes. Thus, there is a mechanism for invalidating the cache itself.

The author describes the model within the FUSE library and the two interfaces it exposes: the low level API, which provides greater control to the user mode file system, at the cost of more state management – specifically the mapping of a path name to the inode (index node).

The authors describe how they constructed a new FUSE file system, StackFS, for evaluating the behavior of the system. StackFS sits on top of an existing file system and attempts to minimize the amount of mapping it performs, since the goal of the authors is to evaluate the performance of FUSE itself.

The authors summarize their findings in Table 3, using both a hard disk and an SSD, the two most common types of storage media used on modern computer systems.

These results were both surprising and interesting to me because some of them surprised me. The performance bottlenecks were not for I/O as much as they were for meta-data operations. Random write performance (which is what we see with databases, for example) is not ideal, but their optimizations did a good job of addressing this, bringing the I/O overhead of the FUSE model down substantially relative to the native file system.

Bottom line: the challenge in improving FUSE performance now moves squarely into the arena of meta-data operations. Creating and deleting files is quite expensive in FUSE.

The authors conclude by pointing out that there is further room for improvement; they suggest some potential future directions. I have been looking at ways to improve this as well and I will discuss those in a future post.

GFS: a Graph-based File System Enhanced with Semantic Features

May 14, 2019

GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.

In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.

I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.

The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.

The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”

Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.

“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”

In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!

So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.

One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.

The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.

The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.

In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.

Collaboration versus Cheating

May 7, 2019 / 1 Comment on Collaboration versus Cheating

Collaboration Versus Cheating: Reducing Code Plagiarism in an Online MS Computer Science Program Tony Mason, Ada Gavrilovska, and David A. Joyner, Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp. 1004-1010, Minneapolis, MN, February 27 – March 2, 2019.

This is a rather different paper than my usual but I thought I would write about it since I was the one that drove this particular work. I first started as a teaching assistant for the Graduate Introduction to Operating Systems in Georgia Tech’s Online Master of Science in Computer Science program. One of my motivations at the time was to really learn more about how online education could be realized. The program’s goals were to achieve scale (which it has) and to maintain a quality level that was non-inferior to their online MSc program offering. I achieved many of my goals here, as I learned how to improve the pedagogical goals (by improving feedback to students, and helping them understand the topics) of the class while also finding ways to ensure academic honesty.

Figure 1: Detected plagiarism rates over time

I started working with the class in Fall 2015; at that point we had no proactive plagiarism detection in place, though we were collecting similarity data. It wasn’t until 2016 that I started looking at that data and began to realize we had a noticeable plagiarism rate. We tried a number of techniques to try and reduce this. We explained what plagiarism was, we offered “amnesties” if people that cheated would come forward and admit it. Neither worked. Amnesty actually lead to cases where students we did not think had plagiarized were admitting to things we considered to be fair use (e.g., using existing code to set up socket connections).

Figure 1 shows the progress over time, and clearly delineates when our intervention strategies were successful. What it does not capture is the qualitative difference. By Spring 2017, I was observing students wholesale submitting other students’ projects and claiming it as their own! It was that semester when I prioritized “early intervention” in which I quickly identified students suspected of cheating and confronted them. The plagiarism rate for subsequent projects dropped precipitously.

The downside to this approach is that it took a tremendous amount of work to do this, which makes it impractical because it does not scale. Effective is great, but scalability was also a requirement if it was going to be useful. In the summer of 2017 we introduced one new mechanism to the class: we added a mandatory quiz that went over the policy and asked students if they understood the course policies defining collaboration and cheating, as well as understanding why this was important and the potential penalties.

The results surprised me: the plagiarism rate plummeted. I was cautiously optimistic though because the Summer 2017 class was the smallest we’d ever had. Thus, I suspected it might simply be an anomaly. But this encouraged me to begin making the evaluation more rigorous. In the Fall of 2018 I spent time automating the process of doing code comparisons not only with other students in the same class, but also comparing submissions across all prior submissions. My motivation for doing this work was to validate what I thought I was seeing.

Thus I performed retrospective analysis on prior semesters in which the course was offered. I used that data to write a paper for Learning at Scale 2018 but that paper was rejected. I used the feedback from that to revise my draft and submit an update to it to ICER 2018 but that was also rejected. I once again dug deep and revised the paper and submitted it to SIGCSE 2019, where it was accepted (and astoundingly, all the reviews were accepts!) I revised the final paper to address the concerns of some reviewers (the weak accepts) and that is what was included at SIGCSE.

This doesn’t move my research forward per-se, but it helped me understand how incredibly important data visualization is – like Figure 1. I also learned how important it is to construct reproducible data analysis flows. The scripts I wrote helped me build a suite of tools for improving reproducibility. The research I did helped me better understand why the primary automated tool we use works (MOSS) and its limitations. When I was done I realized that I could easily continue exploring this research area, but since it isn’t core to my focus, I’ll leave it to others to push the frontiers.

Still it was an interesting project and helped confirm my interest in doing research.

I have a public repository with further information (beyond what is in my paper) on GitHub. It includes a copy of the slides from my presentation in March at SIGCSE (in which I think I had around 60 people, which was far more than I’d expected given that it was the last half-day of the conference!)

Eurosys 2019

April 25, 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve

I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

May 14, 2018

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores. The authors point out that the idea of using timestamps for constructing a transactional system. When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system. After all, you can just have a global counter that is incremented as each transaction proceeds. That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck. While we do not usually think of memory operations as being a bottleneck, they certainly can be. This is because multi-processor computers look much like a distributed system. They exchange messages to provide coherency guarantees. Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees. This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor). So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor. It writes those changes back at some point in the future. One reason is because another processor tries to access something within that same memory region. This is where the messages go flying back and forth, the modified cache line gets written back to main memory. This turns out to be “expensive”. Access to RAM on my laptop is around 65 nanoseconds. Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU. On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds. So it is 130 times slower. On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock. They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650. The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU. From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock. The Intel chip, on the other hand, has a clock that can be used to construct global atomicity. There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system. They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases. They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety. Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.

NVMCached: An NVM-based Key-Value Cache

April 30, 2018

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory. Because the authors do not have NVM available, they focus on simulating it using DRAM. They simulate behaviors by using appropriate cache flushes and fence. In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees. If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this. One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously. The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive. Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event. For memcached, there are users that are asking for persistence. Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern. They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing. They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots). To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation. Memory fragmentation is something we often fix in dynamic memory by rebooting the machine. However, now we have the storage level problem with fragmentation; fixing this involves write amplification. They organize their allocation strategy by observing there is considerable locality across the KV store. They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance. They also use DRAM to track access information as part of their LRU implementation. This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM. Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached. They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.” I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event. They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version. When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance. Two of them, SYS and VAR, show substantially better performance. Both benchmarks are write intensive. I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

The Future of Synchronization on Multicores: The Mulitcore Transformation

April 27, 2018

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read. Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.” I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation. In operating systems, we prevent deadlocks by defining a lock hierarchy. The order in which locks can be acquired forms a graph. To prevent deadlocks, we require that the graph be acyclic. That sounds simple and for simple code bases, it is. However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear. Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.” A common maxim in systems programming is to optimize the common case. Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well: when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state. I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity. It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing. It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made. General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

Keep them small: they don’t change much state
Keep them brief: they either commit or abort quickly
Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested. Transactions can avoid issues around priority inversion, convying, and deadlocks. The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new. We’ve been using them for decades. When we use them at disk I/O speeds, we find the overhead is acceptable. When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software. This gave birth to the idea of hardware transactions. Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively. They don’t suffer from the high overhead of software transactions. The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”. When a cache line cannot remain in the CPU, the transaction is aborted. Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.” As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databases) just retrying the operation once or twice often resolve the fault. But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-level constructs like mutual exclusion locks, to concurrent data structures such as skip-‐lists or priority queues, to system-‐level constructs such as read-‐copy-‐update (RCU), all the way to run-time support for high-‐level language synchronization mechanisms.

So far, this change has not been pervasive. I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances. Systems, and software stacks, do change slowly. Backwards compatibility is a big issue. As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

April 26, 2018

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineering, pp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work. By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective. Intel hardware with transactional memory support was just starting to emerge around the time this paper was released. I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases. Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code). Cache lines are read from memory, any changed made to the cache line, and then written back to memory. This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory. For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion. Even if there is no contention for the lock, you will pay the lock cost. If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back. In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction. Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system. The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots). They graphically demonstrate this in Figure 1 (from the paper). HTM has some overhead, but it tracks with partitioning fairly linearly. This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time. The traditional mechanism for doing this kind of paralleism is the two phase commit protocol. That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work. Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed. Instead, we have hot spots. This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion. There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.” Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning. The authors do a good job of explaining how HTM works on Intel platforms. Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.) Level 1 cache is the fastest to access, but the most expensive to provide. Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU. Level 3 cache might be present on the CPU, in which case it is shared between all three cores. Note that none of this is required. It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way. Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core. Note, however, that this could mean just the L1 cache. In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper). Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software). That probability approaches 100% as the size goes above above 8KB. Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache. What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating). If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises. This is shown in Figure 6 (from the paper). This is because the chance that various systems events will occur, which cause the transaction to abort. This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them.

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient. The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.” They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation. Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well . Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation. They address this a bit later in the paper.

Figure 13 (from the paper) compares their performance against a partitioned system. Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown. One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance. This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation. They admit as much in the paper: the hardware just isn’t available to them. They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations. If an HTM is retried, many will complete with one or two tries. Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.

Category Archives: Conferences Journals and Workshops

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email