Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
January 2025
S M T W T F S
 1234
567891011
12131415161718
19202122232425
262728293031  

QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

Sasha Ames, Maya B. Gokhale, and Carlos Maltzahn, International Journal of Parallel, Emergent and Distributed Systems, Volume 28, Number 2, pp. 159-183, 2013.

This paper came to my attention via feedback from an anonymous reviewer, observing that our idea of constructing a graph file system had “already been done”. It never ceases to amaze me that, despite how much time I have spent combing the literature, there seem to be things I miss. In this particular case, I have to agree with the reviewer that the basic idea we proposed really had been done before, though it seems as if the design space has not been exhausted and this paper actually will save me considerable time because up to this point it’s been a challenge to even explain why this kind of file system is useful.

Indeed, my read of this paper really suggests that these authors also struggled with similar objections because they spend considerable time justifying the need for their work: the introduction is fairly long because it explains the underlying problem, and the prior work section also goes to great length to explain why prior work is inadequate to the job.

In this paper, we discuss an exploration of our approach to the problem: the use of a graph data model for representing file system user-de fined metadata and a query language for retrieval. The purpose of this approach is to provide management of user-defined file metadata along with data under a single file system interface, delivering a common service across applications. Applications would be able to offload their metadata management needs to the service, alleviating the need for their own solution. This arrangement would benefit applications by reducing their code complexity, by virtue of not having their own custom metadata management components. A second benefit is improved opportunities for interoperability among separate applications.

The model we have been discussing, and trying to present, is one in which we have a richer model for meta-data to capture not only attributes of files, but also relationships across files. Like these authors we reached the conclusion that a graph is likely a better representational model for data. This encompasses the hierarchical model that is a fundamental part of POSIX, while at the same time providing us with a robust platform on which to build additional functionality. Before I start explaining that, though, I should go through this paper because it has valid results that I can use moving forward.

In Figure 1 from the paper, the authors describe the type of graph they are using: it has vertices (files) and edges (parent/child), with labels on the edges (attributes). I am not convinced this is the right graph model, but I will save that conversation for a future blog post.

In Figure 2, the authors delve into the structure of their graph in greater detail, as they compare their model to that of Resource Description Framework (RDF) triples that are used in several graph processing systems. Here we see a better description of their format (which is actually closer to what I’ll describe for my own work): “Our data model for file system metadata is a directed graph with attributes on nodes and edges, shown in Figure 1. Nodes in the graph can represent files, and this allows the system to manage relationships among files. We call our directed edges links, connecting parent and child nodes.”

They note that applications do not explicitly define schemas (something else I need to discuss in a future blog post) nor does their system require classes be defined. The authors argue this provides greater flexibility and indeed, the fact it does not force an application to be locked into a specific model.

“A heterogeneous approach to managing metadata gives all applications the
same tools to manage relationships.” It seems to me that this is one of the most compelling reasons on why this is a systems problem and not an application problem. If we insist applications implement this, most will not. Those that do will have no mechanism for interaction across applications. If files were truly isolated from one another, that would be fine, but in the real world, files do have relationships with other objects, whether it is other files, or external references (e.g., the “get me the e-mail from whence this file originated” example I described recently.) This helps with motivation, which I mentioned before has been an area of resistance I’ve received as well.

Figure 3 shows how the authors optimized their file system’s meta-data efficiency. This provides some interesting insight into the cases they expect to be common. I found their emphasis on navigation a useful one as well, particularly given the discussion of it recently.

The choice of optimization models certainly seems to be an important one, given that we can’t optimize for everything, and if we optimize for the wrong things we end up with something that looks much like brute-force search, which I can’t imagine is going to perform well.

In Figure 4 the authors turn their attention to their implementation model. They use the FUSE file systems interface to aid their implementation. This is interesting because one of the areas I’ve been exploring (sigh, yet another area to discuss in greater detail) is ways to more easily enhance the FUSE interface to enable exploring enhanced interfaces.

It seems that one downside to this approach is that it focuses on existing mechanisms for finding and accessing files, without providing a useful mechanism for exploiting enhanced search. Admittedly, the extended attribute interface does provide some mechanism for achieving this, but this is a useful paradigm for exploring how such a file system will work with existing applications – certainly an important aspect of constructing any file system that one expects will be useful.

Figure 5 was quite interesting to me, because it addresses one of the concerns I’ve seen in prior work based on relational databases (e.g., they’re often slow). I suspect that these results, with good times within QMDS relative to their evaluation relate to their optimization model for the queries they are executing.

One thing this doesn’t evaluate is how QDMS performs relative to other FUSE based file systems. The queries they do execute on QMDS seem to be targeted search queries and not necessarily well-correlated to actual usage as a file system.

There is no follow-up to this work, unfortunately, which makes it difficult to understand the general usefulness of QMDS. The upside to this is that it leaves considerable room for future work. It does provide a strong case for exploring this approach more thoroughly and I have already suggested that better evaluation seems justified under the circumstances.

I’ll discuss more of these issues as I turn my attention to describing my own work in future posts.

The Ubiquitous Digital File: A Review of File Management Research

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.

Frameworks

One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!

Systems

Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.


Eurosys 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve




I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.


Reboot time

It’s been almost a year since I posted anything substantive. It is so easy to just focus on other things, which is what I’ve been doing. In the past year I’ve continued to explore some interesting areas related to file systems. For example, for the past year I’ve been looking at persistent memory, which acts somewhat like storage (because it is persistent) and somewhat like DRAM (because it is byte addressed). The findings have been interesting: surprising in some cases, close to predicted in some cases. We’re still doing more work in this area, and I’m hoping to submit two papers later this year.

But the other big project is a file systems project. I’ve decided to use Windows as my platform of choice, both because I’m quite comfortable with Windows and because I think it’s the best choice for this project. Since it has been a goal of mine to do more writing in this area, I thought it would be great to use my blog to capture how I go about writing this file system for Windows with the (probably unrealistic) hope that I can eventually turn it into an online guide. I also plan on using a public repository for my project so that other people can see what I end up doing. I think I’ll turn that into a separate blog post, so I can talk more about that project.

Hopefully, by using this as a mechanism for describing my forward progress I can continue to post new information and content. That’s the theory at least.

If you are interested in persistent memory, two good recent papers are:

Basic Performance Measurements of the Intel Optane DC Persistent Memory Module

Single Machine Graph Analytics on Massive Datasets Using Intel® OptaneTM DC Persistent Memory



MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14).  April 2-4, 2014, Seattle, WA.

It’s time to turn attention to key value stores.  There are quite a few papers on this subject.  Seemingly simple, they turn out to be an important storage mechanism.  I had a delightful discussion about them over the weekend with Keith Bostic, who knows a fair bit about key-value stores (being the author of one of the most widely used KV stores.)  Of course, I have discussed some KVS already: MassTree, NVMCached, and SILT.

I’m going to continue looking at key-value stores (KVS) by starting with MICA. It is neither the first, nor the latest, work in this area, but it seems like a good place to start really describing key value stores in this space.  I have quite a few to cover so let’s get started.

MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4–13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.

It is an interesting system in that the authors decided to look at making their KVS tunable – it can serve as a cache (which means items can be thrown out by the KVS as space becomes scarce) or it can serve as storage (which means items cannot be removed from the KVS by the system itself, though it permits the applications using it to do so.)

To achieve this, they propose a series of data structures tuned to the task: circular logs, lossy concurrent hash indexes, and bulk chaining.  Their goal is to provide high performance through low contention.  They utilize dpdk for fast network processing.  They split their data structures so they independently handle caching and storage separately.  I might argue that this makes it two systems fused together, but let’s give them the benefit of the doubt at this point by calling it “generality”.

One of the concerns they try to address is the hot key issue.  In many KVS systems they get terrific performance on synthetic workloads because they use a uniform distribution.  In the real world, particularly for web objects, the workloads tend to be mostly cold with a small subset of the keys that are hot.  Further, these keys change over time.

They do restrict their domain somewhat to make the problem more tractable.  For example, they do not handle large items – everything must fit within a single network packet.  For systems that require larger items, they suggest using a separate memory allocator.  Their argument is that the extra indirection cost is a marginal increase in total latency.  They are optimizing for the common case.  Another aspect they exclude is durability.  That makes sense – they are only implementing an in-memory KVS, so durability is not a logical part of their work.  This also plays well in terms of performance, since persistence makes things much more complicated.

So what do they want to achieve with MICA?

  • High single-node throughput
  • Low end-to-end latecy
  • Consistent performance across workloads
  • Support small (but variable length) key-value objects
  • Support basic KVS operations: get, put, delete
  • Work efficiently on commodity hardware

They propose achieving this using a combination of design choices:

  • Parallel data access – in fact, they mean read access, as they do not parallelize these, as the cost is too high.
  • Optimized network overhead – as noted previously, they use dpdk to achieve this
  • Efficient KV structures – this includes dual memory allocators (one optimized for caching the other for store) and efficient index implementations.

Figure 7

There are some interesting characteristics to this system, including the fact that it employs dangling pointer handling.  Rather than remove pointers from the index, they allow the pointers to become invalid and instead trap that condition.  The paper describes how they do that in more detail.

They also use an interesting “prefetcher” to encourage efficient memory loading.  This is done by receiving requests in bursts and scheduling those requests to be processed “soon” to this prefetch step.

Figure 9Their evaluation of the system tries to focus on a range of workloads, varying key and value size as well as the range of operations being performed, with a read intensive workload being 95% reads and 5% writes.  A write intensive workload is split evenly.

They also look at how the key space is handled.  In the “Exclusive Read, Exclusive Write” paradigm (EREW) each core of the system is responsible for handling both reads and write for a given key (or range of keys more likely).  In the “Concurrent Read, Exclusive Write” paradigm (CREW), any core may satisfy a read for the key, but only one core may satisfy a write operation for the core.  This is a hat tip towards the cost of “bouncing” control of the cache lines between cores and we will see this approach used by other systems as well.  One interesting question that this reminds me of each time I read about this model is what impact, if any, the Level 3 cache might have on this.  Level 1 and 2 caches are traditionally per core but Level 3 is typically per socket.  Thus, things stored in L3 cache are substantially faster to access than RAM.  Something to consider, perhaps.

The paper has an extensive evaluation section and, of course, they demonstrate how their solution is substantially faster.  One interesting aspect to their evaluation is that they claim that each component of their design decision is important in enabling them to achieve their target performance.   To do this, they evaluate each individual aspect of their design decision.  For example for their parallel data model, they evaluate the variations on their choice and demonstrate that CREW is only faster with read skewed workloads.  They even go so far as to model Concurrent Read Concurrent Write (CRCW) workloads and point out that the cache overhead blunts the benefits of the concurrency.

To make this point on their choice of network implementations, they take MassTree and convert it to use dpdk, which yields a substantial performance improvement.  While they argue that their key indexes also contribute, they do not have a distinct break-out of that implementation and fall back to pointing out how they work better, even in CREW than MassTree.

There are certainly grounds to criticize MICA.  For example, they do choose workloads that fit well with their available resources.  I suspect that once they become resource constrained (e.g., cache) that the performance will drop substantially.  However, it provides an interesting contribution in the space and does point out how picking your data structures carefully can make a tremendous difference in the overall performance of the system.

 

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.


Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra Marathe, and Mark Moir, in International Conference on Principles of Distributed Systems, pp. 83-97, 2013, Springer.

The issue of isolation versus sharing is one that permeates systems design for the past 50 plus years. Isolation makes things clean and easy to reason about, but suffers from some disadvantages, such as increased resource utilization and increased latency costs for communications. Shared memory is actually much more difficult to program due to the need to coordinate activity to a shared resource (the memory) potentially across processors. In this paper the authors explore this space. Their rationale is simple: shared memory with lots of processors is
hard.

Even for small multi-core systems, it has become harder and harder to support a simple shared memory abstraction: processors access some memory regions more quickly than others, a phenomenon called non-uniform memory access (NUMA). These trends have prompted researchers to investigate alternative programming abstractions based on message passing rather than cache-coherent shared memory. To advance a pragmatic understanding of these models’ strengths and weaknesses, we have explored a range of different message passing and shared memory designs, for a variety of concurrent data structures, running on different multicore architectures. Our goal was to evaluate which combinations perform best, and where simple software or hardware optimizations might have the most impact. We observe that different approaches per-form best in different circumstances, and that the communication over-head of message passing can often outweigh its benefits. Nonetheless, we discuss ways in which this balance may shift in the future. Overall, we conclude that, by emphasizing high-level shared data abstractions, software should be designed to be largely independent of the choice of low-level communication mechanism.

Non-uniform memory access (NUMA) was originally constructed as a scaling mechanism for “large scale systems”.  Today, it is generally present on any multi-processor system.  What this means is that memory is divided between the processors.  Some of the memory is bound to the processor and then the rest of the memory is bound to other processors.  Thus, memory is either local or remote, with a corresponding cost differential to access it.

Each processor usually has multiple cores, with each of those cores making up one element of the node.  Some computers may have multiple sockets within a single NUMA node.  No doubt there are other models for managing physical memory beyond is basic model, but the idea is the same.

The authors are thus observing that this configuration has become so common that it is now commonly observed on real world systems.  This in turn means that programmers must deal with this.  One way to achieve this is to change the tools so they hide this complexity from the typical programmer; for some this works fine.  For operating systems and high performance systems, such as key-value stores (KVS), it is important to understand these issues in order to optimize their performance.

So these authors look at performance across a “range of message passing and shared-memory designs”.  This sounds like a laudable goal to me.

They set out to define things.  Delegation is where access to a data structure is only performed by a specific set of threads.  If another thread wishes to make a modification to the data structure, it must request that one of the controlling threads do so by sending a request.  A controlling thread for the data structure validated the operation, performs it, and then returns the results to the original requester.  Simple.  In addition, it can be used to simplify the locking model – after all, if a single thread controls the data structure then there cannot be any contention for the data structure and thus there is no locking required.  In addition, if that controlling thread is running on one core and the requester thread is running on a different processor, you can end up with better resource utilization, notably the processor caches.

But the downside to delegation is that you introduce a message passing delay.  If the data structure is hot – because it has a high volume of usage – then the delegation mechanism can become a bottleneck, involving queuing delays.

The authors then turn their attention to message passing overhead.  They evaluate different queuing mechanisms, though their message format remains fairly uniform.  Their queues are the multiple producer, single consumer (MPSCChannel), a per-NUMA node message queue (InletQueue), and a direct access without using compare-and-swap queue (DNCInletQueue).   They carefully describe how each is implemented.  Then they turn to shared memory, where the issue is synchronizing access to the shared memory.  They use several different locking mechanisms: a spin lock, an MCS lock (“ticket lock”), and two variations of a NUMA optimized MCS lock, one that attempts some level of fairness and the other that does not.

Then given these simple mechanisms they evaluate against a concurrent hash map and a concurrent linked list.  They describe their memory allocation strategy (since they are storing key/value data).  The use a two key workloads (“small” and “large”) and then use three operations mixes: a get-only mix, an insert/delete only mix, and a 50/50 mix of read and write operations.

Then they describe their results: small hash maps just don’t benefit from more threads.  Delegation is slower than shared memory.  Simple MCS locks do best under minimal contention.  Interestingly, unfair NUMA aware MCS locks perform best under high contention of the four lock types.  Linked lists is where delegation wins.  The authors argue this is due to better cache locality.  They point out considerable variance in the ticket locks depending upon the workload.

The also evaluated a mixed workload – trying to see the “hot spot” problem in their evaluation. Once again their unfair NUMA aware ticket lock has the best performance, but the authors point out that it introduces substantial delay for some threads.  If you are concerned about bounded latencies, this is not the lock to use.

They had an interesting observation about the impact of hyper-threading on performance (“sibling rivalry”) and end up ensuring that they do not compete for resources on the same core between clients and servers for the delegation case.

They point out that reducing contention for locking is important because it helps minimize the impact on the overall memory bus; thus minimizing cache contention is helpful to the overall system performance, not just the specific applications in question.

Their conclusion?  “[D]elegation can sometimes outperform direct shared-memory approaches”.

We’re going to see some strong arguments for this position in later papers.

 

 

 

NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

The Future of Synchronization on Multicores: The Mulitcore Transformation

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read.  Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.”  I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation.  In operating systems, we prevent deadlocks by defining a lock hierarchy.  The order in which locks can be acquired forms a graph.  To prevent deadlocks, we require that the graph be acyclic.  That sounds simple and for simple code bases, it is.  However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear.  Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.”  A common maxim in systems programming is to optimize the common case.  Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well:  when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state.  I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity.  It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing.  It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made.  General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

  • Keep them small: they don’t change much state
  • Keep them brief: they either commit or abort quickly
  • Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested.  Transactions can avoid issues around priority inversion, convying, and deadlocks.  The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new.  We’ve been using them for decades.  When we use them at disk I/O speeds, we find the overhead is acceptable.  When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software.  This gave birth to the idea of hardware transactions.  Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively.  They don’t suffer from the high overhead of software transactions.  The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”.   When a cache line cannot remain in the CPU, the transaction is aborted.  Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.”  As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databasesjust retrying the operation once or twice often resolve the fault.  But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-­level constructs like mutual exclusion locks, to concurrent data structures such as skip-­‐lists or priority queues, to system-­‐level constructs such as read-­‐copy-­‐update (RCU), all the way to run-­time support for high-­‐level language synchronization mechanisms.

So far, this change has not been pervasive.  I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances.   Systems, and software stacks, do change slowly.  Backwards compatibility is a big issue.  As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineeringpp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work.  By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective.  Intel hardware with transactional memory support was just starting to emerge around the time this paper was released.  I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases.  Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code).  Cache lines are read from memory, any changed made to the cache line, and then written back to memory.  This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory.  For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion.  Even if there is no contention for the lock, you will pay the lock cost.  If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back.  In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction.  Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system.  The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots).  They graphically demonstrate this in Figure 1 (from the paper).  HTM has some overhead, but it tracks with partitioning fairly linearly.  This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time.  The traditional mechanism for doing this kind of paralleism is the two phase commit protocol.  That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work.   Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed.  Instead, we have hot spots.  This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion.  There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.”  Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning.  The authors do a good job of explaining how HTM works on Intel platforms.  Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.)  Level 1 cache is the fastest to access, but the most expensive to provide.  Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU.  Level 3 cache might be present on the CPU, in which case it is shared between all three cores.  Note that none of this is required.  It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way.  Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core.  Note, however, that this could mean just the L1 cache.  In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper).  Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software).  That probability approaches 100% as the size goes above above 8KB.  Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache.  What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating).  If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises.  This is shown in Figure 6 (from the paper).  This is because the chance that various systems events will occur, which cause the transaction to abort.  This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them. 

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient.   The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.”  They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation.  Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well .  Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation.  They address this a bit later in the paper.

 

Figure 13 (from the paper) compares their performance against a partitioned system.  Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown.  One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance.   This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation.  They admit as much in the paper: the hardware just isn’t available to them.  They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations.  If an HTM is retried, many will complete with one or two tries.  Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.