MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14). April 2-4, 2014, Seattle, WA.
It’s time to turn attention to key value stores. There are quite a few papers on this subject. Seemingly simple, they turn out to be an important storage mechanism. I had a delightful discussion about them over the weekend with Keith Bostic, who knows a fair bit about key-value stores (being the author of one of the most widely used KV stores.) Of course, I have discussed some KVS already: MassTree, NVMCached, and SILT.
I’m going to continue looking at key-value stores (KVS) by starting with MICA. It is neither the first, nor the latest, work in this area, but it seems like a good place to start really describing key value stores in this space. I have quite a few to cover so let’s get started.
MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4–13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.
It is an interesting system in that the authors decided to look at making their KVS tunable – it can serve as a cache (which means items can be thrown out by the KVS as space becomes scarce) or it can serve as storage (which means items cannot be removed from the KVS by the system itself, though it permits the applications using it to do so.)
To achieve this, they propose a series of data structures tuned to the task: circular logs, lossy concurrent hash indexes, and bulk chaining. Their goal is to provide high performance through low contention. They utilize dpdk for fast network processing. They split their data structures so they independently handle caching and storage separately. I might argue that this makes it two systems fused together, but let’s give them the benefit of the doubt at this point by calling it “generality”.
One of the concerns they try to address is the hot key issue. In many KVS systems they get terrific performance on synthetic workloads because they use a uniform distribution. In the real world, particularly for web objects, the workloads tend to be mostly cold with a small subset of the keys that are hot. Further, these keys change over time.
They do restrict their domain somewhat to make the problem more tractable. For example, they do not handle large items – everything must fit within a single network packet. For systems that require larger items, they suggest using a separate memory allocator. Their argument is that the extra indirection cost is a marginal increase in total latency. They are optimizing for the common case. Another aspect they exclude is durability. That makes sense – they are only implementing an in-memory KVS, so durability is not a logical part of their work. This also plays well in terms of performance, since persistence makes things much more complicated.
So what do they want to achieve with MICA?
- High single-node throughput
- Low end-to-end latecy
- Consistent performance across workloads
- Support small (but variable length) key-value objects
- Support basic KVS operations: get, put, delete
- Work efficiently on commodity hardware
They propose achieving this using a combination of design choices:
- Parallel data access – in fact, they mean read access, as they do not parallelize these, as the cost is too high.
- Optimized network overhead – as noted previously, they use dpdk to achieve this
- Efficient KV structures – this includes dual memory allocators (one optimized for caching the other for store) and efficient index implementations.
There are some interesting characteristics to this system, including the fact that it employs dangling pointer handling. Rather than remove pointers from the index, they allow the pointers to become invalid and instead trap that condition. The paper describes how they do that in more detail.
They also use an interesting “prefetcher” to encourage efficient memory loading. This is done by receiving requests in bursts and scheduling those requests to be processed “soon” to this prefetch step.
Their evaluation of the system tries to focus on a range of workloads, varying key and value size as well as the range of operations being performed, with a read intensive workload being 95% reads and 5% writes. A write intensive workload is split evenly.
They also look at how the key space is handled. In the “Exclusive Read, Exclusive Write” paradigm (EREW) each core of the system is responsible for handling both reads and write for a given key (or range of keys more likely). In the “Concurrent Read, Exclusive Write” paradigm (CREW), any core may satisfy a read for the key, but only one core may satisfy a write operation for the core. This is a hat tip towards the cost of “bouncing” control of the cache lines between cores and we will see this approach used by other systems as well. One interesting question that this reminds me of each time I read about this model is what impact, if any, the Level 3 cache might have on this. Level 1 and 2 caches are traditionally per core but Level 3 is typically per socket. Thus, things stored in L3 cache are substantially faster to access than RAM. Something to consider, perhaps.
The paper has an extensive evaluation section and, of course, they demonstrate how their solution is substantially faster. One interesting aspect to their evaluation is that they claim that each component of their design decision is important in enabling them to achieve their target performance. To do this, they evaluate each individual aspect of their design decision. For example for their parallel data model, they evaluate the variations on their choice and demonstrate that CREW is only faster with read skewed workloads. They even go so far as to model Concurrent Read Concurrent Write (CRCW) workloads and point out that the cache overhead blunts the benefits of the concurrency.
To make this point on their choice of network implementations, they take MassTree and convert it to use dpdk, which yields a substantial performance improvement. While they argue that their key indexes also contribute, they do not have a distinct break-out of that implementation and fall back to pointing out how they work better, even in CREW than MassTree.
There are certainly grounds to criticize MICA. For example, they do choose workloads that fit well with their available resources. I suspect that once they become resource constrained (e.g., cache) that the performance will drop substantially. However, it provides an interesting contribution in the space and does point out how picking your data structures carefully can make a tremendous difference in the overall performance of the system.
NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.
This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory. Because the authors do not have NVM available, they focus on simulating it using DRAM. They simulate behaviors by using appropriate cache flushes and fence. In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.
Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees. If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this. One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously. The authors of this paper do this.
They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive. Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)
One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event. For memcached, there are users that are asking for persistence. Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.
However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern. They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing. They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots). To prevent this situation they choose to split their data structures.
The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.
The authors also touch on the challenges of memory allocation. Memory fragmentation is something we often fix in dynamic memory by rebooting the machine. However, now we have the storage level problem with fragmentation; fixing this involves write amplification. They organize their allocation strategy by observing there is considerable locality across the KV store. They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance. They also use DRAM to track access information as part of their LRU implementation. This creates a burden on crash recovery, but benefits from significant improved runtime performance.
They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM. Those hot items will thus be discarded if the system crashes.
Their evaluation is against a modified version of memcached. They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.” I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.
They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event. They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version. When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance. Two of them, SYS and VAR, show substantially better performance. Both benchmarks are write intensive. I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.
Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.
SILT: A Memory-Efficient, High-Performance Key-Value Store
Hyeontaek Lim, Bin Fan, David G. Andersen, Michael Kaminsky, in Proceedings of the 23rd Symposium on Operating Systems Principles (SOSP ’11), October 23-26, 2011, Cascais, Portugal.
In this paper, we have an interesting hybrid key-value store; part of the store is in DRAM, while the rest of it is stored in flash storage. The authors have focused on optimizing the performance of the store across the specific media chosen, which leads to some interesting choices and observations.
SILT (Small Index Large Table) is a memory-efficient, highperformance
key-value store system based on flash storage that
scales to serve billions of key-value items on a single node. It requires
only 0.7 bytes of DRAM per entry and retrieves key/value
pairs using on average 1.01 flash reads each. SILT combines new
algorithmic and systems techniques to balance the use of memory,
storage, and computation. Our contributions include: (1) the design
of three basic key-value stores each with a different emphasis on
memory-efficiency and write-friendliness; (2) synthesis of the basic
key-value stores to build a SILT key-value store system; and (3) an
analytical model for tuning system parameters carefully to meet the
needs of different workloads. SILT requires one to two orders of
magnitude less memory to provide comparable throughput to current
high-performance key-value systems on a commodity desktop
system with flash storage.
Thus, to achieve optimal performance they construct a hybrid ensemble of key-value stores that are optimized to particular underlying media.
They start early with their performance observations, and continue this pace throughout the paper. Thus, they point out that they have a small memory requirement (0.7 bytes per key value) and only a single read from flash required:
This fits well with their observation that memory efficiency is critically important for scalable key-value stores. Their system saturated the capabilities of their hardware with 46,000 lookups per second, though this doesn’t guarantee that it will do so on newer, faster hardware (and we will get to several current KV stores.
- LogStore – uses a CPU-efficient cuckoo-hash based only upon a tag derived from the full key (20 bytes). The benefit of this approach is that it avoids accessing the flash memory if the key is not present. It is probabilistic, in that a key might not be present in Flash with some low probability.
- HashStore – uses a memory-efficient storage model for their cuckoo flash table. These stores are immutable and are stored in flash.
- SortedStore – They use an entropy-coded trie, which provides a very space efficient indexing mechanism: the average according to the authors is 0.4 bytes per key. Given that keys are 160 bits (20 bytes), which means the index is 64 bits (8 bytes). The authors provide us with an impressive list of specific optimizations they have applied to sorting data, packing indexes, intelligent (non-pointer) based addressing,
They also provide us with pretty figures to graphically lay out their data structures. First, Figure 3, which shows the in-memory cuckoo hash table:
When we move onto HashStore, they show us how the structure of the data is moved from insertion order to hash sorted order as the data is moved from memory to flash.
Of course, one of the important aspects of a key-value store is the ability to look up the data quickly. Thus, they also show us how they sort data using a prefix tree (Trie). This is a key aspect of their SortedStore:
The in-memory Trie, permits them to rapidly find the corresponding data in Flash. This is one of their key motivations for very efficient encoding of the Trie: it ensures they can store it in memory. Once they can no longer store the indices in memory the lookup system will become I/O bound (to the Flash, in this case).
Figure 6 then shows how they actually encode the length of the tree in each direction, and how it is entropy encoded:There is some serious thought that has gone into their layout. They do admit, however, that they are not providing crash tolerance: there will be data that was stored in DRAM that cannot be recovered after a crash. They do not discuss the use of NVM for achieving this. Instead, they suggest a synchronous write mechanism in which the write blocks until the log data has been written to the flash-based log.
One of the interesting aspects of their performance evaluation is the recognition that they have quite a few parameters that must be chosen. Thus, they spend more than a page of the paper just discussing the relative merits of the trade-offs involved in selecting various parameters. They consider write amplification (the need to write more than just the data being preserved), read amplification (performing reads only to discover the data that you seek is not present), memory overhead (how much DRAM is used), and update rate versus flash life time. This latter point is an interesting one to keep in mind when reading about non-volatile memory: it wears out. There is a certain amount of degradation that occurs in the underlying physical medium each time it is erased and re-used. The specific details are dependent upon the characteristics of the medium, but it is an issue to keep in mind.
The evaluation of SILT is certainly interesting. They choose not to compare to any other KV system and instead focus on the performance of their system using a variety of fairly standard loads (YCSB) as well as simple get/put key operations as micro-benchmarks. They end up claiming they they can lookup almost 45k keys per second in the slowest of their hybrid stores. When combined, they indicate they can achieve almost 58k get operations per second on a single core. As they move to multiple cores they see some further improvement (180k get operations per second for data that is already in memory).
SILT offers some interesting observations, particularly about highly efficient use of memory and novel indexing schemes.
Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao, Eddie Kohler, Robert Morris, in Proceedings of the 7th ACM european conference on Computer Systems (Eurosys ’12), pp. 183-196, Bern, Switzerland, April 10 – 13, 2012.
In this work, the authors build a key-value system called MassTree. This is an in-memory Key-Value store (I will look at several of them, in fact). What is novel about this particular variant is the authors focus on providing a high-performance parallel access implementation and the mechanisms by which they achieve this.
MassTree support arbitrary-length keys; it assumes nothing about the keys (so they may be binary data). It uses B+-trees for storage organized into a Trie structure. The B+-tree stores a fixed slice of the key; the Trie connects the various slices together. This use is interesting, since most of the cases I’ve seen of Tries are for strings, as they do an excellent job of managing overlapping string data (prefix trees). The authors use this for binary keys. There is nothing inherently more difficult about binary versus string keys (since they are equivalent) but this choice makes the solution very flexible, as it is not particularly data dependent. This basic layout is shown in Figure 1 (from the paper).
One of the challenges with concurrent data structures is how to handle the common case – no collisions – with minimal performance overhead. The classic mutual exclusion model (or reader/writer locks) involves a certain amount of overhead, even for non-contended locks, because of the need to perform interlocked operations against shared (common) memory where the locks are maintained. The system implemented by the authors does not require any locking for readers (lookups). Updates are done with locks local to the CPU, which helps minimize the contention of typical locks.
One interesting observation is that their choice of this hybrid Trie/B+-tree structure was motivated by prior systems that struggled with performance in the presence of variable length keys. In MassTree, the rate limiting factor for queries is the cost of walking the tree. They minimize this by using “a wide fan-out tree to reduce tree-depth, prefetches nodes from DRAM to overlap fetch latencies, and carefully lays out data in cache lines to reduce the amount of data needed per node.”
Hence, my interest in this paper: these all seem to be important lessons for persistent memory as well, where the latencies are somewhat larger than for DRAM. Further, the authors are concerned about correctness. They have not looked at persistence (and recoverable consistency), so there is still further work for me to do should I investigate it further.
The authors conclude by claiming the following contributions:
First, an in-memory concurrent tree that supports keys with shared prefixes efficiently. Second, a set of techniques for laying out the data of each tree node, and accessing it, that reduces the time spent waiting for DRAM while descending the tree. Third, a demonstration that a single tree shared among multiple cores can provide higher performance than a partitioned design for some workloads. Fourth, a complete design that addresses all bottlenecks in the way of million-query-per-second performance.
Their emphasis on correctness and cache efficiency is certainly an attractive aspect of this work.
The driving considerations for their design were: efficient support of many different key distributions including binary and variable length keys, with common prefixes; fine grained concurrent access for high performance parallel access, and cache efficiency through optimal data placement and prefetching.
The authors establish the characteristics of their tree structure, including data placement. These make reasoning about their tree characteristics simpler. For example, they note that MassTree does not guarantee that it is a balanced structure. However, due to the way the tree itself is structured, they have the same algorithmic cost: O(l log n) comparisions, where l is the length of the key and n is the depth of the tree.
As a pragmatic check on their implementation, they also use a partial-key B-tree (pkB-Tree) for comparison. Despite the fact the pkB-Tree is balanced, the authors note that MassTree performs favorably well on several benchmarks. The authors go into detail about their implementation details, including the construction of border nodes and interior nodes, as well as how they lay out data (again, with an eye towards cache line efficiency).
To achieve this efficiently, they use a versioning scheme. A node has a version number. To modify the node, a given processor must update the version to indicate that it is changing the node. A reader will snapshot the version at the start of the read, and compare it at the end of the read. If it changed, the reader knows the state may have changed and can retry the read operation (essentially a variant of software transactional memory). The detailed diagram of this is shown in Figure 3 (from the paper).
The paper describes the concurrency model in the face of conflicting writers as well. By keeping their lock in the same cache line as their data, they exploit the cache coherence protocol. The authors note that lock-free operations have comparable cache behavior (e.g., compare-and-swap or link-load-store-conditional).
Indeed, much of the rest of the technical content of the paper is explaining why their approach is correct – an essential point for concurrent access systems. Without that, there really is not much point!
While brief, their discussion about value storage is interesting: their measurements are done assuming that values will be small. They state they have a scheme for managing large values as well, via a separate allocator. While they do not claim this, my observation is that a “real world” system would likely need to have some hybrid form of this.
MassTree is an in-memory key-value store. To provide persistence, they rely upon a write-behind log scheme. When used via the network interface, the writes are not guaranteed. To minimize the loss window, they choose a 200 ms timer. Thus, the log is written to disk every 200 ms. They do not evaluate this persistence model, offering it to us as an explanation that persistence is not incompatible with performance.
This last point is an interesting one, especially when considered in the context of NVM: what are the trade-offs. The authors hint at this, but do not explore this space.
For those interested, a version of the source code can be found here: https://github.com/kohler/masstree-beta
I have made it a goal for 2018 to answer a question several people have asked me: what papers should I read to learn more about file systems. So I’ve decided to attempt to copy a format that I’ve found useful – The Morning Paper. I admit, I am not sure I’ll be able to keep up the frenetic pace of a paper each day that he maintains, but I do know there are plenty of papers to read.
My motivation for doing this is simple: there are quite a few papers on this topic, I certainly haven’t read them all (or even the majority of them) and reading through them, along with my own interpretation of why the paper matters to File Systems will be useful for me. It also gives me a set of information that I can point out to people that ask me “so what should I read to learn more about this topic.”
Why File Systems? For me it’s been largely a quirk of fate. I’ve been working on operating systems for many years now, and file systems happens to be the area in which I’ve spent more time than any other. For something that is conceptually so simple – after all, it just maps files to blocks on your disk, it has considerable complexity. File systems are also the gateway to one of the more challenging parts of the operating system: the bit that stores persistent state. Errors in file systems are often not transient. That makes them challenging. The bar is set high because we don’t want to lose anyone’s data.
File systems are part of the plumbing of the operating system. They are essential to proper operation but when everything is working properly, nobody really notices they are there. Only when things go wrong does anyone notice. So it is within this world that I will delve.
The place to start is to even attempt to define file systems. While you might think this is simple, it turns out to be surprisingly challenging. So I’ll approach it by providing some examples and then delving into what means to be a file system.
Media File Systems
The easiest place to start is in the concept of the media file system. “Media” in this context means any tangible medium on which we can systematically record and/or read persistent information. Examples would include disks, tapes, non-volatile memory, or optical media. Whether or not RAM file systems fall into this category is an interesting question – and it helps demonstrate my point that pinning down this concept is more challenging than one might otherwise think.
Disk drives are typically what most people think of first for this category, though we now often view solid-state disk drives in the same category. File systems reside on top of media devices and keep track of the organization of the data itself. Some file systems can span multiple disk, or use transactional journals, or provide resilience against errors in the media itself. The continuum here is surprisingly broad, but in the end the purpose of the file system is to map from the vagaries of the media to a (mostly) uniform model of behavior so application programs “just work”.
Network File Systems
As we added computers to networks, we found it useful to be able to transfer information stored on one computer to another. This permitted applications to access data, regardless of where it was actually stored. By constructing a file system that uses a remote access protocol, we can present the remote storage device to the local system as if it were just a local file system – or rather, almost the same.
Good examples of this include the Network File System (NFS) originally developed by SUN Microsystems in the 1980s and the Andrew File System (AFS) originally developed by Carnegie-Mellon University in the same time period. These days many people also use the Server Message Block (SMB) based network file systems; the roots of this are also back in the 1980s. All three of these remain in use today, in fact, though they have evolved from the earliest versions.
There is considerable overlap between the world of file systems and databases. Several early file systems were really just single level lookup mechanisms, rather than the hierarchical name space that is so common these days. It is actually common to implement file systems so that hierarchy is added to a flat indexed name space (a “flat” file system). These are in essence one form of key-value store. Some file systems permit addressing by using keys, whether explicitly assigned or implicitly generated. A subsection of the literature focus more on this type of file system and I will make sure to cover several of these.
Sometimes a file system is more about the namespace than anything else. For example the proc file system does not actually reside on top of any sort of storage. The name space provides a convenient way to find information that is generated on demand. In these cases, it is the name space that really matters, not how the information is stored.
This conflation of storage, presentation, and name-space add to the richness and complexity of work being done in file systems. My goal is to explore these issues, by reviewing the literature.