Home » File Systems » Key-Value Stores » MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

By Tony Mason in Key-Value Stores on 2018-05-15

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14). April 2-4, 2014, Seattle, WA.

It’s time to turn attention to key value stores. There are quite a few papers on this subject. Seemingly simple, they turn out to be an important storage mechanism. I had a delightful discussion about them over the weekend with Keith Bostic, who knows a fair bit about key-value stores (being the author of one of the most widely used KV stores.) Of course, I have discussed some KVS already: MassTree, NVMCached, and SILT.

I’m going to continue looking at key-value stores (KVS) by starting with MICA. It is neither the first, nor the latest, work in this area, but it seems like a good place to start really describing key value stores in this space. I have quite a few to cover so let’s get started.

MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4–13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.

It is an interesting system in that the authors decided to look at making their KVS tunable – it can serve as a cache (which means items can be thrown out by the KVS as space becomes scarce) or it can serve as storage (which means items cannot be removed from the KVS by the system itself, though it permits the applications using it to do so.)

To achieve this, they propose a series of data structures tuned to the task: circular logs, lossy concurrent hash indexes, and bulk chaining. Their goal is to provide high performance through low contention. They utilize dpdk for fast network processing. They split their data structures so they independently handle caching and storage separately. I might argue that this makes it two systems fused together, but let’s give them the benefit of the doubt at this point by calling it “generality”.

One of the concerns they try to address is the hot key issue. In many KVS systems they get terrific performance on synthetic workloads because they use a uniform distribution. In the real world, particularly for web objects, the workloads tend to be mostly cold with a small subset of the keys that are hot. Further, these keys change over time.

They do restrict their domain somewhat to make the problem more tractable. For example, they do not handle large items – everything must fit within a single network packet. For systems that require larger items, they suggest using a separate memory allocator. Their argument is that the extra indirection cost is a marginal increase in total latency. They are optimizing for the common case. Another aspect they exclude is durability. That makes sense – they are only implementing an in-memory KVS, so durability is not a logical part of their work. This also plays well in terms of performance, since persistence makes things much more complicated.

So what do they want to achieve with MICA?

High single-node throughput
Low end-to-end latecy
Consistent performance across workloads
Support small (but variable length) key-value objects
Support basic KVS operations: get, put, delete
Work efficiently on commodity hardware

They propose achieving this using a combination of design choices:

Parallel data access – in fact, they mean read access, as they do not parallelize these, as the cost is too high.
Optimized network overhead – as noted previously, they use dpdk to achieve this
Efficient KV structures – this includes dual memory allocators (one optimized for caching the other for store) and efficient index implementations.

There are some interesting characteristics to this system, including the fact that it employs dangling pointer handling. Rather than remove pointers from the index, they allow the pointers to become invalid and instead trap that condition. The paper describes how they do that in more detail.

They also use an interesting “prefetcher” to encourage efficient memory loading. This is done by receiving requests in bursts and scheduling those requests to be processed “soon” to this prefetch step.

Their evaluation of the system tries to focus on a range of workloads, varying key and value size as well as the range of operations being performed, with a read intensive workload being 95% reads and 5% writes. A write intensive workload is split evenly.

They also look at how the key space is handled. In the “Exclusive Read, Exclusive Write” paradigm (EREW) each core of the system is responsible for handling both reads and write for a given key (or range of keys more likely). In the “Concurrent Read, Exclusive Write” paradigm (CREW), any core may satisfy a read for the key, but only one core may satisfy a write operation for the core. This is a hat tip towards the cost of “bouncing” control of the cache lines between cores and we will see this approach used by other systems as well. One interesting question that this reminds me of each time I read about this model is what impact, if any, the Level 3 cache might have on this. Level 1 and 2 caches are traditionally per core but Level 3 is typically per socket. Thus, things stored in L3 cache are substantially faster to access than RAM. Something to consider, perhaps.

The paper has an extensive evaluation section and, of course, they demonstrate how their solution is substantially faster. One interesting aspect to their evaluation is that they claim that each component of their design decision is important in enabling them to achieve their target performance. To do this, they evaluate each individual aspect of their design decision. For example for their parallel data model, they evaluate the variations on their choice and demonstrate that CREW is only faster with read skewed workloads. They even go so far as to model Concurrent Read Concurrent Write (CRCW) workloads and point out that the cache overhead blunts the benefits of the concurrency.

To make this point on their choice of network implementations, they take MassTree and convert it to use dpdk, which yields a substantial performance improvement. While they argue that their key indexes also contribute, they do not have a distinct break-out of that implementation and fall back to pointing out how they work better, even in CREW than MassTree.

There are certainly grounds to criticize MICA. For example, they do choose workloads that fit well with their available resources. I suspect that once they become resource constrained (e.g., cache) that the performance will drop substantially. However, it provides an interesting contribution in the space and does point out how picking your data structures carefully can make a tremendous difference in the overall performance of the system.

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31