Home » 2018 » May

Monthly Archives: May 2018

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14).  April 2-4, 2014, Seattle, WA.

It’s time to turn attention to key value stores.  There are quite a few papers on this subject.  Seemingly simple, they turn out to be an important storage mechanism.  I had a delightful discussion about them over the weekend with Keith Bostic, who knows a fair bit about key-value stores (being the author of one of the most widely used KV stores.)  Of course, I have discussed some KVS already: MassTree, NVMCached, and SILT.

I’m going to continue looking at key-value stores (KVS) by starting with MICA. It is neither the first, nor the latest, work in this area, but it seems like a good place to start really describing key value stores in this space.  I have quite a few to cover so let’s get started.

MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4–13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.

It is an interesting system in that the authors decided to look at making their KVS tunable – it can serve as a cache (which means items can be thrown out by the KVS as space becomes scarce) or it can serve as storage (which means items cannot be removed from the KVS by the system itself, though it permits the applications using it to do so.)

To achieve this, they propose a series of data structures tuned to the task: circular logs, lossy concurrent hash indexes, and bulk chaining.  Their goal is to provide high performance through low contention.  They utilize dpdk for fast network processing.  They split their data structures so they independently handle caching and storage separately.  I might argue that this makes it two systems fused together, but let’s give them the benefit of the doubt at this point by calling it “generality”.

One of the concerns they try to address is the hot key issue.  In many KVS systems they get terrific performance on synthetic workloads because they use a uniform distribution.  In the real world, particularly for web objects, the workloads tend to be mostly cold with a small subset of the keys that are hot.  Further, these keys change over time.

They do restrict their domain somewhat to make the problem more tractable.  For example, they do not handle large items – everything must fit within a single network packet.  For systems that require larger items, they suggest using a separate memory allocator.  Their argument is that the extra indirection cost is a marginal increase in total latency.  They are optimizing for the common case.  Another aspect they exclude is durability.  That makes sense – they are only implementing an in-memory KVS, so durability is not a logical part of their work.  This also plays well in terms of performance, since persistence makes things much more complicated.

So what do they want to achieve with MICA?

  • High single-node throughput
  • Low end-to-end latecy
  • Consistent performance across workloads
  • Support small (but variable length) key-value objects
  • Support basic KVS operations: get, put, delete
  • Work efficiently on commodity hardware

They propose achieving this using a combination of design choices:

  • Parallel data access – in fact, they mean read access, as they do not parallelize these, as the cost is too high.
  • Optimized network overhead – as noted previously, they use dpdk to achieve this
  • Efficient KV structures – this includes dual memory allocators (one optimized for caching the other for store) and efficient index implementations.

Figure 7

There are some interesting characteristics to this system, including the fact that it employs dangling pointer handling.  Rather than remove pointers from the index, they allow the pointers to become invalid and instead trap that condition.  The paper describes how they do that in more detail.

They also use an interesting “prefetcher” to encourage efficient memory loading.  This is done by receiving requests in bursts and scheduling those requests to be processed “soon” to this prefetch step.

Figure 9Their evaluation of the system tries to focus on a range of workloads, varying key and value size as well as the range of operations being performed, with a read intensive workload being 95% reads and 5% writes.  A write intensive workload is split evenly.

They also look at how the key space is handled.  In the “Exclusive Read, Exclusive Write” paradigm (EREW) each core of the system is responsible for handling both reads and write for a given key (or range of keys more likely).  In the “Concurrent Read, Exclusive Write” paradigm (CREW), any core may satisfy a read for the key, but only one core may satisfy a write operation for the core.  This is a hat tip towards the cost of “bouncing” control of the cache lines between cores and we will see this approach used by other systems as well.  One interesting question that this reminds me of each time I read about this model is what impact, if any, the Level 3 cache might have on this.  Level 1 and 2 caches are traditionally per core but Level 3 is typically per socket.  Thus, things stored in L3 cache are substantially faster to access than RAM.  Something to consider, perhaps.

The paper has an extensive evaluation section and, of course, they demonstrate how their solution is substantially faster.  One interesting aspect to their evaluation is that they claim that each component of their design decision is important in enabling them to achieve their target performance.   To do this, they evaluate each individual aspect of their design decision.  For example for their parallel data model, they evaluate the variations on their choice and demonstrate that CREW is only faster with read skewed workloads.  They even go so far as to model Concurrent Read Concurrent Write (CRCW) workloads and point out that the cache overhead blunts the benefits of the concurrency.

To make this point on their choice of network implementations, they take MassTree and convert it to use dpdk, which yields a substantial performance improvement.  While they argue that their key indexes also contribute, they do not have a distinct break-out of that implementation and fall back to pointing out how they work better, even in CREW than MassTree.

There are certainly grounds to criticize MICA.  For example, they do choose workloads that fit well with their available resources.  I suspect that once they become resource constrained (e.g., cache) that the performance will drop substantially.  However, it provides an interesting contribution in the space and does point out how picking your data structures carefully can make a tremendous difference in the overall performance of the system.


Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra Marathe, and Mark Moir, in International Conference on Principles of Distributed Systems, pp. 83-97, 2013, Springer.

The issue of isolation versus sharing is one that permeates systems design for the past 50 plus years. Isolation makes things clean and easy to reason about, but suffers from some disadvantages, such as increased resource utilization and increased latency costs for communications. Shared memory is actually much more difficult to program due to the need to coordinate activity to a shared resource (the memory) potentially across processors. In this paper the authors explore this space. Their rationale is simple: shared memory with lots of processors is

Even for small multi-core systems, it has become harder and harder to support a simple shared memory abstraction: processors access some memory regions more quickly than others, a phenomenon called non-uniform memory access (NUMA). These trends have prompted researchers to investigate alternative programming abstractions based on message passing rather than cache-coherent shared memory. To advance a pragmatic understanding of these models’ strengths and weaknesses, we have explored a range of different message passing and shared memory designs, for a variety of concurrent data structures, running on different multicore architectures. Our goal was to evaluate which combinations perform best, and where simple software or hardware optimizations might have the most impact. We observe that different approaches per-form best in different circumstances, and that the communication over-head of message passing can often outweigh its benefits. Nonetheless, we discuss ways in which this balance may shift in the future. Overall, we conclude that, by emphasizing high-level shared data abstractions, software should be designed to be largely independent of the choice of low-level communication mechanism.

Non-uniform memory access (NUMA) was originally constructed as a scaling mechanism for “large scale systems”.  Today, it is generally present on any multi-processor system.  What this means is that memory is divided between the processors.  Some of the memory is bound to the processor and then the rest of the memory is bound to other processors.  Thus, memory is either local or remote, with a corresponding cost differential to access it.

Each processor usually has multiple cores, with each of those cores making up one element of the node.  Some computers may have multiple sockets within a single NUMA node.  No doubt there are other models for managing physical memory beyond is basic model, but the idea is the same.

The authors are thus observing that this configuration has become so common that it is now commonly observed on real world systems.  This in turn means that programmers must deal with this.  One way to achieve this is to change the tools so they hide this complexity from the typical programmer; for some this works fine.  For operating systems and high performance systems, such as key-value stores (KVS), it is important to understand these issues in order to optimize their performance.

So these authors look at performance across a “range of message passing and shared-memory designs”.  This sounds like a laudable goal to me.

They set out to define things.  Delegation is where access to a data structure is only performed by a specific set of threads.  If another thread wishes to make a modification to the data structure, it must request that one of the controlling threads do so by sending a request.  A controlling thread for the data structure validated the operation, performs it, and then returns the results to the original requester.  Simple.  In addition, it can be used to simplify the locking model – after all, if a single thread controls the data structure then there cannot be any contention for the data structure and thus there is no locking required.  In addition, if that controlling thread is running on one core and the requester thread is running on a different processor, you can end up with better resource utilization, notably the processor caches.

But the downside to delegation is that you introduce a message passing delay.  If the data structure is hot – because it has a high volume of usage – then the delegation mechanism can become a bottleneck, involving queuing delays.

The authors then turn their attention to message passing overhead.  They evaluate different queuing mechanisms, though their message format remains fairly uniform.  Their queues are the multiple producer, single consumer (MPSCChannel), a per-NUMA node message queue (InletQueue), and a direct access without using compare-and-swap queue (DNCInletQueue).   They carefully describe how each is implemented.  Then they turn to shared memory, where the issue is synchronizing access to the shared memory.  They use several different locking mechanisms: a spin lock, an MCS lock (“ticket lock”), and two variations of a NUMA optimized MCS lock, one that attempts some level of fairness and the other that does not.

Then given these simple mechanisms they evaluate against a concurrent hash map and a concurrent linked list.  They describe their memory allocation strategy (since they are storing key/value data).  The use a two key workloads (“small” and “large”) and then use three operations mixes: a get-only mix, an insert/delete only mix, and a 50/50 mix of read and write operations.

Then they describe their results: small hash maps just don’t benefit from more threads.  Delegation is slower than shared memory.  Simple MCS locks do best under minimal contention.  Interestingly, unfair NUMA aware MCS locks perform best under high contention of the four lock types.  Linked lists is where delegation wins.  The authors argue this is due to better cache locality.  They point out considerable variance in the ticket locks depending upon the workload.

The also evaluated a mixed workload – trying to see the “hot spot” problem in their evaluation. Once again their unfair NUMA aware ticket lock has the best performance, but the authors point out that it introduces substantial delay for some threads.  If you are concerned about bounded latencies, this is not the lock to use.

They had an interesting observation about the impact of hyper-threading on performance (“sibling rivalry”) and end up ensuring that they do not compete for resources on the same core between clients and servers for the delegation case.

They point out that reducing contention for locking is important because it helps minimize the impact on the overall memory bus; thus minimizing cache contention is helpful to the overall system performance, not just the specific applications in question.

Their conclusion?  “[D]elegation can sometimes outperform direct shared-memory approaches”.

We’re going to see some strong arguments for this position in later papers.