Home » Posts tagged 'KVS'

Tag Archives: KVS

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
January 2025
S M T W T F S
 1234
567891011
12131415161718
19202122232425
262728293031  

Reflections on Teaching Distributed Systems

During January through April 2023 I taught CPSC 416 at UBC. It was the first time I had taught this course and it would not have been possible without the assistance I received from others, notably Ada Gavrilovska and Ivan Bestchastnikh both of whom allowed me to use their materials in creating my own course. Of course, I reordered things, and adapted them to fit the class.

Teaching a course for the first time is always an illuminating experience. I’ve designed and taught classes on systems topics, including elements of DCE/DFS, the distributed file system on which I worked back when I was a twenty-something software developer, classes on Windows driver development (device drivers, file systems, and file system filter drivers,) and Windows kernel debugging. I even explored utilizing online mechanisms for providing a non-linear educational approach to core OS concepts (processes, threads, scheduling, and synchronization) as part of my MSCS work at Georgia Tech. Thus, I have enjoyed engaging in education and looking for ways to do better for much of my life. I have always found insights each time I teach a new course. CPSC 416 was no exception to this.

I have never taken a distributed systems class. I learned about distributed systems organically. In my senior year at the University of Chicago I worked for the nascent Computer Science department as part of the facilities team and part of that work included networking. I remember soldering connectors together for Ethernet connections of the time – vastly different than the RJ-45 connections we use now, but the same technology we use today. After graduation I took a job at Stanford working with David Cheriton, who ran the “Distributed Systems Group.” The V operating system, which his group developed, is what I used on my desktop, and I built a number of network components as part of my work for him, including a network protocol (VMTP) which I implemented on a BSD 4.2 UNIX based system, diskless bootstrap drivers, and even an IP-multicast version of a multi-player Mazewar variant.

From Stanford I went to Transarc, a CMU-research inspired start-up company that had two very different product directions: one was an online transaction processing system, and the other was a commercialization of the AFS distributed file system. I also worked on the successor (the DCE/DFS project I mentioned earlier.)

Thus, my background in distributed systems was building distributed systems, often from the perspective of not knowing what I was doing but being surrounded by smart people that helped me figure it out. In some ways, that was my objective in teaching CPSC 416: paying that hard work forward.

One observation now: things that you learn in your twenties become “assumed knowledge” quite easily in your fifties. Thus, I just assumed that everyone knew about how databases maintain consistency in the face of failures. This turns out not to be true. So, one of my first lessons here was that I need to explain this up front in order for many of the things I say afterwards make sense. What is the point about replicating a log (what database people usually refer to as a journal) if you don’t understand that a log is the basic mechanism we use to restore consistency in a database. Lesson 1: teach people about databases and recoverability.

The second observation stems from my first oversight: building transactionally safe recoverable systems is hard. You’d think I’d know that, since I built a transactionally safe recoverable system back in the late 1980s and early 1990s as part of my work on Episode, the local physical file system that we used to support some of the nifty features of DCE/DFS. Episode, in some form, continues to be used in production today (file systems have unnaturally long lives if they get any serious adoption.) I would be quite surprised if the underlying transactional system were significantly different than it was thirty years ago when I worked on it. Lesson 2: teach people about building transactionally safe databases. Related to this is explaining key-value stores explicitly. They are a key part of the programming assignments and understanding why we use them helps. They typically form the basis of databases and file systems. Indeed, file systems are typically key-value stores with a name space built on top of them. The keys are limited (integers representing an entry in a potentially sparse table of objects) and the values are mutable (which complicates implementation and correctness).

The third observation is not an original one, as I have learned in conversations with other educators. There is a fundamental mis-alignment between the objectives of students (which is essentially grade maximization) and my objectives (which is “learn useful stuff that will help you throughout your career.”) This isn’t a big surprise – after all, I have published work in plagiarism reduction – but trying to find ways to fix it is challenging. I had not expected the insane amount of pressure students seem to feel to maximize their grades. I did try to mitigate this somewhat by offering extra credit opportunities, though in the end that seemed to create stress for many to exploit those. Why people who are getting grades in the 90%+ range are “afraid of failing” is beyond me. Lesson 3: extra credit creates more stress than it alleviates. I don’t think this is entirely the case but I’ll be more cautious about using it in the future. Still, I don’t want people to worry that they are going to fail so my thought is to provide an incentive to participate that mitigates the likelihood of them failing.

My fourth observation is that I had never read the various papers about distributed consensus side by side before. Doing so was an eye-opening experience. What I learned is that in many cases the complications in those papers relate to: (1) optimizations; and (2) recovery. Thus, next time I teach this class I want to spend more time walking through the baseline protocol and then pointing out optimizations and handling recovery. One example of this was when I had a student point out that the Paxos Made Moderately Complex paper (PMMC) states that during the leader election phase the voting party sends along a list of their accepted but not committed proposals (from the previous leadership). This is not part of the protocol. It is an optimization that makes recovery faster and more efficient, but you can’t rely upon it to maintain correctness. Now that I understand this point of confusions better, I think I can walk people through it and distinguish this. Doing so will help people understand the underlying protocol better and then the optimizations we use to ensure it works correctly. Lesson 4: walk through the papers more carefully, explaining the base protocol and then pointing out that the primary difference is in optimizations and recovery mechanisms.

My fifth observation is that students focus too much on code and not enough on understanding. Distributed systems is an area in which one must think through failure cases, identify how you will handle them, and what you assume is going to be true (your “invariants”) throughout your code base. I did introduce some tools for doing this (modeling and TLA+ specifically) but I did not incorporate them into the actual assignments. I did have them write reports, but those were post-hoc reports. I would like to try making the design cycle a more prominent portion of this work, encouraging people to think about what they are building rather than trying to hack their way through it. One piece of feedback from several students was that my advice to “walk away and think through the project” was quite helpful. I’d like to make the structure of the course make that happen more naturally. I also think that by having explicit design milestones it would reduce stress by encouraging students to work on the projects before the deadline. Lesson 5: design is more important than code, but code helps students verify their design reflects good understanding. The challenge will be in finding the right balance between the two.

I have other, smaller observations as well that I won’t break out but I’ll capture here:

  • Extensions don’t really help.
  • Providing a flexible late policy can be helpful, but it often creates quite a lot of stress.
  • Some students abhor teams, some like them. I need to find a way to accommodate both learning styles in a way that is equitable.
  • Do as much as possible to simplify grading exams.
  • Make a conscious effort after each lesson to create questions for the exam. I provided individualized exams (drawn from a pool of questions) and I wish I’d had more questions from which to draw. What was particularly nice was being able to provide people with their own personalized exam’s answers right at the end of the exam. It also helped me identify some issues.

I do think a number of steps that I took in this course worked well. People (generally) liked the failure examples, they liked the responsiveness, they liked the material. It is easy to focus on just the negatives, but I want to make sure and acknowledge the positives because it is important to preserve those elements.

Finally, I have agreed to teach this course again in the fall (which for UBC means “Winter Term 1”) so I will have an opportunity to incorporate what I have learned into the next course offering. I’m sure I’ll have more insights after that class.

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra Marathe, and Mark Moir, in International Conference on Principles of Distributed Systems, pp. 83-97, 2013, Springer.

The issue of isolation versus sharing is one that permeates systems design for the past 50 plus years. Isolation makes things clean and easy to reason about, but suffers from some disadvantages, such as increased resource utilization and increased latency costs for communications. Shared memory is actually much more difficult to program due to the need to coordinate activity to a shared resource (the memory) potentially across processors. In this paper the authors explore this space. Their rationale is simple: shared memory with lots of processors is
hard.

Even for small multi-core systems, it has become harder and harder to support a simple shared memory abstraction: processors access some memory regions more quickly than others, a phenomenon called non-uniform memory access (NUMA). These trends have prompted researchers to investigate alternative programming abstractions based on message passing rather than cache-coherent shared memory. To advance a pragmatic understanding of these models’ strengths and weaknesses, we have explored a range of different message passing and shared memory designs, for a variety of concurrent data structures, running on different multicore architectures. Our goal was to evaluate which combinations perform best, and where simple software or hardware optimizations might have the most impact. We observe that different approaches per-form best in different circumstances, and that the communication over-head of message passing can often outweigh its benefits. Nonetheless, we discuss ways in which this balance may shift in the future. Overall, we conclude that, by emphasizing high-level shared data abstractions, software should be designed to be largely independent of the choice of low-level communication mechanism.

Non-uniform memory access (NUMA) was originally constructed as a scaling mechanism for “large scale systems”.  Today, it is generally present on any multi-processor system.  What this means is that memory is divided between the processors.  Some of the memory is bound to the processor and then the rest of the memory is bound to other processors.  Thus, memory is either local or remote, with a corresponding cost differential to access it.

Each processor usually has multiple cores, with each of those cores making up one element of the node.  Some computers may have multiple sockets within a single NUMA node.  No doubt there are other models for managing physical memory beyond is basic model, but the idea is the same.

The authors are thus observing that this configuration has become so common that it is now commonly observed on real world systems.  This in turn means that programmers must deal with this.  One way to achieve this is to change the tools so they hide this complexity from the typical programmer; for some this works fine.  For operating systems and high performance systems, such as key-value stores (KVS), it is important to understand these issues in order to optimize their performance.

So these authors look at performance across a “range of message passing and shared-memory designs”.  This sounds like a laudable goal to me.

They set out to define things.  Delegation is where access to a data structure is only performed by a specific set of threads.  If another thread wishes to make a modification to the data structure, it must request that one of the controlling threads do so by sending a request.  A controlling thread for the data structure validated the operation, performs it, and then returns the results to the original requester.  Simple.  In addition, it can be used to simplify the locking model – after all, if a single thread controls the data structure then there cannot be any contention for the data structure and thus there is no locking required.  In addition, if that controlling thread is running on one core and the requester thread is running on a different processor, you can end up with better resource utilization, notably the processor caches.

But the downside to delegation is that you introduce a message passing delay.  If the data structure is hot – because it has a high volume of usage – then the delegation mechanism can become a bottleneck, involving queuing delays.

The authors then turn their attention to message passing overhead.  They evaluate different queuing mechanisms, though their message format remains fairly uniform.  Their queues are the multiple producer, single consumer (MPSCChannel), a per-NUMA node message queue (InletQueue), and a direct access without using compare-and-swap queue (DNCInletQueue).   They carefully describe how each is implemented.  Then they turn to shared memory, where the issue is synchronizing access to the shared memory.  They use several different locking mechanisms: a spin lock, an MCS lock (“ticket lock”), and two variations of a NUMA optimized MCS lock, one that attempts some level of fairness and the other that does not.

Then given these simple mechanisms they evaluate against a concurrent hash map and a concurrent linked list.  They describe their memory allocation strategy (since they are storing key/value data).  The use a two key workloads (“small” and “large”) and then use three operations mixes: a get-only mix, an insert/delete only mix, and a 50/50 mix of read and write operations.

Then they describe their results: small hash maps just don’t benefit from more threads.  Delegation is slower than shared memory.  Simple MCS locks do best under minimal contention.  Interestingly, unfair NUMA aware MCS locks perform best under high contention of the four lock types.  Linked lists is where delegation wins.  The authors argue this is due to better cache locality.  They point out considerable variance in the ticket locks depending upon the workload.

The also evaluated a mixed workload – trying to see the “hot spot” problem in their evaluation. Once again their unfair NUMA aware ticket lock has the best performance, but the authors point out that it introduces substantial delay for some threads.  If you are concerned about bounded latencies, this is not the lock to use.

They had an interesting observation about the impact of hyper-threading on performance (“sibling rivalry”) and end up ensuring that they do not compete for resources on the same core between clients and servers for the delegation case.

They point out that reducing contention for locking is important because it helps minimize the impact on the overall memory bus; thus minimizing cache contention is helpful to the overall system performance, not just the specific applications in question.

Their conclusion?  “[D]elegation can sometimes outperform direct shared-memory approaches”.

We’re going to see some strong arguments for this position in later papers.

 

 

 

NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

SILT: A Memory-Efficient, High-Performance Key-Value Store

SILT: A Memory-Efficient, High-Performance Key-Value Store
Hyeontaek Lim, Bin Fan, David G. Andersen, Michael Kaminsky, in Proceedings of the 23rd Symposium on Operating Systems Principles (SOSP ’11), October 23-26, 2011, Cascais, Portugal.

In this paper, we have an interesting hybrid key-value store; part of the store is in DRAM, while the rest of it is stored in flash storage.  The authors have focused on optimizing the performance of the store across the specific media chosen, which leads to some interesting choices and observations.

SILT (Small Index Large Table) is a memory-efficient, highperformance
key-value store system based on flash storage that
scales to serve billions of key-value items on a single node. It requires
only 0.7 bytes of DRAM per entry and retrieves key/value
pairs using on average 1.01 flash reads each. SILT combines new
algorithmic and systems techniques to balance the use of memory,
storage, and computation. Our contributions include: (1) the design
of three basic key-value stores each with a different emphasis on
memory-efficiency and write-friendliness; (2) synthesis of the basic
key-value stores to build a SILT key-value store system; and (3) an
analytical model for tuning system parameters carefully to meet the
needs of different workloads. SILT requires one to two orders of
magnitude less memory to provide comparable throughput to current
high-performance key-value systems on a commodity desktop
system with flash storage.

Thus, to achieve optimal performance they construct a hybrid ensemble of key-value stores that are optimized to particular underlying media.

They start early with their performance observations, and continue this pace throughout the paper.  Thus, they point out that they have a small memory requirement (0.7 bytes per key value) and only a single read from flash required:

This fits well with their observation that memory efficiency is critically important for scalable key-value stores.  Their system saturated the capabilities of their hardware with 46,000 lookups per second, though this doesn’t guarantee that it will do so on newer, faster hardware (and we will get to several current KV stores.

 

Figure 2 (from the paper) provides a basic description of the SILT architecture.  From this, we can see the three distinct types of stores that are used within their implementation:

  • LogStore – uses a CPU-efficient cuckoo-hash based only upon a tag derived from the full key (20 bytes).  The benefit of this approach is that it avoids accessing the flash memory if the key is not present.  It is probabilistic, in that a key might not be present in Flash with some low probability.
  • HashStore – uses a memory-efficient storage model for their cuckoo flash table.  These stores are immutable and are stored in flash.
  • SortedStore – They use an entropy-coded trie, which provides a very space efficient indexing mechanism: the average according to the authors is 0.4 bytes per key.  Given that keys are 160 bits (20 bytes), which means the index is 64 bits (8 bytes).   The authors provide us with an impressive list of specific optimizations they have applied to sorting data, packing indexes, intelligent (non-pointer) based addressing,

They also provide us with pretty figures to graphically lay out their data structures.   First, Figure 3, which shows the in-memory cuckoo hash table:

Note the fact the key-value insertions are logged to the flash storage.

When we move onto HashStore, they show us how the structure of the data is moved from insertion order to hash sorted order as the data is moved from memory to flash.

Recall that once written to Flash, the HashStore is immutable – so this order will remain fixed.

Of course, one of the important aspects of a key-value store is the ability to look up the data quickly.  Thus, they also show us how they sort data using a prefix tree (Trie).  This is a key aspect of their SortedStore:

The in-memory Trie, permits them to rapidly find the corresponding data in Flash.  This is one of their key motivations for very efficient encoding of the Trie: it ensures they can store it in memory.  Once they can no longer store the indices in memory the lookup system will become I/O bound (to the Flash, in this case).

Figure 6 then shows how they actually encode the length of the tree in each direction, and how it is entropy encoded:There is some serious thought that has gone into their layout.  They do admit, however, that they are not providing crash tolerance: there will be data that was stored in DRAM that cannot be recovered after a crash.  They do not discuss the use of NVM for achieving this.  Instead, they suggest a synchronous write mechanism in which the write blocks until the log data has been written to the flash-based log.

One of the interesting aspects of their performance evaluation is the recognition that they have quite a few parameters that must be chosen.  Thus, they spend more than a page of the paper just discussing the relative merits of the trade-offs involved in selecting various parameters.  They consider write amplification (the need to write more than just the data being preserved), read amplification (performing reads only to discover the data that you seek is not present), memory overhead (how much DRAM is used), and update rate versus flash life time.  This latter point is an interesting one to keep in mind when reading about non-volatile memory: it wears out.  There is a certain amount of degradation that occurs in the underlying physical medium each time it is erased and re-used.  The specific details are dependent upon the characteristics of the medium, but it is an issue to keep in mind.

The evaluation of SILT is certainly interesting.  They choose not to compare to any other KV system and instead focus on the performance of their system using a variety of fairly standard loads (YCSB) as well as simple get/put key operations as micro-benchmarks.    They end up claiming they they can lookup almost 45k keys per second in the slowest of their hybrid stores.  When combined, they indicate they can achieve almost 58k get operations per second on a single core.  As they move to multiple cores they see some further improvement (180k get operations per second for data that is already in memory).

SILT offers some interesting observations, particularly about highly efficient use of memory and novel indexing schemes.

Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao, Eddie Kohler, Robert Morris, in Proceedings of the 7th ACM european conference on Computer Systems (Eurosys ’12), pp. 183-196, Bern, Switzerland, April 10 – 13, 2012.

In this work, the authors build a key-value system called MassTree.  This is an in-memory Key-Value store (I will look at several of them, in fact).  What is novel about this particular variant is the authors focus on providing a high-performance parallel access implementation and the mechanisms by which they achieve this.

MassTree support arbitrary-length keys; it assumes nothing about the keys (so they may be binary data).  It uses B+-trees for storage organized into a Trie structure.  The B+-tree stores a fixed slice of the key; the Trie connects the various slices together.  This use is interesting, since most of the cases I’ve seen of Tries are for strings, as they do an excellent job of managing overlapping string data (prefix trees).  The authors use this for binary keys.  There is nothing inherently more difficult about binary versus string keys (since they are equivalent) but this choice makes the solution very flexible, as it is not particularly data dependent.  This basic layout is shown in Figure 1 (from the paper).

One of the challenges with concurrent data structures is how to handle the common case – no collisions – with minimal performance overhead.  The classic mutual exclusion model (or reader/writer locks) involves a certain amount of overhead, even for non-contended locks, because of the need to perform interlocked operations against shared (common) memory where the locks are maintained.  The system implemented by the authors does not require any locking for readers (lookups). Updates are done with locks local to the CPU, which helps minimize the contention of typical locks.

One interesting observation is that their choice of this hybrid Trie/B+-tree structure was motivated by prior systems that struggled with performance in the presence of variable length keys.  In MassTree, the rate limiting factor for queries is the cost of walking the tree.  They minimize this by using “a wide fan-out tree to reduce tree-depth, prefetches nodes from DRAM to overlap fetch latencies, and carefully lays out data in cache lines to reduce the amount of data needed per node.”

Hence, my interest in this paper: these all seem to be important lessons for persistent memory as well, where the latencies are somewhat larger than for DRAM. Further, the authors are concerned about correctness.  They have not looked at persistence (and recoverable consistency), so there is still further work for me to do should I investigate it further.

The authors conclude by claiming the following contributions:

First, an in-memory concurrent tree that supports keys with shared prefixes efficiently.  Second, a set of techniques for laying out the data of each tree node, and accessing it, that reduces the time spent waiting for DRAM while descending the tree.  Third, a demonstration that a single tree shared among multiple cores can provide higher performance than a partitioned design for some workloads.  Fourth, a complete design that addresses all bottlenecks in the way of million-query-per-second performance.

Their emphasis on correctness and cache efficiency is certainly an attractive aspect of this work.

The driving considerations for their design were: efficient support of many different key distributions including binary and variable length keys, with common prefixes; fine grained concurrent access for high performance parallel access, and cache efficiency through optimal data placement and prefetching.

The authors establish the characteristics of their tree structure, including data placement.  These make reasoning about their tree characteristics simpler.  For example, they note that MassTree does not guarantee that it is a balanced structure.  However, due to the way the tree itself is structured, they have the same algorithmic cost: O(l log n) comparisions, where l is the length of the key and n is the depth of the tree.

As a pragmatic check on their implementation, they also use a partial-key B-tree (pkB-Tree) for comparison.  Despite the fact the pkB-Tree is balanced, the authors note that MassTree performs favorably well on several benchmarks.  The authors go into detail about their implementation details, including the construction of border nodes and interior nodes, as well as how they lay out data (again, with an eye towards cache line efficiency).

To achieve this efficiently, they use a versioning scheme.  A node has a version number.  To modify the node, a given processor must update the version to indicate that it is changing the node.  A reader will snapshot the version at the start of the read, and compare it at the end of the read.  If it changed, the reader knows the state may have changed and can retry the read operation (essentially a variant of software transactional memory).  The detailed diagram of this is shown in Figure 3 (from the paper).

The paper describes the concurrency model in the face of conflicting writers as well.  By keeping their lock in the same cache line as their data, they exploit the cache coherence protocol.  The authors note that lock-free operations have comparable cache behavior (e.g., compare-and-swap or link-load-store-conditional).

Indeed, much of the rest of the technical content of the paper is explaining why their approach is correct – an essential point for concurrent access systems.  Without that, there really is not much point!

While brief, their discussion about value storage is interesting: their measurements are done assuming that values will be small.  They state they have a scheme for managing large values as well, via a separate allocator.  While they do not claim this, my observation is that a “real world” system would likely need to have some hybrid form of this.

MassTree is an in-memory key-value store. To provide persistence, they rely upon a write-behind log scheme.  When used via the network interface, the writes are not guaranteed.  To minimize the loss window, they choose a 200 ms timer.  Thus, the log is written to disk every 200 ms.  They do not evaluate this persistence model, offering it to us as an explanation that persistence is not incompatible with performance.

This last point is an interesting one, especially when considered in the context of NVM: what are the trade-offs.  The authors hint at this, but do not explore this space.

For those interested, a version of the source code can be found here: https://github.com/kohler/masstree-beta