Home » Posts tagged 'Consistency'

Tag Archives: Consistency

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
January 2025
S M T W T F S
 1234
567891011
12131415161718
19202122232425
262728293031  

NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

Logic and Lattices for Distributed Programming

Logic and Lattices for Distributed Programming
Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier, in Symposium on Cloud Computing 2012 (SOCC ’12), October 14-17, 2012, San Jose, CA.

This is definitely a different direction than we’ve had in prior papers, though I do have an ulterior motive in presenting this particular paper – we will see it used later.

In recent years there has been interest in achieving application-level
consistency criteria without the latency and availability costs of
strongly consistent storage infrastructure. A standard technique is to
adopt a vocabulary of commutative operations; this avoids the risk
of inconsistency due to message reordering. Another approach was
recently captured by the CALM theorem, which proves that logically
monotonic programs are guaranteed to be eventually consistent. In
logic languages such as Bloom, CALM analysis can automatically
verify that programs achieve consistency without coordination.
In this paper we present BloomL, an extension to Bloom that
takes inspiration from both of these traditions. BloomL generalizes
Bloom to support lattices and extends the power of CALM analysis
to whole programs containing arbitrary lattices. We show how the
Bloom interpreter can be generalized to support ecient evaluation
of lattice-based code using well-known strategies from logic
programming. Finally, we use BloomL to develop several practical
distributed programs, including a key-value store similar to Amazon
Dynamo, and show how BloomL encourages the safe composition
of small, easy-to-analyze lattices into larger programs.

Notice they do mention key-value stores, so you have another hint on how I’ll be referring back to this work in a future post.

This tends more to the theoretical side of systems.  It is not a theory paper (there just isn’t enough formalism, let alone proofs!)   It has performance graphs, which you certainly expect from a systems paper, but not from a theory paper.

The driving factor behind this is the issue of distributed consistency.  At a high level, “distributed consistency” is concerned with ensuring that a group of communicating computers, with some temporal separation, agree on the outcome of operations even when things go wrong.  Perhaps the most famous example of distributed consistency is Paxos.  These days we refer to these as consensus protocols.  I generally describe there being several such: two-phase commit is certainly one of the older ones.  Quorum protocols are another (e.g., weighted voting, which I described previously).  Viewstamped Replication is another.  These days, the popular consensus protocols are
Raft and Blockchain.

Figure 8 (From Paper)

The paper starts by pointing out that monotonic consisency provides a valuable mechanism for reasoning about distributed consistency.  Prior work by the authors establishes that all monotonic programs are “invariant to message reordering and retry”, a property they call confluent. This matters for distributed systems because such a system only moves forward (the operations are durable.)

They point out some weaknesses in the prior definition and motivate improving it by explaining one such obvious case that does not fit within the model (a voting quorum in a distributed protocol.)

Hence, they introduce the lattice.  They do this within the context of their language (BloomL), which works on top of Ruby.  I will not dwell on the details.

The authors define a bounded semijoined lattice.  My reading of what they are saying is that in such a set, there is a unique element that happened first.  They define this formally as a set S, with an operator (“bottom” that I don’t seem to have in my font set) that defines a partial ordering.  There is a unique element ⊥ that represents the least element.

From this definition, they construct their model; the paper drops the “bounded semijoined” part of the definition and simply discusses lattices from that point forward, but it is this partial ordering property that imparts the key characteristics to their subsequent operations.

Why is this important?  Because it demonstrates that these lattices – which are going to turn out to be equivalent to key operations in distributed systems – have consistency guarantees that are desirable.

The authors then turn their attention to utilizing lattices for key-value stores.  They describe data structure versioning and vector clocks.  Vector clocks have a property they desire for lattices: they are partially ordered.  They combine this with a quorum voting protocol, to provide the distributed consensus for their system.

Figure 9 (from paper)
Figure 9 (from paper)

Figure 8 (from the paper) shows the general structure of their key-value store implementation, which is implemented in BloomL and Ruby.  Their sample usage for this is a shopping cart, which they graphically describe in Figure 9 (from the paper).

As one would expect in a distributed system, the key benefit here is that there is no centralized authority deciding on the order of things.  They point out that prior work argues shopping carts are non-monotonic and thus cannot be solved in a distributed systems setting.  The authors point out that using the lattice structure, they achieve a monotonic ordering, which permits them to implement it without a centralized decision maker; in fact the decision maker in this case is really the client itself, as it has all the information from all the servers sufficient to complete the operation.

While a shopping cart might not be the killer application for a distributed systems technology, this paper does describe a powerful tool for providing distributed consensus in a system that can be implemented in a modest amount of code; compared to Paxos, Raft, or Viewstamped Replication, that is a significant contribution.

It does not appear to have byzantine protection, however, so if you live in a hostile environment it might not be the right protocol.  Similarly, if you need stronger consistency guarantees, this might not be the best model either.  But for many applications slightly relaxed consistency guarantees are often more than adequate.

We will see how this can be applied in the future.