Home » Memory

Category Archives: Memory

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 204 other subscribers
April 2024
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
282930  

Where has the time gone?

It’s been more than a year since I last posted; it’s not that I haven’t been busy, but rather that I’ve been trying to do too many things and have been (more slowly than I’d like) cutting back on some of my activities.

Still, I miss using this as a (one way) discussion about my own work. In the past year I’ve managed to publish one new (short) paper, though the amount of work that I put into it was substantial (it was just published in Computer Architecture Letters). This short article (letter) journal normally provides at most one revise and resubmit opportunity, but they gave me two such opportunities, then accepted the paper, albeit begrudgingly over the objections of Reviewer # 2 (who agreed to accept it, but didn’t change their comments).

Despite the lack of clear publications to demonstrate forward progress, I’ve been working on a couple of projects to push them along. Both were presented, in some form, at Eurosys as posters.

Since I got back from a three month stint at Microsoft Research (in the UK) I’ve been working on one of those, evolving the idea of kernel bypasses and really analyzing why we keep doing these things; this time through the lens of building user mode file systems. I really should write more about it, since that’s on the drawing board for submission this fall.

The second idea is one that stemmed from my attendance at SOSP 2019. There were three papers that spoke directly to file systems:

Each of these had important insights into the crossover between file systems and persistent memory. One of the struggles I had with that short paper was explaining to people “why file systems are necessary for using persistent memory”. I was still able to capture some of what I’d learned, but a fair bit of it was sacrificed to adding background information.

One key observation was around the size of memory pages and their impact on performance; it convinced me that we’d benefit from using ever larger page sizes for PMEM. Some of this is because persistent memory is, well, persistent and thus we don’t need to “load the contents from storage”. Instead, it is storage. So, we’re off testing out some ideas in this area to see if we can contribute some additional insight.

The other area – the one that I have been ignoring too long – is the thesis of this PhD work in the first place. Part of the challenge is to reduce the problem down to something that is tractable and can be finished in a reasonable amount of time.

Memex

One of the questions (and the one I wanted to explore when I started writing this) is a rather famous article from 1945 entitled As We May Think. Vannevar Bush described something quite understandable, yet we have not achieve this, though we have been trying – one could argue that hypertext stems from these ideas, but I would argue that hypertext links are a pale imitation of the rich assistive model Bush lays out when he describes the Memex.

Thus, to the question, which I will reserve for another day: why have we not achieved this yet? What prevents us from having this, or something better, and how can I move us towards this goal?

I suspect, but am not certain, that one culprit may be the fact we decided to stick with an existing and well-understood model of organization:

Maybe the model is wrong when the data doesn’t fit?

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.


Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores

Message Passing or Shared Memory: Evaluating the Delegation Abstraction for Multicores
Irina Calciu, Dave Dice, Tim Harris, Maurice Herlihy, Alex Kogan, Virendra Marathe, and Mark Moir, in International Conference on Principles of Distributed Systems, pp. 83-97, 2013, Springer.

The issue of isolation versus sharing is one that permeates systems design for the past 50 plus years. Isolation makes things clean and easy to reason about, but suffers from some disadvantages, such as increased resource utilization and increased latency costs for communications. Shared memory is actually much more difficult to program due to the need to coordinate activity to a shared resource (the memory) potentially across processors. In this paper the authors explore this space. Their rationale is simple: shared memory with lots of processors is
hard.

Even for small multi-core systems, it has become harder and harder to support a simple shared memory abstraction: processors access some memory regions more quickly than others, a phenomenon called non-uniform memory access (NUMA). These trends have prompted researchers to investigate alternative programming abstractions based on message passing rather than cache-coherent shared memory. To advance a pragmatic understanding of these models’ strengths and weaknesses, we have explored a range of different message passing and shared memory designs, for a variety of concurrent data structures, running on different multicore architectures. Our goal was to evaluate which combinations perform best, and where simple software or hardware optimizations might have the most impact. We observe that different approaches per-form best in different circumstances, and that the communication over-head of message passing can often outweigh its benefits. Nonetheless, we discuss ways in which this balance may shift in the future. Overall, we conclude that, by emphasizing high-level shared data abstractions, software should be designed to be largely independent of the choice of low-level communication mechanism.

Non-uniform memory access (NUMA) was originally constructed as a scaling mechanism for “large scale systems”.  Today, it is generally present on any multi-processor system.  What this means is that memory is divided between the processors.  Some of the memory is bound to the processor and then the rest of the memory is bound to other processors.  Thus, memory is either local or remote, with a corresponding cost differential to access it.

Each processor usually has multiple cores, with each of those cores making up one element of the node.  Some computers may have multiple sockets within a single NUMA node.  No doubt there are other models for managing physical memory beyond is basic model, but the idea is the same.

The authors are thus observing that this configuration has become so common that it is now commonly observed on real world systems.  This in turn means that programmers must deal with this.  One way to achieve this is to change the tools so they hide this complexity from the typical programmer; for some this works fine.  For operating systems and high performance systems, such as key-value stores (KVS), it is important to understand these issues in order to optimize their performance.

So these authors look at performance across a “range of message passing and shared-memory designs”.  This sounds like a laudable goal to me.

They set out to define things.  Delegation is where access to a data structure is only performed by a specific set of threads.  If another thread wishes to make a modification to the data structure, it must request that one of the controlling threads do so by sending a request.  A controlling thread for the data structure validated the operation, performs it, and then returns the results to the original requester.  Simple.  In addition, it can be used to simplify the locking model – after all, if a single thread controls the data structure then there cannot be any contention for the data structure and thus there is no locking required.  In addition, if that controlling thread is running on one core and the requester thread is running on a different processor, you can end up with better resource utilization, notably the processor caches.

But the downside to delegation is that you introduce a message passing delay.  If the data structure is hot – because it has a high volume of usage – then the delegation mechanism can become a bottleneck, involving queuing delays.

The authors then turn their attention to message passing overhead.  They evaluate different queuing mechanisms, though their message format remains fairly uniform.  Their queues are the multiple producer, single consumer (MPSCChannel), a per-NUMA node message queue (InletQueue), and a direct access without using compare-and-swap queue (DNCInletQueue).   They carefully describe how each is implemented.  Then they turn to shared memory, where the issue is synchronizing access to the shared memory.  They use several different locking mechanisms: a spin lock, an MCS lock (“ticket lock”), and two variations of a NUMA optimized MCS lock, one that attempts some level of fairness and the other that does not.

Then given these simple mechanisms they evaluate against a concurrent hash map and a concurrent linked list.  They describe their memory allocation strategy (since they are storing key/value data).  The use a two key workloads (“small” and “large”) and then use three operations mixes: a get-only mix, an insert/delete only mix, and a 50/50 mix of read and write operations.

Then they describe their results: small hash maps just don’t benefit from more threads.  Delegation is slower than shared memory.  Simple MCS locks do best under minimal contention.  Interestingly, unfair NUMA aware MCS locks perform best under high contention of the four lock types.  Linked lists is where delegation wins.  The authors argue this is due to better cache locality.  They point out considerable variance in the ticket locks depending upon the workload.

The also evaluated a mixed workload – trying to see the “hot spot” problem in their evaluation. Once again their unfair NUMA aware ticket lock has the best performance, but the authors point out that it introduces substantial delay for some threads.  If you are concerned about bounded latencies, this is not the lock to use.

They had an interesting observation about the impact of hyper-threading on performance (“sibling rivalry”) and end up ensuring that they do not compete for resources on the same core between clients and servers for the delegation case.

They point out that reducing contention for locking is important because it helps minimize the impact on the overall memory bus; thus minimizing cache contention is helpful to the overall system performance, not just the specific applications in question.

Their conclusion?  “[D]elegation can sometimes outperform direct shared-memory approaches”.

We’re going to see some strong arguments for this position in later papers.

 

 

 

NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

The Future of Synchronization on Multicores: The Mulitcore Transformation

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read.  Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.”  I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation.  In operating systems, we prevent deadlocks by defining a lock hierarchy.  The order in which locks can be acquired forms a graph.  To prevent deadlocks, we require that the graph be acyclic.  That sounds simple and for simple code bases, it is.  However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear.  Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.”  A common maxim in systems programming is to optimize the common case.  Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well:  when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state.  I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity.  It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing.  It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made.  General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

  • Keep them small: they don’t change much state
  • Keep them brief: they either commit or abort quickly
  • Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested.  Transactions can avoid issues around priority inversion, convying, and deadlocks.  The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new.  We’ve been using them for decades.  When we use them at disk I/O speeds, we find the overhead is acceptable.  When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software.  This gave birth to the idea of hardware transactions.  Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively.  They don’t suffer from the high overhead of software transactions.  The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”.   When a cache line cannot remain in the CPU, the transaction is aborted.  Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.”  As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databasesjust retrying the operation once or twice often resolve the fault.  But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-­level constructs like mutual exclusion locks, to concurrent data structures such as skip-­‐lists or priority queues, to system-­‐level constructs such as read-­‐copy-­‐update (RCU), all the way to run-­time support for high-­‐level language synchronization mechanisms.

So far, this change has not been pervasive.  I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances.   Systems, and software stacks, do change slowly.  Backwards compatibility is a big issue.  As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineeringpp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work.  By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective.  Intel hardware with transactional memory support was just starting to emerge around the time this paper was released.  I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases.  Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code).  Cache lines are read from memory, any changed made to the cache line, and then written back to memory.  This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory.  For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion.  Even if there is no contention for the lock, you will pay the lock cost.  If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back.  In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction.  Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system.  The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots).  They graphically demonstrate this in Figure 1 (from the paper).  HTM has some overhead, but it tracks with partitioning fairly linearly.  This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time.  The traditional mechanism for doing this kind of paralleism is the two phase commit protocol.  That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work.   Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed.  Instead, we have hot spots.  This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion.  There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.”  Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning.  The authors do a good job of explaining how HTM works on Intel platforms.  Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.)  Level 1 cache is the fastest to access, but the most expensive to provide.  Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU.  Level 3 cache might be present on the CPU, in which case it is shared between all three cores.  Note that none of this is required.  It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way.  Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core.  Note, however, that this could mean just the L1 cache.  In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper).  Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software).  That probability approaches 100% as the size goes above above 8KB.  Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache.  What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating).  If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises.  This is shown in Figure 6 (from the paper).  This is because the chance that various systems events will occur, which cause the transaction to abort.  This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them. 

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient.   The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.”  They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation.  Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well .  Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation.  They address this a bit later in the paper.

 

Figure 13 (from the paper) compares their performance against a partitioned system.  Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown.  One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance.   This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation.  They admit as much in the paper: the hardware just isn’t available to them.  They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations.  If an HTM is retried, many will complete with one or two tries.  Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.

 

 

 


 

Cache Craftiness for Fast Multicore Key-Value Storage

Cache Craftiness for Fast Multicore Key-Value Storage
Yandong Mao, Eddie Kohler, Robert Morris, in Proceedings of the 7th ACM european conference on Computer Systems (Eurosys ’12), pp. 183-196, Bern, Switzerland, April 10 – 13, 2012.

In this work, the authors build a key-value system called MassTree.  This is an in-memory Key-Value store (I will look at several of them, in fact).  What is novel about this particular variant is the authors focus on providing a high-performance parallel access implementation and the mechanisms by which they achieve this.

MassTree support arbitrary-length keys; it assumes nothing about the keys (so they may be binary data).  It uses B+-trees for storage organized into a Trie structure.  The B+-tree stores a fixed slice of the key; the Trie connects the various slices together.  This use is interesting, since most of the cases I’ve seen of Tries are for strings, as they do an excellent job of managing overlapping string data (prefix trees).  The authors use this for binary keys.  There is nothing inherently more difficult about binary versus string keys (since they are equivalent) but this choice makes the solution very flexible, as it is not particularly data dependent.  This basic layout is shown in Figure 1 (from the paper).

One of the challenges with concurrent data structures is how to handle the common case – no collisions – with minimal performance overhead.  The classic mutual exclusion model (or reader/writer locks) involves a certain amount of overhead, even for non-contended locks, because of the need to perform interlocked operations against shared (common) memory where the locks are maintained.  The system implemented by the authors does not require any locking for readers (lookups). Updates are done with locks local to the CPU, which helps minimize the contention of typical locks.

One interesting observation is that their choice of this hybrid Trie/B+-tree structure was motivated by prior systems that struggled with performance in the presence of variable length keys.  In MassTree, the rate limiting factor for queries is the cost of walking the tree.  They minimize this by using “a wide fan-out tree to reduce tree-depth, prefetches nodes from DRAM to overlap fetch latencies, and carefully lays out data in cache lines to reduce the amount of data needed per node.”

Hence, my interest in this paper: these all seem to be important lessons for persistent memory as well, where the latencies are somewhat larger than for DRAM. Further, the authors are concerned about correctness.  They have not looked at persistence (and recoverable consistency), so there is still further work for me to do should I investigate it further.

The authors conclude by claiming the following contributions:

First, an in-memory concurrent tree that supports keys with shared prefixes efficiently.  Second, a set of techniques for laying out the data of each tree node, and accessing it, that reduces the time spent waiting for DRAM while descending the tree.  Third, a demonstration that a single tree shared among multiple cores can provide higher performance than a partitioned design for some workloads.  Fourth, a complete design that addresses all bottlenecks in the way of million-query-per-second performance.

Their emphasis on correctness and cache efficiency is certainly an attractive aspect of this work.

The driving considerations for their design were: efficient support of many different key distributions including binary and variable length keys, with common prefixes; fine grained concurrent access for high performance parallel access, and cache efficiency through optimal data placement and prefetching.

The authors establish the characteristics of their tree structure, including data placement.  These make reasoning about their tree characteristics simpler.  For example, they note that MassTree does not guarantee that it is a balanced structure.  However, due to the way the tree itself is structured, they have the same algorithmic cost: O(l log n) comparisions, where l is the length of the key and n is the depth of the tree.

As a pragmatic check on their implementation, they also use a partial-key B-tree (pkB-Tree) for comparison.  Despite the fact the pkB-Tree is balanced, the authors note that MassTree performs favorably well on several benchmarks.  The authors go into detail about their implementation details, including the construction of border nodes and interior nodes, as well as how they lay out data (again, with an eye towards cache line efficiency).

To achieve this efficiently, they use a versioning scheme.  A node has a version number.  To modify the node, a given processor must update the version to indicate that it is changing the node.  A reader will snapshot the version at the start of the read, and compare it at the end of the read.  If it changed, the reader knows the state may have changed and can retry the read operation (essentially a variant of software transactional memory).  The detailed diagram of this is shown in Figure 3 (from the paper).

The paper describes the concurrency model in the face of conflicting writers as well.  By keeping their lock in the same cache line as their data, they exploit the cache coherence protocol.  The authors note that lock-free operations have comparable cache behavior (e.g., compare-and-swap or link-load-store-conditional).

Indeed, much of the rest of the technical content of the paper is explaining why their approach is correct – an essential point for concurrent access systems.  Without that, there really is not much point!

While brief, their discussion about value storage is interesting: their measurements are done assuming that values will be small.  They state they have a scheme for managing large values as well, via a separate allocator.  While they do not claim this, my observation is that a “real world” system would likely need to have some hybrid form of this.

MassTree is an in-memory key-value store. To provide persistence, they rely upon a write-behind log scheme.  When used via the network interface, the writes are not guaranteed.  To minimize the loss window, they choose a 200 ms timer.  Thus, the log is written to disk every 200 ms.  They do not evaluate this persistence model, offering it to us as an explanation that persistence is not incompatible with performance.

This last point is an interesting one, especially when considered in the context of NVM: what are the trade-offs.  The authors hint at this, but do not explore this space.

For those interested, a version of the source code can be found here: https://github.com/kohler/masstree-beta

 

MRAMFS: A compressing file system for non-volatile RAM

MRAMFS: A compressing file system for non-volatile RAM
Nathan K. Edel, Deepa Tuteja, Ethan L. Miller, and Scott A. Brandt in Proceedings of the 12th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), Volendam, Netherlands, October 2004.

This paper allows me to provide both a file systems paper and look at an interesting approach to byte-addressable non-volatile memory (NVM).

We have developed a prototype in-memory file system which utilizes data compression on inodes, and which has preliminary support for compression of file blocks.  Our file system, mramfs, is also based on data structures tuned for storage efficiency in non-volatile memory.

One of the interesting aspects of NVM is that it has characteristics of storage (persistence) and memory (byte-addressability).  Storage people are used to having vast amounts of time to do things: it is quite difficult, though not impossible, to do anything computationally with data that will be an important factor when it is combined with the overhead of I/O latency to disk drives.  In-memory algorithms worry about optimal cache line usage and efficient usage of the processor, but they don’t need to worry about what happens when the power goes off.

Bringing these two things together requires re-thinking things.  NVM isn’t as fast as DRAM.  Storage people aren’t used to worrying about CPU cache effects on data resilience.

So mramfs looks at this from a very file systems centric perspective: how do we exploit this nifty new memory to build a new kind of RAM disk: it’s still RAM but now it’s persistent.   NVRAM is slower than DIMM and hence it makes sense to compress it to increase the effective data transfer rate (though it is not clear if that really will be the case.)

I didn’t find a strong motivation for compression, though I can see the viability of it now, in a world in which we want to pack as much as we can into a 64 byte cache line.  The authors point out that one of the previous systems (Conquest) settled on a 53 byte inode size. The authors studied existing systems and found they could actually compress down to 20 bytes (or less) for a single inode.   They achieved this using a combination of gamma compression and compressing common file patterns (mode, uid, and gid).  Another reason for this approach is they did not wish to burden their file system with a computationally expensive compression scheme.

MRAMFS Figure 1In Figure 1 (from the paper) the authors provide a graphic description of their data structures.  This depicts a fairly traditional UNIX style file system, with an inode table, name space (directories), references from directory entries to the inodes.  Inodes then point to control structures that eventually map to the actual data blocks.

The actual memory is managed by the file system from a single chunk of non-volatile memory; the memory is virtually addressed and the paper points out that they don’t actually care how that mapping is achieved.

Multiple inodes are allocated together in inode blocks with each block consisting of 16 (variable length) inodes.  The minimum size of a block is 256 bytes. inodes are rewritten in place whenever possible, which can lead to slack space.  If an inode doesn’t fit within its existing space, the entire block is reconstructed and then written to a new block.  Aftewards, the block pointer is changed to point to the new block.  Then the old block is freed.

One thing that is missing from this is much reasoning about crash consistency, which surprised me.

The authors have an extensive evaluation section, comparing to ext2fs, ramfs, and jffs2 (all over RAM disk).  Their test was a create/unlink micro-benchmark, thus optimizing the meta-data insertion/deletion case.  They then questioned their entire testing mechanism by pointing out that the time was also comparable to what they achieved using tmpfs building the openssl package from source.  Their final evaluation was done without the compression code enabled (“[U]nfortuantely, the data compression code is not yet reliable enough to complete significant runs of Postmark or of large builds…”).  They said they were getting about 20-25% of the speed without compression.

Despite this finding, their conclusion was “We have shown that both metadata and file data blocks are highly compressisble with little increase in code complexity.  By using tuned compression techniques, we can save more than 60% of the inode space required by previous NVRAM file systems, and with little impact on performance.”

My take-away?  This was an early implementation of a file system on NVM.  It demonstrates one of the risks of thinking too much in file systems terms.  We’ll definitely have to do better.

NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories

NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, Steven Swanson
in ASPLOS XVI Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, Pages 105-118, March 5-11, 2011.

This paper was presented at the same conference as Mnemosyne.  The authors explore a different use of Non-Volatile Memory (NVM): using it for storing persistent objects.  The authors sum up the motivation for this:

Creating these data structures requires a system that is lightweight enough to expose the performance of the underlying memories but also ensures safety in the presence of application and system failures by avoiding familiar bugs such as dangling pointers, multiple free()s, and locking errors. In addition, the system must prevent new types of hard-to-find pointer safety bugs that only arise with persistent objects. These bugs are especially dangerous since any corruption they cause will be permanent.

Thus, their motivation is to enable the use of these nifty “persistent, user-defined objects” that are not practical when backed by disks (“[T]he slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance.”)

The authors make some important observations that are just as applicable today as they were in 2011.  These include the observation that persistent objects in NVM cannot reasonably be treated like disk based objects “… because the gap between memory and storage performance drove many design decisions that shaped them.”  Nor can they be treated like volatile memory: “To guarantee consistency and durability, non-volatile structures must meet a host of challenges, many of which do not exist for volatile memories.”

They also observe that NVMs greatly expand the possibility of bug sources from having disparate address spaces.  In other words, while you have a single address space, part of it is ephemeral and if you store a reference to the ephemeral part in the persistent part, it will be inconsistent after the current process terminates.

I found their observation about the ability of programmers to reason about this also apropos: “[t]rusting the average programmer
to “get it right” in meeting these challenges is both unreasonable…”  This is consistent with more than 50 years of experience in systems.  Personally, I don’t think this is an indictment of the programmer so much as it is a burden on the system (a perspective the authors appear to endorse as well).  To make this viable, we need to make it easy to get it right.

Figure 1 shows the general architecture of NV-Heaps: It is envisioned as a library of useful services layered on top of the operating system provided abstractions.  One important observation here is that this model completely avoids the need to interact with the operating system in ordinary program execution.  Persistence no longer relies upon utilizing the standard file systems interface.

The authors’ explanation of their goals looks like a veritable “wish list” to me: prevent programmer errors, transactions, referential integrity, performance and scalability, and ease of use.  I’m not sure how referential integrity is different than programmer errors, but clearly it is a very important aspect of their persistent system.

Figure 3 shows how they handle one of the complex consistency cases inherent in managing NVM: the need to ensure that operations can be safely restarted.  For example, when deleting a large data structure, such as a tree, it must be removed in a way that it can be stopped and restarted (e.g., if the system were to crash, it must then be able to resume removal).  To resume after a crash, they use a log of operations and replay it – a classic solution to the problem.

To make their goal of referential integrity work properly they utilize the programming language constructs to do this.  The authors note they achieve this by using 128 bit pointer values (on a 64 bit system).

The paper describes their implementation in considerable detail.  Again, as we would expect, the implementation yields substantially better performance than comparable systems backed by disks – this really shouldn’t come as a surprise, given the performance differential between disks and non-volatile memory.  Even if they had used solid state disks (which existed but were rare in 2011) their results would have still be notably better.  Figure 8 shows their performance information, comparing themselves against several other systems.  One thing to note: they do not have NVM memory.  They use a memory simulator to model the behavior of the system.  The performance figures they provide surprised me: they are substantially faster than I would have expected.  For PCM, they used a 67 nano-second (ns) read time and 215 ns write time.  The paper explains how they obtained these values and how they validated them.  For STTM (a different NVM technology) they reported 29 ns read and 95 ns write.  As a baseline, their DRAM read time was 25 ns, and write time was 35 ns.

While these numbers were lower than I would have expected, the relative ratio is close to what I expected from other things that I have read: PCM memory is about 2.5 times slower for reads, and 10 times slower for writes.  This is consistent with what the paper reports.  I guess it’s time to update my mental “Jeff Dean” numbers.  And indeed, it turns out that DRAM latency is around 15 ns.

The authors were able to modify memcached to use their library for persistence.  They report that they were able to get within 8% of the original memcached.  That seems like an excellent outcome.

All we need now are NVMs.

Mnemosyne: Lightweight Persistent Memory

Mnemosyne: Lightweight Persistent Memory
Haris Volos, Andres Jaan Tack, Michael M. Swift, ASPLOS ’11 March 5-11, 2011.

The abstract starts us off in this brave new world:

New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-levelvaccess to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, allowing regular in-memory data structures to survive system crashes.

So faster, doesn’t require privilege, works like memory, and persistent.  Pretty fancy stuff.

File systems aren’t really constructed to have direct access to the disk from user applications.  Generally it is done via an I/O interface: open, close, read, and write.  But memory isn’t accessed in that fashion at all.  So, how does this affect things?  What do the services look like?  What does it mean to take something everyone thinks of as transient and make it persistent?

Let’s start exploring!

Mnemosyne provides an explicit mechanism for exposing persistent memory to applications.  This is done by extending the programming tools so they can declare something should be stored in persistent memory, or so that it can be dynamically allocated with the proviso that it be allocated from this persistent memory.

Thus, the default is that an existing application retains the same behavior – it does not use persistent memory.  If an application wishes to use persistent memory it must be modified to do so.  Mnemosyne will provide a basic service level, but it won’t change the behavior of existing applications (technical debt really does follow us around in this business).

It’s impressive: “… Mnemosyne can persist data as fast as 3 microseconds.”  It makes existing applications modified to use it much faster. Figure 1 (from the paper) describes the architecture the authors created for Mnemosyne.

Mnemosyne ArchitectureThis architecture envisions the persistent memory being exposed to the application through a persistence interface; the motivation for this is that merely having persistent memory is not enough.  It requires additional work to ensure that it is crash resistant.  In other words, the system can restore the state of the contents in memory to some well-defined consistent state.

This is something file systems routinely handle – the issues of persistence and recoverability.  I often try to think about failure: how does failure manifest?  How do I know that I can recover the state to a consistent spot and then proceed?

This is an uncommon concept for most application developers: they don’t need to worry about the contents of memory being “consistent” in the face of crashes because when the application crashes, the memory is lost.

Mnemosyne provides a model of consistency for applications by creating an explicit mechanism for providing crash consistence.  Note that Mnemosyne won’t define those consistent states – the application must define what it means for its data structures to be consistent.  What Mnemosyne offers are certain guarantees about the contents of memory.

The authors’ decision to virtualize their SCM is an interesting one: “[V]irtualization prevents a memory leak in one program from monopolizing a finite amount of SCM.”  Thus, they stage SCM content to disk between processes.  Consistency of data is provided by “ordering writes”.  The authors identify four consistency mechanisms:

  • Atomic variable update – update the data in place as a single all-or-nothing operation.
  • Append updates – data is not written in place, but rather a new copy is written, such as it might be to the end of a log (such updates are ordered).
  • Shadow updates –  data is written to a new location and once done, the pointer to the old copy is updated to point to the new copy (e.g., via an atomic variable update).  The authors point out there is a potential leak here that must be handled properly.
  • In-place updates – used for data structures that can be modified in place; provided the operations are ordered.

Consistency guarantees for persistent memory are accomplished using processor semantics and mechanisms:

  1. A write through operation (e.g., a temporal move) that is written directly to memory.
  2. Memory fences that ensure strict ordering of operations before the fence relative to operations after the fence.
  3. Cache line flushes.  The CPU stores memory inside the processor while it is acting upon it.  In fact, a modern CPU has multiple levels of memory.  The most expensive (and smallest) will be the Level 1 cache.  It’s also the fastest.  L2 cache is larger and slower than L1 cache.  L3 cache is typically shared with all CPUs on the processor; it is the largest and slowest of the caches.

For storage people, some of this is familiar and some of it is different – instead of worrying about storage stack semantics we’re now worrying about processor cache semantics.  One upside is that processor semantics are more rigidly enforced than storage semantics (e.g., disk drives that lie and say that the data has been written when it hasn’t.)  One downside is that it’s a new failure domain.  For anyone used to working with persistent storage, understanding the failure domain is vital.  I suspect it is also different for people used to thinking about the processor perspective, since persistence isn’t usually something you have to reason about.

Mnemosyne implemented a persistent heap allocator, a modified version of Intel’s STM Compiler (we’ll see later that others had to move that work to other compilers because it is now abandoned), a logging mechanism, a persistent region mechanism, a transactional system (based upon TinySTM).

Their results are, of course, good.  After all, if they had not been good, they wouldn’t have been published. They outperform BerkeleyDB (for some metrics). They demonstrated a fast and persistent red-black tree implementation.  They show the benefits of asynchronous truncation.

Mnemosyne was a useful contribution because it was an early exploration into considering how we should use byte-addressable non-volatile memory.   The library they built is used in future work as well, and this is a heavily cited paper.