Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineering, pp 580-591.
I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work. By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective. Intel hardware with transactional memory support was just starting to emerge around the time this paper was released. I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases. Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).
Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code). Cache lines are read from memory, any changed made to the cache line, and then written back to memory. This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.
The idea behind HTM is that sometimes you want to change more than a single element of memory. For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion. Even if there is no contention for the lock, you will pay the lock cost. If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back. In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).
With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction. Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system. The goal then is to avoid doing any memory operations unless absolutely necessary.
The authors point out that the fastest option is partitioning (ignoring hot spots). They graphically demonstrate this in Figure 1 (from the paper). HTM has some overhead, but it tracks with partitioning fairly linearly. This difference is the overhead of HTM.
They compare this to serial execution, which just means performing them one at a time. The traditional mechanism for doing this kind of paralleism is the two phase commit protocol. That’s the lock/work/unlock paradigm.
If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work. Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed. Instead, we have hot spots. This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion. There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.” Thus, their observation is: “we have to deal with this problem”.
So, how can HTM be exploited to provide robust scalability without partitioning. The authors do a good job of explaining how HTM works on Intel platforms. Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.) Level 1 cache is the fastest to access, but the most expensive to provide. Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU. Level 3 cache might be present on the CPU, in which case it is shared between all three cores. Note that none of this is required. It just happens to be how CPUs are constructed now.
The benefit of HTM then is that it exploits the cache in an interesting new way. Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core. Note, however, that this could mean just the L1 cache. In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper). Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software). That probability approaches 100% as the size goes above above 8KB. Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache. What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating). If we need to use a ninth we evict one of the existing pages in the cache.
Similarly, when the duration of the transaction goes up, the probability of it aborting also rises. This is shown in Figure 6 (from the paper). This is because the chance that various systems events will occur, which cause the transaction to abort. This includes various types of interrupts: hardware and software.
We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient. The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.
Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.” They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.
They put all of this insight together and that forms the basis for their evaluation. Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.
Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well . Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation. They address this a bit later in the paper.
Figure 13 (from the paper) compares their performance against a partitioned system. Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown. One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance. This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.
Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation. They admit as much in the paper: the hardware just isn’t available to them. They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.
Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations. If an HTM is retried, many will complete with one or two tries. Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.
Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.