Home » Hardware
Category Archives: Hardware
The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.
I’m going to round out the week with a much lighter read. Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.
The editor’s introduction to this piece really does a good job of summing up the problem:
Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.
The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.” I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”
When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation. In operating systems, we prevent deadlocks by defining a lock hierarchy. The order in which locks can be acquired forms a graph. To prevent deadlocks, we require that the graph be acyclic. That sounds simple and for simple code bases, it is. However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear. Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.
The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.” A common maxim in systems programming is to optimize the common case. Locks do the opposite: they burden the common case with logic that is normally not useful.
The author also points out that our lock mechanisms do not compose well: when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state. I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.
This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity. It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing. It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).
One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made. General transaction systems can be quite complicated: this is a common database approach.
How do we make transactions simple enough to be useful in multicore shared memory environments?
- Keep them small: they don’t change much state
- Keep them brief: they either commit or abort quickly
- Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.
One benefit of transactions is they are composable: they can be nested. Transactions can avoid issues around priority inversion, convying, and deadlocks. The author points to other evidence that says they’re easier for programmers and yield better code.
Transactions aren’t new. We’ve been using them for decades. When we use them at disk I/O speeds, we find the overhead is acceptable. When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software. This gave birth to the idea of hardware transactions. Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively. They don’t suffer from the high overhead of software transactions. The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”. When a cache line cannot remain in the CPU, the transaction is aborted. Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.” As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databases) just retrying the operation once or twice often resolve the fault. But sometimes the operation is just not viable on the system at the present time.
The author’s summary of the impact of hardware transactions is interesting:
The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-level constructs like mutual exclusion locks, to concurrent data structures such as skip-‐lists or priority queues, to system-‐level constructs such as read-‐copy-‐update (RCU), all the way to run-time support for high-‐level language synchronization mechanisms.
So far, this change has not been pervasive. I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances. Systems, and software stacks, do change slowly. Backwards compatibility is a big issue. As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.
Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineering, pp 580-591.
I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work. By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective. Intel hardware with transactional memory support was just starting to emerge around the time this paper was released. I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases. Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).
Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code). Cache lines are read from memory, any changed made to the cache line, and then written back to memory. This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.
The idea behind HTM is that sometimes you want to change more than a single element of memory. For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion. Even if there is no contention for the lock, you will pay the lock cost. If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back. In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).
With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction. Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system. The goal then is to avoid doing any memory operations unless absolutely necessary.
The authors point out that the fastest option is partitioning (ignoring hot spots). They graphically demonstrate this in Figure 1 (from the paper). HTM has some overhead, but it tracks with partitioning fairly linearly. This difference is the overhead of HTM.
They compare this to serial execution, which just means performing them one at a time. The traditional mechanism for doing this kind of paralleism is the two phase commit protocol. That’s the lock/work/unlock paradigm.
If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work. Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed. Instead, we have hot spots. This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion. There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.” Thus, their observation is: “we have to deal with this problem”.
So, how can HTM be exploited to provide robust scalability without partitioning. The authors do a good job of explaining how HTM works on Intel platforms. Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.) Level 1 cache is the fastest to access, but the most expensive to provide. Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU. Level 3 cache might be present on the CPU, in which case it is shared between all three cores. Note that none of this is required. It just happens to be how CPUs are constructed now.
The benefit of HTM then is that it exploits the cache in an interesting new way. Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core. Note, however, that this could mean just the L1 cache. In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper). Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software). That probability approaches 100% as the size goes above above 8KB. Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache. What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating). If we need to use a ninth we evict one of the existing pages in the cache.
Similarly, when the duration of the transaction goes up, the probability of it aborting also rises. This is shown in Figure 6 (from the paper). This is because the chance that various systems events will occur, which cause the transaction to abort. This includes various types of interrupts: hardware and software.
We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient. The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.
Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.” They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.
They put all of this insight together and that forms the basis for their evaluation. Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.
Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well . Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation. They address this a bit later in the paper.
Figure 13 (from the paper) compares their performance against a partitioned system. Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown. One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance. This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.
Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation. They admit as much in the paper: the hardware just isn’t available to them. They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.
Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations. If an HTM is retried, many will complete with one or two tries. Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.
Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.
In my prior life I’d had an opportunity to assemble some awesome tools for my work. This included both a great desktop (workstation) system as well as a good laptop system as well. Even at four years old, the desktop system was terrific: dual Xeon processors with 48GB of RAM (24GB each CPU). Each E5 CPU had 6 cores, with hyperthreading.
I started with a small Intel NUC system – dual core i7-5557U with 16GB of RAM, a mechanical keyboard, a 4K2K UHD monitor and gaming mouse, with a Samsung SSD. Good performance and it allows me to dual boot Windows 10 (1607) and Linux Mint (Ubuntu 16.04). I needed something portable so I ultimately settled on a Microsoft Surface Book. I’m still getting used to it, but I do find the detachable screen/tablet useful.
As I started my current research project (more on that later), I concluded that I’d like to do some work with TSX, Intel’s new hardware transactional memory support. I had no problems downloading and running one of the packages that demonstrates the use of TSX and I was pleasantly surprised to find that my Surface Book actually has full TSX support, which is awesome.
Upon further thought, I concluded that I still needed to have my workstation capable as well – and unfortuantely the NUC system doesn’t have TSX support. So I looked at a variety of options, along with other potential needs for my research and concluded that what made the most sense was to build a new workstation. It took me a couple of days looking at various options to settle upon my current solution: I’m building another workstation.
The base of my workstation is an Asus Z10PE-W8 motherboard. This is a dual socket Intel 2011-3 motherboard. I’ll be using dual E5-2603 v4 CPUs (6 cores each, no hyperthreading). The work I’m doing isn’t CPU bound. Rather, for TSX testing/work I want a machine that will exhibit real memory contention. This is a NUMA architecture system, so each CPU will have its own dedicated memory channel. Hyperthreading is actually bad for TSX (since the threads share CPU resources) though I suppose there may be failure modes I won’t see (but the Surface Book has SMT, so I can test it out there as well). These are the low end CPUs, but they have TSX-NI support (according to the Intel ARK page, let’s hope it’s right). I’ve paired it up with 64GB of RAM (2x32GB) so I’ll have plenty of memory and plenty of options for expanding the memory in the future. While I’d like to add NVDIMMs as well, that’s outside my budget at the moment. The option is still there, though…
For storage, I’ll put a SANDisk 1TB m.2 SSD on the system. As far as I’m concerned, SSD is the best choice for storage on a high performance workstation these days, though there are faster options albeit at a much higher price point. Rounding that out with an AMD FirePro W4300 workstation graphics card (four DPI ports!) and a Phanteks case (this is an SSI EEB motherboard, so there aren’t a huge number of case options!) plus some Noctua CPU fans and I have a solid workstation. Ironically, it’s really just an updated version of what I built four years ago.
I’ve settled on a Das Keyboard Model S – a nice mechanical keyboard with excellent reviews. I’m definitely hard on my keyboards. I’ll use my existing Acer monitor for now, though I’m shopping now for a second monitor, as I know from prior experience that multiple monitors are great for productivity.
Parts started arriving today. Once I have everything, I’ll put it all together and make sure that it all works. The case will be mostly empty for the time being – the case is huge because the motherboard is itself huge, but I’m not going to populate it with storage yet. I’m a fan of the HGST Helium drives – for mechanical drives they are pretty amazing but for what I’ll be doing in the near-term I don’t expect to need a big storage device. Still, I can see how it will be useful moving forward.
Look forward to reporting on how it all comes together.