Home » Hardware

Category Archives: Hardware

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
December 2024
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031  

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.


The Future of Synchronization on Multicores: The Mulitcore Transformation

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read.  Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.”  I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation.  In operating systems, we prevent deadlocks by defining a lock hierarchy.  The order in which locks can be acquired forms a graph.  To prevent deadlocks, we require that the graph be acyclic.  That sounds simple and for simple code bases, it is.  However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear.  Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.”  A common maxim in systems programming is to optimize the common case.  Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well:  when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state.  I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity.  It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing.  It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made.  General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

  • Keep them small: they don’t change much state
  • Keep them brief: they either commit or abort quickly
  • Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested.  Transactions can avoid issues around priority inversion, convying, and deadlocks.  The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new.  We’ve been using them for decades.  When we use them at disk I/O speeds, we find the overhead is acceptable.  When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software.  This gave birth to the idea of hardware transactions.  Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively.  They don’t suffer from the high overhead of software transactions.  The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”.   When a cache line cannot remain in the CPU, the transaction is aborted.  Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.”  As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databasesjust retrying the operation once or twice often resolve the fault.  But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-­level constructs like mutual exclusion locks, to concurrent data structures such as skip-­‐lists or priority queues, to system-­‐level constructs such as read-­‐copy-­‐update (RCU), all the way to run-­time support for high-­‐level language synchronization mechanisms.

So far, this change has not been pervasive.  I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances.   Systems, and software stacks, do change slowly.  Backwards compatibility is a big issue.  As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineeringpp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work.  By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective.  Intel hardware with transactional memory support was just starting to emerge around the time this paper was released.  I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases.  Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code).  Cache lines are read from memory, any changed made to the cache line, and then written back to memory.  This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory.  For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion.  Even if there is no contention for the lock, you will pay the lock cost.  If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back.  In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction.  Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system.  The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots).  They graphically demonstrate this in Figure 1 (from the paper).  HTM has some overhead, but it tracks with partitioning fairly linearly.  This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time.  The traditional mechanism for doing this kind of paralleism is the two phase commit protocol.  That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work.   Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed.  Instead, we have hot spots.  This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion.  There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.”  Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning.  The authors do a good job of explaining how HTM works on Intel platforms.  Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.)  Level 1 cache is the fastest to access, but the most expensive to provide.  Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU.  Level 3 cache might be present on the CPU, in which case it is shared between all three cores.  Note that none of this is required.  It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way.  Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core.  Note, however, that this could mean just the L1 cache.  In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper).  Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software).  That probability approaches 100% as the size goes above above 8KB.  Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache.  What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating).  If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises.  This is shown in Figure 6 (from the paper).  This is because the chance that various systems events will occur, which cause the transaction to abort.  This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them. 

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient.   The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.”  They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation.  Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well .  Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation.  They address this a bit later in the paper.

 

Figure 13 (from the paper) compares their performance against a partitioned system.  Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown.  One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance.   This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation.  They admit as much in the paper: the hardware just isn’t available to them.  They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations.  If an HTM is retried, many will complete with one or two tries.  Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.

 

 

 


 

Moving to the now

I’ve been gone for a bit; real life keeping me busy.  Part of that has been exploring the current edge of my technology space.  So I’m going to take a break from the historical paper review (but it will come back) and instead start talking about new storage technologies and the interesting things that we can do with them.

Specifically, I have been looking at the brave new world of “non-volatile dual inline memory modules”.  What this means is that we now have persistent memory that is directly accessed via processor primitives.  That means a load or a store instruction can be used to access this kind of memory.  We have actually been talking about persistent memory since the dawn of time.  Many of the early operating systems papers think of disk storage and memory as being different forms of memory.

The big change here was when computers moved away from magnetic core memory into solid state memory.  Intel’s first product was dynamic RAM (though it was invented by IBM).  DRAM was cheaper and faster than core – and much faster than magnetic storage.  Thus the abstraction of memory and disk being part of the same continuum we began to think of them as being distinct.  DRAM was transient; disks and tapes were persistent (note that magnetic core was persistent until someone read it, at which point it lost its state and had to be rewritten – even across power cycles).

Disk storage was often used to contain data from memory, such as for paging or segmentation.  Over the years the operating systems community learned many ways to make disks seem fast: caching, read-ahead, asynchronous I/O, reordering operations, etc.

Computers – and memory – were able to experience vast increases in speed.  Disk drives did improve, but only modestly in comparison.

Memory technologies have continued to evolve.  Persistent memory options increased as well; flash memory is one of the most common today and is used inside solid state drives and NVMe disks, both of which are vastly faster than rotating media disks.

Another memory technology that has been discussed for more than 20 years has been persistent RAM.  One way this has been done in “real products” has been to simply add a backup power source: some sort of battery.  Then if the power is lost, the data contents of volatile memory (DRAM) can be written to persistent storage (e.g., Flash).  This approach has been a stop-gap on the way to the actual persistent memory solution that has been promised for at least a decade: persistent memory that acts like DRAM.  Intel is now starting to ship their new Optane memory products. Samsung has announced their new Z-NAND products.  Other technologies remain “in development” as well.

Why does this matter?  Storage is suddenly getting faster.  We went from 10 millisecond access times to 0.2 millisecond access times (HDDs to SSDs).  Now we are looking at going from 0.2 milliseconds to 200 nanoseconds – six orders of magnitude faster.  This sort of change is profound.  We’ve been talking about it for many years, trying to reason about it.  It is now materializing.  Over the next several years the other promise of NVM is going to materialize: it supports higher density than DRAM.  It is slower than DRAM; where DRAM access times are on the order of 50-100 nanoseconds, NVM access times are on the order of 125-800 nanoseconds (writes being slower than reads).

File systems that have been optimized for working on hard disks or SSDs don’t necessarily make sense on NVM.

Thus, the area I’ve been looking at: expanding my own understanding of NVM.  Since this is still related to my own exploration of file systems, I’ll use this as my soapbox for exploring the space.

Let’s see where this journey goes.  And I promise, I’ll come back to the old file systems papers after this diversion.

 

 

Building the New System

In my prior life I’d had an opportunity to assemble some awesome tools for my work.  This included both a great desktop (workstation) system as well as a good laptop system as well.   Even at four years old, the desktop system was terrific: dual Xeon processors with 48GB of RAM (24GB each CPU).  Each E5 CPU had 6 cores, with hyperthreading.

I started with a small Intel NUC system – dual core i7-5557U with 16GB of RAM, a mechanical keyboard, a 4K2K UHD monitor and gaming mouse, with a Samsung SSD.  Good performance and it allows me to dual boot Windows 10 (1607) and Linux Mint (Ubuntu 16.04).  I needed something portable so I ultimately settled on a Microsoft Surface Book.  I’m still getting used to it, but I do find the detachable screen/tablet useful.

As I started my current research project (more on that later), I concluded that I’d like to do some work with TSX, Intel’s new hardware transactional memory support.  I had no problems downloading and running one of the packages that demonstrates the use of TSX and I was pleasantly surprised to find that my Surface Book actually has full TSX support, which is awesome.

Upon further thought, I concluded that I still needed to have my workstation capable as well – and unfortuantely the NUC system doesn’t have TSX support.  So I looked at a variety of options, along with other potential needs for my research and concluded that what made the most sense was to build a new workstation.  It took me a couple of days looking at various options to settle upon my current  solution: I’m building another workstation.

The base of my workstation is an Asus Z10PE-W8 motherboard.  This is a dual socket Intel 2011-3 motherboard.  I’ll be using dual E5-2603 v4 CPUs (6 cores each, no hyperthreading).  The work I’m doing isn’t CPU bound.  Rather, for TSX testing/work I want a machine that will exhibit real memory contention.  This is a NUMA architecture system, so each CPU will have its own dedicated memory channel.  Hyperthreading is actually bad for TSX (since the threads share CPU resources) though I suppose there may be failure modes I won’t see (but the Surface Book has SMT, so I can test it out there as well).   These are the low end CPUs, but they have TSX-NI support (according to the Intel ARK page, let’s hope it’s right).  I’ve paired it up with 64GB of RAM (2x32GB) so I’ll have plenty of memory and plenty of options for expanding the memory in the future.  While I’d like to add NVDIMMs as well, that’s outside my budget at the moment.  The option is still there, though…

For storage, I’ll put a SANDisk 1TB m.2 SSD on the system.  As far as I’m concerned, SSD is the best choice for storage on a high performance workstation these days, though there are faster options albeit at a much higher price point.  Rounding that out with an AMD FirePro W4300 workstation graphics card (four DPI ports!) and a Phanteks case (this is an SSI EEB motherboard, so there aren’t a huge number of case options!)  plus some Noctua CPU fans and I have a solid workstation.  Ironically, it’s really just an updated version of what I built four years ago.

I’ve settled on a Das Keyboard Model S – a nice mechanical keyboard with excellent reviews.  I’m definitely hard on my keyboards.  I’ll use my existing Acer monitor for now, though I’m shopping now for a second monitor, as I know from prior experience that multiple monitors are great for productivity.

Parts started arriving today.  Once I have everything, I’ll put it all together and make sure that it all works.  The case will be mostly empty for the time being – the case is huge because the motherboard is itself huge, but I’m not going to populate it with storage yet.  I’m a fan of the HGST Helium drives – for mechanical drives they are pretty amazing but for what I’ll be doing in the near-term I don’t expect to need a big storage device.  Still, I can see how it will be useful moving forward.

Look forward to reporting on how it all comes together.