Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.
This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores. The authors point out that the idea of using timestamps for constructing a transactional system. When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system. After all, you can just have a global counter that is incremented as each transaction proceeds. That defines an ordering of events.
Here’s the problem with this approach: the clock becomes a bottleneck. While we do not usually think of memory operations as being a bottleneck, they certainly can be. This is because multi-processor computers look much like a distributed system. They exchange messages to provide coherency guarantees. Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees. This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor). So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor. It writes those changes back at some point in the future. One reason is because another processor tries to access something within that same memory region. This is where the messages go flying back and forth, the modified cache line gets written back to main memory. This turns out to be “expensive”. Access to RAM on my laptop is around 65 nanoseconds. Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU. On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds. So it is 130 times slower. On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.
In this paper the authors explore using the CPU level tick counter as a form of Lamport clock. They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650. The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU. From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock. The Intel chip, on the other hand, has a clock that can be used to construct global atomicity. There are certainly some restrictions on this, but the cited guarantee is that:
“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.
From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system. They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.
They do establish that this solution is not more efficient for all use cases. They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.
Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety. Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”
This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.