Home » Conferences Journals and Workshops

Category Archives: Conferences Journals and Workshops

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
December 2024
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031  

Where has the time gone?

It’s been more than a year since I last posted; it’s not that I haven’t been busy, but rather that I’ve been trying to do too many things and have been (more slowly than I’d like) cutting back on some of my activities.

Still, I miss using this as a (one way) discussion about my own work. In the past year I’ve managed to publish one new (short) paper, though the amount of work that I put into it was substantial (it was just published in Computer Architecture Letters). This short article (letter) journal normally provides at most one revise and resubmit opportunity, but they gave me two such opportunities, then accepted the paper, albeit begrudgingly over the objections of Reviewer # 2 (who agreed to accept it, but didn’t change their comments).

Despite the lack of clear publications to demonstrate forward progress, I’ve been working on a couple of projects to push them along. Both were presented, in some form, at Eurosys as posters.

Since I got back from a three month stint at Microsoft Research (in the UK) I’ve been working on one of those, evolving the idea of kernel bypasses and really analyzing why we keep doing these things; this time through the lens of building user mode file systems. I really should write more about it, since that’s on the drawing board for submission this fall.

The second idea is one that stemmed from my attendance at SOSP 2019. There were three papers that spoke directly to file systems:

Each of these had important insights into the crossover between file systems and persistent memory. One of the struggles I had with that short paper was explaining to people “why file systems are necessary for using persistent memory”. I was still able to capture some of what I’d learned, but a fair bit of it was sacrificed to adding background information.

One key observation was around the size of memory pages and their impact on performance; it convinced me that we’d benefit from using ever larger page sizes for PMEM. Some of this is because persistent memory is, well, persistent and thus we don’t need to “load the contents from storage”. Instead, it is storage. So, we’re off testing out some ideas in this area to see if we can contribute some additional insight.

The other area – the one that I have been ignoring too long – is the thesis of this PhD work in the first place. Part of the challenge is to reduce the problem down to something that is tractable and can be finished in a reasonable amount of time.

Memex

One of the questions (and the one I wanted to explore when I started writing this) is a rather famous article from 1945 entitled As We May Think. Vannevar Bush described something quite understandable, yet we have not achieve this, though we have been trying – one could argue that hypertext stems from these ideas, but I would argue that hypertext links are a pale imitation of the rich assistive model Bush lays out when he describes the Memex.

Thus, to the question, which I will reserve for another day: why have we not achieved this yet? What prevents us from having this, or something better, and how can I move us towards this goal?

I suspect, but am not certain, that one culprit may be the fact we decided to stick with an existing and well-understood model of organization:

Maybe the model is wrong when the data doesn’t fit?

To FUSE or Not to FUSE: Performance of User-Space File Systems

To FUSE or Not to FUSE: Performance of User-Space File Systems
Bharath Kumar Reddy Vangoor, Vasily Tarasov, and Erez Zadok,
in The 15th USENIX Conference on File and Storage Technologies (FAST ’17),
February 27 – March 2, 2017, Santa Clara, CA, USA.

Previously, I discussed some of the rationale behind FUSE and a basic introduction to why we use it. This paper is actually a detailed analysis of the performance of FUSE. I have spent a fair bit of time reading – and re-reading – this paper, in order to understand what some of the bottlenecks are within FUSE itself. One area I have previously explored are possible ways of improving its performance; indeed, this is one of my current projects.

Figure 1 from original Vangoor Paper

Figure 1 in the paper is actually a basic block diagram providing an interface model for how FUSE fits into the file systems layer. FUSE consists of:

  • The kernel mode FUSE driver; on Linux this is part of the kernel; and
  • The user mode FUSE library. This handles interactions between the FUSE file system driver and the user mode file system process (referred to as a “daemon” or “background process”) in the diagram.
  • The user mode FUSE file system itself – this is the code that implements the FUSE library interface (one of them, since there are two different interfaces).

Applications can then access this FUSE file system without knowing any details of the implementation.

FUSE kernel driver queuing model.

In Figure 2 from the paper, the authors show the internal structure of how the FUSE kernel driver manages various internal data structures that handle events between the user mode file system and the kernel mode file system support. This includes the messages between the kernel and user mode application (requests and their corresponding replies) as well as synchronous and asynchronous file system operations and the cache invalidation mechanisms (“forgets”).

Caching is an essential part of the system because it allows the system to quickly respond to repetitive events. The downside to caches are that they can become invalid because the underlying state changes. Thus, there is a mechanism for invalidating the cache itself.

The author describes the model within the FUSE library and the two interfaces it exposes: the low level API, which provides greater control to the user mode file system, at the cost of more state management – specifically the mapping of a path name to the inode (index node).

The authors describe how they constructed a new FUSE file system, StackFS, for evaluating the behavior of the system. StackFS sits on top of an existing file system and attempts to minimize the amount of mapping it performs, since the goal of the authors is to evaluate the performance of FUSE itself.

Table 3 from the paper

The authors summarize their findings in Table 3, using both a hard disk and an SSD, the two most common types of storage media used on modern computer systems.

These results were both surprising and interesting to me because some of them surprised me. The performance bottlenecks were not for I/O as much as they were for meta-data operations. Random write performance (which is what we see with databases, for example) is not ideal, but their optimizations did a good job of addressing this, bringing the I/O overhead of the FUSE model down substantially relative to the native file system.

Bottom line: the challenge in improving FUSE performance now moves squarely into the arena of meta-data operations. Creating and deleting files is quite expensive in FUSE.

The authors conclude by pointing out that there is further room for improvement; they suggest some potential future directions. I have been looking at ways to improve this as well and I will discuss those in a future post.

GFS: a Graph-based File System Enhanced with Semantic Features

GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.

In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.

I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.

The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.

The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”

Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.

“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”

In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!

So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.

One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.

The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.

The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.

In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.


Collaboration versus Cheating

Collaboration Versus Cheating: Reducing Code Plagiarism in an Online MS Computer Science Program Tony Mason, Ada Gavrilovska, and David A. Joyner, Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp. 1004-1010, Minneapolis, MN, February 27 – March 2, 2019.

This is a rather different paper than my usual but I thought I would write about it since I was the one that drove this particular work. I first started as a teaching assistant for the Graduate Introduction to Operating Systems in Georgia Tech’s Online Master of Science in Computer Science program. One of my motivations at the time was to really learn more about how online education could be realized. The program’s goals were to achieve scale (which it has) and to maintain a quality level that was non-inferior to their online MSc program offering. I achieved many of my goals here, as I learned how to improve the pedagogical goals (by improving feedback to students, and helping them understand the topics) of the class while also finding ways to ensure academic honesty.


Figure 1: Detected plagiarism rates over time

I started working with the class in Fall 2015; at that point we had no proactive plagiarism detection in place, though we were collecting similarity data. It wasn’t until 2016 that I started looking at that data and began to realize we had a noticeable plagiarism rate. We tried a number of techniques to try and reduce this. We explained what plagiarism was, we offered “amnesties” if people that cheated would come forward and admit it. Neither worked. Amnesty actually lead to cases where students we did not think had plagiarized were admitting to things we considered to be fair use (e.g., using existing code to set up socket connections).

Figure 1 shows the progress over time, and clearly delineates when our intervention strategies were successful. What it does not capture is the qualitative difference. By Spring 2017, I was observing students wholesale submitting other students’ projects and claiming it as their own! It was that semester when I prioritized “early intervention” in which I quickly identified students suspected of cheating and confronted them. The plagiarism rate for subsequent projects dropped precipitously.

The downside to this approach is that it took a tremendous amount of work to do this, which makes it impractical because it does not scale. Effective is great, but scalability was also a requirement if it was going to be useful. In the summer of 2017 we introduced one new mechanism to the class: we added a mandatory quiz that went over the policy and asked students if they understood the course policies defining collaboration and cheating, as well as understanding why this was important and the potential penalties.

The results surprised me: the plagiarism rate plummeted. I was cautiously optimistic though because the Summer 2017 class was the smallest we’d ever had. Thus, I suspected it might simply be an anomaly. But this encouraged me to begin making the evaluation more rigorous. In the Fall of 2018 I spent time automating the process of doing code comparisons not only with other students in the same class, but also comparing submissions across all prior submissions. My motivation for doing this work was to validate what I thought I was seeing.

Thus I performed retrospective analysis on prior semesters in which the course was offered. I used that data to write a paper for Learning at Scale 2018 but that paper was rejected. I used the feedback from that to revise my draft and submit an update to it to ICER 2018 but that was also rejected. I once again dug deep and revised the paper and submitted it to SIGCSE 2019, where it was accepted (and astoundingly, all the reviews were accepts!) I revised the final paper to address the concerns of some reviewers (the weak accepts) and that is what was included at SIGCSE.

This doesn’t move my research forward per-se, but it helped me understand how incredibly important data visualization is – like Figure 1. I also learned how important it is to construct reproducible data analysis flows. The scripts I wrote helped me build a suite of tools for improving reproducibility. The research I did helped me better understand why the primary automated tool we use works (MOSS) and its limitations. When I was done I realized that I could easily continue exploring this research area, but since it isn’t core to my focus, I’ll leave it to others to push the frontiers.

Still it was an interesting project and helped confirm my interest in doing research.

I have a public repository with further information (beyond what is in my paper) on GitHub. It includes a copy of the slides from my presentation in March at SIGCSE (in which I think I had around 60 people, which was far more than I’d expected given that it was the last half-day of the conference!)



Eurosys 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve




I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.


Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.


NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

The Future of Synchronization on Multicores: The Mulitcore Transformation

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read.  Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.”  I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation.  In operating systems, we prevent deadlocks by defining a lock hierarchy.  The order in which locks can be acquired forms a graph.  To prevent deadlocks, we require that the graph be acyclic.  That sounds simple and for simple code bases, it is.  However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear.  Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.”  A common maxim in systems programming is to optimize the common case.  Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well:  when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state.  I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity.  It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing.  It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made.  General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

  • Keep them small: they don’t change much state
  • Keep them brief: they either commit or abort quickly
  • Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested.  Transactions can avoid issues around priority inversion, convying, and deadlocks.  The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new.  We’ve been using them for decades.  When we use them at disk I/O speeds, we find the overhead is acceptable.  When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software.  This gave birth to the idea of hardware transactions.  Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively.  They don’t suffer from the high overhead of software transactions.  The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”.   When a cache line cannot remain in the CPU, the transaction is aborted.  Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.”  As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databasesjust retrying the operation once or twice often resolve the fault.  But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-­level constructs like mutual exclusion locks, to concurrent data structures such as skip-­‐lists or priority queues, to system-­‐level constructs such as read-­‐copy-­‐update (RCU), all the way to run-­time support for high-­‐level language synchronization mechanisms.

So far, this change has not been pervasive.  I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances.   Systems, and software stacks, do change slowly.  Backwards compatibility is a big issue.  As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineeringpp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work.  By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective.  Intel hardware with transactional memory support was just starting to emerge around the time this paper was released.  I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases.  Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code).  Cache lines are read from memory, any changed made to the cache line, and then written back to memory.  This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory.  For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion.  Even if there is no contention for the lock, you will pay the lock cost.  If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back.  In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction.  Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system.  The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots).  They graphically demonstrate this in Figure 1 (from the paper).  HTM has some overhead, but it tracks with partitioning fairly linearly.  This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time.  The traditional mechanism for doing this kind of paralleism is the two phase commit protocol.  That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work.   Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed.  Instead, we have hot spots.  This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion.  There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.”  Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning.  The authors do a good job of explaining how HTM works on Intel platforms.  Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.)  Level 1 cache is the fastest to access, but the most expensive to provide.  Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU.  Level 3 cache might be present on the CPU, in which case it is shared between all three cores.  Note that none of this is required.  It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way.  Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core.  Note, however, that this could mean just the L1 cache.  In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper).  Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software).  That probability approaches 100% as the size goes above above 8KB.  Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache.  What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating).  If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises.  This is shown in Figure 6 (from the paper).  This is because the chance that various systems events will occur, which cause the transaction to abort.  This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them. 

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient.   The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.”  They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation.  Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well .  Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation.  They address this a bit later in the paper.

 

Figure 13 (from the paper) compares their performance against a partitioned system.  Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown.  One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance.   This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation.  They admit as much in the paper: the hardware just isn’t available to them.  They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations.  If an HTM is retried, many will complete with one or two tries.  Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.