Home » Conferences Journals and Workshops

Category Archives: Conferences Journals and Workshops

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 532 other subscribers

October 2018
S M T W T F S
« May    
 123456
78910111213
14151617181920
21222324252627
28293031  

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.


NVMCached: An NVM-based Key-Value Cache

NVMCached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, and Song Jiang, in Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, 2016.

This is a short (6 page) paper that discusses the challenges of taking a memory based key-value cache (e.g., memcached) and converting it to use a non-volatile memory.  Because the authors do not have NVM available, they focus on simulating it using DRAM.  They simulate behaviors by using appropriate cache flushes and fence.  In fact, they use CLFLUSH and MFENCE, which may be a bit more than is required.

Here is an interesting thing for people familiar with Intel processor architecture for more than about the past 15 years: the CPUs no longer provide strong ordering guarantees.  If you’ve learned or refreshed your knowledge about the memory consistency models more recently, then you are likely already familiar with this.  One of the interesting issues with NVM is that we need to think about memory consistency in a way that’s never been required previously.  The authors of this paper do this.

They note that flushes are expensive; they argue for using “consistency-friendly data structures” for minimizing the number of flushes required, which certainly makes sense, particularly given that they are expensive.  Some operations (like anything with the LOCK prefix) are performed atomically as well and are conceptually similar; both tie into the processor’s cache architecture (for those that have been around a while, the cache line semantics for LOCK were introduced around 1995.)

One of the important observations in this paper is that the reason to use NVM is that it allows fast restart after a system failure, such as a crash or power event.  For memcached, there are users that are asking for persistence.  Even if the service is only a cache, it turns out that offloading to the cache can be a powerful mechanism for improving overall throughput, but the larger the cache becomes, the higher the cost when the cache gets tossed after a system crash.

However, this also simplifies the problem at hand: discarding data in this type of system is of minimal concern.   They utilize checksums to detect damaged data structures, which alleviates their need to perform ordered write operations; they reorganize their data structures to permit changes with only infrequent fencing.  They point out the problem of zombie data (data that is deleted, then the system crashes, so the data returns after the system reboots).  To prevent this situation they choose to split their data structures.

The checksums prove to be much less expensive than flush; the authors claim up to 20x faster (my own reading indicates that using the optimized CRC32 built into modern Intel CPUs the cost is just over 1 clock cycle per byte checksummed, so a 64 byte cache line would have a cost of approximately 70 clock cycles; on a 2.0GHz processor, that would be roughly 17.5 nanoseconds, which certainly is somewhere between 5 and 20 times faster than writing back a cache line.

The authors also touch on the challenges of memory allocation.  Memory fragmentation is something we often fix in dynamic memory by rebooting the machine.  However, now we have the storage level problem with fragmentation; fixing this involves write amplification.  They organize their allocation strategy by observing there is considerable locality across the KV store.  They combine a log structured allocation scheme with a “zone based” allocation model and attempt to combine cleaning, eviction, and updates to minimize the impact on performance.   They also use DRAM to track access information as part of their LRU implementation.  This creates a burden on crash recovery, but benefits from significant improved runtime performance.

They also implement a DRAM write combining cache; the idea is that when an item is detected to be “hot” the item is cached in DRAM.  Those hot items will thus be discarded if the system crashes.

Their evaluation is against a modified version of memcached.  They utilize four Facebook traces; interestingly they do not use one (VAR) because “it is write-dominant with only a few distinct keys touched.”  I am not certain that I see why this makes it unsuitable for use, though I suspect they do not obtain favorable behavior in the face of it.

They evaluate the performance of native memcached versus their NVM-enhanced version with a series of micro-benchmarks after a restart event.  They show that in several cases it takes more than 1 billion requests before the performance of the DRAM memcached to converge with the NVM enhanced version.  When they use the “real world” Facebook memached traces, they observe that three of the benchmarks show comparable performance.  Two of them, SYS and VAR, show substantially better performance.  Both benchmarks are write intensive.  I admit, this surprised me, since I am not certain why writing to (simulated) NVM should be substantially after than writing to DRAM.

Overall, it is interesting work and touches on important challenges for working with NVM as well as improving system performance.

The Future of Synchronization on Multicores: The Mulitcore Transformation

The Future of Synchronization on Multicores: The Multicore Transformation
Maurice Herlihy in Ubiquity, September 2014.

I’m going to round out the week with a much lighter read.  Despite this, it has some useful observations that underlie some of the other papers that I’ve been discussing.

The editor’s introduction to this piece really does a good job of summing up the problem:

Synchronization bugs such as data races and deadlocks make every programmer cringe — traditional locks only provide a partial solution, while high-contention locks can easily degrade performance. Maurice Herlihy proposes replacing locks with transactions. He discusses adapting the well-established concept of data base transactions to multicore systems and shared main memory.

The author points out: “Coarse-grained locks … generally do not scale: Threads block one another even when they do not really interfere, and the lock itself becomes a source of contention.”  I have personally experienced this and moved on to the next solution, which has its own separate problems: “Fine-grained locks can mitigate these scalability problems, but they are difficult to use effectively and correctly.”

When I have taught about locking in the past, I’ve often approached it from the debugging perspective: fine-grained locks create deadlocks, which can be almost impossible to debug without instrumentation.  In operating systems, we prevent deadlocks by defining a lock hierarchy.  The order in which locks can be acquired forms a graph.  To prevent deadlocks, we require that the graph be acyclic.  That sounds simple and for simple code bases, it is.  However, in the real world where we introduce such fine-grained locks, the code base is seldom simple and we end up finding complex situations, such as re-entrant behavior, where the cycles appear.  Cycles can be introduced because we have multiple discreet components, each doing something logical, that creates a lock cycle unwittingly.

The author also points out another problem with locks that is important in real systems: “Locks inhibit concurrency because they must be used conservatively: a thread must acquire a lock whenever there is a possibility of synchronization conflict, even if such conflict is actually rare.”  A common maxim in systems programming is to optimize the common case.  Locks do the opposite: they burden the common case with logic that is normally not useful.

The author also points out that our lock mechanisms do not compose well:  when we need to construct consistent higher level logic from lower level locked primitives, we have no simple way to interlock them unless they expose their own locking state.  I have built such systems and the complexity of verifying state after you acquire each lock and unwinding when the state has changed is challenging to explain and conceptualize.

This is so complicated that in many cases concurrency is handled within the tools themselves in order to insulate the programmer from that complexity.  It may be done by isolating the data structures – single threaded data structures don’t need locks – so you can use isolation and message passing.  It can be done in a transactional manner, in which the locking details are handled by the tools and lock issues cause the transaction to roll back (abort), leaving the application programmer to restart again (or the tools to attempt to handle it gracefully).

One such way to achieve this is to implement transactional memory: a series of operations that are performed sequentially and once the operation is done, the outcome is determined: the transaction either becomes visible (it is committed) or it fails (it is aborted) and no changes are made.  General transaction systems can be quite complicated: this is a common database approach.

How do we make transactions simple enough to be useful in multicore shared memory environments?

  • Keep them small: they don’t change much state
  • Keep them brief: they either commit or abort quickly
  • Keep them ephemeral: they don’t involve disk I/O, they aren’t related to persistence they are related to consistency.

One benefit of transactions is they are composable: they can be nested.  Transactions can avoid issues around priority inversion, convying, and deadlocks.  The author points to other evidence that says they’re easier for programmers and yield better code.

Transactions aren’t new.  We’ve been using them for decades.  When we use them at disk I/O speeds, we find the overhead is acceptable.  When we use them at memory speeds we find the overhead of transactions is too high to make them practical to do in software.  This gave birth to the idea of hardware transactions.  Hardware transactions can be used in databases (see Exploiting Hardware Transactional Memory in Main-Memory Databases) quite effectively.  They don’t suffer from the high overhead of software transactions.  The author points out a limitation here: “Hardware transactions, while efficient, are typically limited by the size and associativity of the last-level cache”.   When a cache line cannot remain in the CPU, the transaction is aborted.  Software must then handle the abort: “For these reasons, programs that use hardware transactions typically require a software backup.”  As we saw in previous work (again Exploiting Hardware Transactional Memory in Main-Memory Databasesjust retrying the operation once or twice often resolve the fault.  But sometimes the operation is just not viable on the system at the present time.

The author’s summary of the impact of hardware transactions is interesting:

The author predicts that direct hardware support for transactions will have a pervasive effect across the software stack, affecting how we implement and reason about everything from low-­level constructs like mutual exclusion locks, to concurrent data structures such as skip-­‐lists or priority queues, to system-­‐level constructs such as read-­‐copy-­‐update (RCU), all the way to run-­time support for high-­‐level language synchronization mechanisms.

So far, this change has not been pervasive.  I have seen signs of it in the operating system, where lock operations now take advantage of lock elision in some circumstances.   Systems, and software stacks, do change slowly.  Backwards compatibility is a big issue.  As we move forward though, we need to keep these new mechanisms in mind as we construct new functionality. Better and faster are the goals.

Exploiting Hardware Transactional Memory in Main-Memory Databases

Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineeringpp 580-591.

I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work.  By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective.  Intel hardware with transactional memory support was just starting to emerge around the time this paper was released.  I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases.  Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).

Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code).  Cache lines are read from memory, any changed made to the cache line, and then written back to memory.  This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.

The idea behind HTM is that sometimes you want to change more than a single element of memory.  For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion.  Even if there is no contention for the lock, you will pay the lock cost.  If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back.  In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).

With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction.  Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system.  The goal then is to avoid doing any memory operations unless absolutely necessary.

The authors point out that the fastest option is partitioning (ignoring hot spots).  They graphically demonstrate this in Figure 1 (from the paper).  HTM has some overhead, but it tracks with partitioning fairly linearly.  This difference is the overhead of HTM.

They compare this to serial execution, which just means performing them one at a time.  The traditional mechanism for doing this kind of paralleism is the two phase commit protocol.  That’s the lock/work/unlock paradigm.

If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work.   Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed.  Instead, we have hot spots.  This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion.  There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.”  Thus, their observation is: “we have to deal with this problem”.

So, how can HTM be exploited to provide robust scalability without partitioning.  The authors do a good job of explaining how HTM works on Intel platforms.  Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.)  Level 1 cache is the fastest to access, but the most expensive to provide.  Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU.  Level 3 cache might be present on the CPU, in which case it is shared between all three cores.  Note that none of this is required.  It just happens to be how CPUs are constructed now.

The benefit of HTM then is that it exploits the cache in an interesting new way.  Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core.  Note, however, that this could mean just the L1 cache.  In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper).  Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software).  That probability approaches 100% as the size goes above above 8KB.  Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache.  What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating).  If we need to use a ninth we evict one of the existing pages in the cache.

Similarly, when the duration of the transaction goes up, the probability of it aborting also rises.  This is shown in Figure 6 (from the paper).  This is because the chance that various systems events will occur, which cause the transaction to abort.  This includes various types of interrupts: hardware and software.

Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them. 

We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient.   The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.

Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.”  They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.

They put all of this insight together and that forms the basis for their evaluation.  Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.

Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well .  Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation.  They address this a bit later in the paper.

 

Figure 13 (from the paper) compares their performance against a partitioned system.  Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown.  One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance.   This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.

Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation.  They admit as much in the paper: the hardware just isn’t available to them.  They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.

Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations.  If an HTM is retried, many will complete with one or two tries.  Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.

Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.

 

 

 


 

The Cambridge File Server

The Cambridge File Server
Jeremy Dixon, in ACM SIGOPS Operating Systems Review,  Volume 14, Number 4, pp 26-35, 1980, ACM.

Cambridge was certainly a hotbed of systems work in the 1970s (not to say that it still is not).  They were looking at very different architectures and approaches to problems than we saw from the various Multics influenced systems.

The introduction to this paper is a testament to the vibrant research work being done here.  They author points to the Cambridge ring, which was their mechanism for implementing a shared computer network and a precursor to the Token Ring networks that followed.  The CAP computer was part of this network, and the network included a separate computer that had a vast amount of storage for the time – 150MB.  That common storage was used for both “filing systems” as well as “virtual memory”.   This computer ran the Cambridge File Server and implemented the functionality that was explored in the WFS paper.

They identify key characteristics of their file server:

  • Substantial crash resistance.
  • Capabilities used to control access.
  • Atomic file updates.
  • Automatic “garbage collection” of storage space
  • Fast transfer to random accessed, word-addressable files.

The authors make a point of noting there are only two classes of objects in their system: files and indices.  I found this interesting because it echos the hierarchical file systems models that encouraged me to start this journey in the first place.

They define a file: “… a random access sequence of 16-bit words whose contents can be read or written by client machines using the following operations”.  The operations that follow are read and write.  They go on to define an index: “… a list of unique identifiers, and is analogous to a C-list in capability machines”.  The three operations here are: preserveretrieve, and delete. This permits entries to be added, found, and removed.

The storage controlled by the file server thus appears to its clients as a directed graph whose nodes are files and indices.  Each file or index operation is authorised by quoting the object’s unique identifier to the file server, and UIDs are 64 bits long with 32 random bits. Each client, therefore, can access only some of the nodes in the graph at any time, namely those whose UIDs he knows, an dthose whose UIDs can be retrieved from accessible indices.

Thus, they actually have a graph file system that may in fact consist of nodes that are not connected – essentially a pool of disconnected trees that can be traversed if you know how to find the tree, but is effectively hidden otherwise.  They do point out that the sparse space may not be sufficient protection (though I suspect a small finite delay on an invalid lookup with discourage brute force browsing).

Objects are deleted when they cannot be found from some distinguished root index; the paper describes that each client is given its own entry in the root index, pointing to the client specific index.  There is the implication that they will scan the storage looking for such unreferenced objects that can be cleaned up and indeed they refer to a companion paper for a detailed description of this garbage collector.

Their argument for this omission is that it relieves the client of the burden of managing object lifetimes (“… removes from the clients the burden of deciding when to delete an object…”)

Storage space is segregated into “data” and “map” blocks.  The data blocks contain object contents.  The map blocks contain meta-data. New files are stored as a single data block. As the file grows in size, map blocks are inserted to create a tree of up to three levels deep.

The paper then turns its attention to the atomic nature of the updates to the file server.  The author points out that moving from consistent state to consistent state may require multiple distinct changes. Since failures can interrupt you between any two operations, the discussion revolves around ways in which this can be robustly implemented in atomic and recoverable fashion.  The author points out that the overhead in protecting against this class of failures has substantial overhead.  Given that not all files require this level of robustness, he proposes that the file server provide two separate classes of service for data files.  Map blocks are maintained in consistent fashion because they have the file server’s meta-data within them and the consistency of the file server’s control information needs to be preserved.

Much of the detail in the paper at that point involves describing the structure of the meta data and how it is used to implement atomic operations on the file server.  The paper provides a detailed description of how transactions are implemented within this system.  The fact they describe implementing a complete transactional file system, discuss the ramifications of providing user level transactional data storage, and come up with a hybrid model does make this an impressive piece of early work.  We will see journaling file systems more than once as we move forward.

The balance of the paper discusses how this has worked within their systems at Cambridge.   It is interesting and they tie some of the implementation efficiency to the environment of the Cambridge Ring itself.  This is a production file server and the author notes that it is used by a variety of computers (including different operating systems) within their environment successfully.

Its relatively quick response has allowed it to be used to record and play back digitised speech in real time.  The interface provided seems both simple and suitable for a variety of purposes.

Impressive indeed.

WFS: A Simple Shared File System for a Distributed Environment

WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.

This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.

Woodstock was “an early office system prototype”.  The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame.  The ability to share data between these systems turns out to be surprisingly important.  Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.


This is the environment that gave rise to WFS.  Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism.  In fact, WFS only provided four classes of operations:

  • I/O operations to read and write blocks of data within files
  • Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
  • Managing file properties, including page allocation data
  • Providing maintenance functions

The actual implementation is surprisingly simple.  Indeed the authors’ state that it took two months to build it.

Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.

There are a number of simplifying options with WFS:

  • The namespace for files is flat; there is no hierarchical structure.
  • The file structure is simple (Figure 2).
  • The protocol is stateless and each operation is idempotent.  This simplifies error handling since a lost message can be re-transmitted safely, with no fear that  repeating it will cause problems.
  • Operations are client initiated.  The server never initiates an operation.
  • Clients have limited mutable state.  The server does not permit changing its own state directly from the client.

This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.

Figure 3 describes the layout of the (stateless) data exchange format used by WFS.

Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location.  This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.

Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure.  Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.

This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as  the hierarchical document name space, management of documents and their structure, etc.  This separation of concerns yields an elegant and simple implementation model for their file server.

There are some interesting implementation details described in the paper:

  • Write operations are validated by reading the data page.  Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client.  It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network.  The downside to this is they must read the data before they can write it.
  • They use a write-through cache.  Thus, the cache is really for read efficiency, not write efficiency.  This should help mitigate the write inefficiency.
  • Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.

Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”

Accessing data over the network was faster than the local disk drive!  Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader.  One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)

The authors’ follow up their implementation description with an explanation of their design philosophy.  They emphasize the atomic nature of the operations they support, as well as the following properties:

  • Client initiated operations can only access one data page and “a few” auxiliary disk pages.
  • Operations are persistent before WFS returns status to the client.
  • WFS commands are a single internet packet.
  • The WFS protocol is stateless.

They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.

They delve into how clients might use WFS in Section 4.  One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services.  Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.

The authors are able to point to other systems that utlize WFS besides Woodstock.  They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.

The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities.  They round out this section of the paper by discussing other possible enhancements to WFS.

In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.

 

Granularity of Locks in a Shared Data Base

Granularity of locks in a shared data base,
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, in Proceedings of the 1st International Conference on Very Large Data Bases, pp. 428-451. ACM, 1975.

I was originally going to do a write-up of A Client-Based Transaction System to Maintain Data Integrity but realized that I should help motivate the use of locks and transactions better before diving into this topic area.  So to do this, I’ve picked to talk about locks.  This is a critical concept that we utilize in file systems development on a routine basis.  Note that this work actually started out in the database community, not the file systems community.

I will note that this is a long paper (23 pages) and there is considerable detail that I will omit for the sake of brevity – but I do want to touch on the key points that are raised in this paper.

They start off by defining consistency and transaction.  They define consistency first:

We assume the data base consists of a collection of records and constraints defined on these records.  There are physical constraints (ex: in a list of records, if a record A points to record B then record B must exist) as well as logical constraints (ex: conservation of money in a bank checking account application). When all such constraints are satisfied the data base is said to be consistent.

Then they move on to transaction:

transaction is a series of accesses (for read or write operations) to the data base which, applied to a consistent data base, will product a consistent data base.  During the execution of a transaction, the data base may be temporarily inconsistent.  The programs used to perform the transactions assume that they “see” a consistent data base.  So if several transactions are run concurrently, a locking mechanism must be used to insure that one transaction does not see temporarily inconsistent data cause by another transaction.  Also, even if there are no consistency constraints, locks must be used so that the updates of one transaction are not made available to others before the transaction completes.  Otherwise, transaction backup might cascade to other transactions which read or updated the “back up” updates.

This text does not lay out the modern description of transactions that we use (atomic, consistent, isolated, and durable or ACID) but what it describes is a system in which these properties do in fact exist.  This concept is not unique to databases – we see it in any multi actor system sharing resources.  The usual analogy that I use when describing this are traffic intersection based, as the intersection is a shared resource and the individual vehicles separate actors. lock free traffic flow

The optimal configuration for resource sharing is where the need to share is minimized.  If there is no shared data, there is no need for locks. Similarly, if the data is immutable (a concept I know we’ll return to over time) we don’t need to protect it.  Only mutable state needs to use locks.  As it turns out, frequently mutable state doesn’t actually change but because it can change, if we rely upon it, we must protect against change.

This concept of transitioning from one consistent state to another consistent state safely is why we use locks.  The primary contribution of the authors in this paper is not the creation of locks and consistency but rather their creation of a clear language that we still use to describe these concepts.

In Section 2 they then describe what a “lockable object” is.  There are several important observations here, which I summarize:

  • A transaction is sequentially consistent.  This is a very strong consistency guarantee (as we will see in later papers when things get far more general.)
  • The basic lock type is reader-writer.  A reader lock provides shared access to mutable state, along with the guarantee that the mutable state won’t change while the lock is held.  A writer lock provides exclusive access to mutable state, along with the guarantee that the only changes to that state will be made by the owner of the writer lock.
  • Locks may be used to protect non-existent resources, for example, to block the creation of a new object.  Since the object doesn’t exist, you can’t lock it yet (interestingly, I usually tend to think of that as locking the structure in which the new object exists, but their observation that you can lock non-existent items is certainly valid.
  • Locks are requested dynamically.
  • There is a dynamic balance between lock granularity and overhead.  If we lock many small objects, there is a cost associated with each of those lock operations.  In the years since this paper we have made uncontended lock acquisition rather inexpensive, but contended lock acquisition remains an expensive proposition.

The paper then goes on to discuss lock hierarchies.   This is because in any dynamic lock system in which you obtain more than a single lock, you need to ensure that there will never be a situation in which an actor blocks waiting for a lock that will never be released.  The simplest case of this is when an actor blocks waiting for a lock which it owns.  This is clearly a programming error, but it is one that I have seen numerous times over the years.  The more complex case of this is when we have a cycle in lock acquisition.  The paper points out that to prevent this we need to construct a directed acyclic graph showing the order in which locks are acquired.  Usually these are trees, but I have seen environments in which they really are DAGs, due to re-entrant behavior.

The paper describes their model for one type of lock, which is focused on locking within an hierarchical name space.  Thus, they have exclusive (X) access, shared (S) access, and intention (I) locks.  The intention lock is then used to protect each object along the lock hierarchy, with X or S access to the final node in the hierarchy.  The paper goes into greater detail about this; I’ll omit it because it is not really germane to what I found most interesting about this paper.

The final point that I’ll draw out of this paper is that they discuss the idea of lock scheduling, deadlock, and lock thrashing.  The lock hierarchy is intended on preventing deadlock.  Lock scheduling can be used to detect deadlocks.  Thrashing relates to memory contention and calls out to other operating system mechanisms for resolution.

With this behind us, we should have a better basis for understanding how we use transactions and locking to ensure consistency in file systems.

The DEMOS File System

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems.  I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?”  I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world.  As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

DEMOS Figures 1 and 2The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system.  A key for this project was that it seeks to improve performance as much as possible.  After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance.  It does so in a clear and understandable fashion.DEMOS Figures 3 and 4

DEMOS utilizes a UNIX-like hierarchical file system model.  It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS.  Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes.  To the best of my knowledge this is one of the earliest papers covering this common subject  (which is revisited periodically because workloads change and file sizes also change).  One of the common themes I have seen in other work is mirrored here: most files are small.  Figure 1 shows a CDF for file sizes.  We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well.  Their second figure (Figure 2) describes the proportion of transfer sizes and their source.   We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

demos figure 5They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server.  While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system.  This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications.  Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager.  This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”).  This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this.  It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”.  Their point is that serializing these things decreases overall performance.  Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.”  This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses.  Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation.  Their observation?  The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering.  These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply.  There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging.  They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead.  They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system.  There are plenty of details that will be refined in later work.

 

 

 

The Cap File System

The Cap Filing System
R. M. Needham and A.D. Birrell, Symposium on Operating Systems Principles, 1977, Association for Computing Machinery.

I’ve fallen behind this past ten days, working towards a deadline.  My own conference paper is now submitted and I’m working on recovering.  What that means is this week is going to be a busy one as I work on catching up.  Adding to the fun, FAST is going on this week (February 13-16, 2008).  I hope to add some bonus discussions on the current work being presented there.

Let’s get to discussing this paper.

CAP is a capabilities based system.  The capability systems are a parallel idea that has been explored from time to time versus the access control system for Multics/UNIX inspired operating systems.  The latest contender in this space would be Fuschia, an experimental operating system that Google is developing (though it is an open source project under a mixture of licenses).  Interestingly, there are reports it runs on the Pixelbook in addition to the previous hardware it had supported.

At any rate, the idea behind a capability is to use identifiers that encapsulate the access rights within the capability itself.  Thus, it is a name and the virtue of having the name is that it means you have the access inherent in that name.

These capabilities all represent to a program the right to have or do something: the preservation of information from one run of a program to another (a universal operating system requirement) is thus seen by the CAP programmer as the preservation of a capability rather than of an object itself.

The paper describes how capabilities are managed, which interacts with the “filing system” of course, since it has to store persistent information.  They describe the “system internal name” (SIN), which is combined with a disk address to map memory segments to actual storage.  The capability (in a “directory”) then becomes a repository of these persistent objects.  This also creates a reference to the disk block that is in use, ensuring those storage regions are not reused.

One interesting characteristic of their system is that they do not guarantee free disk space will be recycled quickly (“[T]here is no guarantee that inaccessible disk space will be relinquished at the earliest possible moment.”)  Indeed they note that space reclamation is only done when the system reboots and “[T]he filing system is not designed to run forever.”

They discuss how CAP differs from prior work (e.g., OS/360) where only the name matters; there is no concept of a directory for use as part of the capability system.  The directory actually provides them an additional level of control as well, and they use directory capabilities as well as segment capabilities.  Directories may be persistent (so they have a SIN and disk block location) or ephemeral (so they disappear when the program exits) – a sort of “built in” temporary storage concept for ephemeral memory management.

Sharing objects is now trivial – you simply tell someone the correct name; that embeds the capability within it.  They do not describe the mechanisms used for transfer (“[B]y mechanisms which do not concern us in detail here.”)

They also describe how this name space is similar to other systems (UNIX and CAL-TSS) but different:

  • Access is associated with the name which in turn references information in the directory.  It is not an attribute of the file itself.
  • The name space need not be a “strict hierarchy”.  This means that portions could become disconnected, or even be private to a single application.
  • Their use of directories behaves similar to the model of “current directory” (presumably in UNIX) even though CAP expressly does not have a concept of current directory.
  • Directories are not even directed acyclic graphs!

The paper describes how capabilities work, since they are a fine-grained control mechanism.  They explain that the holder of an existing capability (a program) may generate a more restrictive capability to provide to another program.  Since capabilities apply to both individual files as well as directories, it is possible for a program to transfer a set of capabilities by creating a new directory and storing the relevant capabilities to the target program.

The names themselves can be quite ugly, as they incorporate capabilities within them.  They describe some conventions they employ in terms of name management and directory placement, but point out these are conventions and not a hard requirement of the system.

CAP certainly presents a rather different model of a file system than we see in other systems.  The idea of disconnected name spaces that are only visible to particular programs is an intriguing one.  They use the example of the password database, which requires the program have the password file capability.

They discuss security.  The directory manager is a separate module that programs use to interact with the directories to which they have access.  To invoke the directory manager, the caller must have the ENTER capability.

I find this to be exactly the type of thought provoking paper I had hoped to find as I comb through these old papers.  The idea that a file system name space need not be connected, that it could be private to a particular program or set of programs, and embedding access rights (“capabilities”) into the name will give me plenty to think about.

If you would like to know more about CAP there is a paper about it in the prior Symposium on Operating Systems Principles: The Cambridge CAP Computer and its Operating System.  It is not too surprising that this is available from Microsoft Research, as they also built a capability based operating system (or two): Singularity and Midori.

 

Shanghai Bound!

I’m on my way to the the Symposium on Operating Systems Principles, which is a well-known operating systems conference.  This time it is in Shanghai, China so I’m making my way there.  I do not have the privilege of presenting anything there this trip, but hopefully I will be able to do so in the future – it really is just a small matter of coming up with some interesting research that is worthy.

This will be my first time in China.  Definitely looking forward to seeing some of the interesting talks that are scheduled throughout the conference!