Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell in Proceedings of File Systems and Storage Technology 2011, Volume 11, pp 61-75, USENIX.
In this paper the authors turn their attention to data structure considerations for Non-Volatile Memory (NVM). Unlike the previous papers I have covered (Mnemosyne and NV-Heaps) they look at data structures specifically optimized to exploit the capabilities of NVM. From the abstract:
For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging.
Some other aspects of this paper that stand out:
- They are looking at NVM replacing both DRAM and storage – thus, they view this as a single level store.
- They use versioning to protect their data structures, versus logging.
- They describe how to achieve this without hardware changes.
The paper has a good review of NVM memory technologies that may emerge. Table 1 (from the paper) underscores the dramatic decrease in latency. This is why we’ve been waiting for this for more than 10 years now. It really does change the platform in ways that we have not yet realized.
But it is not just the speed aspect that matters, it is also the persistence aspect. The density is much larger for these memories as well. Anyone that has looked at an NVMe M.2 drive can notice how few and small the components on it.
Do we treat it as storage? If so, perhaps we should look to the file systems world for some insight. The authors turn to the WAFL shadow page mechanism. They point to BTRFS and their use of this technique with B-trees. They dismiss this approach, concluding that they have “fewer data-copies” in CDDS. They distinguish this work because it is byte addressable versus prior work that was block oriented (page addressable). Again, the lessons learned from working with it aren’t directly applicable. They do point out that using NVM makes sense in a world of large, persistent storage backed data farms. So there is a need, and one they see fulfilled by NVM. It just needs efficient use of that NVM.
Thus, the authors walk their own path.
The speed of NVM is such that direct access is the only mechanism that makes sense. System calls impose too much overhead, doubling the cost of accessing the NVM itself. Thus they posit that it will be direct access (and indeed that is what seems to come to pass).
They observe that one of the challenges for persistent data is that CPUs do not provide mechanisms for ordering persistent writes (though they do, but at a fairly coarse granularity of a fence.) So they describe the issues that must be handled:
- Reordering of writes from the caching behavior of the CPU itself as well as a multi-level cache hierarchy.
- Failure semantics of atomic operations across power failures.
They summarize the various approaches available to them for ensuring data has been stored properly in NVM. This includes memory fences, cache writeback and invalidate operations, marking memory as non-cacheable (which forces write-back), cache line flushes, and atomic processor operations. They point out that this is not sufficient for more complex updates, such as tree rebalancing operation. This leads them to versioning.
I found it interesting that their goals were similar to those I have seen previously: durability, consistency, scalability, and ease-of-use for the programmer. They also note that they focus on physical consistency of the data contents in memory. Logical consistency of higher level meta-data structures is not addressed in the context of this work.
Thus, the authors point out that a CDDS is an abstract idea; they demonstrate how they envision using it by implementing a b-tree structure (Figure 1 is from the paper as they describe their B-tree).
In versioning, changes are not made in place to the current version; instead, a new version is written. The current version is immutable (at least as long as it is the current version). Atomic operation and copy-on-write techniques are used to make changes persistent. Once done, a new version number is assigned and the new version becomes the current version. The old version can be recycled once its reference count drops to zero.
Failure recovery then becomes a function of “cleaning up” any in-progress operations that had not been written to disk.
The authors then walk through their B-tree example. They explain their use of versioning, they provide pseudo-code for the various B-tree operations (lookup, insert, delete) as well as the internal operations needed to support them.
They evaluate their solution by simulating NVM on top of DRAM (a common solution as we have seen). They compare against BerkeleyDB, STX B-Tree, Tembo, Cassandra, and Redis. They were slower than the entirely in-memory STX B-Tree, presumably due to the cost overhead. They are much faster than BerkeleyDB (even BerkeleyDB runing on a RAM disk.) They also tested using YCSB as their “end to end” benchmark.
In the end, they do demonstrate that it is possible to rebuild existing data structures – preserving the interface – so they work efficiently with NVM. They even point out it does not require processor changes to do so. Given that better processor cache control mechanisms have been introduced since then, I would expect that exploiting them will lead to even better performance.
NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, Steven Swanson
in ASPLOS XVI Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, Pages 105-118, March 5-11, 2011.
This paper was presented at the same conference as Mnemosyne. The authors explore a different use of Non-Volatile Memory (NVM): using it for storing persistent objects. The authors sum up the motivation for this:
Creating these data structures requires a system that is lightweight enough to expose the performance of the underlying memories but also ensures safety in the presence of application and system failures by avoiding familiar bugs such as dangling pointers, multiple free()s, and locking errors. In addition, the system must prevent new types of hard-to-find pointer safety bugs that only arise with persistent objects. These bugs are especially dangerous since any corruption they cause will be permanent.
Thus, their motivation is to enable the use of these nifty “persistent, user-defined objects” that are not practical when backed by disks (“[T]he slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance.”)
The authors make some important observations that are just as applicable today as they were in 2011. These include the observation that persistent objects in NVM cannot reasonably be treated like disk based objects “… because the gap between memory and storage performance drove many design decisions that shaped them.” Nor can they be treated like volatile memory: “To guarantee consistency and durability, non-volatile structures must meet a host of challenges, many of which do not exist for volatile memories.”
They also observe that NVMs greatly expand the possibility of bug sources from having disparate address spaces. In other words, while you have a single address space, part of it is ephemeral and if you store a reference to the ephemeral part in the persistent part, it will be inconsistent after the current process terminates.
I found their observation about the ability of programmers to reason about this also apropos: “[t]rusting the average programmer
to “get it right” in meeting these challenges is both unreasonable…” This is consistent with more than 50 years of experience in systems. Personally, I don’t think this is an indictment of the programmer so much as it is a burden on the system (a perspective the authors appear to endorse as well). To make this viable, we need to make it easy to get it right.
Figure 1 shows the general architecture of NV-Heaps: It is envisioned as a library of useful services layered on top of the operating system provided abstractions. One important observation here is that this model completely avoids the need to interact with the operating system in ordinary program execution. Persistence no longer relies upon utilizing the standard file systems interface.
The authors’ explanation of their goals looks like a veritable “wish list” to me: prevent programmer errors, transactions, referential integrity, performance and scalability, and ease of use. I’m not sure how referential integrity is different than programmer errors, but clearly it is a very important aspect of their persistent system.
Figure 3 shows how they handle one of the complex consistency cases inherent in managing NVM: the need to ensure that operations can be safely restarted. For example, when deleting a large data structure, such as a tree, it must be removed in a way that it can be stopped and restarted (e.g., if the system were to crash, it must then be able to resume removal). To resume after a crash, they use a log of operations and replay it – a classic solution to the problem.
To make their goal of referential integrity work properly they utilize the programming language constructs to do this. The authors note they achieve this by using 128 bit pointer values (on a 64 bit system).
The paper describes their implementation in considerable detail. Again, as we would expect, the implementation yields substantially better performance than comparable systems backed by disks – this really shouldn’t come as a surprise, given the performance differential between disks and non-volatile memory. Even if they had used solid state disks (which existed but were rare in 2011) their results would have still be notably better. Figure 8 shows their performance information, comparing themselves against several other systems. One thing to note: they do not have NVM memory. They use a memory simulator to model the behavior of the system. The performance figures they provide surprised me: they are substantially faster than I would have expected. For PCM, they used a 67 nano-second (ns) read time and 215 ns write time. The paper explains how they obtained these values and how they validated them. For STTM (a different NVM technology) they reported 29 ns read and 95 ns write. As a baseline, their DRAM read time was 25 ns, and write time was 35 ns.
While these numbers were lower than I would have expected, the relative ratio is close to what I expected from other things that I have read: PCM memory is about 2.5 times slower for reads, and 10 times slower for writes. This is consistent with what the paper reports. I guess it’s time to update my mental “Jeff Dean” numbers. And indeed, it turns out that DRAM latency is around 15 ns.
The authors were able to modify memcached to use their library for persistence. They report that they were able to get within 8% of the original memcached. That seems like an excellent outcome.
All we need now are NVMs.
Mnemosyne: Lightweight Persistent Memory
Haris Volos, Andres Jaan Tack, Michael M. Swift, ASPLOS ’11 March 5-11, 2011.
The abstract starts us off in this brave new world:
New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-levelvaccess to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, allowing regular in-memory data structures to survive system crashes.
So faster, doesn’t require privilege, works like memory, and persistent. Pretty fancy stuff.
File systems aren’t really constructed to have direct access to the disk from user applications. Generally it is done via an I/O interface: open, close, read, and write. But memory isn’t accessed in that fashion at all. So, how does this affect things? What do the services look like? What does it mean to take something everyone thinks of as transient and make it persistent?
Let’s start exploring!
Mnemosyne provides an explicit mechanism for exposing persistent memory to applications. This is done by extending the programming tools so they can declare something should be stored in persistent memory, or so that it can be dynamically allocated with the proviso that it be allocated from this persistent memory.
Thus, the default is that an existing application retains the same behavior – it does not use persistent memory. If an application wishes to use persistent memory it must be modified to do so. Mnemosyne will provide a basic service level, but it won’t change the behavior of existing applications (technical debt really does follow us around in this business).
It’s impressive: “… Mnemosyne can persist data as fast as 3 microseconds.” It makes existing applications modified to use it much faster. Figure 1 (from the paper) describes the architecture the authors created for Mnemosyne.
This architecture envisions the persistent memory being exposed to the application through a persistence interface; the motivation for this is that merely having persistent memory is not enough. It requires additional work to ensure that it is crash resistant. In other words, the system can restore the state of the contents in memory to some well-defined consistent state.
This is something file systems routinely handle – the issues of persistence and recoverability. I often try to think about failure: how does failure manifest? How do I know that I can recover the state to a consistent spot and then proceed?
This is an uncommon concept for most application developers: they don’t need to worry about the contents of memory being “consistent” in the face of crashes because when the application crashes, the memory is lost.
Mnemosyne provides a model of consistency for applications by creating an explicit mechanism for providing crash consistence. Note that Mnemosyne won’t define those consistent states – the application must define what it means for its data structures to be consistent. What Mnemosyne offers are certain guarantees about the contents of memory.
The authors’ decision to virtualize their SCM is an interesting one: “[V]irtualization prevents a memory leak in one program from monopolizing a finite amount of SCM.” Thus, they stage SCM content to disk between processes. Consistency of data is provided by “ordering writes”. The authors identify four consistency mechanisms:
- Atomic variable update – update the data in place as a single all-or-nothing operation.
- Append updates – data is not written in place, but rather a new copy is written, such as it might be to the end of a log (such updates are ordered).
- Shadow updates – data is written to a new location and once done, the pointer to the old copy is updated to point to the new copy (e.g., via an atomic variable update). The authors point out there is a potential leak here that must be handled properly.
- In-place updates – used for data structures that can be modified in place; provided the operations are ordered.
Consistency guarantees for persistent memory are accomplished using processor semantics and mechanisms:
- A write through operation (e.g., a temporal move) that is written directly to memory.
- Memory fences that ensure strict ordering of operations before the fence relative to operations after the fence.
- Cache line flushes. The CPU stores memory inside the processor while it is acting upon it. In fact, a modern CPU has multiple levels of memory. The most expensive (and smallest) will be the Level 1 cache. It’s also the fastest. L2 cache is larger and slower than L1 cache. L3 cache is typically shared with all CPUs on the processor; it is the largest and slowest of the caches.
For storage people, some of this is familiar and some of it is different – instead of worrying about storage stack semantics we’re now worrying about processor cache semantics. One upside is that processor semantics are more rigidly enforced than storage semantics (e.g., disk drives that lie and say that the data has been written when it hasn’t.) One downside is that it’s a new failure domain. For anyone used to working with persistent storage, understanding the failure domain is vital. I suspect it is also different for people used to thinking about the processor perspective, since persistence isn’t usually something you have to reason about.
Mnemosyne implemented a persistent heap allocator, a modified version of Intel’s STM Compiler (we’ll see later that others had to move that work to other compilers because it is now abandoned), a logging mechanism, a persistent region mechanism, a transactional system (based upon TinySTM).
Their results are, of course, good. After all, if they had not been good, they wouldn’t have been published. They outperform BerkeleyDB (for some metrics). They demonstrated a fast and persistent red-black tree implementation. They show the benefits of asynchronous truncation.
Mnemosyne was a useful contribution because it was an early exploration into considering how we should use byte-addressable non-volatile memory. The library they built is used in future work as well, and this is a heavily cited paper.
This paper does not relate to non-volatile memory, but we will see Log-Structured Merge Trees (LSMTs) used in quite a few projects. From the abstract:
The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components.
So LSMTs originate from concerns about the latency issues around disk drives.
In a nutshell, the challenge with disk drives are they have mechanical parts that must be moved in order to read the data. Data is written in concentric bands around the center. The angular velocity of the disk platter is the same, but of course the surface velocity is lowest towards the center and fastest towards the outer edge. The disk drive “head” is moved in and out to read from each of those concentric circles. Since it can only read what is beneath the head, it also must wait for the desired data to rotate under the head. This is one reason why faster disk drives (usually measured by the rotations-per-minute number) provide faster response times. On the other hand, faster disk drives generate more heat and are more expensive to build.
Thus, an important consideration for file systems working on rotating media is the latency to perform random access. Tape drives have the highest latency, since we have to reposition the tape to get to another location and read its data (there are other complexities as well, such as the fact that tape benefits most from streaming write). Hard disk drives (HDDs) are not so bad as tape drives in general (though SMR drives act much like tape drives, which is one reason I mention that here.) Solid State Disks are even better than disk drives, though even for an SSD random access is slower than sequential access – but both are much faster than HDDs.
Some of those papers that I have yet to cover describe the concept of a log-structured file system. One of the things that I learned when working on the Episode File System was that converting random I/O to sequential I/O was definitely a win (so was asynchronous logging). It is this observation: converting random I/O to synchronous I/O that provides the benefit of using journaling techniques (the “log”).
So LSMTs capitalize upon this advantage. Note that an LSMT is not a single data structure; rather it is a general technique for working with systems where insert and delete are the common operations, such as meta-data within a file system, or (key,value) tuples in a key-value store. It also points out that reading large sequential block is generally more efficient; they cite IBM when noting that a single page read from the DB2 Database takes approximately 10 milliseconds. A read of 64 continuous pages costs about 2 milliseconds per page (125ms total). So batching I/O operations is also an important mechanism for improving performance.
So what is an LSMT? “An LSM-tree is composed of two or more tree-like component data structures.” From there the authors describe their initial case: where one tree is memory resident and the other is disk resident. Note, however, this is the smallest set for a valid LSMT. Systems with more than two have been built – and one way to use non-volatile memory (NVM) is to add it as another layer to an LSMT.
Figure 2.1 (from the paper) shows the high level structure of the LSMT – a tree-like structure at each level of the storage hierarchy; these levels can have different characteristics, such as being ephemeral (in memory) or persistent (on disk, tape, SSD, etc.) Figure 2.2 provides greater detail, showing how data is merged from one level to the next (“rolling merge”). Figure 3.1 shows a generalization of the LSMT, in which data is migrated from one level of the storage hierarchy to the next.
Figure 6.1 then helps motivate this work: data that is seldom accessed (which is most of the data) is “cold” and can be stored in lower cost storage. Data that is frequently accessed (which is a small amount of the data) is “hot” and benefits from being stored in faster but more expensive storage. Indeed, there is a substantial body of work at this point that demonstrates how data tends to cycle from being hot to being cold. Thus, there is a period of migration for “warm” data. This also helps explain why having a multi-stage model makes sense. This behavior is quite general, in fact. Disk drives are constructed with caches on them. The cache is for the hot data, the disk storage for the cold data. SSDs are often structured with multiple classes of NVM; a small amount of expensive but fast NVM and then a larger amount of less expensive (often block oriented) NVM. Even CPUs work this way (as I will be discussing ad nauseum), where there are multiple levels of caching: L1 cache is small but very fast, L2 cache is larger and slower (and cheaper), L3 cache is again larger and slower. Then we get to memory (DRAM) which is even slower. That’s all before we get to storage!
This is quite a long paper: they describe how data is merged from one level to the next as well as do an in-depth analysis of cost versus performance. But I will leave ferreting out those details to the interested reader. I got what I came for: a basic description of the tiered nature of LSMTs and how we can use them to make storage more efficient without driving up costs.
The Cambridge File Server
Jeremy Dixon, in ACM SIGOPS Operating Systems Review, Volume 14, Number 4, pp 26-35, 1980, ACM.
Cambridge was certainly a hotbed of systems work in the 1970s (not to say that it still is not). They were looking at very different architectures and approaches to problems than we saw from the various Multics influenced systems.
The introduction to this paper is a testament to the vibrant research work being done here. They author points to the Cambridge ring, which was their mechanism for implementing a shared computer network and a precursor to the Token Ring networks that followed. The CAP computer was part of this network, and the network included a separate computer that had a vast amount of storage for the time – 150MB. That common storage was used for both “filing systems” as well as “virtual memory”. This computer ran the Cambridge File Server and implemented the functionality that was explored in the WFS paper.
They identify key characteristics of their file server:
- Substantial crash resistance.
- Capabilities used to control access.
- Atomic file updates.
- Automatic “garbage collection” of storage space
- Fast transfer to random accessed, word-addressable files.
The authors make a point of noting there are only two classes of objects in their system: files and indices. I found this interesting because it echos the hierarchical file systems models that encouraged me to start this journey in the first place.
They define a file: “… a random access sequence of 16-bit words whose contents can be read or written by client machines using the following operations”. The operations that follow are read and write. They go on to define an index: “… a list of unique identifiers, and is analogous to a C-list in capability machines”. The three operations here are: preserve, retrieve, and delete. This permits entries to be added, found, and removed.
The storage controlled by the file server thus appears to its clients as a directed graph whose nodes are files and indices. Each file or index operation is authorised by quoting the object’s unique identifier to the file server, and UIDs are 64 bits long with 32 random bits. Each client, therefore, can access only some of the nodes in the graph at any time, namely those whose UIDs he knows, an dthose whose UIDs can be retrieved from accessible indices.
Thus, they actually have a graph file system that may in fact consist of nodes that are not connected – essentially a pool of disconnected trees that can be traversed if you know how to find the tree, but is effectively hidden otherwise. They do point out that the sparse space may not be sufficient protection (though I suspect a small finite delay on an invalid lookup with discourage brute force browsing).
Objects are deleted when they cannot be found from some distinguished root index; the paper describes that each client is given its own entry in the root index, pointing to the client specific index. There is the implication that they will scan the storage looking for such unreferenced objects that can be cleaned up and indeed they refer to a companion paper for a detailed description of this garbage collector.
Their argument for this omission is that it relieves the client of the burden of managing object lifetimes (“… removes from the clients the burden of deciding when to delete an object…”)
Storage space is segregated into “data” and “map” blocks. The data blocks contain object contents. The map blocks contain meta-data. New files are stored as a single data block. As the file grows in size, map blocks are inserted to create a tree of up to three levels deep.
The paper then turns its attention to the atomic nature of the updates to the file server. The author points out that moving from consistent state to consistent state may require multiple distinct changes. Since failures can interrupt you between any two operations, the discussion revolves around ways in which this can be robustly implemented in atomic and recoverable fashion. The author points out that the overhead in protecting against this class of failures has substantial overhead. Given that not all files require this level of robustness, he proposes that the file server provide two separate classes of service for data files. Map blocks are maintained in consistent fashion because they have the file server’s meta-data within them and the consistency of the file server’s control information needs to be preserved.
Much of the detail in the paper at that point involves describing the structure of the meta data and how it is used to implement atomic operations on the file server. The paper provides a detailed description of how transactions are implemented within this system. The fact they describe implementing a complete transactional file system, discuss the ramifications of providing user level transactional data storage, and come up with a hybrid model does make this an impressive piece of early work. We will see journaling file systems more than once as we move forward.
The balance of the paper discusses how this has worked within their systems at Cambridge. It is interesting and they tie some of the implementation efficiency to the environment of the Cambridge Ring itself. This is a production file server and the author notes that it is used by a variety of computers (including different operating systems) within their environment successfully.
Its relatively quick response has allowed it to be used to record and play back digitised speech in real time. The interface provided seems both simple and suitable for a variety of purposes.
WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.
This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.
Woodstock was “an early office system prototype”. The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame. The ability to share data between these systems turns out to be surprisingly important. Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.
This is the environment that gave rise to WFS. Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism. In fact, WFS only provided four classes of operations:
- I/O operations to read and write blocks of data within files
- Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
- Managing file properties, including page allocation data
- Providing maintenance functions
The actual implementation is surprisingly simple. Indeed the authors’ state that it took two months to build it.
Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.
There are a number of simplifying options with WFS:
- The namespace for files is flat; there is no hierarchical structure.
- The file structure is simple (Figure 2).
- The protocol is stateless and each operation is idempotent. This simplifies error handling since a lost message can be re-transmitted safely, with no fear that repeating it will cause problems.
- Operations are client initiated. The server never initiates an operation.
- Clients have limited mutable state. The server does not permit changing its own state directly from the client.
This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.
Figure 3 describes the layout of the (stateless) data exchange format used by WFS.
Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location. This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.
Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure. Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.
This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as the hierarchical document name space, management of documents and their structure, etc. This separation of concerns yields an elegant and simple implementation model for their file server.
There are some interesting implementation details described in the paper:
- Write operations are validated by reading the data page. Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client. It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network. The downside to this is they must read the data before they can write it.
- They use a write-through cache. Thus, the cache is really for read efficiency, not write efficiency. This should help mitigate the write inefficiency.
- Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.
Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”
Accessing data over the network was faster than the local disk drive! Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader. One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)
The authors’ follow up their implementation description with an explanation of their design philosophy. They emphasize the atomic nature of the operations they support, as well as the following properties:
- Client initiated operations can only access one data page and “a few” auxiliary disk pages.
- Operations are persistent before WFS returns status to the client.
- WFS commands are a single internet packet.
- The WFS protocol is stateless.
They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.
They delve into how clients might use WFS in Section 4. One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services. Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.
The authors are able to point to other systems that utlize WFS besides Woodstock. They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.
The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities. They round out this section of the paper by discussing other possible enhancements to WFS.
In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.
A Universal File Server
A. D. Birrell and R. M. Needham, in IEEE Transactions on Software Engineering, Vol SE-6, No. 5, September 1980, pp. 450-453.
One of the challenges in this next group of papers is picking which ones to discuss. The advent of networks saw the blossoming of the idea of centralizing storage and having different computer systems accessing it via those networks. By the time this paper is published quite a few network based file server solutions had been constructed and described within the literature – and we will get to them.
The authors here decided to try and extract generality from these works. So in this paper we step back and look for some generality.
This is a rather short paper – four pages.
The authors describe the division of responsibilities in a file server: the “high-level functions more properly associated with a filing system” and “functions belonging to a backing store server” [emphasis in the original]. When I read this I thought that this made sense: we have a functional layer that creates a name space, attributes, etc. and a storage layer that keeps track of storage blocks.
By splitting out this functionality, the authors then suggest that the backing store server is a point of commonality that can be used to support a range of higher level services. Thus, “[t]he backing store server is the sole agency concerned with allocating and relinquishing space on the storage medium.” To achieve this goal the authors propose a universal system of indexes as shown in Figure 1 (from the original paper).
The authors argue for a master table that presents the per-system name space. For each of these namespaces, there is a corresponding master file directory (MFD) and a collection of user file directories (UFDs) that are used to organize the user’s information into a hierarchy.
We note that the files, UFDs and MFD are all stored in storage elements – segments – that are managed by the file server. Thus the file server is responsible for:
- Keeping track of its initial index
- Preserve the names stored in the MFD and UFDs
- Reclaim (delete) the entries in the MFD and UFDs when an entry is deleted
- Manage storage space
From this simple model, they note that a broad range of systems can be constructed.
The paper spends considerable (25% of the paper) time discussing “protection”. By this they refer to the issues inherent in having shared usage of a common resource, such as the files on the file server. The authors describe using ACLs on the file server as one means of providing security. They do not touch upon precisely how the file system will authenticate the users, though at one point they refer to using encryption for access bits in some circumstances.
Their preferred mechanism for access is the capability. This should not come as a surprise, given that they worked on the CAP file system, which provided capabilities. Their observation is that with a sufficiently sparse handle space, it is impractical for an unauthorized party to find the resource. It probably doesn’t require much to point out that this presumes the inherent integrity of the network itself.
The authors complete their universal file server with an observation that this provides a general base upon which individual file systems can implement their own enhanced functionality. Indeed, this was one of their primary objectives in doing this work. They do point out a number of potential issues in their system, but assert that they will not be problematic.
The authors do a good job of describing a basic, abstract file server. The system they describe may not have achieved broad use but this paper does provide a simple, high level view of how a file server might operate. We’ll turn our attention to actual implementations – and there are many such implementations to discuss in the coming posts.
Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.
I return back to distributed systems. Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature. This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme. I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.
This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC). At this point, the Xerox PARC team had been working to develop the personal computer. I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat). Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.
The key to this algorithm is the concept of a quorum for replicated data:
In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
The “votes” assigned to each copy of the file are its weight. This model provides a good way of generalizing replicated data. It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data. The model even permits caching nodes that have no weight.
The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data. This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system. The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure. The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes. A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.
The strategy outlined in this paper would permit at most one partition to proceed. The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.
For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied. In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system. In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy. Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.
Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node. This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.
Thus, the pragmatic option here would be to have a distribution of weights and quorum. For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2. In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.
The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking. Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).
Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet). Section 4 does provide insight into a variety of interesting and useful features of the system:
- “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
- Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations. This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
- Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models. The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
- Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
- Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
- Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed. In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself. Thus, changes will be discovered by the time the read quorum has been satisfied.
- Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
- Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
- Background update – postponing replication can allow smoothing network utilization over time.
The replication policy is, at its heart, an early consensus protocol. While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work. Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.
As We May Think
Vannevar Bush, The Atlantic, July 1945.
I saw this covered by Adrian Colyer recently and unabashedly decided I needed to cover it as well, not because I thought there was anything wrong with his coverage but rather because this speaks well to my own research interests. As Colyer points out the concept of trails is something that seems to have been lost.
Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling. Those who conscientiously attempt to keep abreast of current thought, even in restricted fields, by close and continuous reading might well shy away from an examination calculated to show how much of the previous month’s efforts could be produced on call. Mendel’s concept of the laws of genetics was lost to the world for a generation because his publication did not reach the few who were capable of grasping and extending it; and this sort of catastrophe is undoubtedly being repeated all about us, as truly significant attainments become lost in the mass of the inconsequential.
Bottom line? We can’t find things. I recently described the Polyvalues paper from SOSP 1979. I commented on the fact that there seemed to be useful insights here that just disappeared.
The real heart of the matter of selection, however, goes deeper than a lag in the adoption of mechanisms by libraries, or a lack of development of devices for their use. Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.
This was one of those moments when I realized how deeply embedded our concept of hierarchical organization really is. It isn’t embedded in the operating systems of the 1960s. It was inherited from the more fundamental constraints of paper indexing. Indeed, since reading this article it has given me further insight into how deeply entrenched we are with hierarchical organization.
The first idea, however, to be drawn from the analogy concerns selection. Selection by association, rather than indexing, may yet be mechanized.
“The analogy” here relates to the associative mechanisms inherent in how humans recall information, which is described in the prior paragraph. The author describes “trails”. This evocative idea is similar to the modern field of data provenance. In data provenance, the focus is often on reproducibility, not on finding things, yet there are intriguing similarities. I won’t explore this area further yet, but it seems to be intriguing. Perhaps it will open up some new perspectives to explore.
All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.
The memex is his hypothetical device for capturing and finding this information. At this point the author describes how to build such a system (more or less) using the technology of the day. The terminology is interesting, yet also quite telling that a 73 year old paper could describe modern systems so well.
At some point reading through this it occurred to me that in some ways we have built a system similar to what he describes: the internet. What it doesn’t do is construct the personalized model of inter-relationships between the various components of the system.
Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.
Doesn’t this sound like a web browser?
For me, the key take-away here is encouraging: my own goal of looking for alternatives to hierarchical file systems is not as crazy as it might first seem. It certainly does not mimic the way in which we tend to organize data, though I have also had people point out to me that nothing prevents us from constructing a higher level layer that can be used to achieve the same goal. Indeed, that has been done before and I will get to that at a later point.