Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.
This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems. It is certainly an interesting work on concurrent activity within a distributed system. It relies upon a basic concept of ordering within a distributed system (“decentralized data”). He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.
Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance. By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive. Database systems are further along in allowing distributed data and coordination through things like two-phase commit.
He starts by describing the goals of this work:
The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.
His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled. What does it mean to properly handle failures? Essentially, it means that the data is in a consistent state once the system has recovered from the failure. He starts by defining terms that relate to consistency models. For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed. A single action starts with a consistent state at the start and moves to a consistent state at the end. Any intermediate state of the system is not visible (what we would call “isolation” now). He formally defines these concepts as well.
He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state. In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here. In recent years, formal proof methods have been used to automate some aspects of this. I’m sure I will touch upon it in later posts.
One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure. Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it. Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems. If something can fail, it will.
Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases. The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).
One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome. This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested. He sets forth a series of objectives for his system:
(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.
Having laid this groundwork, the author then defines the tools he will use to achieve these objectives. This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).
He envisions a versioned object system, where the states of the object correspond to changes made to the object.
At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here. For example, if we modify an object in place then this sort of versioning makes sense. However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned. This is not an issue clearly addressed by the current paper as it relates mostly to usage. But I wanted to point it out because this sort of behavior will make things more difficult.
One of the important contributions in this work is the discussion about failure recovery. This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures. That is, one node goes offline, one network switch disappears, one data center loses power.
The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action. Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version. Abort simply means that the tentative versions of the objects in the transaction are deleted. Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.
Thus, we see the basic flow of operations: a transaction is started and a possibility is created. Each potential change is described by a token. Tokens are then added to the possibility. While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).
Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced. The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism). This approach ensures that the participants need not communicate in order to properly preserve ordering. The author calls this pseudotime. He continues on to explain how timestamps are generated.
Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives. There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.
He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem. Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system. Consequently, we would like to be able to throw away old versions of the objects in the system. We can do this pruning of versions without much additional mechanism, however.” His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.
After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!) Again, this is proposed, but not explored fully.
The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented. I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.
Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport, Communications of the ACM, July 1978, pp. 558-565.
I had not originally intended to cover this paper, but as I started trying to describe Implementing Atomic Actions on Decentralized Data, I realized that I needed to include it in order to better explain that work. This is one of the papers for which Leslie Lamport was awarded the Turing Award. In it, he is wrestling with the effects of relativity in computer systems.
What does this mean? In essence, as soon as we have more than two distinct computer systems communicating with one another in a network, we must account for the fact that the order of events may now vary for each individual note. Indeed, as it turns out, this effect is not restricted to distributed systems. It can be seen in single computer systems with multiple processors as well (another area where Lamport was significantly involved) in the study of consistency models.
Networks tend to exacerbate these issues because the times involved often make it more apparent that there are issues. This will profoundly impact distributed systems (including distributed file systems, a notable and classic example of distributed systems) and we continue to wrestle with its manifestations 40 years after this paper.
The paper describes a system (Figure 1) in which three different components, labeled Process P, Process Q, and Process R, are sending messages between them. Individual messages are labeled and numbered in sequence and the time between them is shown. It is worth noting that the times are of both when the message was sent as well as when it was received.
This demonstrates that the observed order of events does not necessarily match what an external observer might see as the ordering. Thus, for example, Process R receives messages in a different order than they were sent by Process Q. This behavior might seem surprising initially, but is in fact a common occurrence. This could be due to queue types, intermediate network components, scheduling delays, etc.
He introduces a nomenclature of describing the “happened before” relationship, establishing an ordering of events within the system. He uses the → character to indicate this relationship. Thus, if a comes before b he writes “a → b“. Similarly it is transitive operation: if a → b and b → c then a → c. Finally he defines an operation as concurrent if there is no ordering between a and b. This applies to operations within a single process as well as between processes. He notes in this discussion that sending a message must have happened before receiving a message.
While this might sound simple, it helps us begin to reason about these events, since we can pay attention to the events that do have ordering constraints and ignore those that do not. In many cases, we won’t care about the order of two operations because the net effect is the same, regardless of the order. What we do want to do, however, is ensure that we understand ordering for those operations where it does matter.
Figure 2 then takes the same chart, but this time he adds the concept of a clock tick. The dashed lines represents the tick of a clock; the ticks occurs between events. He then defines the time ordering (“clock function”) is related to the → relationship. If a → b then C(a) < C(b). That is, the clock tick (time) is also well ordered. He observes then that we can extend this definition to cover the prior case, where we look at the ordering of operations within a single process, as well as across processes: recall that a → b is required for messages sent between processes, where a is the send event and b is the receive event and thus C(a) < C(b) in this case as well. He calls this the “Clock Condition”.
He then goes one step further: he flattens out the clock ticks, yielding Figure 3. Now our clock ticks are uniform time and our events are within one of the clock tick intervals. Since we separate dependent operations, we now have a model we can use to order events globally.
This was the point he was trying to make. “Being able to totally order the events can be very useful in implementing a distributed system.” This is a classic understatement and is the focus of much of distributed systems research and implementation to this day: how do you ensure that you have a definitive order of events. We will see this when looking at later work.
At this point he focuses on more rigorously defining his system. This allows him to note that this becomes a “distributed algorithm”. He describes the replicated state machine that is being executed by all the nodes within the distributed system. There is no central authority in this model forcing the ordering.
This is an important point not to miss: in Lamport’s distributed system there is no requirement for a centralized service. He does not insist on a total ordering of events in the system and settles for a partial ordering – one in which he doesn’t worry about the events that aren’t sensitive to their ordering. This is a powerful technique because it gives up total determinism; we gain performance through parallelism.
From this article we get the Lamport Clock which is not so much a clock as a monotonically increasing value that represents relative ordering. Note that a hardware clock will satisfy this invariant requirement as well.
A Client-Based Transaction System to Maintain Data Integrity
William H. Paxton, in Proceedings of the seventh ACM Symposium on Operating Systems Principles, 1979, pp 18-23.
Last time I discussed the basis of consistency and locks. I did so because I thought it would help explain this paper more easily. We now move into the nascent world of sharing data over the network (what today we might call distributed systems). One of the challenges of this world is that it involves coordinating changes to information that involves multiple actors (the client and server) that has some sort of consistency requirement. The authors here use the term integrity (rather than consistency) but I would suggest the terms are effectively the same for our purposes:
Integrity is a property of a total collection of data. It cannot be maintained simply by using reliable primitives for reading and writing single units — the relations between the units are important also.
This latter point is important. Even given atomic primitives, any time we need to update related state, we need to have some mechanism beyond a simple read or write mechanism.
The definitions offered by the author is highly consistent with what we saw previously:
- The consistency property: although many clients may be performing transactions simultaneously, each will get a consistent view of the shared data as if the transactions were being executed one at a time.
- The atomic property: for each transaction, either all the writes will be executed or none of them will, independent of crashes in servers or clients.
Prior work had focused on having the server handle these issue. This paper describes how the client can accomplish this task.
There were some surprising gems here. For example:
If a locked file is not accessed for a period of time, the server automatically releases the lock so that a crashed client will not leave files permanently unavailable.
We will see this technique used a number of times in the future and it is a standard technique for handling client-side locks in network file systems. Thus, one novelty to the locks presented here is that they have a bounded lifetime. Thus, one of the services required from the server is support for this style of locking.
The author then goes on to propose the use of an intention log (“intention file” in the paper). The idea is that the client computer locks files, computes the set of changes, records them in the intention log, and then commits the changes. Then the actual changes are applied.
To achieve this, it sets out six operations:
- Begin Transaction – this is what reserves the intention file. Note that the author’s model only permits a single transaction at a time. Note that the other operations fail if a transaction has not been started.
- Open – this is what opens the actual data file. At this point the file header is read. If that header indicates the file is being used for a transaction, the file is recovered before the open can proceed.
- Read – as it suggests, this reads data from the file.
- Write – this prepares to write data to the file. Since this is a mutation, the new data must be recorded in the the intention file at this point. The actual file is not modified at this point.
- Abort Transaction – this causes all files to be unlocked and closed. No changes are applied. The transaction becomes “completed”.
- End Transaction – this is how a transaction is recorded and applied. The paper describes how this is handled in some detail. In all successful cases, the changes are applied to the server.
The balance of the paper then describes how crash recovery works. The simplifying assumptions help considerably here – this system only permits a single transaction to proceed at a time. Then the author moves on to a “sketch of correctness proof”. I admit I did find that humorous as the attempt at formalism without actually invoking formalism achieves anything other than warm-fuzzy feelings.
He wraps up by discussing some complex cases, such as file creation and the challenge of “cleaning up” partially created file state. Having dealt with this issue over the years, I can assure you that it can be complex to get it implemented correctly. He also discusses that some files might have multi-block meta-data headers and explains how those should be handled as well. He concludes with a discussion about handling media failure conditions, which are a class of failures that this system does not claim to protect against.
Granularity of locks in a shared data base,
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, in Proceedings of the 1st International Conference on Very Large Data Bases, pp. 428-451. ACM, 1975.
I was originally going to do a write-up of A Client-Based Transaction System to Maintain Data Integrity but realized that I should help motivate the use of locks and transactions better before diving into this topic area. So to do this, I’ve picked to talk about locks. This is a critical concept that we utilize in file systems development on a routine basis. Note that this work actually started out in the database community, not the file systems community.
I will note that this is a long paper (23 pages) and there is considerable detail that I will omit for the sake of brevity – but I do want to touch on the key points that are raised in this paper.
They start off by defining consistency and transaction. They define consistency first:
We assume the data base consists of a collection of records and constraints defined on these records. There are physical constraints (ex: in a list of records, if a record A points to record B then record B must exist) as well as logical constraints (ex: conservation of money in a bank checking account application). When all such constraints are satisfied the data base is said to be consistent.
Then they move on to transaction:
A transaction is a series of accesses (for read or write operations) to the data base which, applied to a consistent data base, will product a consistent data base. During the execution of a transaction, the data base may be temporarily inconsistent. The programs used to perform the transactions assume that they “see” a consistent data base. So if several transactions are run concurrently, a locking mechanism must be used to insure that one transaction does not see temporarily inconsistent data cause by another transaction. Also, even if there are no consistency constraints, locks must be used so that the updates of one transaction are not made available to others before the transaction completes. Otherwise, transaction backup might cascade to other transactions which read or updated the “back up” updates.
This text does not lay out the modern description of transactions that we use (atomic, consistent, isolated, and durable or ACID) but what it describes is a system in which these properties do in fact exist. This concept is not unique to databases – we see it in any multi actor system sharing resources. The usual analogy that I use when describing this are traffic intersection based, as the intersection is a shared resource and the individual vehicles separate actors.
The optimal configuration for resource sharing is where the need to share is minimized. If there is no shared data, there is no need for locks. Similarly, if the data is immutable (a concept I know we’ll return to over time) we don’t need to protect it. Only mutable state needs to use locks. As it turns out, frequently mutable state doesn’t actually change but because it can change, if we rely upon it, we must protect against change.
This concept of transitioning from one consistent state to another consistent state safely is why we use locks. The primary contribution of the authors in this paper is not the creation of locks and consistency but rather their creation of a clear language that we still use to describe these concepts.
In Section 2 they then describe what a “lockable object” is. There are several important observations here, which I summarize:
- A transaction is sequentially consistent. This is a very strong consistency guarantee (as we will see in later papers when things get far more general.)
- The basic lock type is reader-writer. A reader lock provides shared access to mutable state, along with the guarantee that the mutable state won’t change while the lock is held. A writer lock provides exclusive access to mutable state, along with the guarantee that the only changes to that state will be made by the owner of the writer lock.
- Locks may be used to protect non-existent resources, for example, to block the creation of a new object. Since the object doesn’t exist, you can’t lock it yet (interestingly, I usually tend to think of that as locking the structure in which the new object exists, but their observation that you can lock non-existent items is certainly valid.
- Locks are requested dynamically.
- There is a dynamic balance between lock granularity and overhead. If we lock many small objects, there is a cost associated with each of those lock operations. In the years since this paper we have made uncontended lock acquisition rather inexpensive, but contended lock acquisition remains an expensive proposition.
The paper then goes on to discuss lock hierarchies. This is because in any dynamic lock system in which you obtain more than a single lock, you need to ensure that there will never be a situation in which an actor blocks waiting for a lock that will never be released. The simplest case of this is when an actor blocks waiting for a lock which it owns. This is clearly a programming error, but it is one that I have seen numerous times over the years. The more complex case of this is when we have a cycle in lock acquisition. The paper points out that to prevent this we need to construct a directed acyclic graph showing the order in which locks are acquired. Usually these are trees, but I have seen environments in which they really are DAGs, due to re-entrant behavior.
The paper describes their model for one type of lock, which is focused on locking within an hierarchical name space. Thus, they have exclusive (X) access, shared (S) access, and intention (I) locks. The intention lock is then used to protect each object along the lock hierarchy, with X or S access to the final node in the hierarchy. The paper goes into greater detail about this; I’ll omit it because it is not really germane to what I found most interesting about this paper.
The final point that I’ll draw out of this paper is that they discuss the idea of lock scheduling, deadlock, and lock thrashing. The lock hierarchy is intended on preventing deadlock. Lock scheduling can be used to detect deadlocks. Thrashing relates to memory contention and calls out to other operating system mechanisms for resolution.
With this behind us, we should have a better basis for understanding how we use transactions and locking to ensure consistency in file systems.
The Cap Filing System
R. M. Needham and A.D. Birrell, Symposium on Operating Systems Principles, 1977, Association for Computing Machinery.
I’ve fallen behind this past ten days, working towards a deadline. My own conference paper is now submitted and I’m working on recovering. What that means is this week is going to be a busy one as I work on catching up. Adding to the fun, FAST is going on this week (February 13-16, 2008). I hope to add some bonus discussions on the current work being presented there.
Let’s get to discussing this paper.
CAP is a capabilities based system. The capability systems are a parallel idea that has been explored from time to time versus the access control system for Multics/UNIX inspired operating systems. The latest contender in this space would be Fuschia, an experimental operating system that Google is developing (though it is an open source project under a mixture of licenses). Interestingly, there are reports it runs on the Pixelbook in addition to the previous hardware it had supported.
At any rate, the idea behind a capability is to use identifiers that encapsulate the access rights within the capability itself. Thus, it is a name and the virtue of having the name is that it means you have the access inherent in that name.
These capabilities all represent to a program the right to have or do something: the preservation of information from one run of a program to another (a universal operating system requirement) is thus seen by the CAP programmer as the preservation of a capability rather than of an object itself.
The paper describes how capabilities are managed, which interacts with the “filing system” of course, since it has to store persistent information. They describe the “system internal name” (SIN), which is combined with a disk address to map memory segments to actual storage. The capability (in a “directory”) then becomes a repository of these persistent objects. This also creates a reference to the disk block that is in use, ensuring those storage regions are not reused.
One interesting characteristic of their system is that they do not guarantee free disk space will be recycled quickly (“[T]here is no guarantee that inaccessible disk space will be relinquished at the earliest possible moment.”) Indeed they note that space reclamation is only done when the system reboots and “[T]he filing system is not designed to run forever.”
They discuss how CAP differs from prior work (e.g., OS/360) where only the name matters; there is no concept of a directory for use as part of the capability system. The directory actually provides them an additional level of control as well, and they use directory capabilities as well as segment capabilities. Directories may be persistent (so they have a SIN and disk block location) or ephemeral (so they disappear when the program exits) – a sort of “built in” temporary storage concept for ephemeral memory management.
Sharing objects is now trivial – you simply tell someone the correct name; that embeds the capability within it. They do not describe the mechanisms used for transfer (“[B]y mechanisms which do not concern us in detail here.”)
They also describe how this name space is similar to other systems (UNIX and CAL-TSS) but different:
- Access is associated with the name which in turn references information in the directory. It is not an attribute of the file itself.
- The name space need not be a “strict hierarchy”. This means that portions could become disconnected, or even be private to a single application.
- Their use of directories behaves similar to the model of “current directory” (presumably in UNIX) even though CAP expressly does not have a concept of current directory.
- Directories are not even directed acyclic graphs!
The paper describes how capabilities work, since they are a fine-grained control mechanism. They explain that the holder of an existing capability (a program) may generate a more restrictive capability to provide to another program. Since capabilities apply to both individual files as well as directories, it is possible for a program to transfer a set of capabilities by creating a new directory and storing the relevant capabilities to the target program.
The names themselves can be quite ugly, as they incorporate capabilities within them. They describe some conventions they employ in terms of name management and directory placement, but point out these are conventions and not a hard requirement of the system.
CAP certainly presents a rather different model of a file system than we see in other systems. The idea of disconnected name spaces that are only visible to particular programs is an intriguing one. They use the example of the password database, which requires the program have the password file capability.
They discuss security. The directory manager is a separate module that programs use to interact with the directories to which they have access. To invoke the directory manager, the caller must have the ENTER capability.
I find this to be exactly the type of thought provoking paper I had hoped to find as I comb through these old papers. The idea that a file system name space need not be connected, that it could be private to a particular program or set of programs, and embedding access rights (“capabilities”) into the name will give me plenty to think about.
If you would like to know more about CAP there is a paper about it in the prior Symposium on Operating Systems Principles: The Cambridge CAP Computer and its Operating System. It is not too surprising that this is available from Microsoft Research, as they also built a capability based operating system (or two): Singularity and Midori.