Home » Distributed Systems
Category Archives: Distributed Systems
A Comparison of Two Network-Based File Servers
A Comparison of Two Network-Based File Servers
James G. Mitchell and Jeremy Dion, in Communications of the ACM, April 1982, Volume 25, Number 4.
I previously described the Cambridge File Server (CFS). In this 1981 SOSP paper the inner details of it and the Xerox Distributed File System (XDFS) are compared. This paper provides an interesting insight into the inner workings of these file servers.
Of course, the scale and scope of a file server in 1982 was vastly smaller than the scale and scope of file servers today. In 1982 the disk drives used for their file servers were as large as 300MB.
This stands in stark contract to the sheer size of modern SD cards; I think of them as slow but compared to the disk drives of that era they are quite a bit faster not to mention smaller. I suspect the authors of this paper might be rather surprised at how the scale has changed, yet many of the basic considerations they were making back in the early 1980s are still important today.
- Access Control (Security) – CFS was, of course, a capability based system. XDFS was an identity based system; most systems today are identity based systems, though we find aspects of both in use.
- Storage Management – the interesting challenge here is how to ensure that storage is not wasted. The naive model is to shift responsibility for proper cleanup to the clients. Of course, the reality is that this is not a good model; even in the simple case of a client that crashes, it is unlikely the client will robustly ensure that space is reclaimed in such circumstances. CFS handles this using a graph file system and performing garbage collection in which an unreachable node is deemed subject to reclamation. XDFS uses the more naive model, but mitigates this by providing a directory service that can handle proper cleanup for clients – thus clients can “do it right” with minimal fuss, but are not constrained to do so.
- Data Consistency – the authors point to the need to have some form of transactional update model. They observe that both CFS and XDFS offer atomic transactions; this represents the strong semantic end of the design spectrum for network file servers and we will observe that one of the most successful designs (Sun’s NFS) went to a much weaker end of the design spectrum. Some of this likely reflects the database background of the authors.
- Network Protocols – I enjoyed this section, since this is very early networking, with CFS using the predecessor of token ring and XDFS using the 3Mb/s version of Ethernet. They discuss the issues inherent in the network communcations: flow and error control (so message exchange and exception/error handling) and how the two respective systems handle them
The authors also compare details of the implementation:
- They describe a scheme in CFS in which small files use a direct block, and larger files use indirect blocks (blocks of pointers to direct blocks). This means that small files are faster. It is similar to the model that we see in other (later) file systems, while XDFS uses binary tree, used to track allocation of blocks to files, and a bitmap, used to indicate free/used space information.
- They discuss redundancy, with an eye towards handling (partial) disk failures. Like any physical device, the disk drives of that era did wear out and fail.
- They discuss their transaction log and how each system guaranteed consistency: they both use shadow pages, but their implementation of them is different. Ultimately, they both have similar issues, and similar impact. Shadow pages are a technique that we still use.
The evaluation is interesting: it is not so much a measure of performance but rather insights into the strengths and weaknesses of each approach. For XDFS they note that their transaction support has been successful and it permits database transactions (in essence, XDFS becomes a form of simple database service). They point to the lack of support for both normal and special files; from their description a special file is one with guaranteed write semantics. They also observe that ownership of files is easily lost, which in turn leads to inefficient storage utilization. They observe that it is not clear if the B-tree is win or lose of XDFS.
For CFS they point to the performance requirements as being a strength, though it sounds more like a design constraint that forced the CFS developers to make “hard choices” to optimize for performance. Similarly, they observe that the directed graph model of CFS is successful and capabilities are simple to implement. Interestingly, they also point to the index as well as string of names and access rights as being a success point. They also point to the fact that CFS generalizes well (“[t]wo quite different filling systems built in this way coexist on the CFS storage.”) They also point to automatic garbage collection as being a net win for CFS, though they also point out that CFS uses a reference count in addition to the garbage collection model. They list the CFS limitation of transactions to a single file or index as being one of its shortcomings and point to real-world experience porting other operating systems to use CFS as an indicator of the cost of this limitation. Interestingly, the limitation they point to (“… since file directories are implemented as an index with an associated file, it is currently impossible to update both structures in a single transaction.”) They conclude by arguing that XDFS has a better data layout, arguing that XDFS’s strategy of page allocation and intention logging is ultimately better than CFS’s cylinder maps: “… the redundancy function of cylinder maps does not seem to be as successful as those of page allocation and intentions logging; the program to reconstruct a corrupted block is not trivial.”
Ensuring correct recovery in a transactional system certainly challenging in my experience, so I can understand the authors’ concerns about simplicity and scalability.
Overall, it is an interesting read as I can see may of the issues described here as being file systems issues; many of the techniques they describe in this paper show up in subsequent file systems. The distinction between file system and file server also becomes more clearly separated in future work.
Logic and Lattices for Distributed Programming
Logic and Lattices for Distributed Programming
Neil Conway, William R. Marczak, Peter Alvaro, Joseph M. Hellerstein, and David Maier, in Symposium on Cloud Computing 2012 (SOCC ’12), October 14-17, 2012, San Jose, CA.
This is definitely a different direction than we’ve had in prior papers, though I do have an ulterior motive in presenting this particular paper – we will see it used later.
In recent years there has been interest in achieving application-level
consistency criteria without the latency and availability costs of
strongly consistent storage infrastructure. A standard technique is to
adopt a vocabulary of commutative operations; this avoids the risk
of inconsistency due to message reordering. Another approach was
recently captured by the CALM theorem, which proves that logically
monotonic programs are guaranteed to be eventually consistent. In
logic languages such as Bloom, CALM analysis can automatically
verify that programs achieve consistency without coordination.
In this paper we present BloomL, an extension to Bloom that
takes inspiration from both of these traditions. BloomL generalizes
Bloom to support lattices and extends the power of CALM analysis
to whole programs containing arbitrary lattices. We show how the
Bloom interpreter can be generalized to support ecient evaluation
of lattice-based code using well-known strategies from logic
programming. Finally, we use BloomL to develop several practical
distributed programs, including a key-value store similar to Amazon
Dynamo, and show how BloomL encourages the safe composition
of small, easy-to-analyze lattices into larger programs.
Notice they do mention key-value stores, so you have another hint on how I’ll be referring back to this work in a future post.
This tends more to the theoretical side of systems. It is not a theory paper (there just isn’t enough formalism, let alone proofs!) It has performance graphs, which you certainly expect from a systems paper, but not from a theory paper.
The driving factor behind this is the issue of distributed consistency. At a high level, “distributed consistency” is concerned with ensuring that a group of communicating computers, with some temporal separation, agree on the outcome of operations even when things go wrong. Perhaps the most famous example of distributed consistency is Paxos. These days we refer to these as consensus protocols. I generally describe there being several such: two-phase commit is certainly one of the older ones. Quorum protocols are another (e.g., weighted voting, which I described previously). Viewstamped Replication is another. These days, the popular consensus protocols are
Raft and Blockchain.
The paper starts by pointing out that monotonic consisency provides a valuable mechanism for reasoning about distributed consistency. Prior work by the authors establishes that all monotonic programs are “invariant to message reordering and retry”, a property they call confluent. This matters for distributed systems because such a system only moves forward (the operations are durable.)
They point out some weaknesses in the prior definition and motivate improving it by explaining one such obvious case that does not fit within the model (a voting quorum in a distributed protocol.)
Hence, they introduce the lattice. They do this within the context of their language (BloomL), which works on top of Ruby. I will not dwell on the details.
The authors define a bounded semijoined lattice. My reading of what they are saying is that in such a set, there is a unique element that happened first. They define this formally as a set S, with an operator (“bottom” that I don’t seem to have in my font set) that defines a partial ordering. There is a unique element ⊥ that represents the least element.
From this definition, they construct their model; the paper drops the “bounded semijoined” part of the definition and simply discusses lattices from that point forward, but it is this partial ordering property that imparts the key characteristics to their subsequent operations.
Why is this important? Because it demonstrates that these lattices – which are going to turn out to be equivalent to key operations in distributed systems – have consistency guarantees that are desirable.
The authors then turn their attention to utilizing lattices for key-value stores. They describe data structure versioning and vector clocks. Vector clocks have a property they desire for lattices: they are partially ordered. They combine this with a quorum voting protocol, to provide the distributed consensus for their system.
Figure 8 (from the paper) shows the general structure of their key-value store implementation, which is implemented in BloomL and Ruby. Their sample usage for this is a shopping cart, which they graphically describe in Figure 9 (from the paper).
As one would expect in a distributed system, the key benefit here is that there is no centralized authority deciding on the order of things. They point out that prior work argues shopping carts are non-monotonic and thus cannot be solved in a distributed systems setting. The authors point out that using the lattice structure, they achieve a monotonic ordering, which permits them to implement it without a centralized decision maker; in fact the decision maker in this case is really the client itself, as it has all the information from all the servers sufficient to complete the operation.
While a shopping cart might not be the killer application for a distributed systems technology, this paper does describe a powerful tool for providing distributed consensus in a system that can be implemented in a modest amount of code; compared to Paxos, Raft, or Viewstamped Replication, that is a significant contribution.
It does not appear to have byzantine protection, however, so if you live in a hostile environment it might not be the right protocol. Similarly, if you need stronger consistency guarantees, this might not be the best model either. But for many applications slightly relaxed consistency guarantees are often more than adequate.
We will see how this can be applied in the future.
Weighted Voting for Replicated Data
Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.
I return back to distributed systems. Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature. This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme. I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.
This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC). At this point, the Xerox PARC team had been working to develop the personal computer. I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat). Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.
The key to this algorithm is the concept of a quorum for replicated data:
In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
The “votes” assigned to each copy of the file are its weight. This model provides a good way of generalizing replicated data. It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data. The model even permits caching nodes that have no weight.
The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data. This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system. The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure. The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes. A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.
The strategy outlined in this paper would permit at most one partition to proceed. The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.
For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied. In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system. In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy. Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.
Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node. This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.
Thus, the pragmatic option here would be to have a distribution of weights and quorum. For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2. In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.
The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking. Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).
Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet). Section 4 does provide insight into a variety of interesting and useful features of the system:
- “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
- Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations. This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
- Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models. The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
- Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
- Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
- Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed. In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself. Thus, changes will be discovered by the time the read quorum has been satisfied.
- Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
- Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
- Background update – postponing replication can allow smoothing network utilization over time.
The replication policy is, at its heart, an early consensus protocol. While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work. Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.
Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Warren A. Montgomery, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 143-149. ACM, 1979.
I found this paper to be surprisingly interesting despite the fact it may be one of the least cited SOSP papers I’ve ever seen (ACM lists one citation to it, and Google Scholar lists two.)
The solution presented is based on the notion of maintaining several potential current values (a polyvalue) for each database item whose exact value is not known, due to failures interrupting atomic updates. A polyvalue represents the possible set of values that an item could have, depending on the outcome of transactions that have been delayed by failures. Transactions may operate on polyvalues, and in many cases a polyvalue may provide sufficient information to allow the results of a transaction to be computed, even though the polyvalue does not specify an exact value. An analysis and simulation of the polyvalue mechanism shows that the mechanism is suitable for databases with reasonable failure rates and recovery times. The polyvalue mechanism is most useful where prompt processing is essential, but the results that must be produced promptly depend only loosely on the database state. Many applications, such as electronic funds transfer, reservations, and process control, have these characteristics.
To me, this seems like a useful insight: sometimes, the correct outcome of a transactions does not depend upon the specific value of some object. For example, if a transaction is checking to see if there are sufficient seats to sell for an airline, the fact that the range of possible seat counts is 37, 39, 40, or 41 doesn’t impact the ability of the system to sell one more seat. There is no hard requirement that we must have an exact value.
In its own way, this is an intriguing manifestation of eventual consistency. Eventually, the system will be able to figure out the correct number of seats available, once the unknown outcomes have been computed. Today, we understand consistency models well because relaxing consistency in distributed systems helps improve performance.
The traditional, lock-based system approach (such as we discussed in Implementing Atomic Actions on Decentralized Data) provides strong consistency guarantees. This was in keeping with the original requirements that transactions lead to a consistent set of state changes. But transactions are there to ensure we move from one consistent state to another consistent state. This concept of being able to proceed even in the face of some level of uncertainty points out that we just need to end up in a consistent state, not the consistent state. We trade off strict determinism for performance and flexibility.
“[T]he failure of a site should not indefinitely delay any transaction that does not access data stored at that site.” This likely seems blindingly obvious, yet in my own experience with distributed systems achieving this is harder than one might think. Leslie Lamport is credited with defining a distributed system: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”
Polyvalues work by maintaining a vector of possible outcome values. If the existing possible outcome values are all consistent with allowing a new transaction to proceed, it seems reasonable to permit the new transaction to proceed, versus having it block and wait for a single definitive outcome. After all, regardless of the outcome this transaction can proceed.
The author defines a polyvalue: “a set of pairs <v,c> where v is a simple value and c is a condition which is a predicate.” This introduces the idea of a logical operation that determines the outcome, rather than just a simple record of the data value, and the value of an object as being a range of possible values that have not yet been determined. “A polyvalue is assigned to an item if a failure delays a transaction that is updating that item, or a polyvalue may be produced as one of the results of a transaction that accesses an item that has a polyvalue.”
The author then goes on to explain the logic of polyvalues, and how their inclusion into a transaction converts it to a polytransaction. The implementation here is one in which multiple possible outcomes are described. This approach would certainly seem to limit the use of this technique as otherwise there could be a state space explosion. He describes a mechanism of collapsing these states – the precise number of seats on the plane is a polyvalue, but the decision to sell the ticket for one seat need not be blocked at that point since all the polyvalues lead to the same outcome.
A polytransaction that has possible paths which fail will have to block and pend if the outcome is dependent upon the values of the polyvalues, but if all possible polyvalues yield the same result, the polytransaction can be sold.
The insight here is that in highly distributed databases most transactions can achieve a valid outcome regardless of the intermediate state values. If you look at their example of the bank account withdrawal model, it is clear that this makes sense. The operation of withdrawing funds from your account can complete in any order as long as none of them lead to a negative balance (they use this example in the paper). Thus, it makes no sense to block one in favor of the other.
To evaluate this model, the author defines various terms:
- I – the number of items in the database
- U – the number of updates per second
- F – the failure probability of an update
- R – the recovery rate (per second) from failed operations
- D – the dependency count (average) for new values
- Y – the probability the new value the update does not depend upon the previous value
He then predicts the number of polyvalues that will exist in the database (Table 1 from the paper):
Thus, even with somewhat pessimal error and recovery rates, he does not expect more than 51 polyvalues within the database.
Finally, he reports the results of his simulation of the system having 10,000 database entries:
Now with 1% failure rates, very slow (1 per 10 second) recovery rates, high dependency rates (D=5) and 10 transactions per second, he still only ends up with 20 polyvalues. Thus, this approach seems to help in scaling without a dramatic increase in complexity.
My take-away: strict consistency is not necessary to construct a viable system. Even allowing for some variance in outcomes it is possible to optimize the performance of the overall system at a nominal increase in potential size and complexity.
Useful insights, indeed.
Implementing Atomic Actions on Decentralized Data
Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.
This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems. It is certainly an interesting work on concurrent activity within a distributed system. It relies upon a basic concept of ordering within a distributed system (“decentralized data”). He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.
Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance. By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive. Database systems are further along in allowing distributed data and coordination through things like two-phase commit.
He starts by describing the goals of this work:
The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.
His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled. What does it mean to properly handle failures? Essentially, it means that the data is in a consistent state once the system has recovered from the failure. He starts by defining terms that relate to consistency models. For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed. A single action starts with a consistent state at the start and moves to a consistent state at the end. Any intermediate state of the system is not visible (what we would call “isolation” now). He formally defines these concepts as well.
He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state. In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here. In recent years, formal proof methods have been used to automate some aspects of this. I’m sure I will touch upon it in later posts.
One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure. Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it. Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems. If something can fail, it will.
Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases. The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).
One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome. This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested. He sets forth a series of objectives for his system:
(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.
Having laid this groundwork, the author then defines the tools he will use to achieve these objectives. This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).
He envisions a versioned object system, where the states of the object correspond to changes made to the object.
At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here. For example, if we modify an object in place then this sort of versioning makes sense. However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned. This is not an issue clearly addressed by the current paper as it relates mostly to usage. But I wanted to point it out because this sort of behavior will make things more difficult.
One of the important contributions in this work is the discussion about failure recovery. This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures. That is, one node goes offline, one network switch disappears, one data center loses power.
The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action. Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version. Abort simply means that the tentative versions of the objects in the transaction are deleted. Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.
Thus, we see the basic flow of operations: a transaction is started and a possibility is created. Each potential change is described by a token. Tokens are then added to the possibility. While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).
Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced. The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism). This approach ensures that the participants need not communicate in order to properly preserve ordering. The author calls this pseudotime. He continues on to explain how timestamps are generated.
Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives. There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.
He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem. Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system. Consequently, we would like to be able to throw away old versions of the objects in the system. We can do this pruning of versions without much additional mechanism, however.” His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.
After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!) Again, this is proposed, but not explored fully.
The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented. I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.
Time, Clocks, and the Ordering of Events in a Distributed System
Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport, Communications of the ACM, July 1978, pp. 558-565.
I had not originally intended to cover this paper, but as I started trying to describe Implementing Atomic Actions on Decentralized Data, I realized that I needed to include it in order to better explain that work. This is one of the papers for which Leslie Lamport was awarded the Turing Award. In it, he is wrestling with the effects of relativity in computer systems.
What does this mean? In essence, as soon as we have more than two distinct computer systems communicating with one another in a network, we must account for the fact that the order of events may now vary for each individual note. Indeed, as it turns out, this effect is not restricted to distributed systems. It can be seen in single computer systems with multiple processors as well (another area where Lamport was significantly involved) in the study of consistency models.
Networks tend to exacerbate these issues because the times involved often make it more apparent that there are issues. This will profoundly impact distributed systems (including distributed file systems, a notable and classic example of distributed systems) and we continue to wrestle with its manifestations 40 years after this paper.
The paper describes a system (Figure 1) in which three different components, labeled Process P, Process Q, and Process R, are sending messages between them. Individual messages are labeled and numbered in sequence and the time between them is shown. It is worth noting that the times are of both when the message was sent as well as when it was received.
This demonstrates that the observed order of events does not necessarily match what an external observer might see as the ordering. Thus, for example, Process R receives messages in a different order than they were sent by Process Q. This behavior might seem surprising initially, but is in fact a common occurrence. This could be due to queue types, intermediate network components, scheduling delays, etc.
Thus, he asks the question: what is the correct ordering of events in this system? How do we capture this and ensure that the system behaves as we expect?
He introduces a nomenclature of describing the “happened before” relationship, establishing an ordering of events within the system. He uses the → character to indicate this relationship. Thus, if a comes before b he writes “a → b“. Similarly it is transitive operation: if a → b and b → c then a → c. Finally he defines an operation as concurrent if there is no ordering between a and b. This applies to operations within a single process as well as between processes. He notes in this discussion that sending a message must have happened before receiving a message.
While this might sound simple, it helps us begin to reason about these events, since we can pay attention to the events that do have ordering constraints and ignore those that do not. In many cases, we won’t care about the order of two operations because the net effect is the same, regardless of the order. What we do want to do, however, is ensure that we understand ordering for those operations where it does matter.
Figure 2 then takes the same chart, but this time he adds the concept of a clock tick. The dashed lines represents the tick of a clock; the ticks occurs between events. He then defines the time ordering (“clock function”) is related to the → relationship. If a → b then C(a) < C(b). That is, the clock tick (time) is also well ordered. He observes then that we can extend this definition to cover the prior case, where we look at the ordering of operations within a single process, as well as across processes: recall that a → b is required for messages sent between processes, where a is the send event and b is the receive event and thus C(a) < C(b) in this case as well. He calls this the “Clock Condition”.
He then goes one step further: he flattens out the clock ticks, yielding Figure 3. Now our clock ticks are uniform time and our events are within one of the clock tick intervals. Since we separate dependent operations, we now have a model we can use to order events globally.
This was the point he was trying to make. “Being able to totally order the events can be very useful in implementing a distributed system.” This is a classic understatement and is the focus of much of distributed systems research and implementation to this day: how do you ensure that you have a definitive order of events. We will see this when looking at later work.
At this point he focuses on more rigorously defining his system. This allows him to note that this becomes a “distributed algorithm”. He describes the replicated state machine that is being executed by all the nodes within the distributed system. There is no central authority in this model forcing the ordering.
This is an important point not to miss: in Lamport’s distributed system there is no requirement for a centralized service. He does not insist on a total ordering of events in the system and settles for a partial ordering – one in which he doesn’t worry about the events that aren’t sensitive to their ordering. This is a powerful technique because it gives up total determinism; we gain performance through parallelism.
From this article we get the Lamport Clock which is not so much a clock as a monotonically increasing value that represents relative ordering. Note that a hardware clock will satisfy this invariant requirement as well.
A Principle for Resilient Sharing of Distributed Resources
A Principle for Resilient Sharing of Distributed Resources
Peter A. Alsberg and John D. Day, In Proceedings of the 2nd international conference on Software engineering, pp. 562-570. IEEE Computer Society Press, 1976.
Today I turn my attention to a paper that begins to explore the issues surrounding distributed systems. This paper sets forth some basic principles that should be considered as part of distributed systems that are worth capturing and pointing out.
They state that a “resilient server” must have four attributes:
- It must be able to detect and recover from some finite number of errors.
- It must be reliable enough that a user doesn’t need to worry about the possibility of service failure.
- Failure just beyond the finite limit are not catastrophic.
- One user’s abuse should not have material impact on any other user.
These all seem somewhat reasonable, though I must admit that I found (3) a bit surprising, as it is not uncommon for some classes of failures to cause catastrophic failure. For example, when using erasure coding it is common for some number of failures to lead to catastrophic failure. Upon reflection, I realized that one could achieve this goal by simply setting the finite limit a bit lower, though I suspect this is not entirely what the original author had in mind.
Still, the general concept of resiliency is a good one to capture and consider. The authors point out some of the challenges inherent in this over a network, notably latency. “[A] major consideration when choosing a resource sharing strategy is to reduce, as much as possible, the number of message delays required to effect the sharing of resources.”
In other words, keep the messages to a minimum, try not to make them synchronous. Asynchronous messaging systems will turn out to be rather complex, however, and sometimes there are just operations that require synchronous behavior.
Of course, there has to be a file system tie-in here (however tenuous) and there is! Under examples they list “Network Virtual File Systems which provide directory services and data access services…” Thus, it is clear that the authors are considering network file systems as part of their work.
In 1976 the authors indicate that the cost to send a message is on the order of 100ms, while the cost to acquire a lock on the CPU is 100 microseconds to 1 millisecond. While things are faster now, there is still a large difference between these two paths on modern systems. Thus, we will still be dealing with issues on the same scale.
The bulk of the paper then walks through the description of their model for providing resilient network services – an application host, sends a request to a primary server host; that server host then communicates with a secondary server host. That secondary server host can send backup requests to a tertiary server, and so forth. The secondary host confirms with the primary host, and ultimately it is the secondary host that confirms the operation with the application host.
They cover a variety of important concepts, such as the idea that a host may need to record state about the operations, that operations cannot be applied until the other servers have received their messages, etc. This is, in essence, an early consensus protocol. While not efficient, the authors have laid groundwork in helping us think about how we construct such services.
I have included Figure 3 from the paper above. It shows the message flow through the system. The paper also discusses how to handle a number of failure situations and how messages flowing through the system keep track of which nodes are current.
It also touches on one of the most complex issues in distributed systems: network partition. Intriguingly, the authors do not assert that one partition must remain alive as they describe the decision being related to the specific requirements of the environment. Thus, in their model it would be possible to end up with partitions that continue forward but can no longer be easily re-joined after the network partition is resolved. Of course, they don’t require that both sides of a partition progress, though they do not discuss ways to handle this, either.
They assert that the primary/secondary/backup model is optimal in terms of the number of messages that it sends and in how it ensures the messages are strictly ordered. Then they briefly describe a few other possible arrangements that have multiple primaries and/or secondaries. Their final conclusion is that their original strategy is at least as good as the alternatives though this is not demonstrated in any formal way.
Now that they have their replication system, they view how it will work for the network virtual file system. They conclude that the highest levels of the hierarchy need to be stored on all nodes (which makes them the most expensive to maintain). They partition the names space below that and record location information within the nodes of the name space where the storage has been split out across hosts. Thus, we see a description of a global name space.
Their system is simple, but they identify a number of the important issues. Their suggestion about sharding the file systems name space is one that we shall see again in the future.
They have laid the groundwork for thinking about how to build distributed file systems.
Recent Comments