Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
November 2024
S M T W T F S
 12
3456789
10111213141516
17181920212223
24252627282930

WFS: A Simple Shared File System for a Distributed Environment

WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.

This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.

Woodstock was “an early office system prototype”.  The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame.  The ability to share data between these systems turns out to be surprisingly important.  Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.


This is the environment that gave rise to WFS.  Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism.  In fact, WFS only provided four classes of operations:

  • I/O operations to read and write blocks of data within files
  • Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
  • Managing file properties, including page allocation data
  • Providing maintenance functions

The actual implementation is surprisingly simple.  Indeed the authors’ state that it took two months to build it.

Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.

There are a number of simplifying options with WFS:

  • The namespace for files is flat; there is no hierarchical structure.
  • The file structure is simple (Figure 2).
  • The protocol is stateless and each operation is idempotent.  This simplifies error handling since a lost message can be re-transmitted safely, with no fear that  repeating it will cause problems.
  • Operations are client initiated.  The server never initiates an operation.
  • Clients have limited mutable state.  The server does not permit changing its own state directly from the client.

This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.

Figure 3 describes the layout of the (stateless) data exchange format used by WFS.

Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location.  This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.

Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure.  Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.

This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as  the hierarchical document name space, management of documents and their structure, etc.  This separation of concerns yields an elegant and simple implementation model for their file server.

There are some interesting implementation details described in the paper:

  • Write operations are validated by reading the data page.  Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client.  It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network.  The downside to this is they must read the data before they can write it.
  • They use a write-through cache.  Thus, the cache is really for read efficiency, not write efficiency.  This should help mitigate the write inefficiency.
  • Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.

Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”

Accessing data over the network was faster than the local disk drive!  Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader.  One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)

The authors’ follow up their implementation description with an explanation of their design philosophy.  They emphasize the atomic nature of the operations they support, as well as the following properties:

  • Client initiated operations can only access one data page and “a few” auxiliary disk pages.
  • Operations are persistent before WFS returns status to the client.
  • WFS commands are a single internet packet.
  • The WFS protocol is stateless.

They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.

They delve into how clients might use WFS in Section 4.  One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services.  Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.

The authors are able to point to other systems that utlize WFS besides Woodstock.  They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.

The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities.  They round out this section of the paper by discussing other possible enhancements to WFS.

In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.

 

A Universal File Server

A Universal File Server
A. D. Birrell and R. M. Needham, in IEEE Transactions on Software Engineering, Vol SE-6, No. 5, September 1980, pp. 450-453.

One of the challenges in this next group of papers is picking which ones to discuss. The advent of networks saw the blossoming of the idea of centralizing storage and having different computer systems accessing it via those networks.  By the time this paper is published quite a few network based file server solutions had been constructed and described within the literature – and we will get to them.

The authors here decided to try and extract generality from these works.  So in this paper we step back and look for some generality.

This is a rather short paper – four pages.

The authors describe the division of responsibilities in a file server: the “high-level functions more properly associated with a filing system” and “functions belonging to a backing store server” [emphasis in the original].  When I read this I thought that this made sense: we have a functional layer that creates a name space, attributes, etc. and a storage layer that keeps track of storage blocks.

Figure 1

By splitting out this functionality, the authors then suggest that the backing store server is a point of commonality that can be used to support a range of higher level services.  Thus, “[t]he backing store server is the sole agency concerned with allocating and relinquishing space on the storage medium.”  To achieve this goal the authors propose a universal system of indexes as shown in Figure 1 (from the original paper).

The authors argue for a master table that presents the per-system name space.  For each of these namespaces, there is a corresponding master file directory (MFD) and a collection of user file directories (UFDs) that are used to organize the user’s information into a hierarchy.

We note that the files, UFDs and MFD are all stored in storage elements – segments – that are managed by the file server.   Thus the file server is responsible for:

 

  • Keeping track of its initial index
  • Preserve the names stored in the MFD and UFDs
  • Reclaim (delete) the entries in the MFD and UFDs when an entry is deleted
  • Manage storage space

From this simple model, they note that a broad range of systems can be constructed.

The paper spends considerable (25% of the paper) time discussing “protection”.  By this they refer to the issues inherent in having shared usage of a common resource, such as the files on the file server.  The authors describe using ACLs on the file server as one means of providing security.  They do not touch upon precisely how the file system will authenticate the users, though at one point they refer to using encryption for access bits in some circumstances.

Their preferred mechanism for access is the capability.  This should not come as a surprise, given that they worked on the CAP file system, which provided capabilities.  Their observation is that with a sufficiently sparse handle space, it is impractical for an unauthorized party to find the resource.  It probably doesn’t require much to point out that this presumes the inherent integrity of the network itself.

The authors complete their universal file server with an observation that this provides a general base upon which individual file systems can implement their own enhanced functionality. Indeed, this was one of their primary objectives in doing this work.  They do point out a number of potential issues in their system, but assert that they will not be problematic.

The authors do a good job of describing a basic, abstract file server.  The system they describe may not have achieved broad use but this paper does provide a simple, high level view of how a file server might operate.  We’ll turn our attention to actual implementations – and there are many such implementations to discuss in the coming posts.

 

Weighted Voting for Replicated Data

Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.

I return back to distributed systems.  Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature.  This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme.  I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.

This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC).  At this point, the Xerox PARC team had been working to develop the personal computer.  I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat).  Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.

The key to this algorithm is the concept of a quorum for replicated data:

In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.

The “votes” assigned to each copy of the file are its weight.  This model provides a good way of generalizing replicated data.  It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data.  The model even permits caching nodes that have no weight.

The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data.  This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system.  The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure.  The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes.  A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.

The strategy outlined in this paper would permit at most one partition to proceed.  The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.

Network Environment

For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied.  In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system.  In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy.  Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.

Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node.  This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.

Thus, the pragmatic option here would be to have a distribution of weights and quorum.  For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2.  In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.

The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking.  Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).

Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet).  Section 4 does provide insight into a variety of interesting and useful features of the system:

  • “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
  • Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations.  This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
  • Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models.  The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
  • Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
  • Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
  • Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed.  In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself.  Thus, changes will be discovered by the time the read quorum has been satisfied.
  • Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
  • Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
  • Background update – postponing replication can allow smoothing network utilization over time.

The replication policy is, at its heart, an early consensus protocol.  While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work.  Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.

 

 

 

As We May Think

As We May Think
Vannevar Bush, The Atlantic, July 1945.

I saw this covered by Adrian Colyer recently and unabashedly decided I needed to cover it as well, not because I thought there was anything wrong with his coverage but rather because this speaks well to my own research interests.  As Colyer points out the concept of trails is something that seems to have been lost.

Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling. Those who conscientiously attempt to keep abreast of current thought, even in restricted fields, by close and continuous reading might well shy away from an examination calculated to show how much of the previous month’s efforts could be produced on call. Mendel’s concept of the laws of genetics was lost to the world for a generation because his publication did not reach the few who were capable of grasping and extending it; and this sort of catastrophe is undoubtedly being repeated all about us, as truly significant attainments become lost in the mass of the inconsequential.

Bottom line? We can’t find things. I recently described the Polyvalues paper from SOSP 1979. I commented on the fact that there seemed to be useful insights here that just disappeared.

The real heart of the matter of selection, however, goes deeper than a lag in the adoption of mechanisms by libraries, or a lack of development of devices for their use. Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

This was one of those moments when I realized how deeply embedded our concept of hierarchical organization really is.  It isn’t embedded in the operating systems of the 1960s.  It was inherited from the more fundamental constraints of paper indexing.  Indeed, since reading this article it has given me further insight into how deeply entrenched we are with hierarchical organization.

The first idea, however, to be drawn from the analogy concerns selection. Selection by association, rather than indexing, may yet be mechanized. 

“The analogy” here relates to the associative mechanisms inherent in how humans recall information, which is described in the prior paragraph.  The author describes “trails”.  This evocative idea is similar to the modern field of data provenance.  In data provenance, the focus is often on reproducibility, not on finding things, yet there are intriguing similarities.  I won’t explore this area further yet, but it seems to be intriguing.  Perhaps it will open up some new perspectives to explore.

All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.

The memex is his hypothetical device for capturing and finding this information.  At this point the author describes how to build such a system (more or less) using the technology of the day.  The terminology is interesting, yet also quite telling that a 73 year old paper could describe modern systems so well.

At some point reading through this it occurred to me that in some ways we have built a system similar to what he describes: the internet.  What it doesn’t do is construct the personalized model of inter-relationships between the various components of the system.

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.

Doesn’t this sound like a web browser?

For me, the key take-away here is encouraging: my own goal of looking for alternatives to hierarchical file systems is not as crazy as it might first seem.  It certainly does not mimic the way in which we tend to organize data, though I have also had people point out to me that nothing prevents us from constructing a higher level layer that can be used to achieve the same goal.  Indeed, that has been done before and I will get to that at a later point.

Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data

Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Warren A. Montgomery, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 143-149. ACM, 1979.

I found this paper to be surprisingly interesting despite the fact it may be one of the least cited SOSP papers I’ve ever seen (ACM lists one citation to it, and Google Scholar lists two.)

The solution presented is based on the notion of maintaining several potential current values (a polyvalue) for each database item whose exact value is not known, due to failures interrupting atomic updates.  A polyvalue represents the possible set of values that an item could have, depending on the outcome of transactions that have been delayed by failures.  Transactions may operate on polyvalues, and in many cases a polyvalue may provide sufficient information to allow the results of a transaction to be computed, even though the polyvalue does not specify an exact value.  An analysis and simulation of the polyvalue mechanism shows that the mechanism is suitable for databases with reasonable failure rates and recovery times. The polyvalue mechanism is most useful where prompt processing is essential, but the results that must be produced promptly depend only loosely on the database state.  Many applications, such as electronic funds transfer, reservations, and process control, have these characteristics.

To me, this seems like a useful insight: sometimes, the correct outcome of a transactions does not depend upon the specific value of some object.  For example, if a transaction is checking to see if there are sufficient seats to sell for an airline, the fact that the range of possible seat counts is 37, 39, 40, or 41 doesn’t impact the ability of the system to sell one more seat.  There is no hard requirement that we must have an exact value.

In its own way, this is an intriguing manifestation of eventual consistency.  Eventually, the system will be able to figure out the correct number of seats available, once the unknown outcomes have been computed.  Today, we understand consistency models well because relaxing consistency in distributed systems helps improve performance.

The traditional, lock-based system approach (such as we discussed in Implementing Atomic Actions on Decentralized Data) provides strong consistency guarantees.  This was in keeping with the original requirements that transactions lead to a consistent set of state changes.  But transactions are there to ensure we move from one consistent state to another consistent state.  This concept of being able to proceed even in the face of some level of uncertainty points out that we just need to end up in a consistent state, not the consistent state.  We trade off strict determinism for performance and flexibility.

“[T]he failure of a site should not indefinitely delay any transaction that does not access data stored at that site.”  This likely seems blindingly obvious, yet in my own experience with distributed systems achieving this is harder than one might think.  Leslie Lamport is credited with defining a distributed system: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Polyvalues work by maintaining a vector of possible outcome values.  If the existing possible outcome values are all consistent with allowing a new transaction to proceed, it seems reasonable to permit the new transaction to proceed, versus having it block and wait for a single definitive outcome.  After all, regardless of the outcome this transaction can proceed.

The author defines a polyvalue: “a set of pairs <v,c> where v is a simple value and c is a condition which is a predicate.”  This introduces the idea of a logical operation that determines the outcome, rather than just a simple record of the data value, and the value of an object as being a range of possible values that have not yet been determined.  “A polyvalue is assigned to an item if a failure delays a transaction that is updating that item, or a polyvalue may be produced as one of the results of a transaction that accesses an item that has a polyvalue.”

The author then goes on to explain the logic of polyvalues, and how their inclusion into a transaction converts it to a polytransaction.  The implementation here is one in which multiple possible outcomes are described.  This approach would certainly seem to limit the use of this technique as otherwise there could be a state space explosion.  He describes a mechanism of collapsing these states – the precise number of seats on the plane is a polyvalue, but the decision to sell the ticket for one seat need not be blocked at that point since all the polyvalues lead to the same outcome.

A polytransaction that has possible paths which fail will have to block and pend if the outcome is dependent upon the values of the polyvalues, but if all possible polyvalues yield the same result, the polytransaction can be sold.

The insight here is that in highly distributed databases most transactions can achieve a valid outcome regardless of the intermediate state values.  If you look at their example of the bank account withdrawal model, it is clear that this makes sense.  The operation of withdrawing funds from your account can complete in any order as long as none of them lead to a negative balance (they use this example in the paper). Thus, it makes no sense to block one in favor of the other.

To evaluate this model, the author defines various terms:

  • I – the number of items in the database
  • – the number of updates per second
  • F – the failure probability of an update
  • R – the recovery rate (per second) from failed operations
  • D – the dependency count (average) for new values
  • Y – the probability the new value the update does not depend upon the previous value

He then predicts the number of polyvalues that will exist in the database (Table 1 from the paper):

Table 1

Thus, even with somewhat pessimal error and recovery rates, he does not expect more than 51 polyvalues within the database.

Finally, he reports the results of his simulation of the system having 10,000 database entries:

Table 2

Now with 1% failure rates, very slow (1 per 10 second) recovery rates, high dependency rates (D=5) and 10 transactions per second, he still only ends up with 20 polyvalues. Thus, this approach seems to help in scaling without a dramatic increase in complexity.

My take-away: strict consistency is not necessary to construct a viable system. Even allowing for some variance in outcomes it is possible to optimize the performance of the overall system at a nominal increase in potential size and complexity.

Useful insights, indeed.

Implementing Atomic Actions on Decentralized Data

Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.

This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems.  It is certainly an interesting work on concurrent activity within a distributed system.  It relies upon a basic concept of ordering within a distributed system (“decentralized data”).  He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.

Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance.  By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive.  Database systems are further along in allowing distributed data and coordination through things like two-phase commit.

He starts by describing the goals of this work:

The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.

His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled.  What does it mean to properly handle failures?  Essentially, it means that the data is in a consistent state once the system has recovered from the failure.  He starts by defining terms that relate to consistency models.  For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed.  A single action starts with a consistent state at the start and moves to a consistent state at the end.  Any intermediate state of the system is not visible (what we would call “isolation” now).  He formally defines these concepts as well.

He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state.  In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here.  In recent years, formal proof methods have been used to automate some aspects of this.  I’m sure I will touch upon it in later posts.

One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure.  Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it.  Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems.  If something can fail, it will.

Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases.  The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).

One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome.  This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested.  He sets forth a series of objectives for his system:

(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.

Having laid this groundwork, the author then defines the tools he will use to achieve these objectives.  This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).

He envisions a versioned object system, where the states of the object correspond to changes made to the object.

At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here.  For example, if we modify an object in place then this sort of versioning makes sense.  However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned.  This is not an issue clearly addressed by the current paper as it relates mostly to usage.  But I wanted to point it out because this sort of behavior will make things more difficult.

One of the important contributions in this work is the discussion about failure recovery.  This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures.  That is, one node goes offline, one network switch disappears, one data center loses power.

The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action.  Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version.  Abort simply means that the tentative versions of the objects in the transaction are deleted.  Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.

Thus, we see the basic flow of operations: a transaction is started and a possibility is created.  Each potential change is described by a token.  Tokens are then added to the possibility.  While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).

Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced.  The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism).  This approach ensures that the participants need not communicate in order to properly preserve ordering.  The author calls this pseudotime. He continues on to explain how timestamps are generated.

Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives.  There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.

He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem.  Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system.  Consequently, we would like to be able to throw away old versions of the objects in the system.  We can do this pruning of versions without much additional mechanism, however.”  His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.

After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!)  Again, this is proposed, but not explored fully.

The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented.  I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.

Time, Clocks, and the Ordering of Events in a Distributed System

Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport, Communications of the ACM, July 1978, pp. 558-565. 

I had not originally intended to cover this paper, but as I started trying to describe Implementing Atomic Actions on Decentralized Data, I realized that I needed to include it in order to better explain that work.  This is one of the papers for which Leslie Lamport was awarded the Turing Award.  In it, he is wrestling with the effects of relativity in computer systems.

What does this mean?  In essence, as soon as we have more than two distinct computer systems communicating with one another in a network, we must account for the fact that the order of events may now vary for each individual note.   Indeed, as it turns out, this effect is not restricted to distributed systems.  It can be seen in single computer systems with multiple processors as well (another area where Lamport was significantly involved) in the study of consistency models.

Networks tend to exacerbate these issues because the times involved often make it more apparent that there are issues. This will profoundly impact distributed systems (including distributed file systems, a notable and classic example of distributed systems) and we continue to wrestle with its manifestations 40 years after this paper.

Figure 1

The paper describes a system (Figure 1) in which three different components, labeled Process P, Process Q, and Process R, are sending messages between them.  Individual messages are labeled and numbered in sequence and the time between them is shown.  It is worth noting that the times are of both when the message was sent as well as when it was received.

This demonstrates that the observed order of events does not necessarily match what an external observer might see as the ordering.  Thus, for example, Process R receives messages in a different order than they were sent by Process Q.  This behavior might seem surprising initially, but is in fact a common occurrence.  This could be due to queue types, intermediate network components, scheduling delays, etc.

Thus, he asks the question: what is the correct ordering of events in this system?  How do we capture this and ensure that the system behaves as we expect? Figure 2

He introduces a nomenclature of describing the “happened before” relationship, establishing an ordering of events within the system.  He uses the → character to indicate this relationship.  Thus, if comes before b he writes “a → b“.  Similarly it is transitive operation: if a → b and b → c then a → c.  Finally he defines an operation as concurrent if there is no ordering between a and b.   This applies to operations within a single process as well as between processes.  He notes in this discussion that sending a message must have happened before receiving a message.

While this might sound simple, it helps us begin to reason about these events, since we can pay attention to the events that do have ordering constraints and ignore those that do not.  In many cases, we won’t care about the order of two operations because the net effect is the same, regardless of the order.  What we do want to do, however, is ensure that we understand ordering for those operations where it does matter.

Figure 2 then takes the same chart, but this time he adds the concept of a clock tick.  The dashed lines represents the tick of a clock; the ticks occurs between events.  He then defines the time ordering (“clock function”) is related to the → relationship.  If a → b then C(a)  < C(b). That is, the clock tick (time) is also well ordered.  He observes then that we can extend this definition to cover the prior case, where we look at the ordering of operations within a single process, as well as across processes: recall that a → b is required for messages sent between processes, where a is the send event and b is the receive event and thus C(a) < C(b) in this case as well.  He calls this the “Clock Condition”.

Figure 3 He then goes one step further: he flattens out the clock ticks, yielding Figure 3.  Now our clock ticks are uniform time and our events are within one of the clock tick intervals.  Since we separate dependent operations, we now have a model we can use to order events globally.

This was the point he was trying to make.  “Being able to totally order the events can be very useful in implementing a distributed system.”  This is a classic understatement and is the focus of much of distributed systems research and implementation to this day: how do you ensure that you have a definitive order of events.  We will see this when looking at later work.

At this point he focuses on more rigorously defining his system.  This allows him to note that this becomes a “distributed algorithm”.  He describes the replicated state machine that is being executed by all the nodes within the distributed system.  There is no central authority in this model forcing the ordering.

This is an important point not to miss: in Lamport’s distributed system there is no requirement for a centralized service.  He does not insist on a total ordering of events in the system and settles for a partial ordering – one in which he doesn’t worry about the events that aren’t sensitive to their ordering.  This is a powerful technique because it gives up total determinism; we gain performance through parallelism.

From this article we get the Lamport Clock which is not so much a clock as a monotonically increasing value that represents relative ordering.  Note that a hardware clock will satisfy this invariant requirement as well.

A Client-Based Transaction System to Maintain Data Integrity

A Client-Based Transaction System to Maintain Data Integrity
William H. Paxton, in Proceedings of the seventh ACM Symposium on Operating Systems Principles, 1979, pp 18-23.

Last time I discussed the basis of consistency and locks. I did so because I thought it would help explain this paper more easily. We now move into the nascent world of sharing data over the network (what today we might call distributed systems).  One of the challenges of this world is that it involves coordinating changes to information that involves multiple actors (the client and server) that has some sort of consistency requirement.  The authors here use the term integrity (rather than consistency) but I would suggest the terms are effectively the same for our purposes:

Integrity is a property of a total collection of data.  It cannot be maintained simply by using reliable primitives for reading and writing single units — the relations between the units are important also.

This latter point is important.  Even given atomic primitives, any time we need to update related state, we need to have some mechanism beyond a simple read or write mechanism.

The definitions offered by the author is highly consistent with what we saw previously:

  1. The consistency property: although many clients may be performing transactions simultaneously, each will get a consistent view of the shared data as if the transactions were being executed one at a time.
  2. The atomic property: for each transaction, either all the writes will be executed or none of them will, independent of crashes in servers or clients.

Prior work had focused on having the server handle these issue.  This paper describes how the client can accomplish this task.

There were some surprising gems here.  For example:

If a locked file is not accessed for a period of time, the server automatically releases the lock so that a crashed client will not leave files permanently unavailable.

We will see this technique used a number of times in the future and it is a standard technique for handling client-side locks in network file systems.  Thus, one novelty to the locks presented here is that they have a bounded lifetimeThus, one of the services required from the server is support for this style of locking.

The author then goes on to propose the use of an intention log (“intention file” in the paper).  The idea is that the client computer locks files, computes the set of changes, records them in the intention log, and then commits the changes.  Then the actual changes are applied.

To achieve this, it sets out six operations:

  • Begin Transaction – this is what reserves the intention file.  Note that the author’s model only permits a single transaction at a time. Note that the other operations fail if a transaction has not been started.
  • Open – this is what opens the actual data file.  At this point the file header is read.  If that header indicates the file is being used for a transaction, the file is recovered before the open can proceed.
  • Read – as it suggests, this reads data from the file.
  • Write – this prepares to write data to the file.  Since this is a mutation, the new data must be recorded in the the intention file at this point.  The actual file is not modified at this point.
  • Abort Transaction – this causes all files to be unlocked and closed.  No changes are applied.  The transaction becomes “completed”.
  • End Transaction – this is how a transaction is recorded and applied.  The paper describes how this is handled in some detail.  In all successful cases, the changes are applied to the server.

The balance of the paper then describes how crash recovery works.  The simplifying assumptions help considerably here – this system only permits a single transaction to proceed at a time.  Then the author moves on to a “sketch of correctness proof”.  I admit I did find that humorous as the attempt at formalism without actually invoking formalism achieves anything other than warm-fuzzy feelings.

He wraps up by discussing some complex cases, such as file creation and the challenge of “cleaning up” partially created file state. Having dealt with this issue over the years, I can assure you that it can be complex to get it implemented correctly. He also discusses that some files might have multi-block meta-data headers and explains how those should be handled as well.  He concludes with a discussion about handling media failure conditions, which are a class of failures that this system does not claim to protect against.

Granularity of Locks in a Shared Data Base

Granularity of locks in a shared data base,
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, in Proceedings of the 1st International Conference on Very Large Data Bases, pp. 428-451. ACM, 1975.

I was originally going to do a write-up of A Client-Based Transaction System to Maintain Data Integrity but realized that I should help motivate the use of locks and transactions better before diving into this topic area.  So to do this, I’ve picked to talk about locks.  This is a critical concept that we utilize in file systems development on a routine basis.  Note that this work actually started out in the database community, not the file systems community.

I will note that this is a long paper (23 pages) and there is considerable detail that I will omit for the sake of brevity – but I do want to touch on the key points that are raised in this paper.

They start off by defining consistency and transaction.  They define consistency first:

We assume the data base consists of a collection of records and constraints defined on these records.  There are physical constraints (ex: in a list of records, if a record A points to record B then record B must exist) as well as logical constraints (ex: conservation of money in a bank checking account application). When all such constraints are satisfied the data base is said to be consistent.

Then they move on to transaction:

transaction is a series of accesses (for read or write operations) to the data base which, applied to a consistent data base, will product a consistent data base.  During the execution of a transaction, the data base may be temporarily inconsistent.  The programs used to perform the transactions assume that they “see” a consistent data base.  So if several transactions are run concurrently, a locking mechanism must be used to insure that one transaction does not see temporarily inconsistent data cause by another transaction.  Also, even if there are no consistency constraints, locks must be used so that the updates of one transaction are not made available to others before the transaction completes.  Otherwise, transaction backup might cascade to other transactions which read or updated the “back up” updates.

This text does not lay out the modern description of transactions that we use (atomic, consistent, isolated, and durable or ACID) but what it describes is a system in which these properties do in fact exist.  This concept is not unique to databases – we see it in any multi actor system sharing resources.  The usual analogy that I use when describing this are traffic intersection based, as the intersection is a shared resource and the individual vehicles separate actors. lock free traffic flow

The optimal configuration for resource sharing is where the need to share is minimized.  If there is no shared data, there is no need for locks. Similarly, if the data is immutable (a concept I know we’ll return to over time) we don’t need to protect it.  Only mutable state needs to use locks.  As it turns out, frequently mutable state doesn’t actually change but because it can change, if we rely upon it, we must protect against change.

This concept of transitioning from one consistent state to another consistent state safely is why we use locks.  The primary contribution of the authors in this paper is not the creation of locks and consistency but rather their creation of a clear language that we still use to describe these concepts.

In Section 2 they then describe what a “lockable object” is.  There are several important observations here, which I summarize:

  • A transaction is sequentially consistent.  This is a very strong consistency guarantee (as we will see in later papers when things get far more general.)
  • The basic lock type is reader-writer.  A reader lock provides shared access to mutable state, along with the guarantee that the mutable state won’t change while the lock is held.  A writer lock provides exclusive access to mutable state, along with the guarantee that the only changes to that state will be made by the owner of the writer lock.
  • Locks may be used to protect non-existent resources, for example, to block the creation of a new object.  Since the object doesn’t exist, you can’t lock it yet (interestingly, I usually tend to think of that as locking the structure in which the new object exists, but their observation that you can lock non-existent items is certainly valid.
  • Locks are requested dynamically.
  • There is a dynamic balance between lock granularity and overhead.  If we lock many small objects, there is a cost associated with each of those lock operations.  In the years since this paper we have made uncontended lock acquisition rather inexpensive, but contended lock acquisition remains an expensive proposition.

The paper then goes on to discuss lock hierarchies.   This is because in any dynamic lock system in which you obtain more than a single lock, you need to ensure that there will never be a situation in which an actor blocks waiting for a lock that will never be released.  The simplest case of this is when an actor blocks waiting for a lock which it owns.  This is clearly a programming error, but it is one that I have seen numerous times over the years.  The more complex case of this is when we have a cycle in lock acquisition.  The paper points out that to prevent this we need to construct a directed acyclic graph showing the order in which locks are acquired.  Usually these are trees, but I have seen environments in which they really are DAGs, due to re-entrant behavior.

The paper describes their model for one type of lock, which is focused on locking within an hierarchical name space.  Thus, they have exclusive (X) access, shared (S) access, and intention (I) locks.  The intention lock is then used to protect each object along the lock hierarchy, with X or S access to the final node in the hierarchy.  The paper goes into greater detail about this; I’ll omit it because it is not really germane to what I found most interesting about this paper.

The final point that I’ll draw out of this paper is that they discuss the idea of lock scheduling, deadlock, and lock thrashing.  The lock hierarchy is intended on preventing deadlock.  Lock scheduling can be used to detect deadlocks.  Thrashing relates to memory contention and calls out to other operating system mechanisms for resolution.

With this behind us, we should have a better basis for understanding how we use transactions and locking to ensure consistency in file systems.

The DEMOS File System

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems.  I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?”  I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world.  As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

DEMOS Figures 1 and 2The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system.  A key for this project was that it seeks to improve performance as much as possible.  After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance.  It does so in a clear and understandable fashion.DEMOS Figures 3 and 4

DEMOS utilizes a UNIX-like hierarchical file system model.  It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS.  Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes.  To the best of my knowledge this is one of the earliest papers covering this common subject  (which is revisited periodically because workloads change and file sizes also change).  One of the common themes I have seen in other work is mirrored here: most files are small.  Figure 1 shows a CDF for file sizes.  We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well.  Their second figure (Figure 2) describes the proportion of transfer sizes and their source.   We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

demos figure 5They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server.  While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system.  This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications.  Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager.  This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”).  This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this.  It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”.  Their point is that serializing these things decreases overall performance.  Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.”  This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses.  Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation.  Their observation?  The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering.  These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply.  There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging.  They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead.  They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system.  There are plenty of details that will be refined in later work.