Home » File Systems (Page 5)
Category Archives: File Systems
Weighted Voting for Replicated Data
Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.
I return back to distributed systems. Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature. This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme. I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.
This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC). At this point, the Xerox PARC team had been working to develop the personal computer. I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat). Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.
The key to this algorithm is the concept of a quorum for replicated data:
In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
The “votes” assigned to each copy of the file are its weight. This model provides a good way of generalizing replicated data. It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data. The model even permits caching nodes that have no weight.
The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data. This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system. The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure. The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes. A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.
The strategy outlined in this paper would permit at most one partition to proceed. The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.
For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied. In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system. In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy. Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.
Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node. This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.
Thus, the pragmatic option here would be to have a distribution of weights and quorum. For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2. In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.
The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking. Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).
Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet). Section 4 does provide insight into a variety of interesting and useful features of the system:
- “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
- Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations. This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
- Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models. The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
- Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
- Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
- Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed. In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself. Thus, changes will be discovered by the time the read quorum has been satisfied.
- Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
- Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
- Background update – postponing replication can allow smoothing network utilization over time.
The replication policy is, at its heart, an early consensus protocol. While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work. Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.
As We May Think
As We May Think
Vannevar Bush, The Atlantic, July 1945.
I saw this covered by Adrian Colyer recently and unabashedly decided I needed to cover it as well, not because I thought there was anything wrong with his coverage but rather because this speaks well to my own research interests. As Colyer points out the concept of trails is something that seems to have been lost.
Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling. Those who conscientiously attempt to keep abreast of current thought, even in restricted fields, by close and continuous reading might well shy away from an examination calculated to show how much of the previous month’s efforts could be produced on call. Mendel’s concept of the laws of genetics was lost to the world for a generation because his publication did not reach the few who were capable of grasping and extending it; and this sort of catastrophe is undoubtedly being repeated all about us, as truly significant attainments become lost in the mass of the inconsequential.
Bottom line? We can’t find things. I recently described the Polyvalues paper from SOSP 1979. I commented on the fact that there seemed to be useful insights here that just disappeared.
The real heart of the matter of selection, however, goes deeper than a lag in the adoption of mechanisms by libraries, or a lack of development of devices for their use. Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.
This was one of those moments when I realized how deeply embedded our concept of hierarchical organization really is. It isn’t embedded in the operating systems of the 1960s. It was inherited from the more fundamental constraints of paper indexing. Indeed, since reading this article it has given me further insight into how deeply entrenched we are with hierarchical organization.
The first idea, however, to be drawn from the analogy concerns selection. Selection by association, rather than indexing, may yet be mechanized.
“The analogy” here relates to the associative mechanisms inherent in how humans recall information, which is described in the prior paragraph. The author describes “trails”. This evocative idea is similar to the modern field of data provenance. In data provenance, the focus is often on reproducibility, not on finding things, yet there are intriguing similarities. I won’t explore this area further yet, but it seems to be intriguing. Perhaps it will open up some new perspectives to explore.
All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.
The memex is his hypothetical device for capturing and finding this information. At this point the author describes how to build such a system (more or less) using the technology of the day. The terminology is interesting, yet also quite telling that a 73 year old paper could describe modern systems so well.
At some point reading through this it occurred to me that in some ways we have built a system similar to what he describes: the internet. What it doesn’t do is construct the personalized model of inter-relationships between the various components of the system.
Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.
Doesn’t this sound like a web browser?
For me, the key take-away here is encouraging: my own goal of looking for alternatives to hierarchical file systems is not as crazy as it might first seem. It certainly does not mimic the way in which we tend to organize data, though I have also had people point out to me that nothing prevents us from constructing a higher level layer that can be used to achieve the same goal. Indeed, that has been done before and I will get to that at a later point.
Implementing Atomic Actions on Decentralized Data
Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.
This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems. It is certainly an interesting work on concurrent activity within a distributed system. It relies upon a basic concept of ordering within a distributed system (“decentralized data”). He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.
Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance. By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive. Database systems are further along in allowing distributed data and coordination through things like two-phase commit.
He starts by describing the goals of this work:
The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.
His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled. What does it mean to properly handle failures? Essentially, it means that the data is in a consistent state once the system has recovered from the failure. He starts by defining terms that relate to consistency models. For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed. A single action starts with a consistent state at the start and moves to a consistent state at the end. Any intermediate state of the system is not visible (what we would call “isolation” now). He formally defines these concepts as well.
He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state. In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here. In recent years, formal proof methods have been used to automate some aspects of this. I’m sure I will touch upon it in later posts.
One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure. Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it. Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems. If something can fail, it will.
Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases. The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).
One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome. This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested. He sets forth a series of objectives for his system:
(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.
Having laid this groundwork, the author then defines the tools he will use to achieve these objectives. This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).
He envisions a versioned object system, where the states of the object correspond to changes made to the object.
At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here. For example, if we modify an object in place then this sort of versioning makes sense. However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned. This is not an issue clearly addressed by the current paper as it relates mostly to usage. But I wanted to point it out because this sort of behavior will make things more difficult.
One of the important contributions in this work is the discussion about failure recovery. This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures. That is, one node goes offline, one network switch disappears, one data center loses power.
The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action. Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version. Abort simply means that the tentative versions of the objects in the transaction are deleted. Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.
Thus, we see the basic flow of operations: a transaction is started and a possibility is created. Each potential change is described by a token. Tokens are then added to the possibility. While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).
Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced. The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism). This approach ensures that the participants need not communicate in order to properly preserve ordering. The author calls this pseudotime. He continues on to explain how timestamps are generated.
Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives. There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.
He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem. Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system. Consequently, we would like to be able to throw away old versions of the objects in the system. We can do this pruning of versions without much additional mechanism, however.” His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.
After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!) Again, this is proposed, but not explored fully.
The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented. I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.
A Client-Based Transaction System to Maintain Data Integrity
A Client-Based Transaction System to Maintain Data Integrity
William H. Paxton, in Proceedings of the seventh ACM Symposium on Operating Systems Principles, 1979, pp 18-23.
Last time I discussed the basis of consistency and locks. I did so because I thought it would help explain this paper more easily. We now move into the nascent world of sharing data over the network (what today we might call distributed systems). One of the challenges of this world is that it involves coordinating changes to information that involves multiple actors (the client and server) that has some sort of consistency requirement. The authors here use the term integrity (rather than consistency) but I would suggest the terms are effectively the same for our purposes:
Integrity is a property of a total collection of data. It cannot be maintained simply by using reliable primitives for reading and writing single units — the relations between the units are important also.
This latter point is important. Even given atomic primitives, any time we need to update related state, we need to have some mechanism beyond a simple read or write mechanism.
The definitions offered by the author is highly consistent with what we saw previously:
- The consistency property: although many clients may be performing transactions simultaneously, each will get a consistent view of the shared data as if the transactions were being executed one at a time.
- The atomic property: for each transaction, either all the writes will be executed or none of them will, independent of crashes in servers or clients.
Prior work had focused on having the server handle these issue. This paper describes how the client can accomplish this task.
There were some surprising gems here. For example:
If a locked file is not accessed for a period of time, the server automatically releases the lock so that a crashed client will not leave files permanently unavailable.
We will see this technique used a number of times in the future and it is a standard technique for handling client-side locks in network file systems. Thus, one novelty to the locks presented here is that they have a bounded lifetime. Thus, one of the services required from the server is support for this style of locking.
The author then goes on to propose the use of an intention log (“intention file” in the paper). The idea is that the client computer locks files, computes the set of changes, records them in the intention log, and then commits the changes. Then the actual changes are applied.
To achieve this, it sets out six operations:
- Begin Transaction – this is what reserves the intention file. Note that the author’s model only permits a single transaction at a time. Note that the other operations fail if a transaction has not been started.
- Open – this is what opens the actual data file. At this point the file header is read. If that header indicates the file is being used for a transaction, the file is recovered before the open can proceed.
- Read – as it suggests, this reads data from the file.
- Write – this prepares to write data to the file. Since this is a mutation, the new data must be recorded in the the intention file at this point. The actual file is not modified at this point.
- Abort Transaction – this causes all files to be unlocked and closed. No changes are applied. The transaction becomes “completed”.
- End Transaction – this is how a transaction is recorded and applied. The paper describes how this is handled in some detail. In all successful cases, the changes are applied to the server.
The balance of the paper then describes how crash recovery works. The simplifying assumptions help considerably here – this system only permits a single transaction to proceed at a time. Then the author moves on to a “sketch of correctness proof”. I admit I did find that humorous as the attempt at formalism without actually invoking formalism achieves anything other than warm-fuzzy feelings.
He wraps up by discussing some complex cases, such as file creation and the challenge of “cleaning up” partially created file state. Having dealt with this issue over the years, I can assure you that it can be complex to get it implemented correctly. He also discusses that some files might have multi-block meta-data headers and explains how those should be handled as well. He concludes with a discussion about handling media failure conditions, which are a class of failures that this system does not claim to protect against.
The Cap File System
The Cap Filing System
R. M. Needham and A.D. Birrell, Symposium on Operating Systems Principles, 1977, Association for Computing Machinery.
I’ve fallen behind this past ten days, working towards a deadline. My own conference paper is now submitted and I’m working on recovering. What that means is this week is going to be a busy one as I work on catching up. Adding to the fun, FAST is going on this week (February 13-16, 2008). I hope to add some bonus discussions on the current work being presented there.
Let’s get to discussing this paper.
CAP is a capabilities based system. The capability systems are a parallel idea that has been explored from time to time versus the access control system for Multics/UNIX inspired operating systems. The latest contender in this space would be Fuschia, an experimental operating system that Google is developing (though it is an open source project under a mixture of licenses). Interestingly, there are reports it runs on the Pixelbook in addition to the previous hardware it had supported.
At any rate, the idea behind a capability is to use identifiers that encapsulate the access rights within the capability itself. Thus, it is a name and the virtue of having the name is that it means you have the access inherent in that name.
These capabilities all represent to a program the right to have or do something: the preservation of information from one run of a program to another (a universal operating system requirement) is thus seen by the CAP programmer as the preservation of a capability rather than of an object itself.
The paper describes how capabilities are managed, which interacts with the “filing system” of course, since it has to store persistent information. They describe the “system internal name” (SIN), which is combined with a disk address to map memory segments to actual storage. The capability (in a “directory”) then becomes a repository of these persistent objects. This also creates a reference to the disk block that is in use, ensuring those storage regions are not reused.
One interesting characteristic of their system is that they do not guarantee free disk space will be recycled quickly (“[T]here is no guarantee that inaccessible disk space will be relinquished at the earliest possible moment.”) Indeed they note that space reclamation is only done when the system reboots and “[T]he filing system is not designed to run forever.”
They discuss how CAP differs from prior work (e.g., OS/360) where only the name matters; there is no concept of a directory for use as part of the capability system. The directory actually provides them an additional level of control as well, and they use directory capabilities as well as segment capabilities. Directories may be persistent (so they have a SIN and disk block location) or ephemeral (so they disappear when the program exits) – a sort of “built in” temporary storage concept for ephemeral memory management.
Sharing objects is now trivial – you simply tell someone the correct name; that embeds the capability within it. They do not describe the mechanisms used for transfer (“[B]y mechanisms which do not concern us in detail here.”)
They also describe how this name space is similar to other systems (UNIX and CAL-TSS) but different:
- Access is associated with the name which in turn references information in the directory. It is not an attribute of the file itself.
- The name space need not be a “strict hierarchy”. This means that portions could become disconnected, or even be private to a single application.
- Their use of directories behaves similar to the model of “current directory” (presumably in UNIX) even though CAP expressly does not have a concept of current directory.
- Directories are not even directed acyclic graphs!
The paper describes how capabilities work, since they are a fine-grained control mechanism. They explain that the holder of an existing capability (a program) may generate a more restrictive capability to provide to another program. Since capabilities apply to both individual files as well as directories, it is possible for a program to transfer a set of capabilities by creating a new directory and storing the relevant capabilities to the target program.
The names themselves can be quite ugly, as they incorporate capabilities within them. They describe some conventions they employ in terms of name management and directory placement, but point out these are conventions and not a hard requirement of the system.
CAP certainly presents a rather different model of a file system than we see in other systems. The idea of disconnected name spaces that are only visible to particular programs is an intriguing one. They use the example of the password database, which requires the program have the password file capability.
They discuss security. The directory manager is a separate module that programs use to interact with the directories to which they have access. To invoke the directory manager, the caller must have the ENTER capability.
I find this to be exactly the type of thought provoking paper I had hoped to find as I comb through these old papers. The idea that a file system name space need not be connected, that it could be private to a particular program or set of programs, and embedding access rights (“capabilities”) into the name will give me plenty to think about.
If you would like to know more about CAP there is a paper about it in the prior Symposium on Operating Systems Principles: The Cambridge CAP Computer and its Operating System. It is not too surprising that this is available from Microsoft Research, as they also built a capability based operating system (or two): Singularity and Midori.
A Principle for Resilient Sharing of Distributed Resources
A Principle for Resilient Sharing of Distributed Resources
Peter A. Alsberg and John D. Day, In Proceedings of the 2nd international conference on Software engineering, pp. 562-570. IEEE Computer Society Press, 1976.
Today I turn my attention to a paper that begins to explore the issues surrounding distributed systems. This paper sets forth some basic principles that should be considered as part of distributed systems that are worth capturing and pointing out.
They state that a “resilient server” must have four attributes:
- It must be able to detect and recover from some finite number of errors.
- It must be reliable enough that a user doesn’t need to worry about the possibility of service failure.
- Failure just beyond the finite limit are not catastrophic.
- One user’s abuse should not have material impact on any other user.
These all seem somewhat reasonable, though I must admit that I found (3) a bit surprising, as it is not uncommon for some classes of failures to cause catastrophic failure. For example, when using erasure coding it is common for some number of failures to lead to catastrophic failure. Upon reflection, I realized that one could achieve this goal by simply setting the finite limit a bit lower, though I suspect this is not entirely what the original author had in mind.
Still, the general concept of resiliency is a good one to capture and consider. The authors point out some of the challenges inherent in this over a network, notably latency. “[A] major consideration when choosing a resource sharing strategy is to reduce, as much as possible, the number of message delays required to effect the sharing of resources.”
In other words, keep the messages to a minimum, try not to make them synchronous. Asynchronous messaging systems will turn out to be rather complex, however, and sometimes there are just operations that require synchronous behavior.
Of course, there has to be a file system tie-in here (however tenuous) and there is! Under examples they list “Network Virtual File Systems which provide directory services and data access services…” Thus, it is clear that the authors are considering network file systems as part of their work.
In 1976 the authors indicate that the cost to send a message is on the order of 100ms, while the cost to acquire a lock on the CPU is 100 microseconds to 1 millisecond. While things are faster now, there is still a large difference between these two paths on modern systems. Thus, we will still be dealing with issues on the same scale.
The bulk of the paper then walks through the description of their model for providing resilient network services – an application host, sends a request to a primary server host; that server host then communicates with a secondary server host. That secondary server host can send backup requests to a tertiary server, and so forth. The secondary host confirms with the primary host, and ultimately it is the secondary host that confirms the operation with the application host.
They cover a variety of important concepts, such as the idea that a host may need to record state about the operations, that operations cannot be applied until the other servers have received their messages, etc. This is, in essence, an early consensus protocol. While not efficient, the authors have laid groundwork in helping us think about how we construct such services.
I have included Figure 3 from the paper above. It shows the message flow through the system. The paper also discusses how to handle a number of failure situations and how messages flowing through the system keep track of which nodes are current.
It also touches on one of the most complex issues in distributed systems: network partition. Intriguingly, the authors do not assert that one partition must remain alive as they describe the decision being related to the specific requirements of the environment. Thus, in their model it would be possible to end up with partitions that continue forward but can no longer be easily re-joined after the network partition is resolved. Of course, they don’t require that both sides of a partition progress, though they do not discuss ways to handle this, either.
They assert that the primary/secondary/backup model is optimal in terms of the number of messages that it sends and in how it ensures the messages are strictly ordered. Then they briefly describe a few other possible arrangements that have multiple primaries and/or secondaries. Their final conclusion is that their original strategy is at least as good as the alternatives though this is not demonstrated in any formal way.
Now that they have their replication system, they view how it will work for the network virtual file system. They conclude that the highest levels of the hierarchy need to be stored on all nodes (which makes them the most expensive to maintain). They partition the names space below that and record location information within the nodes of the name space where the storage has been split out across hosts. Thus, we see a description of a global name space.
Their system is simple, but they identify a number of the important issues. Their suggestion about sharding the file systems name space is one that we shall see again in the future.
They have laid the groundwork for thinking about how to build distributed file systems.
Some Observations about Decentralization of File Systems
Some Observations about Decentralization of File Systems.
Jerome H. Saltzer, 1971.
This paper caught my eye because it leads in a different direction than the other file systems papers I’ve been looking at. Instead of talking about file systems on a single computer, it has the audacity of suggesting that maybe we want to have remotely accessible storage.
The author frames this in the context of networks. The first (of two) references in this paper are about networks and help frame the conversation about “decentralized” file systems:
Computer network development to achieve resource sharing
Lawrence G. Roberts and Barry D. Wessler, AFIPS ’70 (Spring) Proceedings of the May 5-7, 1970, spring joint computer conference, pp 543-549.
The authors’ affiliation is the Advance Research Project Agency (ARPA) and what they describe in this paper is the ARPA Network. I don’t want to get too far into this, but I did want to include this wonderful bandwidth/cost chart – something I have definitely seen in various guises since then.
In this time frame, they are discussing the creation of a network for sharing data across geographically dispersed sites, including sites scattered around the United States. Connection speeds in this time frame are 50Kb/s. It estimates the entire bandwidth consumption of the network will be between 200-800Kb/s by mid-1971, with twenty nodes connected to the entire network.
The point of this diagram is to discuss costs, where it points out that the least expensive way to move large quantities of data is to encode it on a magnetic tape and sent it in the mail (air mail, of course.)
Why is this important? Because it helps set the stage for the conversation about resource sharing. This is before Ethernet exists (Version 1.0 of the Ethernet specification appears in 1980). Thus, the networks that do exist are hardware specific. The amount of data being moved is remarkably small by modern standards.
This is, however, where we start considering what is involved in supporting network file systems – decentralized systems of storage that can communicate with one another.
The author’s stake out their position in the abstract:
This short note takes the position that the inherent complexity of a decentralized and a centralized information storage system are by nature essentially the same.
This defines the starting point of what will be a decades long conversation on this fundamental issue. The authors’ argue that in fact the real issue is one of engineering, not models:
The consequence of this claim, if substantiated, is that the technical choice between a centralized or decentralized organization is one of engineering tradeoffs pertaining to maintainability, economics, equipment available, and the problem being solved, rather than one of functional properties or fundamental differences in complexity.
The discussion then points out that in some cases, such as adding a 20-40 millisecond delay on top of the usual 20-50 millisecond disk delay is not dramatically different. They explore other areas where the timing might make a substantial difference. Intriguingly, they discuss a form of disaggregation – where they envision compute being in one location, memory in another, storage in yet another. They point out that this turns back into a computer (albeit one with long latency to access memory, for example.)
They then discuss caching (“buffer memory”) of information but point out the challenge is now that the system has multiple copies of what need to be the same data – it “has the problem of systematic management of multiple copies of information”. Surprisingly they make a leap here equating this situation between centralized and decentralized systems: this is a problem of modeling access patterns to shared information and then invent algorithms for “having the information on the right storage device at the right time”!
With decades of hindsight, this looks surprisingly naive. Indeed, even the authors’ construct a clear path for retreat here: “… this is not to say that there are no differences of significance”.
They conclude by opining that storage is challenging:
The complexity is the inevitable consequence of the objectives of the information storage system: information protection, sharing, privacy, accessibility, reliability, capacity, and so on. What is needed in both cases is a methodical engineering basis for designing an information storage system to specification. On the other hand a decentralized organization offers the potential for both more administrative flexibility, and also more administrative chaos.
What surprised me about this paper is that the issues around sharing data between computers was already being considered at this point. The idea that network file systems and local file systems really should be “more or less” the same is thus planted early. I’m sure we will see divergence on this point as we continue moving forward.
For now, we have a base upon which to build this new direction of “decentralized” file systems.
The UNIX Time-Sharing Operating System
The UNIX Time-Sharing Operating System
Dennis M. Ritchie and Ken Thompson, Bell Labs, Communications of the ACM, July 1974, Volume 17, Number 7, pp. 365-375.
This paper describes Version 3 of UNIX. It was Version 6 that became the basis of the Berkeley Software Distribution (BSD) version of UNIX. The only other operating system in CS history to date that has had so much impact on operating systems development is MULTICS, and UNIX is a direct descendant. Developed by several Bell Labs researchers that had been involved in MULTICS, its goal was to try and build a smaller operating system that retained what they viewed as the key benefits of MULTICS.
Much of UNIX Version 3 was written in the C programming language, itself derived from BCPL, a language that had been used on the MULTICS project.
“The most important job of UNIX is to provide a file system.“
These words leave little doubt about the role of file systems in UNIX and the importance assigned to them. The paper then goes on to describe files, in similar terms to MULTICS: files, directories, and special files (devices). We see the hierarchical file system of MULTICS reflected back in the description of the system. They talk about file names being 14 characters or less, the formation of paths, and the iterative walk of names through the file system name space to find other directories, as well as the terminal file nodes.
“The directory structure is constrained to have the form of a rooted tree.“
This is what I am looking for – the why of hierarchical file systems. I found the answer here, unsurprising yet ironically disappointing:
“The reason for this is to simplify the writing of programs which visit subtrees of the directory structure, and more important, to avoid the separation of portions of the hierarchy.“
Not surprising, yet not precisely what I had expected. I had expected the reason to be for the simplicity of the operating system (though they do allude to this by discussing the difficulty of knowing when it is safe to delete a directory. They describe links to files, however, so their file system is not really a tree. Rather, it is more like a directed acyclic graph (DAG). Files do not have pointers back to their directories, but directories have pointers back to their parent. Thus we have the distinction. The namespace is a DAG. Files don’t really live in the name space directly, they are referenced from the namespace, but have a reference count.
Oddly, with that, I found what I came for, at least in terms of insight for my own research area. There is a certain satisfaction in being able to point to this seminal document and say “this is why we got to where we are now.”
But if I stopped at this point, I would be leaving out the bits I had not expected to find.
First, the mundane: they discuss removable file systems, the fact that this is in fact a collection of name spaces, combining persistent name spaces with one another using a non-persistent mechanism (mounting), There is a simple description of how the file system is itself implemented. They describe the i-number (now the inode number) as an index into the file table. Thus, a directory entry is where the name lives, it merely refers to the file using its i-number. These entries are then called i-nodes. The i-node contains information about the owner of the file, the protection bits governing the file, the location information for where logical data is physically stored on the medium, the size of the file, its timestamps, it’s attribute bits, and the number of directory entries referencing the given i-node.
Surprisingly, not all that different than it is now, 45 years later. Implementation details have changed, as we no longer limit files to 10MB in size.
They describe bufering, they describe sector sized I/O and how it is more efficient for a program to do sector-sized I/O operations.
Much of the paper has nothing to do with file systems. I leave that to the interested reader to explore beyond that.
There are two interesting tid-bits remaining:
- They lost data once, on a hard disk that failed. The backup was 3 days old.
- They considered the permuted index application as one of the “major programs available”.
The fact they considered the permuted index important at this early stage was an interesting insight to me. Clearly, the ability to “find our stuff” is one that’s been around since the dawn of time. Maybe this research direction of mine does make sense.
Recent Comments