Home » 2018 » March

Monthly Archives: March 2018

The Cambridge File Server

The Cambridge File Server
Jeremy Dixon, in ACM SIGOPS Operating Systems Review,  Volume 14, Number 4, pp 26-35, 1980, ACM.

Cambridge was certainly a hotbed of systems work in the 1970s (not to say that it still is not).  They were looking at very different architectures and approaches to problems than we saw from the various Multics influenced systems.

The introduction to this paper is a testament to the vibrant research work being done here.  They author points to the Cambridge ring, which was their mechanism for implementing a shared computer network and a precursor to the Token Ring networks that followed.  The CAP computer was part of this network, and the network included a separate computer that had a vast amount of storage for the time – 150MB.  That common storage was used for both “filing systems” as well as “virtual memory”.   This computer ran the Cambridge File Server and implemented the functionality that was explored in the WFS paper.

They identify key characteristics of their file server:

  • Substantial crash resistance.
  • Capabilities used to control access.
  • Atomic file updates.
  • Automatic “garbage collection” of storage space
  • Fast transfer to random accessed, word-addressable files.

The authors make a point of noting there are only two classes of objects in their system: files and indices.  I found this interesting because it echos the hierarchical file systems models that encouraged me to start this journey in the first place.

They define a file: “… a random access sequence of 16-bit words whose contents can be read or written by client machines using the following operations”.  The operations that follow are read and write.  They go on to define an index: “… a list of unique identifiers, and is analogous to a C-list in capability machines”.  The three operations here are: preserveretrieve, and delete. This permits entries to be added, found, and removed.

The storage controlled by the file server thus appears to its clients as a directed graph whose nodes are files and indices.  Each file or index operation is authorised by quoting the object’s unique identifier to the file server, and UIDs are 64 bits long with 32 random bits. Each client, therefore, can access only some of the nodes in the graph at any time, namely those whose UIDs he knows, an dthose whose UIDs can be retrieved from accessible indices.

Thus, they actually have a graph file system that may in fact consist of nodes that are not connected – essentially a pool of disconnected trees that can be traversed if you know how to find the tree, but is effectively hidden otherwise.  They do point out that the sparse space may not be sufficient protection (though I suspect a small finite delay on an invalid lookup with discourage brute force browsing).

Objects are deleted when they cannot be found from some distinguished root index; the paper describes that each client is given its own entry in the root index, pointing to the client specific index.  There is the implication that they will scan the storage looking for such unreferenced objects that can be cleaned up and indeed they refer to a companion paper for a detailed description of this garbage collector.

Their argument for this omission is that it relieves the client of the burden of managing object lifetimes (“… removes from the clients the burden of deciding when to delete an object…”)

Storage space is segregated into “data” and “map” blocks.  The data blocks contain object contents.  The map blocks contain meta-data. New files are stored as a single data block. As the file grows in size, map blocks are inserted to create a tree of up to three levels deep.

The paper then turns its attention to the atomic nature of the updates to the file server.  The author points out that moving from consistent state to consistent state may require multiple distinct changes. Since failures can interrupt you between any two operations, the discussion revolves around ways in which this can be robustly implemented in atomic and recoverable fashion.  The author points out that the overhead in protecting against this class of failures has substantial overhead.  Given that not all files require this level of robustness, he proposes that the file server provide two separate classes of service for data files.  Map blocks are maintained in consistent fashion because they have the file server’s meta-data within them and the consistency of the file server’s control information needs to be preserved.

Much of the detail in the paper at that point involves describing the structure of the meta data and how it is used to implement atomic operations on the file server.  The paper provides a detailed description of how transactions are implemented within this system.  The fact they describe implementing a complete transactional file system, discuss the ramifications of providing user level transactional data storage, and come up with a hybrid model does make this an impressive piece of early work.  We will see journaling file systems more than once as we move forward.

The balance of the paper discusses how this has worked within their systems at Cambridge.   It is interesting and they tie some of the implementation efficiency to the environment of the Cambridge Ring itself.  This is a production file server and the author notes that it is used by a variety of computers (including different operating systems) within their environment successfully.

Its relatively quick response has allowed it to be used to record and play back digitised speech in real time.  The interface provided seems both simple and suitable for a variety of purposes.

Impressive indeed.

WFS: A Simple Shared File System for a Distributed Environment

WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.

This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.

Woodstock was “an early office system prototype”.  The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame.  The ability to share data between these systems turns out to be surprisingly important.  Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.

This is the environment that gave rise to WFS.  Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism.  In fact, WFS only provided four classes of operations:

  • I/O operations to read and write blocks of data within files
  • Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
  • Managing file properties, including page allocation data
  • Providing maintenance functions

The actual implementation is surprisingly simple.  Indeed the authors’ state that it took two months to build it.

Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.

There are a number of simplifying options with WFS:

  • The namespace for files is flat; there is no hierarchical structure.
  • The file structure is simple (Figure 2).
  • The protocol is stateless and each operation is idempotent.  This simplifies error handling since a lost message can be re-transmitted safely, with no fear that  repeating it will cause problems.
  • Operations are client initiated.  The server never initiates an operation.
  • Clients have limited mutable state.  The server does not permit changing its own state directly from the client.

This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.

Figure 3 describes the layout of the (stateless) data exchange format used by WFS.

Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location.  This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.

Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure.  Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.

This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as  the hierarchical document name space, management of documents and their structure, etc.  This separation of concerns yields an elegant and simple implementation model for their file server.

There are some interesting implementation details described in the paper:

  • Write operations are validated by reading the data page.  Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client.  It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network.  The downside to this is they must read the data before they can write it.
  • They use a write-through cache.  Thus, the cache is really for read efficiency, not write efficiency.  This should help mitigate the write inefficiency.
  • Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.

Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”

Accessing data over the network was faster than the local disk drive!  Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader.  One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)

The authors’ follow up their implementation description with an explanation of their design philosophy.  They emphasize the atomic nature of the operations they support, as well as the following properties:

  • Client initiated operations can only access one data page and “a few” auxiliary disk pages.
  • Operations are persistent before WFS returns status to the client.
  • WFS commands are a single internet packet.
  • The WFS protocol is stateless.

They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.

They delve into how clients might use WFS in Section 4.  One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services.  Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.

The authors are able to point to other systems that utlize WFS besides Woodstock.  They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.

The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities.  They round out this section of the paper by discussing other possible enhancements to WFS.

In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.


A Universal File Server

A Universal File Server
A. D. Birrell and R. M. Needham, in IEEE Transactions on Software Engineering, Vol SE-6, No. 5, September 1980, pp. 450-453.

One of the challenges in this next group of papers is picking which ones to discuss. The advent of networks saw the blossoming of the idea of centralizing storage and having different computer systems accessing it via those networks.  By the time this paper is published quite a few network based file server solutions had been constructed and described within the literature – and we will get to them.

The authors here decided to try and extract generality from these works.  So in this paper we step back and look for some generality.

This is a rather short paper – four pages.

The authors describe the division of responsibilities in a file server: the “high-level functions more properly associated with a filing system” and “functions belonging to a backing store server” [emphasis in the original].  When I read this I thought that this made sense: we have a functional layer that creates a name space, attributes, etc. and a storage layer that keeps track of storage blocks.

Figure 1

By splitting out this functionality, the authors then suggest that the backing store server is a point of commonality that can be used to support a range of higher level services.  Thus, “[t]he backing store server is the sole agency concerned with allocating and relinquishing space on the storage medium.”  To achieve this goal the authors propose a universal system of indexes as shown in Figure 1 (from the original paper).

The authors argue for a master table that presents the per-system name space.  For each of these namespaces, there is a corresponding master file directory (MFD) and a collection of user file directories (UFDs) that are used to organize the user’s information into a hierarchy.

We note that the files, UFDs and MFD are all stored in storage elements – segments – that are managed by the file server.   Thus the file server is responsible for:


  • Keeping track of its initial index
  • Preserve the names stored in the MFD and UFDs
  • Reclaim (delete) the entries in the MFD and UFDs when an entry is deleted
  • Manage storage space

From this simple model, they note that a broad range of systems can be constructed.

The paper spends considerable (25% of the paper) time discussing “protection”.  By this they refer to the issues inherent in having shared usage of a common resource, such as the files on the file server.  The authors describe using ACLs on the file server as one means of providing security.  They do not touch upon precisely how the file system will authenticate the users, though at one point they refer to using encryption for access bits in some circumstances.

Their preferred mechanism for access is the capability.  This should not come as a surprise, given that they worked on the CAP file system, which provided capabilities.  Their observation is that with a sufficiently sparse handle space, it is impractical for an unauthorized party to find the resource.  It probably doesn’t require much to point out that this presumes the inherent integrity of the network itself.

The authors complete their universal file server with an observation that this provides a general base upon which individual file systems can implement their own enhanced functionality. Indeed, this was one of their primary objectives in doing this work.  They do point out a number of potential issues in their system, but assert that they will not be problematic.

The authors do a good job of describing a basic, abstract file server.  The system they describe may not have achieved broad use but this paper does provide a simple, high level view of how a file server might operate.  We’ll turn our attention to actual implementations – and there are many such implementations to discuss in the coming posts.


Weighted Voting for Replicated Data

Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.

I return back to distributed systems.  Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature.  This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme.  I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.

This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC).  At this point, the Xerox PARC team had been working to develop the personal computer.  I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat).  Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.

The key to this algorithm is the concept of a quorum for replicated data:

In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.

The “votes” assigned to each copy of the file are its weight.  This model provides a good way of generalizing replicated data.  It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data.  The model even permits caching nodes that have no weight.

The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data.  This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system.  The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure.  The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes.  A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.

The strategy outlined in this paper would permit at most one partition to proceed.  The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.

Network Environment

For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied.  In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system.  In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy.  Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.

Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node.  This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.

Thus, the pragmatic option here would be to have a distribution of weights and quorum.  For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2.  In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.

The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking.  Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).

Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet).  Section 4 does provide insight into a variety of interesting and useful features of the system:

  • “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
  • Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations.  This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
  • Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models.  The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
  • Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
  • Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
  • Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed.  In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself.  Thus, changes will be discovered by the time the read quorum has been satisfied.
  • Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
  • Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
  • Background update – postponing replication can allow smoothing network utilization over time.

The replication policy is, at its heart, an early consensus protocol.  While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work.  Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.




As We May Think

As We May Think
Vannevar Bush, The Atlantic, July 1945.

I saw this covered by Adrian Colyer recently and unabashedly decided I needed to cover it as well, not because I thought there was anything wrong with his coverage but rather because this speaks well to my own research interests.  As Colyer points out the concept of trails is something that seems to have been lost.

Professionally our methods of transmitting and reviewing the results of research are generations old and by now are totally inadequate for their purpose. If the aggregate time spent in writing scholarly works and in reading them could be evaluated, the ratio between these amounts of time might well be startling. Those who conscientiously attempt to keep abreast of current thought, even in restricted fields, by close and continuous reading might well shy away from an examination calculated to show how much of the previous month’s efforts could be produced on call. Mendel’s concept of the laws of genetics was lost to the world for a generation because his publication did not reach the few who were capable of grasping and extending it; and this sort of catastrophe is undoubtedly being repeated all about us, as truly significant attainments become lost in the mass of the inconsequential.

Bottom line? We can’t find things. I recently described the Polyvalues paper from SOSP 1979. I commented on the fact that there seemed to be useful insights here that just disappeared.

The real heart of the matter of selection, however, goes deeper than a lag in the adoption of mechanisms by libraries, or a lack of development of devices for their use. Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

This was one of those moments when I realized how deeply embedded our concept of hierarchical organization really is.  It isn’t embedded in the operating systems of the 1960s.  It was inherited from the more fundamental constraints of paper indexing.  Indeed, since reading this article it has given me further insight into how deeply entrenched we are with hierarchical organization.

The first idea, however, to be drawn from the analogy concerns selection. Selection by association, rather than indexing, may yet be mechanized. 

“The analogy” here relates to the associative mechanisms inherent in how humans recall information, which is described in the prior paragraph.  The author describes “trails”.  This evocative idea is similar to the modern field of data provenance.  In data provenance, the focus is often on reproducibility, not on finding things, yet there are intriguing similarities.  I won’t explore this area further yet, but it seems to be intriguing.  Perhaps it will open up some new perspectives to explore.

All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.

The memex is his hypothetical device for capturing and finding this information.  At this point the author describes how to build such a system (more or less) using the technology of the day.  The terminology is interesting, yet also quite telling that a 73 year old paper could describe modern systems so well.

At some point reading through this it occurred to me that in some ways we have built a system similar to what he describes: the internet.  What it doesn’t do is construct the personalized model of inter-relationships between the various components of the system.

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified.

Doesn’t this sound like a web browser?

For me, the key take-away here is encouraging: my own goal of looking for alternatives to hierarchical file systems is not as crazy as it might first seem.  It certainly does not mimic the way in which we tend to organize data, though I have also had people point out to me that nothing prevents us from constructing a higher level layer that can be used to achieve the same goal.  Indeed, that has been done before and I will get to that at a later point.

Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data

Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Warren A. Montgomery, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 143-149. ACM, 1979.

I found this paper to be surprisingly interesting despite the fact it may be one of the least cited SOSP papers I’ve ever seen (ACM lists one citation to it, and Google Scholar lists two.)

The solution presented is based on the notion of maintaining several potential current values (a polyvalue) for each database item whose exact value is not known, due to failures interrupting atomic updates.  A polyvalue represents the possible set of values that an item could have, depending on the outcome of transactions that have been delayed by failures.  Transactions may operate on polyvalues, and in many cases a polyvalue may provide sufficient information to allow the results of a transaction to be computed, even though the polyvalue does not specify an exact value.  An analysis and simulation of the polyvalue mechanism shows that the mechanism is suitable for databases with reasonable failure rates and recovery times. The polyvalue mechanism is most useful where prompt processing is essential, but the results that must be produced promptly depend only loosely on the database state.  Many applications, such as electronic funds transfer, reservations, and process control, have these characteristics.

To me, this seems like a useful insight: sometimes, the correct outcome of a transactions does not depend upon the specific value of some object.  For example, if a transaction is checking to see if there are sufficient seats to sell for an airline, the fact that the range of possible seat counts is 37, 39, 40, or 41 doesn’t impact the ability of the system to sell one more seat.  There is no hard requirement that we must have an exact value.

In its own way, this is an intriguing manifestation of eventual consistency.  Eventually, the system will be able to figure out the correct number of seats available, once the unknown outcomes have been computed.  Today, we understand consistency models well because relaxing consistency in distributed systems helps improve performance.

The traditional, lock-based system approach (such as we discussed in Implementing Atomic Actions on Decentralized Data) provides strong consistency guarantees.  This was in keeping with the original requirements that transactions lead to a consistent set of state changes.  But transactions are there to ensure we move from one consistent state to another consistent state.  This concept of being able to proceed even in the face of some level of uncertainty points out that we just need to end up in a consistent state, not the consistent state.  We trade off strict determinism for performance and flexibility.

“[T]he failure of a site should not indefinitely delay any transaction that does not access data stored at that site.”  This likely seems blindingly obvious, yet in my own experience with distributed systems achieving this is harder than one might think.  Leslie Lamport is credited with defining a distributed system: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Polyvalues work by maintaining a vector of possible outcome values.  If the existing possible outcome values are all consistent with allowing a new transaction to proceed, it seems reasonable to permit the new transaction to proceed, versus having it block and wait for a single definitive outcome.  After all, regardless of the outcome this transaction can proceed.

The author defines a polyvalue: “a set of pairs <v,c> where v is a simple value and c is a condition which is a predicate.”  This introduces the idea of a logical operation that determines the outcome, rather than just a simple record of the data value, and the value of an object as being a range of possible values that have not yet been determined.  “A polyvalue is assigned to an item if a failure delays a transaction that is updating that item, or a polyvalue may be produced as one of the results of a transaction that accesses an item that has a polyvalue.”

The author then goes on to explain the logic of polyvalues, and how their inclusion into a transaction converts it to a polytransaction.  The implementation here is one in which multiple possible outcomes are described.  This approach would certainly seem to limit the use of this technique as otherwise there could be a state space explosion.  He describes a mechanism of collapsing these states – the precise number of seats on the plane is a polyvalue, but the decision to sell the ticket for one seat need not be blocked at that point since all the polyvalues lead to the same outcome.

A polytransaction that has possible paths which fail will have to block and pend if the outcome is dependent upon the values of the polyvalues, but if all possible polyvalues yield the same result, the polytransaction can be sold.

The insight here is that in highly distributed databases most transactions can achieve a valid outcome regardless of the intermediate state values.  If you look at their example of the bank account withdrawal model, it is clear that this makes sense.  The operation of withdrawing funds from your account can complete in any order as long as none of them lead to a negative balance (they use this example in the paper). Thus, it makes no sense to block one in favor of the other.

To evaluate this model, the author defines various terms:

  • I – the number of items in the database
  • – the number of updates per second
  • F – the failure probability of an update
  • R – the recovery rate (per second) from failed operations
  • D – the dependency count (average) for new values
  • Y – the probability the new value the update does not depend upon the previous value

He then predicts the number of polyvalues that will exist in the database (Table 1 from the paper):

Table 1

Thus, even with somewhat pessimal error and recovery rates, he does not expect more than 51 polyvalues within the database.

Finally, he reports the results of his simulation of the system having 10,000 database entries:

Table 2

Now with 1% failure rates, very slow (1 per 10 second) recovery rates, high dependency rates (D=5) and 10 transactions per second, he still only ends up with 20 polyvalues. Thus, this approach seems to help in scaling without a dramatic increase in complexity.

My take-away: strict consistency is not necessary to construct a viable system. Even allowing for some variance in outcomes it is possible to optimize the performance of the overall system at a nominal increase in potential size and complexity.

Useful insights, indeed.