Home » Database
Category Archives: Database
Exploiting Hardware Transactional Memory in Main-Memory Databases
Exploiting Hardware Transactional Memory in Main-Memory Databases
Viktor Leis, Alfons Kemper, Thomas Neumann in 2014 IEEE 30th International Conference on Data Engineering, pp 580-591.
I have not spent much time discussing transactional memory previously, though I have touched upon it in prior work. By the time this paper was presented, transactional memory had been fairly well explored from a more theoretical perspective. Intel hardware with transactional memory support was just starting to emerge around the time this paper was released. I would note that Intel had substantial challenges in getting hardware transactional memory (HTM) correct as they released and then pulled support for it in several different CPU releases. Note that HTM was not new, as it had been described in the literature to an extent that earlier papers (e.g., Virtualizing Transactional Memory, which I have decided not to write about further, discusses the limitations of HTM back in 2005).
Logically, it extends the functionality of the processor cache by tracking what is accessed by the processor (and driven by the program code). Cache lines are read from memory, any changed made to the cache line, and then written back to memory. This is in turn all managed by the cache coherency protocol, which provides a variety of levels of coherency.
The idea behind HTM is that sometimes you want to change more than a single element of memory. For example, you might use a mutual exclusion, then add something to a linked list, and increment a counter indicating how many elements are in the linked list before you release the mutual exclusion. Even if there is no contention for the lock, you will pay the lock cost. If the platform requires a fence operation (to ensure memory has been flushed properly) you will also stall while the memory is written back. In a surprising number of cases, you need to do multiple fences to ensure that operations are sequentially consistent (which is a very strong form of consistency).
With HTM you can do this all speculatively: start the transaction, add something to the linked list, increment the counter, then commit the transaction. Once this has been followed with an appropriate fence, the change is visible to all other CPUs in the system. The goal then is to avoid doing any memory operations unless absolutely necessary.
The authors point out that the fastest option is partitioning (ignoring hot spots). They graphically demonstrate this in Figure 1 (from the paper). HTM has some overhead, but it tracks with partitioning fairly linearly. This difference is the overhead of HTM.
They compare this to serial execution, which just means performing them one at a time. The traditional mechanism for doing this kind of paralleism is the two phase commit protocol. That’s the lock/work/unlock paradigm.
If we only considered this diagram, we’d stick with strong partitioning – and we’re going to see this observation reflected again in future work. Of course the reason we don’t do this is because it turns out that the database (and it shows up in file systems as well) is not being uniformly accessed. Instead, we have hot spots. This was a particular concern in the MassTree paper, where they supported novel data structures to spread the load around in a rather interesting fashion. There’s quite a bit of discussion about this problem in the current paper – “[A] good partitioning scheme is often hard to find, in particular when workloads may shift over time.” Thus, their observation is: “we have to deal with this problem”.
So, how can HTM be exploited to provide robust scalability without partitioning. The authors do a good job of explaining how HTM works on Intel platforms. Figure 4 (from the paper) shows a fairly standard description of how this is done on the Intel platform: it has a bus snooping cache, an on-chip memory management unit (MMU), a shared Level 3 cache, and per core Level 1 and Level 2 caches (in case you are interested, the two caches do have somewhat different roles and characteristics.) Level 1 cache is the fastest to access, but the most expensive to provide. Level 2 cache is slower than Level 1, but because it is also cheaper we can have more of it on the CPU. Level 3 cache might be present on the CPU, in which case it is shared between all three cores. Note that none of this is required. It just happens to be how CPUs are constructed now.
The benefit of HTM then is that it exploits the cache in an interesting new way. Changes that are made inside a transaction are pinned inside the cache so they are not visible outside the current core. Note, however, that this could mean just the L1 cache. In fact, the functional size permitted is even smaller than that, as shown in Figure 5 (from the paper). Transactions below 8KB have a low probability of aborting (and if it aborts, the operation failed so it must be tried again, either using HTM or the fallback mechanism with software). That probability approaches 100% as the size goes above above 8KB. Interestingly, the primary reason for this is not so much the size of the cache as the associativity of the cache. What that means is the cache uses some bits from the address to figure out where to store data from that particular cache line. The paper points out that 6 bits (7-12) are used for determining the cache location, and each cache location (so each unique value of bits 7 through 12) are has a fixed number of cache lines (e.g., 8 entries in the Haswell chips the authors are evaluating). If we need to use a ninth we evict one of the existing pages in the cache.
Similarly, when the duration of the transaction goes up, the probability of it aborting also rises. This is shown in Figure 6 (from the paper). This is because the chance that various systems events will occur, which cause the transaction to abort. This includes various types of interrupts: hardware and software.
Thus, these two graphically demonstrate that to exploit HTM effectively we need to keep our transactions small in both duration and the number of cache lines modified by them.
We also note that we should take steps to minimize the amount of sharing of data structures that might be required – the point that not sharing things is more efficient. The authors discuss a variety of approaches to this issue: segmenting data structures, removing unnecessary conflict points (e.g., counters), and appropriate choice of data structures.
Recall the Trie structures from MassTree? These authors offer us Adaptive Radix Trees, which seem to have a similar goal: they are “[A]n efficient ordered indexing structure for main memory databases.” They combine this with a spin lock; the benefit now is that HTM doesn’t require the spin lock normally, so even if some parts of the tree are being read shared, the lock is not being acquired and thus it does not force a transactional abort for other (unrelated) nodes.
They put all of this insight together and that forms the basis for their evaluation. Figure 11 in the paper makes the point that HTM scales much better than traditional locking for small lookups (4 byte keys) with a uniform distribution once there is more than one thread.
Figure 12 (from the paper) evaluates the TPC-C Benchmark against their resulting system to demonstrate that it scales well . Note they stick with four threads, which are all likely on a single physical CPU, so there are no NUMA considerations in this aspect of the evaluation. They address this a bit later in the paper.
Figure 13 (from the paper) compares their performance against a partitioned system. Because they cannot prevent such cross-partition access, they must “live with” the inherent slowdown. One of the amazing benefits of HTM is thus revealed: as more operations cross partition boundaries, HTM continues to provide a constant performance. This seems to be one of the key lessons: no sharing is great, but once you find that you must share, synchronizing optimistically works surprisingly well.
Figure 14 (from the paper) attempts to address my comment earlier abut Figure 12: they really don’t have a multiprocessor system under evaluation. They admit as much in the paper: the hardware just isn’t available to them. They provide their simulation results to defend their contention that this does continue to scale, projecting almost 800,000 transactions per second with 32 cores.
Figure 15 (from the paper) finally demonstrates the reproducibility of HTM abort operations. If an HTM is retried, many will complete with one or two tries. Thus, it seems that even with multiple threads, they tend to converge towards the hardware limitations.
Bottom line: hardware transactional memory can be a key aspect of improving performance in a shared memory systems with classical synchronization.
Weighted Voting for Replicated Data
Weighted Voting for Replicated Data
David K. Gifford, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 150-162, 1979.
I return back to distributed systems. Previously I discussed a companion paper at the same conference (Polyvalues) that was essentially ignored in the subsequent literature. This paper, on the other hand, is well-cited and lays the groundwork for a quorum-based replicated data distribution scheme. I can see echos of a more theoretical paper (“Crumbling Walls“) that I will review at some point in the future.
This work was done while Dave Gifford was at Xerox Palo Alto Research Center (Xerox PARC). At this point, the Xerox PARC team had been working to develop the personal computer. I’m not reviewing it, but another interesting paper from this time period is the Pilot paper (perhaps I should, I see it describes the file systems as large and flat). Thus, the author of this paper is describing an actual working system, not a theoretical model for how one might implement such a system.
The key to this algorithm is the concept of a quorum for replicated data:
In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of r votes to read a file, and a write quorum of w votes to write a file, such that r+ w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file’s voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.
The “votes” assigned to each copy of the file are its weight. This model provides a good way of generalizing replicated data. It could describe a primary/secondary model, or shift the emphasis on ensuring critical systems have copies of the data. The model even permits caching nodes that have no weight.
The key to this approach is that the read quorum is set up so that it is large enough that at least one copy within the read set will have the current data. This is accomplished by ensuring that the combination of read quorum and write quorum represents a number (weight) that is larger than the total sum of all weights within the system. The challenge in a system like this is that choosing these values will determine the reliability of the system in the face of failure. The author doesn’t go into great detail about the types of failures that can occur, but an obvious one is that one of the replicated copies becomes unavailable: a machine crashes. A more interesting one is where the network partitions so that one group of replicas exist on one side of the partition and a different group exists in a separate partition.
The strategy outlined in this paper would permit at most one partition to proceed. The other partition (or partitions) could continue to make some level of progress if the read quorum limit is low enough, where “low enough” means there are at least that many readable copies available within the given partition.
For example, it may be sufficient for only a single replica to be available in order for the read quorum to be satisfied. In that case, it is consistent because the sum of the read quorum plus write quorum is greater than the number of votes in the system. In other words, it works because with the lowest possible read quorum a write requires recording the changes reliably on every (voting) replicated copy. Such a system provides strong guarantees, but won’t allow any progress when any of the nodes are down, since the write quorum requirement is so high.
Similarly, the other extreme is one in which the read quorum is equal to the number of votes in the system, so that the write quorum is just a single node. This does not seem like a very good option, given that it would cause all the data to become unavailable when any of the replicas became unavailable.
Thus, the pragmatic option here would be to have a distribution of weights and quorum. For example, if you have three replicas, each with the same weight (say 1 for this discussion) then a workable model is to insist on a read quorum of 2 and a write quorum of 2. In that way, a single failure will not prevent you from making forward progress, but if two nodes go down then the system can no longer make progress.
The author describes the typical environment he envisions for this system: a network of personal computers, connected via a network, file servers, and even wide area networking. Xerox had the personal computers at that point, and had defined networking protocols (XNS) and would, in cooperation with Digital and Intel issue Version 1.0 of the Ethernet specification the following year (1980).
Much of the paper is in fact a fairly detailed description of the system that they had implemented (in Violet). Section 4 does provide insight into a variety of interesting and useful features of the system:
- “Weak representatitves” – these are basically cached copies of the data; they do not have any voting rights. The author describes them as a performance optimization. It indicates a way of marking the copy as invalid so it will need to be re-fetched before it can be used.
- Lock optimization – the author points out that they have an optimized lock scheme that permits updates which are compatible with read operations. This is consistent with the observation that as long as ordering of write operations is preserved on persistent storage write back operations are permissible.
- Weak consistency – the original model was serial consistency but the author points out that some problems can be addressed with weaker consistency models. The author does not explore these weak models substantially, but merely mentioning them is indeed a useful insight.
- Object size – the model permits locking on the file level, so the object stored within the file should be “of suitable size”.
- Read lock breaking – if the file system permits breaking read locks as part of conflict resolution (rather than transaction abort) then object version numbers can change during the transaction; the change is detectable since the version number shifts.
- Dynamic reconfiguration – the author describes how additional replicas can be added (and presumably removed) or weights changed. In essence, he uses the same rules for updating the voting configuration data as for the underlying data itself. Thus, changes will be discovered by the time the read quorum has been satisfied.
- Replicated containers – the author explains how replication can be used with (mostly) the same interface as non-replicated storage (just with the benefits of being replicated!)
- Minimizing communications overhead – the author points out that releasing unneeded read locks prior to commit eliminates the need to communicate during commit processing.
- Background update – postponing replication can allow smoothing network utilization over time.
The replication policy is, at its heart, an early consensus protocol. While the author does not discuss this, the approach described does have some scalability challenges that will become apparent (and be addressed) in subsequent work. Overall, this work really does an amazing job of describing so many aspects of modern computer systems: networks, file servers, personal computers, wide area networks, redundancy, consistency, etc.
Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Polyvalues: A Tool for Implementing Atomic Updates to Distributed Data
Warren A. Montgomery, in Proceedings of the seventh ACM symposium on Operating systems principles, pp. 143-149. ACM, 1979.
I found this paper to be surprisingly interesting despite the fact it may be one of the least cited SOSP papers I’ve ever seen (ACM lists one citation to it, and Google Scholar lists two.)
The solution presented is based on the notion of maintaining several potential current values (a polyvalue) for each database item whose exact value is not known, due to failures interrupting atomic updates. A polyvalue represents the possible set of values that an item could have, depending on the outcome of transactions that have been delayed by failures. Transactions may operate on polyvalues, and in many cases a polyvalue may provide sufficient information to allow the results of a transaction to be computed, even though the polyvalue does not specify an exact value. An analysis and simulation of the polyvalue mechanism shows that the mechanism is suitable for databases with reasonable failure rates and recovery times. The polyvalue mechanism is most useful where prompt processing is essential, but the results that must be produced promptly depend only loosely on the database state. Many applications, such as electronic funds transfer, reservations, and process control, have these characteristics.
To me, this seems like a useful insight: sometimes, the correct outcome of a transactions does not depend upon the specific value of some object. For example, if a transaction is checking to see if there are sufficient seats to sell for an airline, the fact that the range of possible seat counts is 37, 39, 40, or 41 doesn’t impact the ability of the system to sell one more seat. There is no hard requirement that we must have an exact value.
In its own way, this is an intriguing manifestation of eventual consistency. Eventually, the system will be able to figure out the correct number of seats available, once the unknown outcomes have been computed. Today, we understand consistency models well because relaxing consistency in distributed systems helps improve performance.
The traditional, lock-based system approach (such as we discussed in Implementing Atomic Actions on Decentralized Data) provides strong consistency guarantees. This was in keeping with the original requirements that transactions lead to a consistent set of state changes. But transactions are there to ensure we move from one consistent state to another consistent state. This concept of being able to proceed even in the face of some level of uncertainty points out that we just need to end up in a consistent state, not the consistent state. We trade off strict determinism for performance and flexibility.
“[T]he failure of a site should not indefinitely delay any transaction that does not access data stored at that site.” This likely seems blindingly obvious, yet in my own experience with distributed systems achieving this is harder than one might think. Leslie Lamport is credited with defining a distributed system: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”
Polyvalues work by maintaining a vector of possible outcome values. If the existing possible outcome values are all consistent with allowing a new transaction to proceed, it seems reasonable to permit the new transaction to proceed, versus having it block and wait for a single definitive outcome. After all, regardless of the outcome this transaction can proceed.
The author defines a polyvalue: “a set of pairs <v,c> where v is a simple value and c is a condition which is a predicate.” This introduces the idea of a logical operation that determines the outcome, rather than just a simple record of the data value, and the value of an object as being a range of possible values that have not yet been determined. “A polyvalue is assigned to an item if a failure delays a transaction that is updating that item, or a polyvalue may be produced as one of the results of a transaction that accesses an item that has a polyvalue.”
The author then goes on to explain the logic of polyvalues, and how their inclusion into a transaction converts it to a polytransaction. The implementation here is one in which multiple possible outcomes are described. This approach would certainly seem to limit the use of this technique as otherwise there could be a state space explosion. He describes a mechanism of collapsing these states – the precise number of seats on the plane is a polyvalue, but the decision to sell the ticket for one seat need not be blocked at that point since all the polyvalues lead to the same outcome.
A polytransaction that has possible paths which fail will have to block and pend if the outcome is dependent upon the values of the polyvalues, but if all possible polyvalues yield the same result, the polytransaction can be sold.
The insight here is that in highly distributed databases most transactions can achieve a valid outcome regardless of the intermediate state values. If you look at their example of the bank account withdrawal model, it is clear that this makes sense. The operation of withdrawing funds from your account can complete in any order as long as none of them lead to a negative balance (they use this example in the paper). Thus, it makes no sense to block one in favor of the other.
To evaluate this model, the author defines various terms:
- I – the number of items in the database
- U – the number of updates per second
- F – the failure probability of an update
- R – the recovery rate (per second) from failed operations
- D – the dependency count (average) for new values
- Y – the probability the new value the update does not depend upon the previous value
He then predicts the number of polyvalues that will exist in the database (Table 1 from the paper):
Thus, even with somewhat pessimal error and recovery rates, he does not expect more than 51 polyvalues within the database.
Finally, he reports the results of his simulation of the system having 10,000 database entries:
Now with 1% failure rates, very slow (1 per 10 second) recovery rates, high dependency rates (D=5) and 10 transactions per second, he still only ends up with 20 polyvalues. Thus, this approach seems to help in scaling without a dramatic increase in complexity.
My take-away: strict consistency is not necessary to construct a viable system. Even allowing for some variance in outcomes it is possible to optimize the performance of the overall system at a nominal increase in potential size and complexity.
Useful insights, indeed.
Implementing Atomic Actions on Decentralized Data
Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.
This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems. It is certainly an interesting work on concurrent activity within a distributed system. It relies upon a basic concept of ordering within a distributed system (“decentralized data”). He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.
Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance. By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive. Database systems are further along in allowing distributed data and coordination through things like two-phase commit.
He starts by describing the goals of this work:
The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.
His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled. What does it mean to properly handle failures? Essentially, it means that the data is in a consistent state once the system has recovered from the failure. He starts by defining terms that relate to consistency models. For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed. A single action starts with a consistent state at the start and moves to a consistent state at the end. Any intermediate state of the system is not visible (what we would call “isolation” now). He formally defines these concepts as well.
He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state. In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here. In recent years, formal proof methods have been used to automate some aspects of this. I’m sure I will touch upon it in later posts.
One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure. Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it. Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems. If something can fail, it will.
Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases. The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).
One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome. This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested. He sets forth a series of objectives for his system:
(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.
Having laid this groundwork, the author then defines the tools he will use to achieve these objectives. This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).
He envisions a versioned object system, where the states of the object correspond to changes made to the object.
At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here. For example, if we modify an object in place then this sort of versioning makes sense. However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned. This is not an issue clearly addressed by the current paper as it relates mostly to usage. But I wanted to point it out because this sort of behavior will make things more difficult.
One of the important contributions in this work is the discussion about failure recovery. This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures. That is, one node goes offline, one network switch disappears, one data center loses power.
The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action. Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version. Abort simply means that the tentative versions of the objects in the transaction are deleted. Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.
Thus, we see the basic flow of operations: a transaction is started and a possibility is created. Each potential change is described by a token. Tokens are then added to the possibility. While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).
Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced. The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism). This approach ensures that the participants need not communicate in order to properly preserve ordering. The author calls this pseudotime. He continues on to explain how timestamps are generated.
Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives. There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.
He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem. Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system. Consequently, we would like to be able to throw away old versions of the objects in the system. We can do this pruning of versions without much additional mechanism, however.” His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.
After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!) Again, this is proposed, but not explored fully.
The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented. I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.
Granularity of Locks in a Shared Data Base
Granularity of locks in a shared data base,
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, in Proceedings of the 1st International Conference on Very Large Data Bases, pp. 428-451. ACM, 1975.
I was originally going to do a write-up of A Client-Based Transaction System to Maintain Data Integrity but realized that I should help motivate the use of locks and transactions better before diving into this topic area. So to do this, I’ve picked to talk about locks. This is a critical concept that we utilize in file systems development on a routine basis. Note that this work actually started out in the database community, not the file systems community.
I will note that this is a long paper (23 pages) and there is considerable detail that I will omit for the sake of brevity – but I do want to touch on the key points that are raised in this paper.
They start off by defining consistency and transaction. They define consistency first:
We assume the data base consists of a collection of records and constraints defined on these records. There are physical constraints (ex: in a list of records, if a record A points to record B then record B must exist) as well as logical constraints (ex: conservation of money in a bank checking account application). When all such constraints are satisfied the data base is said to be consistent.
Then they move on to transaction:
A transaction is a series of accesses (for read or write operations) to the data base which, applied to a consistent data base, will product a consistent data base. During the execution of a transaction, the data base may be temporarily inconsistent. The programs used to perform the transactions assume that they “see” a consistent data base. So if several transactions are run concurrently, a locking mechanism must be used to insure that one transaction does not see temporarily inconsistent data cause by another transaction. Also, even if there are no consistency constraints, locks must be used so that the updates of one transaction are not made available to others before the transaction completes. Otherwise, transaction backup might cascade to other transactions which read or updated the “back up” updates.
This text does not lay out the modern description of transactions that we use (atomic, consistent, isolated, and durable or ACID) but what it describes is a system in which these properties do in fact exist. This concept is not unique to databases – we see it in any multi actor system sharing resources. The usual analogy that I use when describing this are traffic intersection based, as the intersection is a shared resource and the individual vehicles separate actors.
The optimal configuration for resource sharing is where the need to share is minimized. If there is no shared data, there is no need for locks. Similarly, if the data is immutable (a concept I know we’ll return to over time) we don’t need to protect it. Only mutable state needs to use locks. As it turns out, frequently mutable state doesn’t actually change but because it can change, if we rely upon it, we must protect against change.
This concept of transitioning from one consistent state to another consistent state safely is why we use locks. The primary contribution of the authors in this paper is not the creation of locks and consistency but rather their creation of a clear language that we still use to describe these concepts.
In Section 2 they then describe what a “lockable object” is. There are several important observations here, which I summarize:
- A transaction is sequentially consistent. This is a very strong consistency guarantee (as we will see in later papers when things get far more general.)
- The basic lock type is reader-writer. A reader lock provides shared access to mutable state, along with the guarantee that the mutable state won’t change while the lock is held. A writer lock provides exclusive access to mutable state, along with the guarantee that the only changes to that state will be made by the owner of the writer lock.
- Locks may be used to protect non-existent resources, for example, to block the creation of a new object. Since the object doesn’t exist, you can’t lock it yet (interestingly, I usually tend to think of that as locking the structure in which the new object exists, but their observation that you can lock non-existent items is certainly valid.
- Locks are requested dynamically.
- There is a dynamic balance between lock granularity and overhead. If we lock many small objects, there is a cost associated with each of those lock operations. In the years since this paper we have made uncontended lock acquisition rather inexpensive, but contended lock acquisition remains an expensive proposition.
The paper then goes on to discuss lock hierarchies. This is because in any dynamic lock system in which you obtain more than a single lock, you need to ensure that there will never be a situation in which an actor blocks waiting for a lock that will never be released. The simplest case of this is when an actor blocks waiting for a lock which it owns. This is clearly a programming error, but it is one that I have seen numerous times over the years. The more complex case of this is when we have a cycle in lock acquisition. The paper points out that to prevent this we need to construct a directed acyclic graph showing the order in which locks are acquired. Usually these are trees, but I have seen environments in which they really are DAGs, due to re-entrant behavior.
The paper describes their model for one type of lock, which is focused on locking within an hierarchical name space. Thus, they have exclusive (X) access, shared (S) access, and intention (I) locks. The intention lock is then used to protect each object along the lock hierarchy, with X or S access to the final node in the hierarchy. The paper goes into greater detail about this; I’ll omit it because it is not really germane to what I found most interesting about this paper.
The final point that I’ll draw out of this paper is that they discuss the idea of lock scheduling, deadlock, and lock thrashing. The lock hierarchy is intended on preventing deadlock. Lock scheduling can be used to detect deadlocks. Thrashing relates to memory contention and calls out to other operating system mechanisms for resolution.
With this behind us, we should have a better basis for understanding how we use transactions and locking to ensure consistency in file systems.
Database versus File System
A general purpose programming system for random access memories
Proceedings of the October 27-29, 1964 Fall Joint Computer Conference, Part 1, pp. 411-422, ACM, 1964.
Bachman, Charles W., and S. B. Williams.
When I discussed MULTICS file systems I disingenuously glossed over an important point they make there: namely, that file systems are an ordered sequence of elements. Today, we think of those elements as bytes, but in fact their definition was more general than just bytes.
Thus, I step back in time and look deeper. What happens if the elements are not individual bytes? Where does that lead us? In this paper, it discusses records. Thus, we see before the clear division of unstructured data (byte oriented files) and structured data (record oriented data). These two fields are inextricably linked in history and while we will see them diverge from one another, we will see the similarity as techniques from one are often used by the other.
Thus we see the early concept of a database emerging. While I won’t talk extensively about databases, it seems useful to point them out because we will find ourselves returning to some of these ideas over time.
Thus, they define a record: it has multiple data fields, with chained information. They have structure – “describing an event, a thing, plan, [or] status”. If we have multiple records that are logically indexed, we have a file system – at least based upon the MULTICS definition.
Of course, the astute reader might point out that this paper precedes the MULTICS paper by a year. Thus, I would suggest that in MULTICS they didn’t wish to exclude the idea of record oriented systems.
Today, we tend to think of files as being a logical array of bytes. This doesn’t have the inherent concept of a relationship between the fields (bytes) because in a byte oriented model there is no relationship. But I’m getting ahead of the conversation at this point.
What this paper is describing is a logical association of information elements that are internally linked together in some fashion. There is structure to this information; the idea is that whatever we are describing requires some finite amount of storage.
Here is where things get interesting in this paper (at least to me). They show us this wonderful diagram (Figure 3 from the paper) that views our storage space as being nothing more than an extension of memory – these records can be stored in “core memory” or they can be stored in “disk memory”. This concept of thinking about disks as being a form of memory might seem strange today, but there is a certain logic to it – and we see modern systems dealing with the hierarchical nature of storage. Again, I’m getting ahead of myself.
So what other interesting insights lurk in this paper? May of its details are about their model for creating associations between records into a chain. These chains are intriguing, as they form a closed loop in their model. One benefit of this approach is that when you construct chains, you can traverse them from beginning to end, regardless of where you start. The authors build on top of this model to construct an interesting system in which data is shared across different record chains. They point out this avoids inconsistency and saves space.
Today, we look at this and think “but of course, that’s how we construct data structures”. Thus, one of the contributions of this paper: an idea we now take for granted. We’ve moved far beyond this concept, but it is rooted here.
The paper links the insertion and retrieval of these records with the terminology of then modern languages COBOL and FORTRAN: they PUT and GET individual records (much like we do with the HTTP protocol).
The paper discusses the details of moving data between core and disk of how to determine if something is in core, and even the need to try and keep frequently accessed data in core, while infrequently accessed data is transferred to disk.
This is a common concern for file systems and databases, and we see it described in this paper. In essence, they are describing the least recently used algorithm.
The rest of this paper then uses a purchase ordering system as the basis of describing the interactions of the records and how they are used to implement the necessary functionality. We don’t tend to think of purchasing systems, or any database interaction, as being a systems level activity these days, but in 1964 it certainly was.
Why is this structure useful? “It yields efficient programs with buffered operation of the disc file.”
We will return to these issues in future papers as well. My goal here was just to touch upon this divide between file systems and databases that forms in this early time period.
Recent Comments