Home » Conferences Journals and Workshops (Page 2)

Category Archives: Conferences Journals and Workshops

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
November 2024
S M T W T F S
 12
3456789
10111213141516
17181920212223
24252627282930

The Cambridge File Server

The Cambridge File Server
Jeremy Dixon, in ACM SIGOPS Operating Systems Review,  Volume 14, Number 4, pp 26-35, 1980, ACM.

Cambridge was certainly a hotbed of systems work in the 1970s (not to say that it still is not).  They were looking at very different architectures and approaches to problems than we saw from the various Multics influenced systems.

The introduction to this paper is a testament to the vibrant research work being done here.  They author points to the Cambridge ring, which was their mechanism for implementing a shared computer network and a precursor to the Token Ring networks that followed.  The CAP computer was part of this network, and the network included a separate computer that had a vast amount of storage for the time – 150MB.  That common storage was used for both “filing systems” as well as “virtual memory”.   This computer ran the Cambridge File Server and implemented the functionality that was explored in the WFS paper.

They identify key characteristics of their file server:

  • Substantial crash resistance.
  • Capabilities used to control access.
  • Atomic file updates.
  • Automatic “garbage collection” of storage space
  • Fast transfer to random accessed, word-addressable files.

The authors make a point of noting there are only two classes of objects in their system: files and indices.  I found this interesting because it echos the hierarchical file systems models that encouraged me to start this journey in the first place.

They define a file: “… a random access sequence of 16-bit words whose contents can be read or written by client machines using the following operations”.  The operations that follow are read and write.  They go on to define an index: “… a list of unique identifiers, and is analogous to a C-list in capability machines”.  The three operations here are: preserveretrieve, and delete. This permits entries to be added, found, and removed.

The storage controlled by the file server thus appears to its clients as a directed graph whose nodes are files and indices.  Each file or index operation is authorised by quoting the object’s unique identifier to the file server, and UIDs are 64 bits long with 32 random bits. Each client, therefore, can access only some of the nodes in the graph at any time, namely those whose UIDs he knows, an dthose whose UIDs can be retrieved from accessible indices.

Thus, they actually have a graph file system that may in fact consist of nodes that are not connected – essentially a pool of disconnected trees that can be traversed if you know how to find the tree, but is effectively hidden otherwise.  They do point out that the sparse space may not be sufficient protection (though I suspect a small finite delay on an invalid lookup with discourage brute force browsing).

Objects are deleted when they cannot be found from some distinguished root index; the paper describes that each client is given its own entry in the root index, pointing to the client specific index.  There is the implication that they will scan the storage looking for such unreferenced objects that can be cleaned up and indeed they refer to a companion paper for a detailed description of this garbage collector.

Their argument for this omission is that it relieves the client of the burden of managing object lifetimes (“… removes from the clients the burden of deciding when to delete an object…”)

Storage space is segregated into “data” and “map” blocks.  The data blocks contain object contents.  The map blocks contain meta-data. New files are stored as a single data block. As the file grows in size, map blocks are inserted to create a tree of up to three levels deep.

The paper then turns its attention to the atomic nature of the updates to the file server.  The author points out that moving from consistent state to consistent state may require multiple distinct changes. Since failures can interrupt you between any two operations, the discussion revolves around ways in which this can be robustly implemented in atomic and recoverable fashion.  The author points out that the overhead in protecting against this class of failures has substantial overhead.  Given that not all files require this level of robustness, he proposes that the file server provide two separate classes of service for data files.  Map blocks are maintained in consistent fashion because they have the file server’s meta-data within them and the consistency of the file server’s control information needs to be preserved.

Much of the detail in the paper at that point involves describing the structure of the meta data and how it is used to implement atomic operations on the file server.  The paper provides a detailed description of how transactions are implemented within this system.  The fact they describe implementing a complete transactional file system, discuss the ramifications of providing user level transactional data storage, and come up with a hybrid model does make this an impressive piece of early work.  We will see journaling file systems more than once as we move forward.

The balance of the paper discusses how this has worked within their systems at Cambridge.   It is interesting and they tie some of the implementation efficiency to the environment of the Cambridge Ring itself.  This is a production file server and the author notes that it is used by a variety of computers (including different operating systems) within their environment successfully.

Its relatively quick response has allowed it to be used to record and play back digitised speech in real time.  The interface provided seems both simple and suitable for a variety of purposes.

Impressive indeed.

WFS: A Simple Shared File System for a Distributed Environment

WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.

This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.

Woodstock was “an early office system prototype”.  The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame.  The ability to share data between these systems turns out to be surprisingly important.  Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.


This is the environment that gave rise to WFS.  Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism.  In fact, WFS only provided four classes of operations:

  • I/O operations to read and write blocks of data within files
  • Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
  • Managing file properties, including page allocation data
  • Providing maintenance functions

The actual implementation is surprisingly simple.  Indeed the authors’ state that it took two months to build it.

Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.

There are a number of simplifying options with WFS:

  • The namespace for files is flat; there is no hierarchical structure.
  • The file structure is simple (Figure 2).
  • The protocol is stateless and each operation is idempotent.  This simplifies error handling since a lost message can be re-transmitted safely, with no fear that  repeating it will cause problems.
  • Operations are client initiated.  The server never initiates an operation.
  • Clients have limited mutable state.  The server does not permit changing its own state directly from the client.

This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.

Figure 3 describes the layout of the (stateless) data exchange format used by WFS.

Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location.  This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.

Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure.  Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.

This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as  the hierarchical document name space, management of documents and their structure, etc.  This separation of concerns yields an elegant and simple implementation model for their file server.

There are some interesting implementation details described in the paper:

  • Write operations are validated by reading the data page.  Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client.  It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network.  The downside to this is they must read the data before they can write it.
  • They use a write-through cache.  Thus, the cache is really for read efficiency, not write efficiency.  This should help mitigate the write inefficiency.
  • Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.

Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”

Accessing data over the network was faster than the local disk drive!  Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader.  One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)

The authors’ follow up their implementation description with an explanation of their design philosophy.  They emphasize the atomic nature of the operations they support, as well as the following properties:

  • Client initiated operations can only access one data page and “a few” auxiliary disk pages.
  • Operations are persistent before WFS returns status to the client.
  • WFS commands are a single internet packet.
  • The WFS protocol is stateless.

They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.

They delve into how clients might use WFS in Section 4.  One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services.  Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.

The authors are able to point to other systems that utlize WFS besides Woodstock.  They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.

The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities.  They round out this section of the paper by discussing other possible enhancements to WFS.

In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.

 

Granularity of Locks in a Shared Data Base

Granularity of locks in a shared data base,
Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, in Proceedings of the 1st International Conference on Very Large Data Bases, pp. 428-451. ACM, 1975.

I was originally going to do a write-up of A Client-Based Transaction System to Maintain Data Integrity but realized that I should help motivate the use of locks and transactions better before diving into this topic area.  So to do this, I’ve picked to talk about locks.  This is a critical concept that we utilize in file systems development on a routine basis.  Note that this work actually started out in the database community, not the file systems community.

I will note that this is a long paper (23 pages) and there is considerable detail that I will omit for the sake of brevity – but I do want to touch on the key points that are raised in this paper.

They start off by defining consistency and transaction.  They define consistency first:

We assume the data base consists of a collection of records and constraints defined on these records.  There are physical constraints (ex: in a list of records, if a record A points to record B then record B must exist) as well as logical constraints (ex: conservation of money in a bank checking account application). When all such constraints are satisfied the data base is said to be consistent.

Then they move on to transaction:

transaction is a series of accesses (for read or write operations) to the data base which, applied to a consistent data base, will product a consistent data base.  During the execution of a transaction, the data base may be temporarily inconsistent.  The programs used to perform the transactions assume that they “see” a consistent data base.  So if several transactions are run concurrently, a locking mechanism must be used to insure that one transaction does not see temporarily inconsistent data cause by another transaction.  Also, even if there are no consistency constraints, locks must be used so that the updates of one transaction are not made available to others before the transaction completes.  Otherwise, transaction backup might cascade to other transactions which read or updated the “back up” updates.

This text does not lay out the modern description of transactions that we use (atomic, consistent, isolated, and durable or ACID) but what it describes is a system in which these properties do in fact exist.  This concept is not unique to databases – we see it in any multi actor system sharing resources.  The usual analogy that I use when describing this are traffic intersection based, as the intersection is a shared resource and the individual vehicles separate actors. lock free traffic flow

The optimal configuration for resource sharing is where the need to share is minimized.  If there is no shared data, there is no need for locks. Similarly, if the data is immutable (a concept I know we’ll return to over time) we don’t need to protect it.  Only mutable state needs to use locks.  As it turns out, frequently mutable state doesn’t actually change but because it can change, if we rely upon it, we must protect against change.

This concept of transitioning from one consistent state to another consistent state safely is why we use locks.  The primary contribution of the authors in this paper is not the creation of locks and consistency but rather their creation of a clear language that we still use to describe these concepts.

In Section 2 they then describe what a “lockable object” is.  There are several important observations here, which I summarize:

  • A transaction is sequentially consistent.  This is a very strong consistency guarantee (as we will see in later papers when things get far more general.)
  • The basic lock type is reader-writer.  A reader lock provides shared access to mutable state, along with the guarantee that the mutable state won’t change while the lock is held.  A writer lock provides exclusive access to mutable state, along with the guarantee that the only changes to that state will be made by the owner of the writer lock.
  • Locks may be used to protect non-existent resources, for example, to block the creation of a new object.  Since the object doesn’t exist, you can’t lock it yet (interestingly, I usually tend to think of that as locking the structure in which the new object exists, but their observation that you can lock non-existent items is certainly valid.
  • Locks are requested dynamically.
  • There is a dynamic balance between lock granularity and overhead.  If we lock many small objects, there is a cost associated with each of those lock operations.  In the years since this paper we have made uncontended lock acquisition rather inexpensive, but contended lock acquisition remains an expensive proposition.

The paper then goes on to discuss lock hierarchies.   This is because in any dynamic lock system in which you obtain more than a single lock, you need to ensure that there will never be a situation in which an actor blocks waiting for a lock that will never be released.  The simplest case of this is when an actor blocks waiting for a lock which it owns.  This is clearly a programming error, but it is one that I have seen numerous times over the years.  The more complex case of this is when we have a cycle in lock acquisition.  The paper points out that to prevent this we need to construct a directed acyclic graph showing the order in which locks are acquired.  Usually these are trees, but I have seen environments in which they really are DAGs, due to re-entrant behavior.

The paper describes their model for one type of lock, which is focused on locking within an hierarchical name space.  Thus, they have exclusive (X) access, shared (S) access, and intention (I) locks.  The intention lock is then used to protect each object along the lock hierarchy, with X or S access to the final node in the hierarchy.  The paper goes into greater detail about this; I’ll omit it because it is not really germane to what I found most interesting about this paper.

The final point that I’ll draw out of this paper is that they discuss the idea of lock scheduling, deadlock, and lock thrashing.  The lock hierarchy is intended on preventing deadlock.  Lock scheduling can be used to detect deadlocks.  Thrashing relates to memory contention and calls out to other operating system mechanisms for resolution.

With this behind us, we should have a better basis for understanding how we use transactions and locking to ensure consistency in file systems.

The DEMOS File System

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems.  I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?”  I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world.  As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

DEMOS Figures 1 and 2The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system.  A key for this project was that it seeks to improve performance as much as possible.  After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance.  It does so in a clear and understandable fashion.DEMOS Figures 3 and 4

DEMOS utilizes a UNIX-like hierarchical file system model.  It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS.  Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes.  To the best of my knowledge this is one of the earliest papers covering this common subject  (which is revisited periodically because workloads change and file sizes also change).  One of the common themes I have seen in other work is mirrored here: most files are small.  Figure 1 shows a CDF for file sizes.  We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well.  Their second figure (Figure 2) describes the proportion of transfer sizes and their source.   We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

demos figure 5They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server.  While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system.  This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications.  Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager.  This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”).  This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this.  It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”.  Their point is that serializing these things decreases overall performance.  Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.”  This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses.  Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation.  Their observation?  The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering.  These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply.  There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging.  They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead.  They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system.  There are plenty of details that will be refined in later work.

 

 

 

The Cap File System

The Cap Filing System
R. M. Needham and A.D. Birrell, Symposium on Operating Systems Principles, 1977, Association for Computing Machinery.

I’ve fallen behind this past ten days, working towards a deadline.  My own conference paper is now submitted and I’m working on recovering.  What that means is this week is going to be a busy one as I work on catching up.  Adding to the fun, FAST is going on this week (February 13-16, 2008).  I hope to add some bonus discussions on the current work being presented there.

Let’s get to discussing this paper.

CAP is a capabilities based system.  The capability systems are a parallel idea that has been explored from time to time versus the access control system for Multics/UNIX inspired operating systems.  The latest contender in this space would be Fuschia, an experimental operating system that Google is developing (though it is an open source project under a mixture of licenses).  Interestingly, there are reports it runs on the Pixelbook in addition to the previous hardware it had supported.

At any rate, the idea behind a capability is to use identifiers that encapsulate the access rights within the capability itself.  Thus, it is a name and the virtue of having the name is that it means you have the access inherent in that name.

These capabilities all represent to a program the right to have or do something: the preservation of information from one run of a program to another (a universal operating system requirement) is thus seen by the CAP programmer as the preservation of a capability rather than of an object itself.

The paper describes how capabilities are managed, which interacts with the “filing system” of course, since it has to store persistent information.  They describe the “system internal name” (SIN), which is combined with a disk address to map memory segments to actual storage.  The capability (in a “directory”) then becomes a repository of these persistent objects.  This also creates a reference to the disk block that is in use, ensuring those storage regions are not reused.

One interesting characteristic of their system is that they do not guarantee free disk space will be recycled quickly (“[T]here is no guarantee that inaccessible disk space will be relinquished at the earliest possible moment.”)  Indeed they note that space reclamation is only done when the system reboots and “[T]he filing system is not designed to run forever.”

They discuss how CAP differs from prior work (e.g., OS/360) where only the name matters; there is no concept of a directory for use as part of the capability system.  The directory actually provides them an additional level of control as well, and they use directory capabilities as well as segment capabilities.  Directories may be persistent (so they have a SIN and disk block location) or ephemeral (so they disappear when the program exits) – a sort of “built in” temporary storage concept for ephemeral memory management.

Sharing objects is now trivial – you simply tell someone the correct name; that embeds the capability within it.  They do not describe the mechanisms used for transfer (“[B]y mechanisms which do not concern us in detail here.”)

They also describe how this name space is similar to other systems (UNIX and CAL-TSS) but different:

  • Access is associated with the name which in turn references information in the directory.  It is not an attribute of the file itself.
  • The name space need not be a “strict hierarchy”.  This means that portions could become disconnected, or even be private to a single application.
  • Their use of directories behaves similar to the model of “current directory” (presumably in UNIX) even though CAP expressly does not have a concept of current directory.
  • Directories are not even directed acyclic graphs!

The paper describes how capabilities work, since they are a fine-grained control mechanism.  They explain that the holder of an existing capability (a program) may generate a more restrictive capability to provide to another program.  Since capabilities apply to both individual files as well as directories, it is possible for a program to transfer a set of capabilities by creating a new directory and storing the relevant capabilities to the target program.

The names themselves can be quite ugly, as they incorporate capabilities within them.  They describe some conventions they employ in terms of name management and directory placement, but point out these are conventions and not a hard requirement of the system.

CAP certainly presents a rather different model of a file system than we see in other systems.  The idea of disconnected name spaces that are only visible to particular programs is an intriguing one.  They use the example of the password database, which requires the program have the password file capability.

They discuss security.  The directory manager is a separate module that programs use to interact with the directories to which they have access.  To invoke the directory manager, the caller must have the ENTER capability.

I find this to be exactly the type of thought provoking paper I had hoped to find as I comb through these old papers.  The idea that a file system name space need not be connected, that it could be private to a particular program or set of programs, and embedding access rights (“capabilities”) into the name will give me plenty to think about.

If you would like to know more about CAP there is a paper about it in the prior Symposium on Operating Systems Principles: The Cambridge CAP Computer and its Operating System.  It is not too surprising that this is available from Microsoft Research, as they also built a capability based operating system (or two): Singularity and Midori.

 

Shanghai Bound!

I’m on my way to the the Symposium on Operating Systems Principles, which is a well-known operating systems conference.  This time it is in Shanghai, China so I’m making my way there.  I do not have the privilege of presenting anything there this trip, but hopefully I will be able to do so in the future – it really is just a small matter of coming up with some interesting research that is worthy.

This will be my first time in China.  Definitely looking forward to seeing some of the interesting talks that are scheduled throughout the conference!

Decision Made

The decision deadline was last weekend and I confirmed my decision, accepting one of the offers and declining the other.  I’m excited that I’ll be starting my new program officially near the end of summer.

I’m already starting to think about the fact that this journey is really about to begin.  I’d be lying if I said that I weren’t still in a bit of shock – euophoric shock, for sure, but still shock.  Despite everything that I had going against me, I managed to make this happen – and no amount of actually thinking about it makes me view it as quite a (wonderful) surprise.

For example, if I hadn’t been in the position to take the research assistant position with UBC, I suspect I’d have been less likely to have been offered admission there.  Of course, had I not been summarily dismissed from my job just a couple weeks earlier, i wouldn’t have been in a position to do so.   Pretty exciting, indeed!