Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
May 2025
S M T W T F S
 123
45678910
11121314151617
18192021222324
25262728293031

The DEMOS File System

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems.  I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?”  I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world.  As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

DEMOS Figures 1 and 2The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system.  A key for this project was that it seeks to improve performance as much as possible.  After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance.  It does so in a clear and understandable fashion.DEMOS Figures 3 and 4

DEMOS utilizes a UNIX-like hierarchical file system model.  It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS.  Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes.  To the best of my knowledge this is one of the earliest papers covering this common subject  (which is revisited periodically because workloads change and file sizes also change).  One of the common themes I have seen in other work is mirrored here: most files are small.  Figure 1 shows a CDF for file sizes.  We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well.  Their second figure (Figure 2) describes the proportion of transfer sizes and their source.   We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

demos figure 5They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server.  While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system.  This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications.  Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager.  This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”).  This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this.  It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”.  Their point is that serializing these things decreases overall performance.  Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.”  This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses.  Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation.  Their observation?  The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering.  These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply.  There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging.  They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead.  They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system.  There are plenty of details that will be refined in later work.

 

 

 

The Cap File System

The Cap Filing System
R. M. Needham and A.D. Birrell, Symposium on Operating Systems Principles, 1977, Association for Computing Machinery.

I’ve fallen behind this past ten days, working towards a deadline.  My own conference paper is now submitted and I’m working on recovering.  What that means is this week is going to be a busy one as I work on catching up.  Adding to the fun, FAST is going on this week (February 13-16, 2008).  I hope to add some bonus discussions on the current work being presented there.

Let’s get to discussing this paper.

CAP is a capabilities based system.  The capability systems are a parallel idea that has been explored from time to time versus the access control system for Multics/UNIX inspired operating systems.  The latest contender in this space would be Fuschia, an experimental operating system that Google is developing (though it is an open source project under a mixture of licenses).  Interestingly, there are reports it runs on the Pixelbook in addition to the previous hardware it had supported.

At any rate, the idea behind a capability is to use identifiers that encapsulate the access rights within the capability itself.  Thus, it is a name and the virtue of having the name is that it means you have the access inherent in that name.

These capabilities all represent to a program the right to have or do something: the preservation of information from one run of a program to another (a universal operating system requirement) is thus seen by the CAP programmer as the preservation of a capability rather than of an object itself.

The paper describes how capabilities are managed, which interacts with the “filing system” of course, since it has to store persistent information.  They describe the “system internal name” (SIN), which is combined with a disk address to map memory segments to actual storage.  The capability (in a “directory”) then becomes a repository of these persistent objects.  This also creates a reference to the disk block that is in use, ensuring those storage regions are not reused.

One interesting characteristic of their system is that they do not guarantee free disk space will be recycled quickly (“[T]here is no guarantee that inaccessible disk space will be relinquished at the earliest possible moment.”)  Indeed they note that space reclamation is only done when the system reboots and “[T]he filing system is not designed to run forever.”

They discuss how CAP differs from prior work (e.g., OS/360) where only the name matters; there is no concept of a directory for use as part of the capability system.  The directory actually provides them an additional level of control as well, and they use directory capabilities as well as segment capabilities.  Directories may be persistent (so they have a SIN and disk block location) or ephemeral (so they disappear when the program exits) – a sort of “built in” temporary storage concept for ephemeral memory management.

Sharing objects is now trivial – you simply tell someone the correct name; that embeds the capability within it.  They do not describe the mechanisms used for transfer (“[B]y mechanisms which do not concern us in detail here.”)

They also describe how this name space is similar to other systems (UNIX and CAL-TSS) but different:

  • Access is associated with the name which in turn references information in the directory.  It is not an attribute of the file itself.
  • The name space need not be a “strict hierarchy”.  This means that portions could become disconnected, or even be private to a single application.
  • Their use of directories behaves similar to the model of “current directory” (presumably in UNIX) even though CAP expressly does not have a concept of current directory.
  • Directories are not even directed acyclic graphs!

The paper describes how capabilities work, since they are a fine-grained control mechanism.  They explain that the holder of an existing capability (a program) may generate a more restrictive capability to provide to another program.  Since capabilities apply to both individual files as well as directories, it is possible for a program to transfer a set of capabilities by creating a new directory and storing the relevant capabilities to the target program.

The names themselves can be quite ugly, as they incorporate capabilities within them.  They describe some conventions they employ in terms of name management and directory placement, but point out these are conventions and not a hard requirement of the system.

CAP certainly presents a rather different model of a file system than we see in other systems.  The idea of disconnected name spaces that are only visible to particular programs is an intriguing one.  They use the example of the password database, which requires the program have the password file capability.

They discuss security.  The directory manager is a separate module that programs use to interact with the directories to which they have access.  To invoke the directory manager, the caller must have the ENTER capability.

I find this to be exactly the type of thought provoking paper I had hoped to find as I comb through these old papers.  The idea that a file system name space need not be connected, that it could be private to a particular program or set of programs, and embedding access rights (“capabilities”) into the name will give me plenty to think about.

If you would like to know more about CAP there is a paper about it in the prior Symposium on Operating Systems Principles: The Cambridge CAP Computer and its Operating System.  It is not too surprising that this is available from Microsoft Research, as they also built a capability based operating system (or two): Singularity and Midori.

 

A Principle for Resilient Sharing of Distributed Resources

A Principle for Resilient Sharing of Distributed Resources
Peter A. Alsberg and John D. Day, In Proceedings of the 2nd international conference on Software engineering, pp. 562-570. IEEE Computer Society Press, 1976.

Today I turn my attention to a paper that begins to explore the issues surrounding distributed systems.  This paper sets forth some basic principles that should be considered as part of distributed systems that are worth capturing and pointing out.

They state that a “resilient server” must have four attributes:

  1. It must be able to detect and recover from some finite number of errors.
  2. It must be reliable enough that a user doesn’t need to worry about the possibility of service failure.
  3. Failure just beyond the finite limit are not catastrophic.
  4. One user’s abuse should not have material impact on any other user.

These all seem somewhat reasonable, though I must admit that I found (3) a bit surprising, as it is not uncommon for some classes of failures to cause catastrophic failure.  For example, when using erasure coding it is common for some number of failures to lead to catastrophic failure.  Upon reflection, I realized that one could achieve this goal by simply setting the finite limit a bit lower, though I suspect this is not entirely what the original author had in mind.

Figure 3
Figure 3

Still, the general concept of resiliency is a good one to capture and consider.  The authors point out some of the challenges inherent in this over a network, notably latency.  “[A] major consideration when choosing a resource sharing strategy is to reduce, as much as possible, the number of message delays required to effect the sharing of resources.”

In other words, keep the messages to a minimum, try not to make them synchronous.  Asynchronous messaging systems will turn out to be rather complex, however, and sometimes there are just operations that require synchronous behavior.

Of course, there has to be a file system tie-in here (however tenuous) and there is!  Under examples they list “Network Virtual File Systems which provide directory services and data access services…”  Thus, it is clear that the authors are considering network file systems as part of their work.

In 1976 the authors indicate that the cost to send a message is on the order of 100ms, while the cost to acquire a lock on the CPU is 100 microseconds to 1 millisecond.  While things are faster now, there is still a large difference between these two paths on modern systems.  Thus, we will still be dealing with issues on the same scale.

The bulk of the paper then walks through the description of their model for providing resilient network services – an application host, sends a request to a primary server host; that server host then communicates with a secondary server host.  That secondary server host can send backup requests to a tertiary server, and so forth.  The secondary host confirms with the primary host, and ultimately it is the secondary host that confirms the operation with the application host.

They cover a variety of important concepts, such as the idea that a host may need to record state about the operations, that operations cannot be applied until the other servers have received their messages, etc.  This is, in essence, an early consensus protocol. While not efficient, the authors have laid groundwork in helping us think about how we construct such services.

I have included Figure 3 from the paper above.  It shows the message flow through the system.  The paper also discusses how to handle a number of failure situations and how messages flowing through the system keep track of which nodes are current.

It also touches on one of the most complex issues in distributed systems: network partition.  Intriguingly, the authors do not assert that one partition must remain alive as they describe the decision being related to the specific requirements of the environment.  Thus, in their model it would be possible to end up with partitions that continue forward but can no longer be easily re-joined after the network partition is resolved.  Of course, they don’t require that both sides of a partition progress, though they do not discuss ways to handle this, either.

They assert that the primary/secondary/backup model is optimal in terms of the number of messages that it sends and in how it ensures the messages are strictly ordered.  Then they briefly describe a few other possible arrangements that have multiple primaries and/or secondaries.  Their final conclusion is that their original strategy is at least as good as the alternatives though this is not demonstrated in any formal way.

Now that they have their replication system, they view how it will work for the network virtual file system.  They conclude that the highest levels of the hierarchy need to be stored on all nodes (which makes them the most expensive to maintain).  They partition the names space below that and record location information within the nodes of the name space where the storage has been split out across hosts.  Thus, we see a description of a global name space.

Their system is simple, but they identify a number of the important issues.  Their suggestion about sharding the file systems name space is one that we shall see again in the future.

They have laid the groundwork for thinking about how to build distributed file systems.

Some Observations about Decentralization of File Systems

Some Observations about Decentralization of File Systems.
Jerome H. Saltzer, 1971.

This paper caught my eye because it leads in a different direction than the other file systems papers I’ve been looking at.  Instead of talking about file systems on a single computer, it has the audacity of suggesting that maybe we want to have remotely accessible storage.

The author frames this in the context of networks.  The first (of two) references in this paper are about networks and help frame the conversation about “decentralized” file systems:

Computer network development to achieve resource sharing
Lawrence G. Roberts and Barry D. Wessler, AFIPS ’70 (Spring) Proceedings of the May 5-7, 1970, spring joint computer conference, pp 543-549.

The authors’ affiliation is the Advance Research Project Agency (ARPA) and what they describe in this paper is the ARPA Network. I don’t want to get too far into this, but I did want to include this wonderful bandwidth/cost chart – something I have definitely seen in various guises since then.

ARPANet Performance
ARPA Network 1970 Performance Cost Comparison

In this time frame, they are discussing the creation of a network for sharing data across geographically dispersed sites, including sites scattered around the United States.  Connection speeds in this time frame are 50Kb/s.  It estimates the entire bandwidth consumption of the network will be between 200-800Kb/s by mid-1971, with twenty nodes connected to the entire network.

The point of this diagram is to discuss costs, where it points out that the least expensive way to move large quantities of data is to encode it on a magnetic tape and sent it in the mail (air mail, of course.)

Why is this important?  Because it helps set the stage for the conversation about resource sharing.  This is before Ethernet exists (Version 1.0 of the Ethernet specification appears in 1980).  Thus, the networks that do exist are hardware specific.  The amount of data being moved is remarkably small by modern standards.

This is, however, where we start considering what is involved in supporting network file systems – decentralized systems of storage that can communicate with one another.

The author’s stake out their position in the abstract:

This short note takes the position that the inherent complexity of a decentralized and a centralized information storage system are by nature essentially the same.

This defines the starting point of what will be a decades long conversation on this fundamental issue.   The authors’ argue that in fact the real issue is one of engineering, not models:

The consequence of this claim, if substantiated, is that the technical choice between a centralized or decentralized organization is one of engineering tradeoffs pertaining to maintainability, economics, equipment available, and the problem being solved, rather than one of functional properties or fundamental differences in complexity.

The discussion then points out that in some cases, such as adding a 20-40 millisecond delay on top of the usual 20-50 millisecond disk delay is not dramatically different.  They explore other areas where the timing might make a substantial difference.  Intriguingly, they discuss a form of disaggregation – where they envision compute being in one location, memory in another, storage in yet another.  They point out that this turns back into a computer (albeit one with long latency to access memory, for example.)

They then discuss caching (“buffer memory”) of information but point out the challenge is now that the system has multiple copies of what need to be the same data – it “has the problem of systematic management of multiple copies of information”.  Surprisingly they make a leap here equating this situation between centralized and decentralized systems: this is a problem of modeling access patterns to shared information and then invent algorithms for “having the information on the right storage device at the right time”!

With decades of hindsight, this looks surprisingly naive. Indeed, even the authors’ construct a clear path for retreat here: “… this is not to say that there are no differences of significance”.

They conclude by opining that storage is challenging:

The complexity is the inevitable consequence of the objectives of the information storage system: information protection, sharing, privacy, accessibility, reliability, capacity, and so on.  What is needed in both cases is a methodical engineering basis for designing an information storage system to specification.  On the other hand a decentralized organization offers the potential for both more administrative flexibility, and also more administrative chaos.

What surprised me about this paper is that the issues around sharing data between computers was already being considered at this point.  The idea that network file systems and local file systems really should be “more or less” the same is thus planted early. I’m sure we will see divergence on this point as we continue moving forward.

For now, we have a base upon which to build this new direction of “decentralized” file systems.

An Experimental Implementation of the Kernel/Domain Architecture

An Experimental Implementation of the Kernel/Domain Architecture
Michael J. Spier, Thomas N. Hastings, David N. Cutler,  ACM SIGOPS Operating Systems Review, vol. 7, no. 4, pp. 8-21. ACM, 1973.

While looking for file systems papers, one of the approaches that I used was to walk through the successive years of conference proceedings from the Symposium on Operating Systems Principles (SOSP).  This biennial conference has been going on for more than 50 years at this point and contains a wealth of interesting and influential papers from the operating systems domain.  It is common to find file systems papers at these conferences.

This particular gem comes from the fourth such symposium (1973).  While it discusses various aspects of operating systems, which makes it quite interesting anyway, but it also discusses file systems within the context of this experimental operating system that may prove insightful.

This paragraph caught my eye:

On our proposed architecture with its many monitor-like domains we envisioned the possibility of supporting the concurrent existence of similar purpose redundant supervisory services (e.g., 3 file systems, 7 file nomenclature hierarchies, 4 login responders, etc.) which, for all practical intents and purposes, would provide the external appearance of concurrent operating systems of disparate natures.

This is the first time I have seen a reference to the concept of having multiple file systems simultaneously as an explicit part of the operating system model.  It also describes this intriguing concept of “concurrent operating systems of disparate natures”.  One of the authors (David N. Cutler) goes on to work on the construction of several different operating systems including what is now Windows today – and it’s architecture is one in which it was designed to provide precisely this type of concurrent disparate operating system environments.  Windows also supports multiple dynamically loaded file systems which now does not look so novel, as essentially all modern operating systems do.

Indeed, what they go on to describe is the “domain machine” in which the monolithic operating system has been divided up into distinct components, which constitute distinct supervisory domains.  One of these domains represents the small set of functionality necessary to manage the actual hardware,   The goal was to keep it small and bug free.

Complicating this was the author’s observation that the system must be re-entrant in order to avoid deadlock.  This was a deliberate goal and they even go so far as to point out this behavior when involving the file system – the kernel may need to invoke the file system to handle demand loading of a domain, but the file system must invoke the kernel in order to access the disk where the data is stored.

Another of their observations intrigued me: they separate the “traditional file system” into a storage layer (“disk space manager” as they call it) and a “name space manager”.  They admit the possibility that this name space might be hierarchical.  They also explicitly note they could support “differently flavored file systems”.

In scheduling, they touch on the fact that I/O bound processes (e.g., the file system) benefit from being given real-time scheduling priority on traditional operating systems; in this domain system they do not need to do so.

This paper definitely surprised me, as I see the germs of ideas here that will be clearly realized in later systems work.  Some of these ideas clearly survive and emerge later, including those that affect file systems.

As we will see later, this logical divide of file systems actually naturally develops along the storage line, though the name space is not so clearly delineated.  Perhaps it should be.

While not directly germane to file systems, Dave Cutler goes on to work on RSX-11M and VMS (at Digital Equipment Corporation), as well as Windows NT (at Microsoft).  The latter became the basis of the Windows OS beginning with Windows XP.

 

The UNIX Time-Sharing Operating System

The UNIX Time-Sharing Operating System
Dennis M. Ritchie and Ken Thompson, Bell Labs, Communications of the ACM, July 1974, Volume 17, Number 7, pp. 365-375.

This paper describes Version 3 of UNIX.  It was Version 6 that became the basis of the Berkeley Software Distribution (BSD) version of UNIX.  The only other operating system in CS history to date that has had so much impact on operating systems development is MULTICS, and UNIX is a direct descendant.  Developed by several Bell Labs researchers that had been involved in MULTICS, its goal was to try and build a smaller operating system that retained what they viewed as the key benefits of MULTICS.

Much of UNIX Version 3 was written in the C programming language, itself derived from BCPL, a language that had been used on the MULTICS project.

The most important job of UNIX is to provide a file system.

These words leave little doubt about the role of file systems in UNIX and the importance assigned to them.  The paper then goes on to describe files, in similar terms to MULTICS: files, directories, and special files (devices).  We see the hierarchical file system of MULTICS reflected back in the description of the system.  They talk about file names being 14 characters or less, the formation of paths, and the iterative walk of names through the file system name space to find other directories, as well as the terminal file nodes.

The directory structure is constrained to have the form of a rooted tree.

This is what I am looking for – the why of hierarchical file systems.  I found the answer here, unsurprising yet ironically disappointing:

The reason for this is to simplify the writing of programs which visit subtrees of the directory structure, and more important, to avoid the separation of portions of the hierarchy.

Not surprising, yet not precisely what I had expected.  I had expected the reason to be for the simplicity of the operating system (though they do allude to this by discussing the difficulty of knowing when it is safe to delete a directory.  They describe links to files, however, so their file system is not really a tree.  Rather, it is more like a directed acyclic graph (DAG).  Files do not have pointers back to their directories, but directories have pointers back to their parent.  Thus we have the distinction.  The namespace is a DAG.  Files don’t really live in the name space directly, they are referenced from the namespace, but have a reference count.

Oddly, with that, I found what I came for, at least in terms of insight for my own research area.  There is a certain satisfaction in being able to point to this seminal document and say “this is why we got to where we are now.”

But if I stopped at this point, I would be leaving out the bits I had not expected to find.

First, the mundane: they discuss removable file systems, the fact that this is in fact a collection of name spaces, combining persistent name spaces with one another using a non-persistent mechanism (mounting),  There is a simple description of how the file system is itself implemented.  They describe the i-number (now the inode number) as an index into the file table.  Thus, a directory entry is where the name lives, it merely refers to the file using its i-number.  These entries are then called i-nodes.  The i-node contains information about the owner of the file, the protection bits governing the file, the location information for where logical data is physically stored on the medium, the size of the file, its timestamps, it’s attribute bits, and the number of directory entries referencing the given i-node.

Surprisingly, not all that different than it is now, 45 years later.  Implementation details have changed, as we no longer limit files to 10MB in size.

They describe bufering, they describe sector sized I/O and how it is more efficient for a program to do sector-sized I/O operations.

Much of the paper has nothing to do with file systems.  I leave that to the interested reader to explore beyond that.

There are two interesting tid-bits remaining:

  • They lost data once, on a hard disk that failed.  The backup was 3 days old.
  • They considered the permuted index application as one of the “major programs available”.

The fact they considered the permuted index important at this early stage was an interesting insight to me.  Clearly, the ability to “find our stuff” is one that’s been around since the dawn of time.  Maybe this research direction of mine does make sense.

MULTICS I/O

Feiertag, Richard J., and Elliott I. Organick. “The Multics input/output system.” In Proceedings of the third ACM symposium on Operating systems principles, pp. 35-41. ACM, 1971.

In this paper we return to MULTICS but this time the focus is on the I/O system.  Much of the content is not directly related to the file system, but it certainly does touch on familiar concepts including the file system.

By this time MULTICS had been released and Bell Labs had dropped out of the project (to build the “castrated” version of MULTICS that became UNICS and ultimately UNIX.)  Thus, this is more a reflection of the system that has been built (versus the earlier paper, where it is a prospective description of the system).

So what is this I/O System?  The description is of a system with a flexible interface, from a highly general (and easy-to-use) interface, to a highly specific (and challenging-to-use) interface to devices.  The idea seems to be that devices have a common interface – the general level interface – but may have specific features that require device-specific knowledge to exploit – the specialized interface.

There are some interesting insights in this paper, again, that reflect the way we do things now rather closely.  For example, the MULTICS I/O Model puts the file systems as a separate service, but then shows they interact with an I/O system component.  Thus, applications can be seen to directly interact with the file system, via its own API, or they can do so through the more general Device I/O Module (DIM).

This abstraction of devices is one we see in many subsequent operating systems (and all modern ones).  File systems gained their idiosyncratic behavior model fairly early, then, where they play closely, but not subserviently, to the I/O subsystem.

Indeed, much of this paper relates to how I/O from files may be redirected for other purposes, such as for the input to an application instead of a more common interface device, such as a teletype (tty) machine.  Of course, MULTICS had already indicated that files were treated logically as memory and their contents could be pulled in “as needed” (demand paging).

Multics IO System Figure 2

This paper also describes the interface to the file system, albeit in general terms.  It describes the idea that logical data flows (streams as they call it) can be separated from the actual data storage mechanism used beneath them.  It is this mechanism that permits them to redirect the tty.

This paper thus introduces some surprising concepts around I/O: a semi-abstract interface, the idea that file systems were a consumer of that interface (and in theory a provider of the interface).  The ability to extend the OS for new device types because of this generalized interface.  After reading it, I can see its influence on both Linux (via UNIX) and Windows (presumably via RSX-11M).

 

File Structure

Complex Information Processing: A File Structure for the Complex, The Changing, and the Indeterminate
ACM ’65 Proceedings of the 1965 20th national conference, Pages 84-100
Nelson, T. H.

This paper has been a bit more challenging to incorporate.  I originally picked it from the intriguing title and lead-in.  It represents very early thought on how information should be organized.  In my final analysis, I concluded more that this relates to the structure of information than it does specific file systems work.

There are some useful insights within this paper.  For me, the key one is that users desire complex structure for their information – they don’t really want a pure hierarchical model.  Even the MULTICS paper hints at this (where their name space is hierarchical but they have links that violate their hierarchy.)

THE KINDS OF FILE structures required if we are to use the computer for personal files and as an adjunct to creativity are wholly different in character from those customary in business and scientific data processing. They need to provide the capacity for intricate and idiosyncratic arrangements, total modifiability, undecided alternatives, and thorough internal documentation.

This is the opening paragraph for this paper. There is no abstract and the organization is even more casual than seems common for this time period. But the author’s challenge is an interesting one: unstructured data is different than the structured data of the business world.  My initial reading of this was that this recognizes the distinction between structured data – the path that leads to databases, and unstructured data – the path that leads to file systems.

But then what it goes on to describe is more like some sort of interactive editor environment, rather than a classic file system.  It focuses on the way in which the information itself would be managed in this environment:

The original problem was to specify a computer system for personal information retrieval and documentation, able to do some rather complicated things in clear and simple ways. The investigation gathered generality, however, and has eventuated in a number of ideas. These are an information structure, a file structure, and a file language, each progressively more complicated.

Thus, the author gives us important data structures, such as the “zippered list”.  I note it here more as a curiosity than anything else – it gives me a sense of why I struggled a bit with this particular paper.

This paper reinforces the idea that human users want some concept of temporal relationship.  Much of the text of the paper relates to the way in which an author creates works of authorship, starting from the initial text, where ideas and information migrate around as the structure and organization of the document changes.

Thus my observation that at some level this is a paper about how to build a word processor; yet it also provides insight into how users think about the organization of their data – and it is definitely not hierarchical in nature! 
Figure three actually reminds me more of the record oriented structure that we saw for IDS.  So while the author started off distinguishing his usage pattern from industrial patterns, his diagrams certainly do suggest this is remarkably similar to the relationship one sees in data within a database.

Thus, it raises a question in my own mind: is this distinction between structured and unstructured data a fixed one, or is it more like one end of a complex design space?

The paper describes three distinct components of this envisioned system: the zippered lists (data relationship maps), the structure of the files (ELF), and a proposed “file language” PRIDE.  In fact, much of the paper seems rather problem specific, despite the author’s attempt to generalize this.

To recap, here are the insights I obtained from this paper:

  • The structure of real information is a graph of relationships;
  • Data relationships change over time – thus what we might view as a data move is often a change in the relationship of pieces of the information
  • Temporal information can be important.  A history of operations (for example) or versioning.
  • How information should be organized is often domain specific.

If I fall back to think about the current distinction – file systems store unstructured information, databases store structured information – it occurs to me that there is more of a continuum here.  File systems do store some structured information – file attributes, for example, or the file name.  Databases do store some unstructured information (albeit fixed size).  Thus, perhaps the distinction is more artificial than I first realized.

Clearly we have more to read.

Database versus File System

A general purpose programming system for random access memories
Proceedings of the October 27-29, 1964 Fall Joint Computer Conference, Part 1, pp. 411-422, ACM, 1964.
Bachman, Charles W., and S. B. Williams.

When I discussed MULTICS file systems I disingenuously glossed over an important point they make there: namely, that file systems are an ordered sequence of elements.  Today, we think of those elements as bytes, but in fact their definition was more general than just bytes.

Thus, I step back in time and look deeper.  What happens if the elements are not individual bytes?  Where does that lead us? In this paper, it discusses records.  Thus, we see before the clear division of unstructured data (byte oriented files) and structured data (record oriented data).  These two fields are inextricably linked in history and while we will see them diverge from one another, we will see the similarity as techniques from one are often used by the other.

Thus we see the early concept of a database emerging.  While I won’t talk extensively about databases, it seems useful to point them out because we will find ourselves returning to some of these ideas over time.

Thus, they define a record: it has multiple data fields, with chained information.  They have structure – “describing an event, a thing, plan, [or] status”.  If we have multiple records that are logically indexed, we have a file system – at least based upon the MULTICS definition.

Of course, the astute reader might point out that this paper precedes the MULTICS paper by a year. Thus, I would suggest that in MULTICS they didn’t wish to exclude the idea of record oriented systems.

Today, we tend to think of files as being a logical array of bytes.  This doesn’t have the inherent concept of a relationship between the fields (bytes) because in a byte oriented model there is no relationship.  But I’m getting ahead of the conversation at this point.

What this paper is describing is a logical association of information elements that are internally linked together in some fashion.  There is structure to this information; the idea is that whatever we are describing requires some finite amount of storage.

Here is where things get interesting in this paper (at least to me).  They show us this wonderful diagram (Figure 3 from the paper) that views our storage space as being nothing more than an extension of memory – these records can be stored in “core memory” or they can be stored in “disk memory”.  This concept of thinking about disks as being a form of memory might seem strange today, but there is a certain logic to it – and we see modern systems dealing with the hierarchical nature of storage.  Again, I’m getting ahead of myself.

So what other interesting insights lurk in this paper?  May of its details are about their model for creating associations between records into a chain.  These chains are intriguing, as they form a closed loop in their model.  One benefit of this approach is that when you construct chains, you can traverse them from beginning to end, regardless of where you start.  The authors build on top of this model to construct an interesting system in which data is shared across different record chains.  They point out this avoids inconsistency and saves space.

Today, we look at this and think “but of course, that’s how we construct data structures”. Thus, one of the contributions of this paper: an idea we now take for granted.  We’ve moved far beyond this concept, but it is rooted here.

The paper links the insertion and retrieval of these records with the terminology of then modern languages COBOL and FORTRAN: they PUT and GET individual records (much like we do with the HTTP protocol).

The paper discusses the details of moving data between core and disk of how to determine if something is in core, and even the need to try and keep frequently accessed data in core, while infrequently accessed data is transferred to disk.

This is a common concern for file systems and databases, and we see it described in this paper.  In essence, they are describing the least recently used algorithm.

The rest of this paper then uses a purchase ordering system as the basis of describing the interactions of the records and how they are used to implement the necessary functionality.  We don’t tend to think of purchasing systems, or any database interaction, as being a systems level activity these days, but in 1964 it certainly was.

Why is this structure useful?  “It yields efficient programs with buffered operation of the disc file.”

We will return to these issues in future papers as well.  My goal here was just to touch upon this divide between file systems and databases that forms in this early time period.

TENEX

Tenex, a Paged Time Sharing System for the PDP-10
Communications of the ACM, March 1972, Volume 15, Number 3
Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson, Bolt Beranek and Newman Inc.

TENEX is a new time sharing system implemented on a DEC PDP-10 augmented by special paging hardware developed at BBN. This report specifies a set of goals which are important for any time sharing system. It describes how the TENEX design and implementation achieve these goals. These include specifications for a powerful multiprocess large memory virtual machine, intimate terminal interaction, comprehensive uniform file and I/O capabilities, and clean flexible system structure. Although the implementation described here required some compromise to achieve a system operational within six months of hardware checkout, TENEX has met its major goals and provided reliable service at several sites and through the ARPA network.

Storage organization and management in TENEX
Daniel L. Murphy
AFIPS ’72 (Fall, part I) Proceedings of the December 5-7, 1972, fall joint computer conference, part I
Pages 23-32, Anaheim, California — December 05 – 07, 1972

The first of these two papers discusses TENEX; much of the paper is not about file systems, but there are about a page and a half about the TENEX file system.  The second paper goes into greater detail about storage – including the file system – for TENEX.  I have picked these papers for several reasons:

  • They demonstrate the impact of the MULTICS work on the systems that follow (certainly beyond the obvious UNIX work).
  • They introduce the concept of virtual integration with the file system
  • They introduce the concept of copy-on-write
  • They show the fundamental drive to maintain backwards compatibility
  • They introduce the concept of a suffix (or extension) as a means of identifying the purpose of a file
  • They delve into the details of open file state management

A TENEX file has a compound name structure:

A powerful and versatile directory and file naming facility is provided in which a particular file is identified by a fixed-depth path which includes device, directory name, file name, extension, and version.

The fixed-depth path is a limitation the TENEX developers chose to implement for backwards compatibility with existing PDP-10 programs, an early example of how application compatibility is often a critical concern for operating systems development.  The authors do note they are considering expanding upon this to make it arbitrary depth – a feature of MULTICS.

Both papers also discuss the Job concept, the idea of a set of related processes.  The implication is that processes within a single job can share resources, thus providing more of that “balance between sharing and isolation” that operating systems have to handle.  When a file is opened successfully, a Job File Number is created in a table.  That encapsulates the information about how to find the given file and instead uses an index value – in other words, a file descriptor or file handle.  “Once the initial association of JFN and file has been established, the JFN is used for all ensuing operations on the file, including sequential reading and writing, opening, closing, etc.

TENEX then allows random access to the file by combining the JFN with an index identifying the desired element.  The authors point out that this is more flexible than previous systems in which the file was not random access.

TENEX File to Process MappingThis becomes flexible when describing the page map for a given process.  The Process Map points from an entry in the virtual address space to a corresponding JFN and index (offset).  Thus, the contents of that page can be retrieved on demand from the underlying file system.

None of this should look particularly surprising to anyone familiar with modern operating systems, of course.  This just happens to be part of the path to get to where we are today.  The papers actually go into greater detail about the details here, including access control, but that isn’t germane to my file systems focus.

Since the file path names identify files over the domain of all jobs in the system, it is evident that our naming and mapping procedures readily provide a means for sharing storage. Using the appropriate path names (including legality checks), processes in two or more different jobs can identify the same file, and each can obtain a JFN for it. Nothing in the mapping procedures specified above requires that either process be aware of the other’s access, and so each process constructs an identifier and places it in its process map (Figure 4).

In other words, the contents of regions of a file can be shared across processes.  This is, in fact, transparent to the processes.

Sharing at this level would be particularly important because of the limited address space and desire to share code – the papers discuss this, and point out the benefits of this form of sharing.

This leads to their inclusion of copy-on-write.  “One other important TENEX feature which facilitates sharing is a type of page access called copy-on-write. To our knowledge, this facility was first developed and used on the BBN-LISP system for the
XDS-9407.”  Thus, while not original to TENEX, this is a logical extension beyond what MULTICS had described.  Copy-on-write is a mainstay of both modern operating systems and some file systems.

Interestingly, TENEX seems to implement a rudimentary page cache as well:

To implement the file sequential monitor calls (e.g., byte-in, byte-out) the monitor maintains a number of “window” pages in a separate map invisible to the user process. For each file with sequential operations in progress, the monitor maps the file page which is to receive or provide the next byte. Each call from the user causes one or more bytes to be loaded from or stored into this page, and a count updated to determine if a new page should be mapped. Movement through the file is accomplished by mapping successive pages, and the sequential access module does not have to be aware of the physical device on which the page resides nor interface with I/O driver modules to read or write it. This modularity is very satisfying from an operating system design point of view.

Thus, byte level access to block level devices is managed via this window page mechanism.  The files are not strictly memory mapped, though, so this is more like a buffer cache than a page cache.

They also use the file system to implement inter-process communications – a form of file-backed shared memory.

Page management is tightly tied to this implementation as well, though the description involves what we would likely consider the memory management unit and page fault handling logic as well as the page to file/offset mapping necessary to provide the system’s demand paging.

Two other interesting aspects of their file systems model includes a pair of extra mapping layers: one for mapping from logical storage address to physical storage location, and the other mapping from multiple distinct page references to a single storage block.

The underlying rationale here is that this permits relocating the storage to different locations, typically from higher speed storage (when warm/hot) to slower speed storage (when cold).

This mechanism doesn’t involve changing the actual description of the storage and instead moves to a logical storage addressing model.  It was interesting to me to see this level of indirection added in such an early system, but clearly the mismatch in speeds between various types of storage dictated the importance of this scheme.  Once again, it is interesting to see how little the problems we face have actually changed.

The data sharing model also uses an extra level of indirection.  I’m familiar with this model from my own work in Windows, where shared memory is indirectly mapped in a similar fashion.  That this mechanism was around in the early 1970s is once again a reminder of how little operating systems have fundamentally changed.

There are many aspects of this paper that I have glossed over, in no small part because they don’t really apply to modern systems – we don’t have to worry about drum memories, for example, no more than we need to worry about punch card readers.  These two papers, however, clearly lay out a deeper realization of the file system than I have seen in prior work.

TENEX differed from MULTICS in a number of ways and the two systems remained competitors for many years.  TENEX ultimately would become TOPS-20 and in turn be supported by Digital Equipment Corporation.  It was an important part of the early (pre-VAX) ARPANET and survived for many years as a viable system.

If you would like to read more about this, I’d recommend Dan Murphy’s Origins and Development of TOPS-20 post.  It provides further fascinating background on how TENEX evolved and how systems evolved.  I leave you with the final words from that post:

Although this book is about DEC’s 36-bit architecture, it is clear now that hardware CPU architectures are of declining importance in shaping software. For a long time, instruction set architectures drove the creation of new software. A new architecture would be introduced, and new software systems would be written for it. The 36-bit architecture was large in comparison to most other systems which permitted interactive use. It was also lower in cost than most other systems of that size. These factors were important in the creation of the kind of software for which the architecture is known.

Now, new architectures are coming along with increasing frequency, but they are simply “slid in” under the software. The software systems are far too large to be rewritten each time, and a system which cannot adapt to new architectures will eventually suffer declining interest and loss of competitive hardware platforms. TOPS-20 didn’t pass that test, although it did contribute a number of ideas to the technology of interactive systems. How far these ideas ultimately spread is a story yet to be told.

There is considerable insight in this for me, particularly the admonishment “software systems are far too large to be rewritten each time” as it resonates with (one of) my own research directions.