Home » File Systems » Network File Systems

Category Archives: Network File Systems

June 2025
S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

What is the Optimal Location for Storing Metadata

November 6, 2022

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

A Comparison of Two Network-Based File Servers

July 11, 2019

A Comparison of Two Network-Based File Servers
James G. Mitchell and Jeremy Dion, in Communications of the ACM, April 1982, Volume 25, Number 4.

PAir of File Servers

I previously described the Cambridge File Server (CFS). In this 1981 SOSP paper the inner details of it and the Xerox Distributed File System (XDFS) are compared. This paper provides an interesting insight into the inner workings of these file servers.

Of course, the scale and scope of a file server in 1982 was vastly smaller than the scale and scope of file servers today. In 1982 the disk drives used for their file servers were as large as 300MB.

SD Cards

This stands in stark contract to the sheer size of modern SD cards; I think of them as slow but compared to the disk drives of that era they are quite a bit faster not to mention smaller. I suspect the authors of this paper might be rather surprised at how the scale has changed, yet many of the basic considerations they were making back in the early 1980s are still important today.

Access Control (Security) – CFS was, of course, a capability based system. XDFS was an identity based system; most systems today are identity based systems, though we find aspects of both in use.
Storage Management – the interesting challenge here is how to ensure that storage is not wasted. The naive model is to shift responsibility for proper cleanup to the clients. Of course, the reality is that this is not a good model; even in the simple case of a client that crashes, it is unlikely the client will robustly ensure that space is reclaimed in such circumstances. CFS handles this using a graph file system and performing garbage collection in which an unreachable node is deemed subject to reclamation. XDFS uses the more naive model, but mitigates this by providing a directory service that can handle proper cleanup for clients – thus clients can “do it right” with minimal fuss, but are not constrained to do so.
Data Consistency – the authors point to the need to have some form of transactional update model. They observe that both CFS and XDFS offer atomic transactions; this represents the strong semantic end of the design spectrum for network file servers and we will observe that one of the most successful designs (Sun’s NFS) went to a much weaker end of the design spectrum. Some of this likely reflects the database background of the authors.
Network Protocols – I enjoyed this section, since this is very early networking, with CFS using the predecessor of token ring and XDFS using the 3Mb/s version of Ethernet. They discuss the issues inherent in the network communcations: flow and error control (so message exchange and exception/error handling) and how the two respective systems handle them

The authors also compare details of the implementation:

They describe a scheme in CFS in which small files use a direct block, and larger files use indirect blocks (blocks of pointers to direct blocks). This means that small files are faster. It is similar to the model that we see in other (later) file systems, while XDFS uses binary tree, used to track allocation of blocks to files, and a bitmap, used to indicate free/used space information.
They discuss redundancy, with an eye towards handling (partial) disk failures. Like any physical device, the disk drives of that era did wear out and fail.
They discuss their transaction log and how each system guaranteed consistency: they both use shadow pages, but their implementation of them is different. Ultimately, they both have similar issues, and similar impact. Shadow pages are a technique that we still use.

The evaluation is interesting: it is not so much a measure of performance but rather insights into the strengths and weaknesses of each approach. For XDFS they note that their transaction support has been successful and it permits database transactions (in essence, XDFS becomes a form of simple database service). They point to the lack of support for both normal and special files; from their description a special file is one with guaranteed write semantics. They also observe that ownership of files is easily lost, which in turn leads to inefficient storage utilization. They observe that it is not clear if the B-tree is win or lose of XDFS.

For CFS they point to the performance requirements as being a strength, though it sounds more like a design constraint that forced the CFS developers to make “hard choices” to optimize for performance. Similarly, they observe that the directed graph model of CFS is successful and capabilities are simple to implement. Interestingly, they also point to the index as well as string of names and access rights as being a success point. They also point to the fact that CFS generalizes well (“[t]wo quite different filling systems built in this way coexist on the CFS storage.”) They also point to automatic garbage collection as being a net win for CFS, though they also point out that CFS uses a reference count in addition to the garbage collection model. They list the CFS limitation of transactions to a single file or index as being one of its shortcomings and point to real-world experience porting other operating systems to use CFS as an indicator of the cost of this limitation. Interestingly, the limitation they point to (“… since file directories are implemented as an index with an associated file, it is currently impossible to update both structures in a single transaction.”) They conclude by arguing that XDFS has a better data layout, arguing that XDFS’s strategy of page allocation and intention logging is ultimately better than CFS’s cylinder maps: “… the redundancy function of cylinder maps does not seem to be as successful as those of page allocation and intentions logging; the program to reconstruct a corrupted block is not trivial.”

Ensuring correct recovery in a transactional system certainly challenging in my experience, so I can understand the authors’ concerns about simplicity and scalability.

Overall, it is an interesting read as I can see may of the issues described here as being file systems issues; many of the techniques they describe in this paper show up in subsequent file systems. The distinction between file system and file server also becomes more clearly separated in future work.

The Cambridge File Server

March 20, 2018

The Cambridge File Server
Jeremy Dixon, in ACM SIGOPS Operating Systems Review, Volume 14, Number 4, pp 26-35, 1980, ACM.

Cambridge was certainly a hotbed of systems work in the 1970s (not to say that it still is not). They were looking at very different architectures and approaches to problems than we saw from the various Multics influenced systems.

The introduction to this paper is a testament to the vibrant research work being done here. They author points to the Cambridge ring, which was their mechanism for implementing a shared computer network and a precursor to the Token Ring networks that followed. The CAP computer was part of this network, and the network included a separate computer that had a vast amount of storage for the time – 150MB. That common storage was used for both “filing systems” as well as “virtual memory”. This computer ran the Cambridge File Server and implemented the functionality that was explored in the WFS paper.

They identify key characteristics of their file server:

Substantial crash resistance.
Capabilities used to control access.
Atomic file updates.
Automatic “garbage collection” of storage space
Fast transfer to random accessed, word-addressable files.

The authors make a point of noting there are only two classes of objects in their system: files and indices. I found this interesting because it echos the hierarchical file systems models that encouraged me to start this journey in the first place.

They define a file: “… a random access sequence of 16-bit words whose contents can be read or written by client machines using the following operations”. The operations that follow are read and write. They go on to define an index: “… a list of unique identifiers, and is analogous to a C-list in capability machines”. The three operations here are: preserve, retrieve, and delete. This permits entries to be added, found, and removed.

The storage controlled by the file server thus appears to its clients as a directed graph whose nodes are files and indices. Each file or index operation is authorised by quoting the object’s unique identifier to the file server, and UIDs are 64 bits long with 32 random bits. Each client, therefore, can access only some of the nodes in the graph at any time, namely those whose UIDs he knows, an dthose whose UIDs can be retrieved from accessible indices.

Thus, they actually have a graph file system that may in fact consist of nodes that are not connected – essentially a pool of disconnected trees that can be traversed if you know how to find the tree, but is effectively hidden otherwise. They do point out that the sparse space may not be sufficient protection (though I suspect a small finite delay on an invalid lookup with discourage brute force browsing).

Objects are deleted when they cannot be found from some distinguished root index; the paper describes that each client is given its own entry in the root index, pointing to the client specific index. There is the implication that they will scan the storage looking for such unreferenced objects that can be cleaned up and indeed they refer to a companion paper for a detailed description of this garbage collector.

Their argument for this omission is that it relieves the client of the burden of managing object lifetimes (“… removes from the clients the burden of deciding when to delete an object…”)

Storage space is segregated into “data” and “map” blocks. The data blocks contain object contents. The map blocks contain meta-data. New files are stored as a single data block. As the file grows in size, map blocks are inserted to create a tree of up to three levels deep.

The paper then turns its attention to the atomic nature of the updates to the file server. The author points out that moving from consistent state to consistent state may require multiple distinct changes. Since failures can interrupt you between any two operations, the discussion revolves around ways in which this can be robustly implemented in atomic and recoverable fashion. The author points out that the overhead in protecting against this class of failures has substantial overhead. Given that not all files require this level of robustness, he proposes that the file server provide two separate classes of service for data files. Map blocks are maintained in consistent fashion because they have the file server’s meta-data within them and the consistency of the file server’s control information needs to be preserved.

Much of the detail in the paper at that point involves describing the structure of the meta data and how it is used to implement atomic operations on the file server. The paper provides a detailed description of how transactions are implemented within this system. The fact they describe implementing a complete transactional file system, discuss the ramifications of providing user level transactional data storage, and come up with a hybrid model does make this an impressive piece of early work. We will see journaling file systems more than once as we move forward.

The balance of the paper discusses how this has worked within their systems at Cambridge. It is interesting and they tie some of the implementation efficiency to the environment of the Cambridge Ring itself. This is a production file server and the author notes that it is used by a variety of computers (including different operating systems) within their environment successfully.

Its relatively quick response has allowed it to be used to record and play back digitised speech in real time. The interface provided seems both simple and suitable for a variety of purposes.

Impressive indeed.

WFS: A Simple Shared File System for a Distributed Environment

March 13, 2018

WFS: A Simple Shared File System for a Distributed Environment
Daniel Swinehart, Gene McDaniel, and David Boggs, in Proceedings of the Seventh ACM Symposium on Operating Systems Principles, pp. 9-17, 1979, ACM.

This file system was developed at Xerox’s Palo Alto Research Center (PARC), which produced a string of amazing advances in the nascent computer technology area in the 1970s.

Woodstock was “an early office system prototype”. The authors’ description of Woodstock sound much like early word processing systems, such as those pioneered by Wang Laboratories in the same time frame. The ability to share data between these systems turns out to be surprisingly important. Local storage space was used to track the current work, but then centralized storage provides an efficient way to store them and make the work available to others.

This is the environment that gave rise to WFS. Because Woostock already existed and provided its own hierarchical document directory structure, WFS did not need to provide such a mechanism. In fact, WFS only provided four classes of operations:

I/O operations to read and write blocks of data within files
Creating/Destroying resources: file identifiers (FIDs) and storage blocks (pages)
Managing file properties, including page allocation data
Providing maintenance functions

The actual implementation is surprisingly simple. Indeed the authors’ state that it took two months to build it.

Figure 1 (from the original paper) describes the format of a request/response packet, showing the basic information exchange model. It is interesting to note that the entire message fits within a small amount of memory and includes an end-to-end checksum.

There are a number of simplifying options with WFS:

The namespace for files is flat; there is no hierarchical structure.
The file structure is simple (Figure 2).
The protocol is stateless and each operation is idempotent. This simplifies error handling since a lost message can be re-transmitted safely, with no fear that repeating it will cause problems.
Operations are client initiated. The server never initiates an operation.
Clients have limited mutable state. The server does not permit changing its own state directly from the client.

This simiplicity does limit the generality of WFS, but it also demonstrates an important abstraction that we will see used (and re-used) in subsequent systems: a file can be treated as a block structured device (a “disk”) in an interesting and transparent fashion.

Figure 3 describes the layout of the (stateless) data exchange format used by WFS.

Figure 4 shows the layout of the file directory table which is a contiguous and fixed-size region on disk at a known disk location. This is a fairly common characteristic of on-disk file system formats, having a known location where meta-data is to be found.

Note that Figure 4 also shows how the file’s allocated storage is described via direct and indirect block references organized into a tree structure. Again, this will be a recurring model that occurs in file systems; it combines the flexibility of supporting efficient space utilization, ability to describe variable sized files, and efficient utilization of block-addressable storage.

This simple mechanism permits their clients to utilize a flexible storage mechanism without forcing the file server to support any of the mechanisms the client already provides, such as the hierarchical document name space, management of documents and their structure, etc. This separation of concerns yields an elegant and simple implementation model for their file server.

There are some interesting implementation details described in the paper:

Write operations are validated by reading the data page. Thus, writes become compare and swap operations that prevents concurrent access from inadvertently overwriting changes made by another client. It would be rather inefficient to rely upon this mechanism, but it helps prevent out-of-order packet processing in an unreliable network. The downside to this is they must read the data before they can write it.
They use a write-through cache. Thus, the cache is really for read efficiency, not write efficiency. This should help mitigate the write inefficiency.
Most of their caching is done against meta-data pages (“auxiliary disk pages”) because they are more frequently accessed than client data pages.

Here’s one of the interesting performance results: “In the single-user (lightly loaded) case, WFS improved Woodstock’s average input response time over the local disk’s time for several reasons: WFS’s disks were faster than Woodstock’s local disks, requested pages were sometimes still in the WFS main memory cache, and the amount of arm motion on the local disk was reduced because it no longer had to seek between a code swap-area and the user data area.”

Accessing data over the network was faster than the local disk drive! Whether this is a statement of how slow disks were versus networks I leave as an exercise to the reader. One thing we can take away from this: the network often does not impose a significant bottleneck to utilizing remote storage (except, of course, when it does.)

The authors’ follow up their implementation description with an explanation of their design philosophy. They emphasize the atomic nature of the operations they support, as well as the following properties:

Client initiated operations can only access one data page and “a few” auxiliary disk pages.
Operations are persistent before WFS returns status to the client.
WFS commands are a single internet packet.
The WFS protocol is stateless.

They then explain the rationale for these decisions, which relate to simplifying the protocol and server side implementation.

They delve into how clients might use WFS in Section 4. One explicit take-away here is that they view these “files” as acting like “virtual disks” and this permits the WFS clients to implement their own abstraction on top of the WFS-provided services. Because WFS doesn’t assume any specific structure for the client data, there is no burden placed upon those client implementations – though they admit at one point that this complicates the client.

The authors are able to point to other systems that utlize WFS besides Woodstock. They cite to Paxton’s system (A Client-Based Transaction System to Maintain Data Integrity) as being based upon WFS.

The paper discusses security and privacy considerations, admitting their system does not address these issues and suggests various techniques to addressing security using encryption and capabilities. They round out this section of the paper by discussing other possible enhancements to WFS.

In the end, they provided a simple model for a network file server that permitted a client to implement a range of solutions. As we begin looking at more network file systems, we will see this model extended in various way.

A Universal File Server

March 12, 2018

A Universal File Server
A. D. Birrell and R. M. Needham, in IEEE Transactions on Software Engineering, Vol SE-6, No. 5, September 1980, pp. 450-453.

One of the challenges in this next group of papers is picking which ones to discuss. The advent of networks saw the blossoming of the idea of centralizing storage and having different computer systems accessing it via those networks. By the time this paper is published quite a few network based file server solutions had been constructed and described within the literature – and we will get to them.

The authors here decided to try and extract generality from these works. So in this paper we step back and look for some generality.

This is a rather short paper – four pages.

The authors describe the division of responsibilities in a file server: the “high-level functions more properly associated with a filing system” and “functions belonging to a backing store server” [emphasis in the original]. When I read this I thought that this made sense: we have a functional layer that creates a name space, attributes, etc. and a storage layer that keeps track of storage blocks.

By splitting out this functionality, the authors then suggest that the backing store server is a point of commonality that can be used to support a range of higher level services. Thus, “[t]he backing store server is the sole agency concerned with allocating and relinquishing space on the storage medium.” To achieve this goal the authors propose a universal system of indexes as shown in Figure 1 (from the original paper).

The authors argue for a master table that presents the per-system name space. For each of these namespaces, there is a corresponding master file directory (MFD) and a collection of user file directories (UFDs) that are used to organize the user’s information into a hierarchy.

We note that the files, UFDs and MFD are all stored in storage elements – segments – that are managed by the file server. Thus the file server is responsible for:

Keeping track of its initial index
Preserve the names stored in the MFD and UFDs
Reclaim (delete) the entries in the MFD and UFDs when an entry is deleted
Manage storage space

From this simple model, they note that a broad range of systems can be constructed.

The paper spends considerable (25% of the paper) time discussing “protection”. By this they refer to the issues inherent in having shared usage of a common resource, such as the files on the file server. The authors describe using ACLs on the file server as one means of providing security. They do not touch upon precisely how the file system will authenticate the users, though at one point they refer to using encryption for access bits in some circumstances.

Their preferred mechanism for access is the capability. This should not come as a surprise, given that they worked on the CAP file system, which provided capabilities. Their observation is that with a sufficiently sparse handle space, it is impractical for an unauthorized party to find the resource. It probably doesn’t require much to point out that this presumes the inherent integrity of the network itself.

The authors complete their universal file server with an observation that this provides a general base upon which individual file systems can implement their own enhanced functionality. Indeed, this was one of their primary objectives in doing this work. They do point out a number of potential issues in their system, but assert that they will not be problematic.

The authors do a good job of describing a basic, abstract file server. The system they describe may not have achieved broad use but this paper does provide a simple, high level view of how a file server might operate. We’ll turn our attention to actual implementations – and there are many such implementations to discuss in the coming posts.

Implementing Atomic Actions on Decentralized Data

February 25, 2018

Implementing Atomic Actions on Decentralized Data
David P. Reed, Transactions on Computer Systems, Vol 1. No. 1, February 1983, pp. 3-23.

This certainly must have been an interesting choice to be the first paper of the first ACM Transactions on Computer Systems. It is certainly an interesting work on concurrent activity within a distributed system. It relies upon a basic concept of ordering within a distributed system (“decentralized data”). He builds upon the basics laid down by Leslie Lamport in Time, Clocks, and the Ordering of Events in a Distributed System. While Lamport was concerned about defining the (partial) ordering of events in the distributed system, Reed is concerned about using that ordering to construct useful distributed data updates.

Given that file systems, especially distributed file systems, are concerned with managing data across nodes in a consistent fashion, this work is of particular importance. By 1983 we have seen the emergence of network file systems, which I plan on describing further in coming posts, but they are still fairly primitive. Database systems are further along in allowing distributed data and coordination through things like two-phase commit.

He starts by describing the goals of this work:

The research reported here was begun with the intention of discovering methods for combining programmed actions on data at multiple decentralized computers into coherent actions forming a part of a distributed application program.

His focus is on coordinating concurrent actions across distributed data and ensuring that failures are properly handled. What does it mean to properly handle failures? Essentially, it means that the data is in a consistent state once the system has recovered from the failure. He starts by defining terms that relate to consistency models. For example, he defines an atomic action as being a set of operations that execute in different locations and at different times but cannot be further decomposed. A single action starts with a consistent state at the start and moves to a consistent state at the end. Any intermediate state of the system is not visible (what we would call “isolation” now). He formally defines these concepts as well.

He touches on the idea of consistency, in which one starts with a consistent system and then proves each (atomic) operation yields a consistent state. In my experience this aspect of distributed systems is sometimes skipped, often due to the complexity of doing the work required here. In recent years, formal proof methods have been used to automate some aspects of this. I’m sure I will touch upon it in later posts.

One key benefit of this system of atomic actions is that it makes things simpler for application programmers: in general, they need not deal with unplanned concurrency and failure. Indeed, that is one of the key contributions of this work: the process of reasoning about failure and how to handle it. Indeed, in my experience, handling failure gracefully is one of the substantial challenges inherent in constructing distributed systems. If something can fail, it will.

Achieving atomic action requires the ability to interlock (“synchronization”) against other actors within the system and the ability to gracefully recover from failure cases. The author goes on to describe what his decentralized system looks like: a message passing model (via the network, presumably,) with nodes containing stable storage and the ability to read and write some finite sized storage unit atomically (“blocks”).

One class of failure the author explicitly disclaims: a situation in which the system performs an operation but ends up with a different but valid outcome. This makes sense, as it would be difficult to reason in the face of arbitrary changes each time a given operation were requested. He sets forth a series of objectives for his system:

(1) Maximize node autonomy, while allowing multisite atomic actions
(2) Modular composability of atomic actions.
(3) Support for data-dependent access patterns.
(4) Minimize additional communications.
(5) No critical nodes.
(6) Unilateral aborting of remote requests.

Having laid this groundwork, the author then defines the tools he will use to achieve these objectives. This includes a time-like ordering of events, version information for objects, and the ability to associate intermediate tentative steps together (“possibilities”).

He envisions a versioned object system, where the states of the object correspond to changes made to the object.

At this point I’ll stop and make an observation: one of the challenges for this type of versioning is that the way in which we view objects can greatly complicate things here. For example, if we modify an object in place then this sort of versioning makes sense. However, if we modify an object by creating a new object, writing it, and then replacing the old object with the new object, we have a more complex functional model than might be naively envisioned. This is not an issue clearly addressed by the current paper as it relates mostly to usage. But I wanted to point it out because this sort of behavior will make things more difficult.

One of the important contributions in this work is the discussion about failure recovery. This is, without a doubt, one of the most complex parts of building a distributed system: we must handle partial failures. That is, one node goes offline, one network switch disappears, one data center loses power.

The author thus observes: “If a failure prevents an atomic action from being completed, any WRITE the atomic action had done to share data should be aborted to satisfy the requirement that no intermediate states of atomic actions are visible outside the atomic action. Thus, one benefit of the versioned objects is that the pending transaction (“possibilities”) can track the updated version. Abort simply means that the tentative versions of the objects in the transaction are deleted. Committing means that the tentative versions of the object in the transaction are promoted to being the latest version.

Thus, we see the basic flow of operations: a transaction is started and a possibility is created. Each potential change is described by a token. Tokens are then added to the possibility. While the term is not here, is appears to be a model for what we refer to as write-ahead logging (sometimes also called intention logging).

Time stamps are introduced in order to provide the partial ordering of events or operations, so that the changes can be reliably reproduced. The author goes into quite an extensive discussion about how the system generates time stamps in a distributed fashion (via a pre-reservation mechanism). This approach ensures that the participants need not communicate in order to properly preserve ordering. The author calls this pseudotime. He continues on to explain how timestamps are generated.

Using his ordered pseudo-time operations, his read and write operations, possibilities, and tokens, he then constructs his distributed data system using these primitives. There is detail about how it was implemented, challenges in doing so and the benefits of immutable versions.

He admits there are serious issues with the implementation of this system: “For most practical systems, our implementation so far suffers from a serious problem. Since all versions of an object are stored forever, the total storage used by the system will increase at a rate proportional to the update traffic in the system. Consequently, we would like to be able to throw away old versions of the objects in the system. We can do this pruning of versions without much additional mechanism, however.” His discussion of why this may not be desirable is interesting as it discusses the tension between reclaiming things and slow transactions. He does describe a mechanism for pruning.

After this, the author turns his attention to using atomic transactions to construct new atomic transactions: nested transactions (also “composable transactions” so that we have multiple terms for the same concept!) Again, this is proposed, but not explored fully.

The breadth and scope of this paper is certainly impressive, as the author has described a mechanism by which distributed transactions can be implemented. I would note there are no evaluations of the system they constructed, so we don’t know how efficient this was, but the model is an important contribution to distributed storage.

A Client-Based Transaction System to Maintain Data Integrity

February 15, 2018

A Client-Based Transaction System to Maintain Data Integrity
William H. Paxton, in Proceedings of the seventh ACM Symposium on Operating Systems Principles, 1979, pp 18-23.

Last time I discussed the basis of consistency and locks. I did so because I thought it would help explain this paper more easily. We now move into the nascent world of sharing data over the network (what today we might call distributed systems). One of the challenges of this world is that it involves coordinating changes to information that involves multiple actors (the client and server) that has some sort of consistency requirement. The authors here use the term integrity (rather than consistency) but I would suggest the terms are effectively the same for our purposes:

Integrity is a property of a total collection of data. It cannot be maintained simply by using reliable primitives for reading and writing single units — the relations between the units are important also.

This latter point is important. Even given atomic primitives, any time we need to update related state, we need to have some mechanism beyond a simple read or write mechanism.

The definitions offered by the author is highly consistent with what we saw previously:

The consistency property: although many clients may be performing transactions simultaneously, each will get a consistent view of the shared data as if the transactions were being executed one at a time.

The atomic property: for each transaction, either all the writes will be executed or none of them will, independent of crashes in servers or clients.

Prior work had focused on having the server handle these issue. This paper describes how the client can accomplish this task.

There were some surprising gems here. For example:

If a locked file is not accessed for a period of time, the server automatically releases the lock so that a crashed client will not leave files permanently unavailable.

We will see this technique used a number of times in the future and it is a standard technique for handling client-side locks in network file systems. Thus, one novelty to the locks presented here is that they have a bounded lifetime. Thus, one of the services required from the server is support for this style of locking.

The author then goes on to propose the use of an intention log (“intention file” in the paper). The idea is that the client computer locks files, computes the set of changes, records them in the intention log, and then commits the changes. Then the actual changes are applied.

To achieve this, it sets out six operations:

Begin Transaction – this is what reserves the intention file. Note that the author’s model only permits a single transaction at a time. Note that the other operations fail if a transaction has not been started.
Open – this is what opens the actual data file. At this point the file header is read. If that header indicates the file is being used for a transaction, the file is recovered before the open can proceed.
Read – as it suggests, this reads data from the file.
Write – this prepares to write data to the file. Since this is a mutation, the new data must be recorded in the the intention file at this point. The actual file is not modified at this point.
Abort Transaction – this causes all files to be unlocked and closed. No changes are applied. The transaction becomes “completed”.
End Transaction – this is how a transaction is recorded and applied. The paper describes how this is handled in some detail. In all successful cases, the changes are applied to the server.

The balance of the paper then describes how crash recovery works. The simplifying assumptions help considerably here – this system only permits a single transaction to proceed at a time. Then the author moves on to a “sketch of correctness proof”. I admit I did find that humorous as the attempt at formalism without actually invoking formalism achieves anything other than warm-fuzzy feelings.

He wraps up by discussing some complex cases, such as file creation and the challenge of “cleaning up” partially created file state. Having dealt with this issue over the years, I can assure you that it can be complex to get it implemented correctly. He also discusses that some files might have multi-block meta-data headers and explains how those should be handled as well. He concludes with a discussion about handling media failure conditions, which are a class of failures that this system does not claim to protect against.

A Principle for Resilient Sharing of Distributed Resources

January 30, 2018

A Principle for Resilient Sharing of Distributed Resources
Peter A. Alsberg and John D. Day, In Proceedings of the 2nd international conference on Software engineering, pp. 562-570. IEEE Computer Society Press, 1976.

Today I turn my attention to a paper that begins to explore the issues surrounding distributed systems. This paper sets forth some basic principles that should be considered as part of distributed systems that are worth capturing and pointing out.

They state that a “resilient server” must have four attributes:

It must be able to detect and recover from some finite number of errors.
It must be reliable enough that a user doesn’t need to worry about the possibility of service failure.
Failure just beyond the finite limit are not catastrophic.
One user’s abuse should not have material impact on any other user.

These all seem somewhat reasonable, though I must admit that I found (3) a bit surprising, as it is not uncommon for some classes of failures to cause catastrophic failure. For example, when using erasure coding it is common for some number of failures to lead to catastrophic failure. Upon reflection, I realized that one could achieve this goal by simply setting the finite limit a bit lower, though I suspect this is not entirely what the original author had in mind.

Still, the general concept of resiliency is a good one to capture and consider. The authors point out some of the challenges inherent in this over a network, notably latency. “[A] major consideration when choosing a resource sharing strategy is to reduce, as much as possible, the number of message delays required to effect the sharing of resources.”

In other words, keep the messages to a minimum, try not to make them synchronous. Asynchronous messaging systems will turn out to be rather complex, however, and sometimes there are just operations that require synchronous behavior.

Of course, there has to be a file system tie-in here (however tenuous) and there is! Under examples they list “Network Virtual File Systems which provide directory services and data access services…” Thus, it is clear that the authors are considering network file systems as part of their work.

In 1976 the authors indicate that the cost to send a message is on the order of 100ms, while the cost to acquire a lock on the CPU is 100 microseconds to 1 millisecond. While things are faster now, there is still a large difference between these two paths on modern systems. Thus, we will still be dealing with issues on the same scale.

The bulk of the paper then walks through the description of their model for providing resilient network services – an application host, sends a request to a primary server host; that server host then communicates with a secondary server host. That secondary server host can send backup requests to a tertiary server, and so forth. The secondary host confirms with the primary host, and ultimately it is the secondary host that confirms the operation with the application host.

They cover a variety of important concepts, such as the idea that a host may need to record state about the operations, that operations cannot be applied until the other servers have received their messages, etc. This is, in essence, an early consensus protocol. While not efficient, the authors have laid groundwork in helping us think about how we construct such services.

I have included Figure 3 from the paper above. It shows the message flow through the system. The paper also discusses how to handle a number of failure situations and how messages flowing through the system keep track of which nodes are current.

It also touches on one of the most complex issues in distributed systems: network partition. Intriguingly, the authors do not assert that one partition must remain alive as they describe the decision being related to the specific requirements of the environment. Thus, in their model it would be possible to end up with partitions that continue forward but can no longer be easily re-joined after the network partition is resolved. Of course, they don’t require that both sides of a partition progress, though they do not discuss ways to handle this, either.

They assert that the primary/secondary/backup model is optimal in terms of the number of messages that it sends and in how it ensures the messages are strictly ordered. Then they briefly describe a few other possible arrangements that have multiple primaries and/or secondaries. Their final conclusion is that their original strategy is at least as good as the alternatives though this is not demonstrated in any formal way.

Now that they have their replication system, they view how it will work for the network virtual file system. They conclude that the highest levels of the hierarchy need to be stored on all nodes (which makes them the most expensive to maintain). They partition the names space below that and record location information within the nodes of the name space where the storage has been split out across hosts. Thus, we see a description of a global name space.

Their system is simple, but they identify a number of the important issues. Their suggestion about sharding the file systems name space is one that we shall see again in the future.

They have laid the groundwork for thinking about how to build distributed file systems.

Some Observations about Decentralization of File Systems

January 25, 2018

Some Observations about Decentralization of File Systems.
Jerome H. Saltzer, 1971.

This paper caught my eye because it leads in a different direction than the other file systems papers I’ve been looking at. Instead of talking about file systems on a single computer, it has the audacity of suggesting that maybe we want to have remotely accessible storage.

The author frames this in the context of networks. The first (of two) references in this paper are about networks and help frame the conversation about “decentralized” file systems:

Computer network development to achieve resource sharing
Lawrence G. Roberts and Barry D. Wessler, AFIPS ’70 (Spring) Proceedings of the May 5-7, 1970, spring joint computer conference, pp 543-549.

The authors’ affiliation is the Advance Research Project Agency (ARPA) and what they describe in this paper is the ARPA Network. I don’t want to get too far into this, but I did want to include this wonderful bandwidth/cost chart – something I have definitely seen in various guises since then.

ARPANet Performance — ARPA Network 1970 Performance Cost Comparison

In this time frame, they are discussing the creation of a network for sharing data across geographically dispersed sites, including sites scattered around the United States. Connection speeds in this time frame are 50Kb/s. It estimates the entire bandwidth consumption of the network will be between 200-800Kb/s by mid-1971, with twenty nodes connected to the entire network.

The point of this diagram is to discuss costs, where it points out that the least expensive way to move large quantities of data is to encode it on a magnetic tape and sent it in the mail (air mail, of course.)

Why is this important? Because it helps set the stage for the conversation about resource sharing. This is before Ethernet exists (Version 1.0 of the Ethernet specification appears in 1980). Thus, the networks that do exist are hardware specific. The amount of data being moved is remarkably small by modern standards.

This is, however, where we start considering what is involved in supporting network file systems – decentralized systems of storage that can communicate with one another.

The author’s stake out their position in the abstract:

This short note takes the position that the inherent complexity of a decentralized and a centralized information storage system are by nature essentially the same.

This defines the starting point of what will be a decades long conversation on this fundamental issue. The authors’ argue that in fact the real issue is one of engineering, not models:

The consequence of this claim, if substantiated, is that the technical choice between a centralized or decentralized organization is one of engineering tradeoffs pertaining to maintainability, economics, equipment available, and the problem being solved, rather than one of functional properties or fundamental differences in complexity.

The discussion then points out that in some cases, such as adding a 20-40 millisecond delay on top of the usual 20-50 millisecond disk delay is not dramatically different. They explore other areas where the timing might make a substantial difference. Intriguingly, they discuss a form of disaggregation – where they envision compute being in one location, memory in another, storage in yet another. They point out that this turns back into a computer (albeit one with long latency to access memory, for example.)

They then discuss caching (“buffer memory”) of information but point out the challenge is now that the system has multiple copies of what need to be the same data – it “has the problem of systematic management of multiple copies of information”. Surprisingly they make a leap here equating this situation between centralized and decentralized systems: this is a problem of modeling access patterns to shared information and then invent algorithms for “having the information on the right storage device at the right time”!

With decades of hindsight, this looks surprisingly naive. Indeed, even the authors’ construct a clear path for retreat here: “… this is not to say that there are no differences of significance”.

They conclude by opining that storage is challenging:

The complexity is the inevitable consequence of the objectives of the information storage system: information protection, sharing, privacy, accessibility, reliability, capacity, and so on. What is needed in both cases is a methodical engineering basis for designing an information storage system to specification. On the other hand a decentralized organization offers the potential for both more administrative flexibility, and also more administrative chaos.

What surprised me about this paper is that the issues around sharing data between computers was already being considered at this point. The idea that network file systems and local file systems really should be “more or less” the same is thus planted early. I’m sure we will see divergence on this point as we continue moving forward.

For now, we have a base upon which to build this new direction of “decentralized” file systems.

On the history of File Systems

January 1, 2018

I have made it a goal for 2018 to answer a question several people have asked me: what papers should I read to learn more about file systems. So I’ve decided to attempt to copy a format that I’ve found useful – The Morning Paper. I admit, I am not sure I’ll be able to keep up the frenetic pace of a paper each day that he maintains, but I do know there are plenty of papers to read.

My motivation for doing this is simple: there are quite a few papers on this topic, I certainly haven’t read them all (or even the majority of them) and reading through them, along with my own interpretation of why the paper matters to File Systems will be useful for me. It also gives me a set of information that I can point out to people that ask me “so what should I read to learn more about this topic.”

Why File Systems? For me it’s been largely a quirk of fate. I’ve been working on operating systems for many years now, and file systems happens to be the area in which I’ve spent more time than any other. For something that is conceptually so simple – after all, it just maps files to blocks on your disk, it has considerable complexity. File systems are also the gateway to one of the more challenging parts of the operating system: the bit that stores persistent state. Errors in file systems are often not transient. That makes them challenging. The bar is set high because we don’t want to lose anyone’s data.

File systems are part of the plumbing of the operating system. They are essential to proper operation but when everything is working properly, nobody really notices they are there. Only when things go wrong does anyone notice. So it is within this world that I will delve.

The place to start is to even attempt to define file systems. While you might think this is simple, it turns out to be surprisingly challenging. So I’ll approach it by providing some examples and then delving into what means to be a file system.

Media File Systems

The easiest place to start is in the concept of the media file system. “Media” in this context means any tangible medium on which we can systematically record and/or read persistent information. Examples would include disks, tapes, non-volatile memory, or optical media. Whether or not RAM file systems fall into this category is an interesting question – and it helps demonstrate my point that pinning down this concept is more challenging than one might otherwise think.

Disk drives are typically what most people think of first for this category, though we now often view solid-state disk drives in the same category. File systems reside on top of media devices and keep track of the organization of the data itself. Some file systems can span multiple disk, or use transactional journals, or provide resilience against errors in the media itself. The continuum here is surprisingly broad, but in the end the purpose of the file system is to map from the vagaries of the media to a (mostly) uniform model of behavior so application programs “just work”.

Network File Systems

As we added computers to networks, we found it useful to be able to transfer information stored on one computer to another. This permitted applications to access data, regardless of where it was actually stored. By constructing a file system that uses a remote access protocol, we can present the remote storage device to the local system as if it were just a local file system – or rather, almost the same.

Good examples of this include the Network File System (NFS) originally developed by SUN Microsystems in the 1980s and the Andrew File System (AFS) originally developed by Carnegie-Mellon University in the same time period. These days many people also use the Server Message Block (SMB) based network file systems; the roots of this are also back in the 1980s. All three of these remain in use today, in fact, though they have evolved from the earliest versions.

Key-Value Stores

There is considerable overlap between the world of file systems and databases. Several early file systems were really just single level lookup mechanisms, rather than the hierarchical name space that is so common these days. It is actually common to implement file systems so that hierarchy is added to a flat indexed name space (a “flat” file system). These are in essence one form of key-value store. Some file systems permit addressing by using keys, whether explicitly assigned or implicitly generated. A subsection of the literature focus more on this type of file system and I will make sure to cover several of these.

Name Spaces

Sometimes a file system is more about the namespace than anything else. For example the proc file system does not actually reside on top of any sort of storage. The name space provides a convenient way to find information that is generated on demand. In these cases, it is the name space that really matters, not how the information is stored.

This conflation of storage, presentation, and name-space add to the richness and complexity of work being done in file systems. My goal is to explore these issues, by reviewing the literature.