Home » File Systems » Media File Systems

Category Archives: Media File Systems

April 2024
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

What is the Optimal Location for Storing Metadata

November 6, 2022

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

MRAMFS: A compressing file system for non-volatile RAM

April 20, 2018

MRAMFS: A compressing file system for non-volatile RAM
Nathan K. Edel, Deepa Tuteja, Ethan L. Miller, and Scott A. Brandt in Proceedings of the 12th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), Volendam, Netherlands, October 2004.

This paper allows me to provide both a file systems paper and look at an interesting approach to byte-addressable non-volatile memory (NVM).

We have developed a prototype in-memory file system which utilizes data compression on inodes, and which has preliminary support for compression of file blocks. Our file system, mramfs, is also based on data structures tuned for storage efficiency in non-volatile memory.

One of the interesting aspects of NVM is that it has characteristics of storage (persistence) and memory (byte-addressability). Storage people are used to having vast amounts of time to do things: it is quite difficult, though not impossible, to do anything computationally with data that will be an important factor when it is combined with the overhead of I/O latency to disk drives. In-memory algorithms worry about optimal cache line usage and efficient usage of the processor, but they don’t need to worry about what happens when the power goes off.

Bringing these two things together requires re-thinking things. NVM isn’t as fast as DRAM. Storage people aren’t used to worrying about CPU cache effects on data resilience.

So mramfs looks at this from a very file systems centric perspective: how do we exploit this nifty new memory to build a new kind of RAM disk: it’s still RAM but now it’s persistent. NVRAM is slower than DIMM and hence it makes sense to compress it to increase the effective data transfer rate (though it is not clear if that really will be the case.)

I didn’t find a strong motivation for compression, though I can see the viability of it now, in a world in which we want to pack as much as we can into a 64 byte cache line. The authors point out that one of the previous systems (Conquest) settled on a 53 byte inode size. The authors studied existing systems and found they could actually compress down to 20 bytes (or less) for a single inode. They achieved this using a combination of gamma compression and compressing common file patterns (mode, uid, and gid). Another reason for this approach is they did not wish to burden their file system with a computationally expensive compression scheme.

In Figure 1 (from the paper) the authors provide a graphic description of their data structures. This depicts a fairly traditional UNIX style file system, with an inode table, name space (directories), references from directory entries to the inodes. Inodes then point to control structures that eventually map to the actual data blocks.

The actual memory is managed by the file system from a single chunk of non-volatile memory; the memory is virtually addressed and the paper points out that they don’t actually care how that mapping is achieved.

Multiple inodes are allocated together in inode blocks with each block consisting of 16 (variable length) inodes. The minimum size of a block is 256 bytes. inodes are rewritten in place whenever possible, which can lead to slack space. If an inode doesn’t fit within its existing space, the entire block is reconstructed and then written to a new block. Aftewards, the block pointer is changed to point to the new block. Then the old block is freed.

One thing that is missing from this is much reasoning about crash consistency, which surprised me.

The authors have an extensive evaluation section, comparing to ext2fs, ramfs, and jffs2 (all over RAM disk). Their test was a create/unlink micro-benchmark, thus optimizing the meta-data insertion/deletion case. They then questioned their entire testing mechanism by pointing out that the time was also comparable to what they achieved using tmpfs building the openssl package from source. Their final evaluation was done without the compression code enabled (“[U]nfortuantely, the data compression code is not yet reliable enough to complete significant runs of Postmark or of large builds…”). They said they were getting about 20-25% of the speed without compression.

Despite this finding, their conclusion was “We have shown that both metadata and file data blocks are highly compressisble with little increase in code complexity. By using tuned compression techniques, we can save more than 60% of the inode space required by previous NVRAM file systems, and with little impact on performance.”

My take-away? This was an early implementation of a file system on NVM. It demonstrates one of the risks of thinking too much in file systems terms. We’ll definitely have to do better.

The DEMOS File System

February 13, 2018

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems. I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?” I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world. As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system. A key for this project was that it seeks to improve performance as much as possible. After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance. It does so in a clear and understandable fashion.

DEMOS utilizes a UNIX-like hierarchical file system model. It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS. Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes. To the best of my knowledge this is one of the earliest papers covering this common subject (which is revisited periodically because workloads change and file sizes also change). One of the common themes I have seen in other work is mirrored here: most files are small. Figure 1 shows a CDF for file sizes. We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well. Their second figure (Figure 2) describes the proportion of transfer sizes and their source. We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server. While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system. This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications. Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager. This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”). This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this. It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”. Their point is that serializing these things decreases overall performance. Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.” This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses. Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation. Their observation? The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering. These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply. There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging. They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead. They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system. There are plenty of details that will be refined in later work.

TENEX

January 4, 2018

Tenex, a Paged Time Sharing System for the PDP-10
Communications of the ACM, March 1972, Volume 15, Number 3
Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson, Bolt Beranek and Newman Inc.

TENEX is a new time sharing system implemented on a DEC PDP-10 augmented by special paging hardware developed at BBN. This report specifies a set of goals which are important for any time sharing system. It describes how the TENEX design and implementation achieve these goals. These include specifications for a powerful multiprocess large memory virtual machine, intimate terminal interaction, comprehensive uniform file and I/O capabilities, and clean flexible system structure. Although the implementation described here required some compromise to achieve a system operational within six months of hardware checkout, TENEX has met its major goals and provided reliable service at several sites and through the ARPA network.

Storage organization and management in TENEX
Daniel L. Murphy
AFIPS ’72 (Fall, part I) Proceedings of the December 5-7, 1972, fall joint computer conference, part I
Pages 23-32, Anaheim, California — December 05 – 07, 1972

The first of these two papers discusses TENEX; much of the paper is not about file systems, but there are about a page and a half about the TENEX file system. The second paper goes into greater detail about storage – including the file system – for TENEX. I have picked these papers for several reasons:

They demonstrate the impact of the MULTICS work on the systems that follow (certainly beyond the obvious UNIX work).
They introduce the concept of virtual integration with the file system
They introduce the concept of copy-on-write
They show the fundamental drive to maintain backwards compatibility
They introduce the concept of a suffix (or extension) as a means of identifying the purpose of a file
They delve into the details of open file state management

A TENEX file has a compound name structure:

A powerful and versatile directory and file naming facility is provided in which a particular file is identified by a fixed-depth path which includes device, directory name, file name, extension, and version.

The fixed-depth path is a limitation the TENEX developers chose to implement for backwards compatibility with existing PDP-10 programs, an early example of how application compatibility is often a critical concern for operating systems development. The authors do note they are considering expanding upon this to make it arbitrary depth – a feature of MULTICS.

Both papers also discuss the Job concept, the idea of a set of related processes. The implication is that processes within a single job can share resources, thus providing more of that “balance between sharing and isolation” that operating systems have to handle. When a file is opened successfully, a Job File Number is created in a table. That encapsulates the information about how to find the given file and instead uses an index value – in other words, a file descriptor or file handle. “Once the initial association of JFN and file has been established, the JFN is used for all ensuing operations on the file, including sequential reading and writing, opening, closing, etc.”

TENEX then allows random access to the file by combining the JFN with an index identifying the desired element. The authors point out that this is more flexible than previous systems in which the file was not random access.

This becomes flexible when describing the page map for a given process. The Process Map points from an entry in the virtual address space to a corresponding JFN and index (offset). Thus, the contents of that page can be retrieved on demand from the underlying file system.

None of this should look particularly surprising to anyone familiar with modern operating systems, of course. This just happens to be part of the path to get to where we are today. The papers actually go into greater detail about the details here, including access control, but that isn’t germane to my file systems focus.

Since the file path names identify files over the domain of all jobs in the system, it is evident that our naming and mapping procedures readily provide a means for sharing storage. Using the appropriate path names (including legality checks), processes in two or more different jobs can identify the same file, and each can obtain a JFN for it. Nothing in the mapping procedures specified above requires that either process be aware of the other’s access, and so each process constructs an identifier and places it in its process map (Figure 4).

In other words, the contents of regions of a file can be shared across processes. This is, in fact, transparent to the processes.

Sharing at this level would be particularly important because of the limited address space and desire to share code – the papers discuss this, and point out the benefits of this form of sharing.

This leads to their inclusion of copy-on-write. “One other important TENEX feature which facilitates sharing is a type of page access called copy-on-write. To our knowledge, this facility was first developed and used on the BBN-LISP system for the
XDS-9407.” Thus, while not original to TENEX, this is a logical extension beyond what MULTICS had described. Copy-on-write is a mainstay of both modern operating systems and some file systems.

Interestingly, TENEX seems to implement a rudimentary page cache as well:

To implement the file sequential monitor calls (e.g., byte-in, byte-out) the monitor maintains a number of “window” pages in a separate map invisible to the user process. For each file with sequential operations in progress, the monitor maps the file page which is to receive or provide the next byte. Each call from the user causes one or more bytes to be loaded from or stored into this page, and a count updated to determine if a new page should be mapped. Movement through the file is accomplished by mapping successive pages, and the sequential access module does not have to be aware of the physical device on which the page resides nor interface with I/O driver modules to read or write it. This modularity is very satisfying from an operating system design point of view.

Thus, byte level access to block level devices is managed via this window page mechanism. The files are not strictly memory mapped, though, so this is more like a buffer cache than a page cache.

They also use the file system to implement inter-process communications – a form of file-backed shared memory.

Page management is tightly tied to this implementation as well, though the description involves what we would likely consider the memory management unit and page fault handling logic as well as the page to file/offset mapping necessary to provide the system’s demand paging.

Two other interesting aspects of their file systems model includes a pair of extra mapping layers: one for mapping from logical storage address to physical storage location, and the other mapping from multiple distinct page references to a single storage block.

The underlying rationale here is that this permits relocating the storage to different locations, typically from higher speed storage (when warm/hot) to slower speed storage (when cold).

This mechanism doesn’t involve changing the actual description of the storage and instead moves to a logical storage addressing model. It was interesting to me to see this level of indirection added in such an early system, but clearly the mismatch in speeds between various types of storage dictated the importance of this scheme. Once again, it is interesting to see how little the problems we face have actually changed.

The data sharing model also uses an extra level of indirection. I’m familiar with this model from my own work in Windows, where shared memory is indirectly mapped in a similar fashion. That this mechanism was around in the early 1970s is once again a reminder of how little operating systems have fundamentally changed.

There are many aspects of this paper that I have glossed over, in no small part because they don’t really apply to modern systems – we don’t have to worry about drum memories, for example, no more than we need to worry about punch card readers. These two papers, however, clearly lay out a deeper realization of the file system than I have seen in prior work.

TENEX differed from MULTICS in a number of ways and the two systems remained competitors for many years. TENEX ultimately would become TOPS-20 and in turn be supported by Digital Equipment Corporation. It was an important part of the early (pre-VAX) ARPANET and survived for many years as a viable system.

If you would like to read more about this, I’d recommend Dan Murphy’s Origins and Development of TOPS-20 post. It provides further fascinating background on how TENEX evolved and how systems evolved. I leave you with the final words from that post:

Although this book is about DEC’s 36-bit architecture, it is clear now that hardware CPU architectures are of declining importance in shaping software. For a long time, instruction set architectures drove the creation of new software. A new architecture would be introduced, and new software systems would be written for it. The 36-bit architecture was large in comparison to most other systems which permitted interactive use. It was also lower in cost than most other systems of that size. These factors were important in the creation of the kind of software for which the architecture is known.

Now, new architectures are coming along with increasing frequency, but they are simply “slid in” under the software. The software systems are far too large to be rewritten each time, and a system which cannot adapt to new architectures will eventually suffer declining interest and loss of competitive hardware platforms. TOPS-20 didn’t pass that test, although it did contribute a number of ideas to the technology of interactive systems. How far these ideas ultimately spread is a story yet to be told.

There is considerable insight in this for me, particularly the admonishment “software systems are far too large to be rewritten each time” as it resonates with (one of) my own research directions.

MULTICS

January 2, 2018

A General-Purpose File System For Secondary Storage
R. C. Daley and P.G. Neumann
Published in the Proceedings of the American Federation of Information Processing Societies 1965, Fall Joint Computer Conference, vol. 27, pp. 213-229.

This is the seminal paper discussing how file systems were envisioned within the MULTICS operating system. While you can still run MULTICS, it is a curiosity at this point. However, virtually all operating systems we now use today descended from MULTICS and thus, its design profoundly influenced their development.

This paper is a delightfully easy read, written at the dawn of this new field of multi-programming. Prior to this time computers were essentially single user. The introduction of the idea of sharing a computer with other users was nascent. Thus, the experts working in the field at the time had to begin thinking about things like organization, security, and sharing.

Indeed, a common tension in operating systems literature in general is between isolation and sharing. Isolation is great from a security perspective, but is inefficient. Each user of the system often uses the same programs, for example, but we do not want to keep a distinct copy of the same thing for each user as that would be wasteful. This profoundly impacts the file systems work because the file system is the point of persistence, the level at which shared resources become manifest.

But I’m jumping ahead at this point. Let’s start with the simple question: What is a file system? As we will find during this journey, its meaning and usage is far richer than one might think upon first reading. While this paper is not the first paper to discuss storage and file systems, it is a good example of the state of the art in 1965.

This paper offers us some useful definitions:

A file is simply an ordered sequence of elements, where an element could be a machine word, a character, or a bit, depending upon the implementation.

That seems to be a rather broad definition, but it is a reasonable place for us to start. This does not impose structure on the content itself, which proves to be one reason why this abstraction ultimately turns out to be a powerful one. “At the level of the file system, a file is formatless.”

This paper also establishes the name space abstractions as well:

As far as a particular user is concerned, a file has one name, and that name is symbolic. (Symbolic names may be arbitrarily long, and may have syntax of their own. For example, they may consist of several parts, some of which are relevant to the nature of the file, e.g., ALPHA FAP DEBUG.)

This again paints a rather broad abstraction. The name has meaning to the user, but otherwise is just symbolic data for the file system. The paper goes on to define the now classic name space specific abstraction:

A directory is a special file which is maintained by the file system, and which contains a list of entries. To a user, an entry appears to be a file and is accessed in terms of its symbolic entry name, which is the user’s file name. An entry name need be unique only within the directory in which it occurs.

Thus, this paper lays out the quintessential aspect of modern file system name spaces: they are hierarchically organized. The paper describes this in greater detail and refers to links and branches.

The authors describe how users might work in different parts of the hierarchical name space. They observe that this then creates a situation in which sharing of files might be an issue, and thus they introduce the concept of links to resolve this.

A link in this context is an entry in a directory that refers to an existing file within the file system. Thus, we see the genesis of links, though the paper does not clearly delineate symbolic links from hard links. This does help motivate why these features show up in UNIX a number of years later, however.

The paper goes into greater detail about how they envision this hierarchical name space functioning, including traversal, working directory, and links.

From there the authors then turn their attention to a problem inherent in having a single shared file system name space with data contents belonging to different users, namely managing access to the individual files. Thus, they introduce access controls. They note that a file system could default to either permissive or restrictive access within this model. From this point they incorporate the access control list, the access mode for a given file, and the concept of access attributes. Of the five attributes listed, four of them are familiar: read, execute, write, and append access have analogs in modern file systems. The fifth, trap is interesting, in that it defines an explicit exception mechanism for access control that requires external validation for access – an interesting generality that is not typically present in modern file systems.

They also describe file sharing, introducing the concept of explicit operations to lock access to a given file, or to unlock the file. They suggest that a locked file would require the user provide a designated key to permit accessing the file; I have seen this approach in some file systems, actually, though it is fairly uncommon.

The paper describes access control at length with no real surprises otherwise, other than perhaps the fact that many of these features disappeared from later operating systems, only to be resurrected and added many years later.

There is quite a bit in the paper about backup and restore processing. Much of the detail here is interesting historically but does not really add much to my exploration of file systems. If you are looking for more information about the history of backup, I do encourage you to read those sections. Having done magnetic tape backups in the past, I’m content with leaving them be.

There is one observation I will point out from this section, however. The authors actually discuss a recurring theme in file systems – the fact that storage itself ends up being a multi-level media management challenge.

In most cases a user does not need to know how or where a file is stored by the file system. A user’s primary concern is that the file be readily available to him when he needs it. In general, only the file system knows on which device a file resides.

The file system is designed to accommodate any configuration of secondary storage devices. These devices may cover a wide range of speeds and capacities. All considerations of speed and efficiency of storage devices are left to the file system. Thus all user programs and all other system programs are independent of the particular configuration of secondary storage.

They go on to describe migration of data from hot to cold storage as needed and point out these are functions of the file system. I found this an interesting insight since even today we routinely deal with these sorts of situations, such as the Strata paper from SOSP 2017 (“Strata: a Cross Media File System“).

The remainder of the paper focuses on how the file system is implemented in MULTICS

I must admit, I found this section of the paper both detailed and yet fascinating because of the broad sweeping nature in which the authors lay down fundamental ideas that we see in modern operating systems. Some of it is not really in the scope of file systems (“segment management” which appears to equate to shared executables, such as programs and shared libraries in modern systems,) the concept of demand segment loading (which presages demand paging,) the concept of file system search, mechanisms for managing file systems, memory recycling (reminiscent of page reclamation in modern virtual memory systems based upon its description,) device management, and I/O queueing. They finish up by describing their “multilevel storage management module”, backup system, and utility functions. The latter has (now) amusing functions like “file to cards” and “tape to file”.

So these give us asynchronous operation, paging, backup, hierarchical storage management, security, file sharing, and directory sharing. Most of these concepts survive to this day. Indeed, what I found most surprising after reading this paper is how few of these ideas disappeared: traps and file system search are the two that spring to my mind.

Thus the lesson of today’s paper: in 1965 the MULTICS team more or less laid down a model for how file systems were to work in virtually all modern operating systems. The subsequent work will provide us with greater insight into the details, but the basic shape of our file systems has not strayed far from this early vision.

On the history of File Systems

January 1, 2018

I have made it a goal for 2018 to answer a question several people have asked me: what papers should I read to learn more about file systems. So I’ve decided to attempt to copy a format that I’ve found useful – The Morning Paper. I admit, I am not sure I’ll be able to keep up the frenetic pace of a paper each day that he maintains, but I do know there are plenty of papers to read.

My motivation for doing this is simple: there are quite a few papers on this topic, I certainly haven’t read them all (or even the majority of them) and reading through them, along with my own interpretation of why the paper matters to File Systems will be useful for me. It also gives me a set of information that I can point out to people that ask me “so what should I read to learn more about this topic.”

Why File Systems? For me it’s been largely a quirk of fate. I’ve been working on operating systems for many years now, and file systems happens to be the area in which I’ve spent more time than any other. For something that is conceptually so simple – after all, it just maps files to blocks on your disk, it has considerable complexity. File systems are also the gateway to one of the more challenging parts of the operating system: the bit that stores persistent state. Errors in file systems are often not transient. That makes them challenging. The bar is set high because we don’t want to lose anyone’s data.

File systems are part of the plumbing of the operating system. They are essential to proper operation but when everything is working properly, nobody really notices they are there. Only when things go wrong does anyone notice. So it is within this world that I will delve.

The place to start is to even attempt to define file systems. While you might think this is simple, it turns out to be surprisingly challenging. So I’ll approach it by providing some examples and then delving into what means to be a file system.

Media File Systems

The easiest place to start is in the concept of the media file system. “Media” in this context means any tangible medium on which we can systematically record and/or read persistent information. Examples would include disks, tapes, non-volatile memory, or optical media. Whether or not RAM file systems fall into this category is an interesting question – and it helps demonstrate my point that pinning down this concept is more challenging than one might otherwise think.

Disk drives are typically what most people think of first for this category, though we now often view solid-state disk drives in the same category. File systems reside on top of media devices and keep track of the organization of the data itself. Some file systems can span multiple disk, or use transactional journals, or provide resilience against errors in the media itself. The continuum here is surprisingly broad, but in the end the purpose of the file system is to map from the vagaries of the media to a (mostly) uniform model of behavior so application programs “just work”.

Network File Systems

As we added computers to networks, we found it useful to be able to transfer information stored on one computer to another. This permitted applications to access data, regardless of where it was actually stored. By constructing a file system that uses a remote access protocol, we can present the remote storage device to the local system as if it were just a local file system – or rather, almost the same.

Good examples of this include the Network File System (NFS) originally developed by SUN Microsystems in the 1980s and the Andrew File System (AFS) originally developed by Carnegie-Mellon University in the same time period. These days many people also use the Server Message Block (SMB) based network file systems; the roots of this are also back in the 1980s. All three of these remain in use today, in fact, though they have evolved from the earliest versions.

Key-Value Stores

There is considerable overlap between the world of file systems and databases. Several early file systems were really just single level lookup mechanisms, rather than the hierarchical name space that is so common these days. It is actually common to implement file systems so that hierarchy is added to a flat indexed name space (a “flat” file system). These are in essence one form of key-value store. Some file systems permit addressing by using keys, whether explicitly assigned or implicitly generated. A subsection of the literature focus more on this type of file system and I will make sure to cover several of these.

Name Spaces

Sometimes a file system is more about the namespace than anything else. For example the proc file system does not actually reside on top of any sort of storage. The name space provides a convenient way to find information that is generated on demand. In these cases, it is the name space that really matters, not how the information is stored.

This conflation of storage, presentation, and name-space add to the richness and complexity of work being done in file systems. My goal is to explore these issues, by reviewing the literature.

Category Archives: Media File Systems

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email