MRAMFS: A compressing file system for non-volatile RAM
Nathan K. Edel, Deepa Tuteja, Ethan L. Miller, and Scott A. Brandt in Proceedings of the 12th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), Volendam, Netherlands, October 2004.
This paper allows me to provide both a file systems paper and look at an interesting approach to byte-addressable non-volatile memory (NVM).
We have developed a prototype in-memory file system which utilizes data compression on inodes, and which has preliminary support for compression of file blocks. Our file system, mramfs, is also based on data structures tuned for storage efficiency in non-volatile memory.
One of the interesting aspects of NVM is that it has characteristics of storage (persistence) and memory (byte-addressability). Storage people are used to having vast amounts of time to do things: it is quite difficult, though not impossible, to do anything computationally with data that will be an important factor when it is combined with the overhead of I/O latency to disk drives. In-memory algorithms worry about optimal cache line usage and efficient usage of the processor, but they don’t need to worry about what happens when the power goes off.
Bringing these two things together requires re-thinking things. NVM isn’t as fast as DRAM. Storage people aren’t used to worrying about CPU cache effects on data resilience.
So mramfs looks at this from a very file systems centric perspective: how do we exploit this nifty new memory to build a new kind of RAM disk: it’s still RAM but now it’s persistent. NVRAM is slower than DIMM and hence it makes sense to compress it to increase the effective data transfer rate (though it is not clear if that really will be the case.)
I didn’t find a strong motivation for compression, though I can see the viability of it now, in a world in which we want to pack as much as we can into a 64 byte cache line. The authors point out that one of the previous systems (Conquest) settled on a 53 byte inode size. The authors studied existing systems and found they could actually compress down to 20 bytes (or less) for a single inode. They achieved this using a combination of gamma compression and compressing common file patterns (mode, uid, and gid). Another reason for this approach is they did not wish to burden their file system with a computationally expensive compression scheme.
In Figure 1 (from the paper) the authors provide a graphic description of their data structures. This depicts a fairly traditional UNIX style file system, with an inode table, name space (directories), references from directory entries to the inodes. Inodes then point to control structures that eventually map to the actual data blocks.
The actual memory is managed by the file system from a single chunk of non-volatile memory; the memory is virtually addressed and the paper points out that they don’t actually care how that mapping is achieved.
Multiple inodes are allocated together in inode blocks with each block consisting of 16 (variable length) inodes. The minimum size of a block is 256 bytes. inodes are rewritten in place whenever possible, which can lead to slack space. If an inode doesn’t fit within its existing space, the entire block is reconstructed and then written to a new block. Aftewards, the block pointer is changed to point to the new block. Then the old block is freed.
One thing that is missing from this is much reasoning about crash consistency, which surprised me.
The authors have an extensive evaluation section, comparing to ext2fs, ramfs, and jffs2 (all over RAM disk). Their test was a create/unlink micro-benchmark, thus optimizing the meta-data insertion/deletion case. They then questioned their entire testing mechanism by pointing out that the time was also comparable to what they achieved using tmpfs building the openssl package from source. Their final evaluation was done without the compression code enabled (“[U]nfortuantely, the data compression code is not yet reliable enough to complete significant runs of Postmark or of large builds…”). They said they were getting about 20-25% of the speed without compression.
Despite this finding, their conclusion was “We have shown that both metadata and file data blocks are highly compressisble with little increase in code complexity. By using tuned compression techniques, we can save more than 60% of the inode space required by previous NVRAM file systems, and with little impact on performance.”
My take-away? This was an early implementation of a file system on NVM. It demonstrates one of the risks of thinking too much in file systems terms. We’ll definitely have to do better.
A General-Purpose File System For Secondary Storage
R. C. Daley and P.G. Neumann
Published in the Proceedings of the American Federation of Information Processing Societies 1965, Fall Joint Computer Conference, vol. 27, pp. 213-229.
This is the seminal paper discussing how file systems were envisioned within the MULTICS operating system. While you can still run MULTICS, it is a curiosity at this point. However, virtually all operating systems we now use today descended from MULTICS and thus, its design profoundly influenced their development.
This paper is a delightfully easy read, written at the dawn of this new field of multi-programming. Prior to this time computers were essentially single user. The introduction of the idea of sharing a computer with other users was nascent. Thus, the experts working in the field at the time had to begin thinking about things like organization, security, and sharing.
Indeed, a common tension in operating systems literature in general is between isolation and sharing. Isolation is great from a security perspective, but is inefficient. Each user of the system often uses the same programs, for example, but we do not want to keep a distinct copy of the same thing for each user as that would be wasteful. This profoundly impacts the file systems work because the file system is the point of persistence, the level at which shared resources become manifest.
But I’m jumping ahead at this point. Let’s start with the simple question: What is a file system? As we will find during this journey, its meaning and usage is far richer than one might think upon first reading. While this paper is not the first paper to discuss storage and file systems, it is a good example of the state of the art in 1965.
This paper offers us some useful definitions:
A file is simply an ordered sequence of elements, where an element could be a machine word, a character, or a bit, depending upon the implementation.
That seems to be a rather broad definition, but it is a reasonable place for us to start. This does not impose structure on the content itself, which proves to be one reason why this abstraction ultimately turns out to be a powerful one. “At the level of the file system, a file is formatless.”
This paper also establishes the name space abstractions as well:
As far as a particular user is concerned, a file has one name, and that name is symbolic. (Symbolic names may be arbitrarily long, and may have syntax of their own. For example, they may consist of several parts, some of which are relevant to the nature of the file, e.g., ALPHA FAP DEBUG.)
This again paints a rather broad abstraction. The name has meaning to the user, but otherwise is just symbolic data for the file system. The paper goes on to define the now classic name space specific abstraction:
A directory is a special file which is maintained by the file system, and which contains a list of entries. To a user, an entry appears to be a file and is accessed in terms of its symbolic entry name, which is the user’s file name. An entry name need be unique only within the directory in which it occurs.
Thus, this paper lays out the quintessential aspect of modern file system name spaces: they are hierarchically organized. The paper describes this in greater detail and refers to links and branches.
The authors describe how users might work in different parts of the hierarchical name space. They observe that this then creates a situation in which sharing of files might be an issue, and thus they introduce the concept of links to resolve this.
A link in this context is an entry in a directory that refers to an existing file within the file system. Thus, we see the genesis of links, though the paper does not clearly delineate symbolic links from hard links. This does help motivate why these features show up in UNIX a number of years later, however.
The paper goes into greater detail about how they envision this hierarchical name space functioning, including traversal, working directory, and links.
From there the authors then turn their attention to a problem inherent in having a single shared file system name space with data contents belonging to different users, namely managing access to the individual files. Thus, they introduce access controls. They note that a file system could default to either permissive or restrictive access within this model. From this point they incorporate the access control list, the access mode for a given file, and the concept of access attributes. Of the five attributes listed, four of them are familiar: read, execute, write, and append access have analogs in modern file systems. The fifth, trap is interesting, in that it defines an explicit exception mechanism for access control that requires external validation for access – an interesting generality that is not typically present in modern file systems.
They also describe file sharing, introducing the concept of explicit operations to lock access to a given file, or to unlock the file. They suggest that a locked file would require the user provide a designated key to permit accessing the file; I have seen this approach in some file systems, actually, though it is fairly uncommon.
The paper describes access control at length with no real surprises otherwise, other than perhaps the fact that many of these features disappeared from later operating systems, only to be resurrected and added many years later.
There is quite a bit in the paper about backup and restore processing. Much of the detail here is interesting historically but does not really add much to my exploration of file systems. If you are looking for more information about the history of backup, I do encourage you to read those sections. Having done magnetic tape backups in the past, I’m content with leaving them be.
There is one observation I will point out from this section, however. The authors actually discuss a recurring theme in file systems – the fact that storage itself ends up being a multi-level media management challenge.
In most cases a user does not need to know how or where a file is stored by the file system. A user’s primary concern is that the file be readily available to him when he needs it. In general, only the file system knows on which device a file resides.
The file system is designed to accommodate any configuration of secondary storage devices. These devices may cover a wide range of speeds and capacities. All considerations of speed and efficiency of storage devices are left to the file system. Thus all user programs and all other system programs are independent of the particular configuration of secondary storage.
They go on to describe migration of data from hot to cold storage as needed and point out these are functions of the file system. I found this an interesting insight since even today we routinely deal with these sorts of situations, such as the Strata paper from SOSP 2017 (“Strata: a Cross Media File System“).
The remainder of the paper focuses on how the file system is implemented in MULTICS
I must admit, I found this section of the paper both detailed and yet fascinating because of the broad sweeping nature in which the authors lay down fundamental ideas that we see in modern operating systems. Some of it is not really in the scope of file systems (“segment management” which appears to equate to shared executables, such as programs and shared libraries in modern systems,) the concept of demand segment loading (which presages demand paging,) the concept of file system search, mechanisms for managing file systems, memory recycling (reminiscent of page reclamation in modern virtual memory systems based upon its description,) device management, and I/O queueing. They finish up by describing their “multilevel storage management module”, backup system, and utility functions. The latter has (now) amusing functions like “file to cards” and “tape to file”.
So these give us asynchronous operation, paging, backup, hierarchical storage management, security, file sharing, and directory sharing. Most of these concepts survive to this day. Indeed, what I found most surprising after reading this paper is how few of these ideas disappeared: traps and file system search are the two that spring to my mind.
Thus the lesson of today’s paper: in 1965 the MULTICS team more or less laid down a model for how file systems were to work in virtually all modern operating systems. The subsequent work will provide us with greater insight into the details, but the basic shape of our file systems has not strayed far from this early vision.
I have made it a goal for 2018 to answer a question several people have asked me: what papers should I read to learn more about file systems. So I’ve decided to attempt to copy a format that I’ve found useful – The Morning Paper. I admit, I am not sure I’ll be able to keep up the frenetic pace of a paper each day that he maintains, but I do know there are plenty of papers to read.
My motivation for doing this is simple: there are quite a few papers on this topic, I certainly haven’t read them all (or even the majority of them) and reading through them, along with my own interpretation of why the paper matters to File Systems will be useful for me. It also gives me a set of information that I can point out to people that ask me “so what should I read to learn more about this topic.”
Why File Systems? For me it’s been largely a quirk of fate. I’ve been working on operating systems for many years now, and file systems happens to be the area in which I’ve spent more time than any other. For something that is conceptually so simple – after all, it just maps files to blocks on your disk, it has considerable complexity. File systems are also the gateway to one of the more challenging parts of the operating system: the bit that stores persistent state. Errors in file systems are often not transient. That makes them challenging. The bar is set high because we don’t want to lose anyone’s data.
File systems are part of the plumbing of the operating system. They are essential to proper operation but when everything is working properly, nobody really notices they are there. Only when things go wrong does anyone notice. So it is within this world that I will delve.
The place to start is to even attempt to define file systems. While you might think this is simple, it turns out to be surprisingly challenging. So I’ll approach it by providing some examples and then delving into what means to be a file system.
Media File Systems
The easiest place to start is in the concept of the media file system. “Media” in this context means any tangible medium on which we can systematically record and/or read persistent information. Examples would include disks, tapes, non-volatile memory, or optical media. Whether or not RAM file systems fall into this category is an interesting question – and it helps demonstrate my point that pinning down this concept is more challenging than one might otherwise think.
Disk drives are typically what most people think of first for this category, though we now often view solid-state disk drives in the same category. File systems reside on top of media devices and keep track of the organization of the data itself. Some file systems can span multiple disk, or use transactional journals, or provide resilience against errors in the media itself. The continuum here is surprisingly broad, but in the end the purpose of the file system is to map from the vagaries of the media to a (mostly) uniform model of behavior so application programs “just work”.
Network File Systems
As we added computers to networks, we found it useful to be able to transfer information stored on one computer to another. This permitted applications to access data, regardless of where it was actually stored. By constructing a file system that uses a remote access protocol, we can present the remote storage device to the local system as if it were just a local file system – or rather, almost the same.
Good examples of this include the Network File System (NFS) originally developed by SUN Microsystems in the 1980s and the Andrew File System (AFS) originally developed by Carnegie-Mellon University in the same time period. These days many people also use the Server Message Block (SMB) based network file systems; the roots of this are also back in the 1980s. All three of these remain in use today, in fact, though they have evolved from the earliest versions.
There is considerable overlap between the world of file systems and databases. Several early file systems were really just single level lookup mechanisms, rather than the hierarchical name space that is so common these days. It is actually common to implement file systems so that hierarchy is added to a flat indexed name space (a “flat” file system). These are in essence one form of key-value store. Some file systems permit addressing by using keys, whether explicitly assigned or implicitly generated. A subsection of the literature focus more on this type of file system and I will make sure to cover several of these.
Sometimes a file system is more about the namespace than anything else. For example the proc file system does not actually reside on top of any sort of storage. The name space provides a convenient way to find information that is generated on demand. In these cases, it is the name space that really matters, not how the information is stored.
This conflation of storage, presentation, and name-space add to the richness and complexity of work being done in file systems. My goal is to explore these issues, by reviewing the literature.