Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
December 2024
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031  

ZUFS

After one of my earlier posts on FUSE file system performance, someone mentioned this project to me – the Zero copy Userspace File System project (ZUFS) which appears to be a NetApp sponsored project.

Sometimes Zero is best
Sometimes Zero is best.

There have been a variety of talks about this project, including the Linux Plumber’s Conference (which was held next door to me – I can see the venue from my window as I write this), as well as the SNIA Persistent Memory Summit in 2018. The NetApp repositories on Github.com contain both a file system reflector (zufs-zuf), which appears to be similar to the FUSE kernel driver, as well as the user mode server (zufs-zus) which handles dispatching the kernel level requests to the user mode file system implementations.

Their concern appears to be eliminating the copy of any data between kernel and user mode, which makes sense given their objective of supporting persistent memory, such as the new Intel Optane DC Persistent Memory that has recently become commercially available.

Persistent memory benefits from a direct access model, in which traditional file data caching is eschewed in favor of direct access. Thus, data is read or written directly from the underlying persistent memory, rather than copied from a buffer cache.

There are a few persistent memory file systems, including UCSD’s NOVA file system, though usually they were developed using emulation of persistent memory. In such systems, there is no benefit to copying the data from persistent memory into DRAM and back; indeed, it is a significant performance impediment.

What is not currently present in the NetApp repository is an implementation of a user mode persistent file system (they have a dummy file system implementation, which appears to be the base from which one could build a real file system). This definitely presents an interesting alternative to using traditional FUSE.

Fuze vs ZUFS
FUSE vs ZUFS Performance (from NetApp SNIA presentation)

I have not had an opportunity to play with this new system yet, but it certainly does seem to be intriguing – and the performance graph from the SNIA presentation is rather compelling, given the massive improvement in scalable performance.

There sure are quite a few alternatives to traditional FUSE to consider…

A Comparison of Two Network-Based File Servers

A Comparison of Two Network-Based File Servers
James G. Mitchell and Jeremy Dion, in Communications of the ACM, April 1982, Volume 25, Number 4.

PAir of File Servers

I previously described the Cambridge File Server (CFS).  In this 1981 SOSP paper the inner details of it and the Xerox Distributed File System (XDFS) are compared.  This paper provides an interesting insight into the inner workings of these file servers.

Of course, the scale and scope of a file server in 1982 was vastly smaller than the scale and scope of file servers today.  In 1982 the disk drives used for their file servers were as large as 300MB.

SD Cards

This stands in stark contract to the sheer size of modern SD cards; I think of them as slow but compared to the disk drives of that era they are quite a bit faster not to mention smaller.  I suspect the authors of this paper might be rather surprised at how the scale has changed, yet many of the basic considerations they were making back in the early 1980s are still important today.

 

  • Access Control (Security) – CFS was, of course, a capability based system. XDFS was an identity based system; most systems today are identity based systems, though we find aspects of both in use.
  • Storage Management – the interesting challenge here is how to ensure that storage is not wasted. The naive model is to shift responsibility for proper cleanup to the clients. Of course, the reality is that this is not a good model; even in the simple case of a client that crashes, it is unlikely the client will robustly ensure that space is reclaimed in such circumstances. CFS handles this using a graph file system and performing garbage collection in which an unreachable node is deemed subject to reclamation. XDFS uses the more naive model, but mitigates this by providing a directory service that can handle proper cleanup for clients – thus clients can “do it right” with minimal fuss, but are not constrained to do so.
  • Data Consistency – the authors point to the need to have some form of transactional update model. They observe that both CFS and XDFS offer atomic transactions; this represents the strong semantic end of the design spectrum for network file servers and we will observe that one of the most successful designs (Sun’s NFS) went to a much weaker end of the design spectrum. Some of this likely reflects the database background of the authors.
  • Network Protocols – I enjoyed this section, since this is very early networking, with CFS using the predecessor of token ring and XDFS using the 3Mb/s version of Ethernet. They discuss the issues inherent in the network communcations: flow and error control (so message exchange and exception/error handling) and how the two respective systems handle them

The authors also compare details of the implementation:

  • They describe a scheme in CFS in which small files use a direct block, and larger files use indirect blocks (blocks of pointers to direct blocks). This means that small files are faster. It is similar to the model that we see in other (later) file systems, while XDFS uses binary tree, used to track allocation of blocks to files, and a bitmap, used to indicate free/used space information.
  • They discuss redundancy, with an eye towards handling (partial) disk failures. Like any physical device, the disk drives of that era did wear out and fail.
  • They discuss their transaction log and how each system guaranteed consistency: they both use shadow pages, but their implementation of them is different. Ultimately, they both have similar issues, and similar impact. Shadow pages are a technique that we still use.

The evaluation is interesting: it is not so much a measure of performance but rather insights into the strengths and weaknesses of each approach. For XDFS they note that their transaction support has been successful and it permits database transactions (in essence, XDFS becomes a form of simple database service). They point to the lack of support for both normal and special files; from their description a special file is one with guaranteed write semantics. They also observe that ownership of files is easily lost, which in turn leads to inefficient storage utilization. They observe that it is not clear if the B-tree is win or lose of XDFS.

For CFS they point to the performance requirements as being a strength, though it sounds more like a design constraint that forced the CFS developers to make “hard choices” to optimize for performance. Similarly, they observe that the directed graph model of CFS is successful and capabilities are simple to implement. Interestingly, they also point to the index as well as string of names and access rights as being a success point. They also point to the fact that CFS generalizes well (“[t]wo quite different filling systems built in this way coexist on the CFS storage.”) They also point to automatic garbage collection as being a net win for CFS, though they also point out that CFS uses a reference count in addition to the garbage collection model. They list the CFS limitation of transactions to a single file or index as being one of its shortcomings and point to real-world experience porting other operating systems to use CFS as an indicator of the cost of this limitation. Interestingly, the limitation they point to (“… since file directories are implemented as an index with an associated file, it is currently impossible to update both structures in a single transaction.”) They conclude by arguing that XDFS has a better data layout, arguing that XDFS’s strategy of page allocation and intention logging is ultimately better than CFS’s cylinder maps: “… the redundancy function of cylinder maps does not seem to be as successful as those of page allocation and intentions logging; the program to reconstruct a corrupted block is not trivial.”

Ensuring correct recovery in a transactional system certainly challenging in my experience, so I can understand the authors’ concerns about simplicity and scalability.

Overall, it is an interesting read as I can see may of the issues described here as being file systems issues; many of the techniques they describe in this paper show up in subsequent file systems. The distinction between file system and file server also becomes more clearly separated in future work.

Extension Framework for File Systems in User space

Extension Framework for File Systems in User space, Ashish Bijlani and Umakishore Ramachandran, USENIX Annual Technical Conference, 2019.

Useful Extensions

The idea of improving FUSE performance has become a common theme. This paper, which will be presented this week at USENIX ATC 2019 in Renton, WA, is one more to explore how we can improve FUSE performance.

One bit of feedback I received from the last FUSE performance paper I reviewed (last week) suggested that people do want to build file systems in user space for a variety of reasons, not the least of which is because they want to move that complexity out of the kernel environment. Thus, the argument is that the reason people build kernel file systems is because of performance. While I remain unconvinced that this is not the only impediment to a broader adoption of FUSE file systems, I will save that for a future discussion.

The approach the authors take this time does seem to try and bridge the gap: they’re proposal is to add kernel extensions that permit user mode file systems developers to add small modular components to the file system to optimize performance critical aspects. They address the increased security considerations inherent in allowing “kernel extensions” by sandboxing those extensions into an “in-kernel Virtual Machine (VM) runtime that safely executes the extensions”.

Their description of FUSE is quite a bit different than what I got from the FUSE performance paper at FAST 2018 – this paper describes FUSE as a “simple interposition layer”; the earlier description made it sound more complex than that. They do point out that FUSE file systems in production are becoming more common and point to Gluster, Ceph, and even Android’s SD card file system. For network file systems the overhead of FUSE is unlikely to have a material impact all but the most performance sensitive environments because the overhead of the network likely dominates. Similarly, SD card media is typically slow so once again the rate-limiting overhead is likely not the FUSE library and driver.

In addition to proposing an extension model, the authors also point out that there are a class of “unneeded” operations that are difficult to omit because the level of control offered by FUSE presently is not sufficiently fine grained enough; the authors propose enhancing FUSE to address these issues as well.

They set forth an interesting set of design considerations:

  • Compatibility – their observation is that the extension model must be something that works with existing file systems without requiring redesign or extensive coding.
  • Extensibility – the features offered by ExtFuse must allow adding specific features in a clean, minimalistic fashion, so that a FUSE file system developer can pick the specific features needed for their use case.
  • Safe and Performant – these are competing goals; the primary purpose of their work is to improve performance but they cannot do so at the expense of sacrificing security.
  • Correctness – they point out the challenge of having two operational paths (the “fast” path and the “slow” path, where the latter corresponds to the legacy path)
(Figure 1 from Paper)

The authors’ provide a graphical description of the architecture of their system in Figure 1 of the paper, which I have reproduced here. It shows the fact there are dual paths: the traditional FUSE path, as well as their accelerated path.

They move on to describe the extensions they implemented to demonstrate the range of functionality with their extension model:

  • Meta-data caching – the idea is that VFS itself cannot do effective caching due to the nature of its interface; the tighter interface between the extension and the user mode file system make this more practical.
  • I/O stacking – the concept here is that data may have multiple processing layers, such as logging, or union file systems. By permitting the extension to handle this, the overhead is minimized; indeed, this reminded me of the Scout Operating Systems work, which focuses on constructing optimized pipelines for such work.

Their evaluation focuses on a handful of critical operations: getattr, setattr, getxattr, and read/write. They looked at a mix of optimization models: the use of a smart attribute cache is clearly a win based upon their performance analysis. FUSE remains slower than a native file system in many scenarios however (e.g., they use EXT4 as a benchmark comparison) though the performance seems to be much closer than we’ve seen in prior work.

They also ported multiple different file systems to their extension library: StackFS, BindFS, Android’s sdcard file system, MergerFS, and LoggedFS. None of them required even 1,000 lines of new code for the kernel extensions. While the authors do discuss some of the observed performance improvements for those file systems, they do not provide us with general benchmark comparisons.

Overall, this is an interesting paper, which combines a number of ideas together into an intriguing package. It will be interesting to see if this gains traction in the FUSE community.

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support, Yue Zhu, Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, Muhib Khan, and Weikuan Yu, in Proceedings of the 8th International Workshop on Runtime Operating Systems for Supercomputers, page 6, 2018.

Modern Fuse, circuit breakers instead of actual fuses.
Modern Fuse

There are quite a few papers that discuss the performance of the FUSE model. I already discussed a recent paper that explored the performance of FUSE on Linux and that paper observed that I/O performance for FUSE is reasonably good due to the optimization work that has been done to minimize the data copy overhead that can occur with a naive implementation.

What I do find surprising is the emphasis on FUSE performance; this leads me to think that people look to user mode file systems as something viable for implementing production file systems. Of course, one motivation for this is that building a FUSE file system is generally simpler than implementing an in-kernel file system. Some of this is environmental – the kernel is a harsh development environment, in which the smallest bugs lead to the system crashing.

Of course, virtual machine technologies have done quite a lot to minimize this overhead, as the “machine” that crashes is now more like an application. If you are developing code for the UNIX, Linux, or Windows kernel you are likely to be developing using C, the most commonly used systems language these days. It is possible to bravely branch out and use other languages, but then you inherit other interesting restrictions and frequently find that you are developing the tools as much as you are developing the file system.

Thus, one benefit of the user space file systems model is that you can use other development tools – FUSE file system implementations us a much larger range of programming languages than is normally found in kernel file systems. The FUSE model also permits fairly rapid development of a prototypical file system.

Today’s paper touches on these traditional issues and points out that sometimes what you need isn’t a general-purpose file system but rather something that is specifically crafted to solve the problem at hand. For the HPC community, performance is an important driver for the specialized file systems of choice. The authors’ use an optimized library, libsysio, that provides a POSIX-like interface which intercepts I/O operations to a remote file system – in essence, a sort of automated mechanism for turning I/O calls into something reminiscent of RPC.

The emphasis of the authors is in eliminating the overhead of system calls. Their approach is certainly focused: this solution works for a single application that requires high performance operations.

They start off by evaluating the cost overhead of using FUSE. Because their emphasis is on I/O, that is what they evaluate. Thus, unlike the earlier FUSE analysis, which indicated that meta-data operations were the most significant bottleneck, this work concludes there is still substantial impact on I/O performance as well.

They take an existing library from Sandia Labs, libsysio. I found multiple different versions of this library available on the Internet and was interested to find that it has been integrated into other file systems, including Lustre, with which I have some familiarity from past work. The authors’ don’t discuss if their approach is better than using other HPC file systems, focusing on improving the performance of their specific use case.

One interesting design consideration for Direct-FUSE is they seek to support multiple FUSE file systems from a single application, using the same high performance communications approach. This is not usually an issue for applications with pure FUSE file systems because to the application the FUSE file system appears to be functionally equivalent to every other file system. This is, however, an issue that can arise when incorporating multiple I/O library based models into a single application; something they address in Direct-FUSE.

They describe their implementation model for supporting multiple distinct file systems, differentiating between file systems via a prefix matching model, and then forwarding name based requests as appropriate. File handle based operations work by using an indirection table for encapsulating the additional state needed to determine which file system should be used to satisfy requests against the particular file handle.

Much of the paper focuses on the evaluation of their solution. In keeping with their focus on raw I/O performance, the evaluation is all about bandwidth at various I/O sizes. Their results indicate that they are able to achieve performance that is comparable to similar native file systems (they use ext4 and tmpfs implementations for these benchmarks). Thus, they demonstrate that their approach has comparable performance to the native ext4 and tmpfs implementations.

They also compare their performance in the distributed file systems arena using FusionFS, an existing FUSE file system. They show comparable performance for read I/O bandwidth (including scalability to multiple nodes) as well as improved write I/O bandwidth.

They then evaluate the context switch difference between the two solutions (FUSE and Direct-FUSE) and observe that they have eliminated the context switch overhead.

Bottom line, they have found a way to improve performance over traditional FUSE file systems. They do not compare to other HPC oriented file system (e.g., Lustre) and thus it is difficult for me to tell if this is a viable contender for larger scale distributed file systems work. Nevertheless, they do point out the impact of the context switch costs inherent in the traditional FUSE model.

I am left asking myself “is the goal to make FUSE performance close enough to native kernel file systems that it makes sense to simply implement in FUSE?” Since they only focus on I/O bandwidth, I am not sure if they will achieve this goal for broader benchmarks.

Windows Filesystems: File Object Relationships

A complex intersection in Shanghai
A complex intersection (Shanghai)

One of the challenges of developing file systems in Windows is related to the complex relationships that exist between various data structures in the operating system that are part of the file systems domain. In this post I want to discuss one aspect of this complex relationship because it leads to behavior that makes sense when you understand it, but leads to very counter-intuitive behavior if you do not.

Each time a user opens a file the I/O Manager creates a new file object (I described this in Create previously.) A Windows File system is responsible for initializing specific fields of the File Object including the SectionObjectPointers field. This is a peculiar field – the file system allocates the space, but does not set the fields within it. There is a many-to-one relationship between File Objects and this field. The information here is used by two other parts of the operating system: the Memory Manager and the Cache Manager.

The Memory Manager is responsible for managing virtual memory. Each process has its own unique address space. An address space defines what code executing within that process will see. Code running within the address space works using virtual addresses. The CPU and operating system work together to use and manage this data. The details of how this labor is divided is a function of the CPU architecture and the details do vary across different architectures. The operating system must be modified to support new CPU architectures and this is one of the areas that can require additional programming effort.

There are quite a few distinct parts to the virtual memory system. For example, the CPU typically contains a special cache of recently used virtual to physical addresses (a translation lookaside buffer or TLB). When a virtual address needs to be interpreted by the processor it first looks in the TLB to see if it knows the correct mapping for the given virtual address. If it does, it uses that without accessing memory. This is quite important because when the processor does not find the entry in the TLB (a TLB “miss”) it must then convert the virtual to physical address using the data in memory. Memory, while fast, is much slower than the TLB. Modern TLBs can be accessed within about 0.5 nanoseconds (ns) while accessing memory can often require 75 ns. Plus, converting a virtual to physical address can require access to multiple physical memory pages, which further drives up the time required.

One reason for having a virtual memory system is that the operating system can set up a range of virtual addresses without assigning physical memory to those addresses. As necessary, the hardware will invoke the operating system to allocate physical memory and then assign it to the relevant virtual memory location. This process is typically called a page fault because memory is managed in small units called pages. In order to fill in the newly allocated physical page properly, the virtual memory system needs to keep track, for every virtual address, where its corresponding data is presently stored. This is why File Systems are closely involved: they manage data. Thus, there is a symbiotic relationship between the Windows Memory Manager and the File System.

File Systems in Windows also offer a non-block oriented interface for storing and retrieving content: a read/write interface. Such byte oriented interfaces are a mismatch with the block oriented interfaces of most storage devices and the File System converts between the two of them. In Windows, file systems typically do so by memory mapping a region of the file into memory. Windows provides another component – the Cache Manager – to assist in managing these mapped regions. Thus, we have a tight collaboration between the three components: Memory Manager, Cache Manager, and File System. Each has a distinct role to play, but relies upon the other to accomplish its task.

The following diagram attempts to capture the interesting subset of relationships that I am describing here:

Object Relationships in Windows File Systems
Object Relationships in Windows File Systems

Let’s start by describing the individual components:

  • File Handle – this is an abstract value that Windows gives to an application to represent a file. Any time we use a file handle in the kernel, it must be validated since we do not trust applications not to have bugs or nefarious intent.
  • Section Handle – this is similar to the file handle, in that it is an abstract value that Windows gives to an application. In this case it represents a section, which is used to permit memory mapping of some or all of a file into the address space of an application.
  • File Object – this is the internal Windows kernel data structure that is used to represent an open file; we use file in its broadest sense here because it could be a file, a directory, a device interface, a communications endpoint, or pretty much anything else that “behaves like a file”.
  • Section Object – this is the internal Windows kernel data structure that is used to represent anything that can be mapped into memory. Note that this is distinct from the actual mapping. There are two kinds of section objects with which a File System is concerned: the image section object and the data section object. One key difference here is that something mapped using an image section object will be copy-on-write, so changes are not written back to disk, while something mapped using a data section object will permit reading and writing and changes to it are normally written back to storage. Both support the concept of sharing memory – in which two or more processes using the same section will also use the same physical memory.
  • Shared Cache Map – this is a control structure used by the Windows Cache Manager to track regions of files mapped into memory for use by the file systems to handle read and write calls (though file systems may choose to not use the cache). This structure belongs to the Cache Manager and shouldn’t be used by any other component (though it is useful to us, as file systems developers, to look at it _in the kernel debugger).
  • Section Object Pointers – this structure is allocated by the file system from non-pageable memory. It is large enough to store three pointer values: a pointer to the ImageSectionObject, a pointer to the DataSectionObject, and a pointer to the SharedCacheMap. These map to, as you might expect, section objects and the shared cache map.

With this background, I can now describe my diagram. When a File System is initializing a new File Object, one of its responsibilities is to set up storage for the Section Object Pointers structure. Typically, this is the same for each new open instance of the same file. One reason to set them up to be different is if you want to construct a separate view of a given file. For example, I have used distinct views in the past to manage encrypted files: one view is the clear view of the file, while another view is the encrypted view of the file. Similarly, I could do this if I were supporting file versioning.

If an application opens a file that has not been previously opened, the section object pointers value will point to cleared memory (all values are set to zero). If the File System then invokes the Cache Manager to set up caching (CcInitializeCacheMap), upon return, the DataSectionObject and SharedCacheMap fields will have been initialized.

It turns out that the Cache Manager needs to have a File Object to implement some of its functionality. Given that the only File Object it has to use is the one that is passed into the cache map initialization function, it bumps the reference count on this object and stores a pointer to it.

It is rather common for a File System to defer initializing the cache until the first read, since many files are opened but never used for read or write. If an application memory maps a file, it does this in two steps: first, it opens the file, then it opens the section.

Since the Memory Manager “owns” sections, the call to open a section is sent to it, along with the file handle for the file to be mapped. The Memory Manager converts the file handle to a File Object. Then it looks inside the section object pointers value to see if there is an existing DataSectionObject value. If there is, it uses that section object to complete the call. There is no interaction with the file system required. If there is not an existing Data SectionObject, it will create a new one. This turns out to require some level of interaction with the File System because the Memory Manager needs to know how big the file is – after all, the size of the section and file need to be the same. Once the new section object is created, a pointer to it is stored in the SectionObjectPointers block associated with the file. Since the Memory Manager may need to retrieve data from the given file, it maintains a reference to the File Object.

At this point it is worth noting that if a file is memory mapped there is no shared cache map. Recall that when the File System calls the Cache Manager it passes a File Object, which the Cache Manager will use.

Thus, if a file is first memory mapped and then subsequently opened again and used for cached read/write operations, we find a situation where the Memory Manager is using one File Object, and the Cache Manager a second File Object. If we do this in the reverse order, the Cache Manager and Memory Manager use the same File Object for their references because the Cache Manager creates a section object using the same File Object that was passed to it by the File System.

So, when might we have a third distinct File Object? When the file is an executable image. Suppose you copy an executable program. The copy program opens the target file for data access; it then writes the contents into the file. Unless this is done without memory mapping or caching, there is now a Data Section Object backed by this file (and thus a File Object). Next you decide to execute it: this time when the file is loaded for execution the Memory Manager will need an Image Section Object. If one does not exist, it creates one. This is one way how you end up with two section objects for a single file. If you combine that with the memory mapping versus read/write usage described earlier, you can end up with three distinct File Objects for a single file that is only in use by one application.

But for a File System there are some more interesting effects as a result of this shuffling around:

  • A paging write operation may be performed against a File Object that was opened for read. This is because the Memory Manager uses the File Object stored in the section object; it does not know how the original file was opened. A subsequent open of the file for write continues to use the File Object that was opened for read and passed to the Memory Manager when the section was created. I will note there is now an API for switching this file object out, as this behavior can be challenging for some file systems to handle.
  • Paging I/O Request Packets (IRPs) can be sent to the file system with any of the three File Objects; it will depend upon the reason the paging request is being sent.
  • The lifetime of these three File Objects is now decoupled from the user file handle. What this means is that when the user application closes the file, the File Object’s reference count will not drop to zero. So the IRP_MJ_CLEANUP will not be immediately followed by an IRP_MJ_CLOSE. It also leads to a peculiar situation where the IRP_MJ_CLOSE is sent inside the cleanup handler, but I’ll talk about that another time.

These relationships, while complex, make sense in the context of the interactions between these components. Hopefully this basic description will help those writing Windows File Systems better understand the interaction patterns.

To FUSE or Not to FUSE: Performance of User-Space File Systems

To FUSE or Not to FUSE: Performance of User-Space File Systems
Bharath Kumar Reddy Vangoor, Vasily Tarasov, and Erez Zadok,
in The 15th USENIX Conference on File and Storage Technologies (FAST ’17),
February 27 – March 2, 2017, Santa Clara, CA, USA.

Previously, I discussed some of the rationale behind FUSE and a basic introduction to why we use it. This paper is actually a detailed analysis of the performance of FUSE. I have spent a fair bit of time reading – and re-reading – this paper, in order to understand what some of the bottlenecks are within FUSE itself. One area I have previously explored are possible ways of improving its performance; indeed, this is one of my current projects.

Figure 1 from original Vangoor Paper

Figure 1 in the paper is actually a basic block diagram providing an interface model for how FUSE fits into the file systems layer. FUSE consists of:

  • The kernel mode FUSE driver; on Linux this is part of the kernel; and
  • The user mode FUSE library. This handles interactions between the FUSE file system driver and the user mode file system process (referred to as a “daemon” or “background process”) in the diagram.
  • The user mode FUSE file system itself – this is the code that implements the FUSE library interface (one of them, since there are two different interfaces).

Applications can then access this FUSE file system without knowing any details of the implementation.

FUSE kernel driver queuing model.

In Figure 2 from the paper, the authors show the internal structure of how the FUSE kernel driver manages various internal data structures that handle events between the user mode file system and the kernel mode file system support. This includes the messages between the kernel and user mode application (requests and their corresponding replies) as well as synchronous and asynchronous file system operations and the cache invalidation mechanisms (“forgets”).

Caching is an essential part of the system because it allows the system to quickly respond to repetitive events. The downside to caches are that they can become invalid because the underlying state changes. Thus, there is a mechanism for invalidating the cache itself.

The author describes the model within the FUSE library and the two interfaces it exposes: the low level API, which provides greater control to the user mode file system, at the cost of more state management – specifically the mapping of a path name to the inode (index node).

The authors describe how they constructed a new FUSE file system, StackFS, for evaluating the behavior of the system. StackFS sits on top of an existing file system and attempts to minimize the amount of mapping it performs, since the goal of the authors is to evaluate the performance of FUSE itself.

Table 3 from the paper

The authors summarize their findings in Table 3, using both a hard disk and an SSD, the two most common types of storage media used on modern computer systems.

These results were both surprising and interesting to me because some of them surprised me. The performance bottlenecks were not for I/O as much as they were for meta-data operations. Random write performance (which is what we see with databases, for example) is not ideal, but their optimizations did a good job of addressing this, bringing the I/O overhead of the FUSE model down substantially relative to the native file system.

Bottom line: the challenge in improving FUSE performance now moves squarely into the arena of meta-data operations. Creating and deleting files is quite expensive in FUSE.

The authors conclude by pointing out that there is further room for improvement; they suggest some potential future directions. I have been looking at ways to improve this as well and I will discuss those in a future post.

FUSE: File Systems in User Space

File systems are notoriously difficult to implement: of all the pieces that appear in an operating system, they have the highest quality bar and are often called upon more than almost any other part of the operating system; virtual memory management may be called upon more.

Of course, the fact that modern operating systems tend to make the boundaries of file systems and virtual memory a bit fuzzy doesn’t really diminish their difficulty.

So, what makes building a file system challenging?

  • Persistence – file systems do things “for keeps”. If you build an application program, you can quit when things go bad. The user can just restart from scratch. A crashing application might leave its own files damaged and require recovery. When a file system gets it wrong (particularly a physical media file system) it can wipe out all the files.
  • Multi-threading – modern operating systems are heavily multi-threaded and often performing I/O. Frequently, that I/O is going through the file system. A multi-threaded application needs to worry about its own data. A file system needs to worry about its own data that’s being accessed by numerous threads across numerous applications.
  • Security – modern operating systems are written with the idea of multi-tenancy in mind. We can use lots of different isolation techniques to help mitigate some of these problems but file systems are fundamentally a layer that revolves around sharing data. I could (and have and probably will again) argue that sharing data might not always be what we want to do. Remember the CAP file system? One of its interesting features was that it did attempt to hide data and only expose it via capabilities.
  • Performance – file systems are extremely performance sensitive. Hard disks are slow, with high latencies and modest bandwidth. Of course, SSDs have fixed some of that, with lower latencies and higher bandwidth. They do introduce their own issues that file systems have had to adapt to handle.
  • Features – each feature you add in a file system tends to create an interference pattern with other features. For example, NTFS implements a file caching scheme for applications called oplocks (opportunistic locks). It also implements byte range locks. The two were not compatible. Then a new set of oplocks were added and they became compatible. The old oplocks are supported, the new oplocks are supported. The good news is that very few applications use byte range locks. It’s not the only example, it’s just one example.

In my experience, the file systems development cycle starts out with a bold new design: we’re going to fix all of the ills of prior file systems and/or implement bold, new features. We’ll be faster and more capable. We’ve studied the work that’s already been done and we know that we can do it better. Then you build your bold new design and begin to construct your file system. Eventually, you get to a point where it starts to work. Each new feature you add has side-effects that ripple through the code base. You begin to evaluate your file system and you realize it is slow. So you go through some cycles of iterative tuning. These introduce performance at the cost of complexity.

You find out about failure cases you hadn’t really understood before; maybe you read about them. You experience failures that you’d never heard of before – only to realize when you’re reading some other paper that they’re describing the same problem. That’s happened to me – the other paper said “we had these weird deadlocks, so we just increased the number of buffers we used and it went away.” We actually built a robust reservation scheme to ensure we’d never hit that deadlock.

Deadlocks in file systems are just part of life. You wanted performance so you added fine-grained data structure locking. Then you realized you had special cases where you had re-entrant calls. You find that you can have a thread moving file a to b at the same time that some other thread is moving file b to a. How do you lock that properly? In the past, I’ve built entire mechanisms for tracking and enforcing lock hierarchies, dealing with re-entrant calls, and ensuring we don’t deadlock.

So you performance tune, find and fix bugs, increase your parallelism, and continue your relentless march to victory. You learn about reference counting bugs (the ABA problem, for example, which I’ve seen in practice when trying to decrement and delete the reference count). You create interesting solutions that allow parallelism in all but the cases where it really matters.

You get old and grey in the process. You learn how to look at a damaged meta-data structure on disk and in your head theorize how that might happen, then you go look at the code and see if you can find that path.

All you wanted to do was build a file system. Moving a file system into user space is a very micro-kernel like thing to do; I’ve worked on micro-kernels where the file system was in user space. If it crashes, it doesn’t bring the machine down. The only threads it has to really worry about are those it creates. Maybe it isn’t as fast. Maybe it isn’t as challenging to build.

File Systems in User Space (FUSE) is a framework in which a kernel component interacts with an application program – the user-mode file system – and presents it to applications so that it looks much like a file system.

FUSE doesn’t fix all the challenges of building file systems, but it does address some of them. Security is addressed by restricting the file system process and the applications to belonging to the same security entity (“user”). The tools available are often easier for the debugging process; testing in user space is simpler due to the availability of test harnesses (kernel file systems can be run for testing purposes in user space as well, as I’ve done it before. Most of the file system logic isn’t tied to any specific operating mode.)

FUSE is a wildly popular interface. The last time I looked on Github.com the number of FUSE file systems numbered in the hundreds. At one point I was working on cataloguing them, more out of curiosity than anything else. Indeed, that might make a good write-up for some future post

Applications transparently use a FUSE file system because the file system supports the standard file systems interfaces. To the applications, it really does just look like another file system, mounted somewhere in the name space of the operating system itself, such as mounted on a directory in Linux or UNIX. Windows uses mount points and symbolic links to make file systems visible to applications. In either case, applications are oblivious to the fact that these file systems are implemented in user space.

The biggest advantage of this model is that building a new file system via FUSE is simpler. It isn’t going to win any performance records, it won’t be bootable, and its usage model is much more limited than a “general purpose” file system, but often that’s all that is required. For example, there are multiple implementations of a file system on top of Amazon’s Simple Storage Service (S3) – mostly because it provides a simple-to-use interface that works with existing tools. It certainly is not a performance-oriented approach, but often performance is not actually that important for a specialized service.

I will discuss some of the issues with fuse, and some potential solutions, in a future post.

https://github.com/libfuse/libfuse
https://github.com/osxfuse
https://github.com/billziss-gh/winfsp

File System Driver: Create

The road ahead.

The usual place to start when building a file system is to think about the Create operation. This may also be referred to as an open operation, but that conflates the object with the handle.

I think of Create as being “create a handle to the object”. The creation of the object itself can be a side effect of creating the handle; it does not make much sense to create a handle to something that doesn’t exist.

POSIX actually does have a create system call (and at one point it was creat for obscure historical reassons). Now you can call open and specify O_CREATE and it will perform the same operation. In Windows the native call is NtCreateFile, though there is also an NtOpenFile, but they both ultimately invoke the same internal kernel operation (IoCreateFile or its successor IoCreateFileEx – the Ex thing is one way of saying “oh, we need to pass in more parameters, so sorry.” I’ll just talk about IoCreateFile since it is shorter and if you need the other version for some peculiar reason you can figure it out. Remember, it’s an operating system internal call, so it only matters to people writing software to execute in the kernel.

Inside Windows, IoCreateFile is actually a rather large, complicated function: while the concept of creating a new file object is simple, the details are complicated because as much as we like to pretend files are nothing more than byte streams, there are so many special cases and exceptions to this that our illusion thin, at best. UNIX suffers from this as well, with symbolic links and special files. So, files are byte streams, except when they aren’t.

Since I’m writing this article as a description of how file systems on Windows work, I’ll stick with talking about Windows behavior and leave talking about other operating systems behavior to another time.

Windows Flow of Control for Create
Windows Flow of Control for Create (Simplified)

The animation shows the basic flow of a request through the system:

  1. An application opens (or creates) a file. Depending upon the implementation for the subsystem, it will call through the subsystem itself; often this is just implemented inside a shared library (dynamic link library or DLL).
  2. Since this requires a system call, it will invoke the relevant system call interface inside ntdll.dll, which is mapped into every process in the system. It will format the request as appropriate for the platform and then issue a system call (syscall/sysenter on Intel platforms, or swi on ARM platforms).
  3. The Windows system call dispatcher will forward this to the I/O Manager, since it handles files.
  4. The I/O Manager is presented with a name but it does not yet know which device and driver will handle this specific request. Thus, the I/O Manager will ask the Object Manager (ObLookupObject) to parse the name until it finds a relevant device. Assuming it does reference a specific device, the Object Manager will then invoke the I/O Manager, because DeviceObjects (and FileObjects) point to a function registered by the I/O Manager with the Object Manager. For example, this will invoke IopParseDevice. At this point the I/O Manager now has a DeviceObject and the balance of the name. At this point it can allocate an I/O Request Packet (IRP) and set it up. In the case of a physical file system, however, the I/O Manager must reference the volume parameter block (vpb). Thus, it retrieves the relevant file system’s device object from the vpb. It asks the Object Manager to create a new FileObject. The I/O Manager will complete initialization of the FileObject and format the IRP with the relevant parameters. Create is a complex I/O request and the fields are scattered, unlike other I/O requests. Once formatted, the I/O Manager will invoke the file system driver (via IoCallDriver).
  5. The file system driver will receive the IRP. It will process the request, which may involve creating a new file as a side effect of creating the file object. It will parse the balance of the name. It may check security, sharing, allocate space for new files, validate options. It must handle symbolic links at this point as well (including reparse points) and format return information appropriately. It might attach ExtraCreateParameters to the I/O request. It could perform the operation within the context of a kernel transaction if the FileObject is part of a transaction. It may have special cases for volumes, directories, alternate data streams, or other file system specific behavior. While the Windows I/O Model allows any IRP to be processed asynchronously, the I/O Manager will block and wait for completion of the request. The file system may also need to perform additional operations for oplocks, which even have a case where the create is completed, even though the file is not yet usable.
  6. Once the file system is done processing the request – successfully or not – it will complete the request by calling IoCompleteRequest. The I/O Manager will unwind the I/O request stack (in case there are filters, which there almost always are now) and once done, it will copy results from the kernel to the user address space and return control to the system call dispatcher.
  7. The system call dispatcher will restore state, set the return code in the return register, and complete the system call.
  8. In user mode, ntdll will process the request, may raise exceptions if needed, and return to the application caller.

For a Windows file system Create is often one of the most complex routines – I have seen file systems with almost 20% of their code in this path. There are numerous edge conditions: open by file ID, for example, as well as oplocks, not to mention creating new files, creating in-memory data structures and managing the many-to-one relationship between file objects and the file system control structures. Each open of a file creates a new FileObject. I will discuss why this becomes complicated in a future post, because it leads to unexpected behavior for the unwary.

What About POSIX?

Geriatric Care Sign
Has POSIX ceased to be relevant?

The POSIX specification was originally written to codify existing practice in UNIX. While it was called the PORTABLE Operating Systems Interface, it really set out to create that interface by documenting existing UNIX practice. While POSIX has certainly evolved, it has done so slowly.

But I’m not trying to dive into another polemic on the weaknesses of POSIX. That’s been done before.

Similarly, more tempered discussions of this subject have also been done before. For example, in USENIX login; Vaggelis Atlidakis et. al., explored this very issue in the article Posix Has Become Outdated. The authors make some useful points:

  • High-level frameworks now drive the use of POSIX. Since portability was a strong, motivating factor, the fact that applications now work against a different set (or sets) of APIs dilutes this motivation.
  • POSIX is missing abstractions. This is observed because the use of ioctl is high and that usage is precisely to provide functionality that isn’t present in POSIX.
  • The new abstractions are not converging. In other words, the three operating systems they studied are introducing new abstractions but they aren’t coalescing to a common set of abstractions. Of the three key points the authors’ make, this one is likely the most damning since it undermines the original intent of POSIX, namely portability.

Note: the underlying paper was presented at Eurosys 2013 (POSIX Abstractions in Modern OperatingSystems: The Old, the New, and the Missing).

I find it interesting that they didn’t even evaluate Windows, which I am sure would even further strengthen their findings of missing abstractions and divergence. While people might not realize it, Microsoft has had some level of POSIX compatibility throughout the lifetime of Windows (POSIX.1 specifically, which is the basic OS functionality level.) I often see people claim Windows isn’t “POSIX compliant” but I know that this isn’t strictly true. Back in 1992 when I first started looking at Windows NT, I was quite surprised at how closely the security model they had implemented mapped to what we had done in DCE/DFS for security as well – and our work was based on the draft POSIX security specification at the time.

Yet, I would also note that people continue to look at building new POSIX operating systems. For example, last year at OSDI, Cutler, et. al., presented a paper on building an operating system in the Go languages: The benefits and costs of writing a POSIX kernel in a high-level language.

One reason I care about this is because my own work in considering alternative file systems structures, e.g., the idea of having rich name spaces, has led me into this exact area multiple times. The existing interfaces aren’t adequate to the task. FUSE, which is a popular tool for file systems prototyping, has added support for ioctl, which provides a popular generic mechanism for adding such functionality. But of course the need to use a generic extension mechanism just underscores the weaknesses of the interface.

Jeff Darcy wrote an insightful piece back in 2016 entitled Updating POSIX, in which he looked at file systems specific portions of POSIX, described weaknesses, and made insightful observations about how POSIX doesn’t match what we need these days:

  • Rename: I certainly understand this morass since it is something with which I’ve been wrestling for most of my career. When we were designing the Episode file system (1989/1990) one reason we ultimately agreed on using a transaction log was because there was no correct order of operations that would yield a consistent file system. Given that I’m the one who designed and implemented the transaction log, I’m glad we did because our decision to use a log and the way we ultimately implemented it was certainly a strength of the final product.
  • Fsync: This is how application programs ensure their data is committed to disk and thus presumably durable. Very few applications worry about durability; one reason is that the cost is so high for ensuring it. One thing I really liked about Jeff’s discussion here is his observation that POSIX focuses on consistency but not on durability. Of course, I just mentioned that durability was one of the reasons applications used this API. The cost to check file system consistency was a serious pain point in the 1980s. At Transarc, when a file system would crash it could take hours before a file server could restart because the entire file system was scanned to ensure it was consistent. Durability wasn’t so much a file systems concern.
  • Readdir: This is interesting to me because one of my pain points when I first started working with Windows NT file systems is their decision to shift filtering directory enumeration into the kernel. Yet, his perspective is a good one: applications often are only interested in looking for a subset of files. If you have a directory of 35,000 files and you are looking for just two of them, that’s a lot of overhead moving data between the kernel and the user application. So, perhaps the pain of filtering in the kernel is worth it. There are plenty of other issues with respect to readdir as well. There’s quite a bit of impedence mismatch between the POSIX approach and the Windows NT approach. For example, NTFS actually maintains duplicate copies of file meta-data (timestamps and sizes) between the directory and the file itself. When I taught file systems classes I would sometimes show people the impact of the implementation, where directory enumerations could provide stale results under certain situations. Of course, readdir is of interest to me because as I look at converting the file system to a graph I think quite a lot about what “readdir” looks like when it comes to such a system.
  • Chmod: In fact, Jeff’s discussion here is broader than just chmod, it really deals with the complexity of modern security. There’s no simple mechanism for dealing with multiple security domains – try implementing NFS on Windows some time and find out what a pain point it is. His argument of using capabilities is an interesting one, and resonates with some of my own observations in this area. For example, in a graph I can easily conceptualize of disjoint clusters of the graph, where nobody else can find the stuff I’ve created until I give them a magic cookie for it (a capability). There’s no concept of capabilities in POSIX file systems.

I can point to other aspects of the file systems APIs in POSIX that cause grief as well. For example, the fact that the file handle embeds a current offset is definitely a pain point. I see this every semester as students in CS 6200 struggle to add multi-threading support and then have to learn that the read and write calls are not thread safe. What amazes me is that this mistake was carried over into Windows NT: the CurrentByteOffset is an attribute of the kernel FILE_OBJECT. When I write a Windows file system, I’m responsible for updating this field correctly (where “correctly” itself is related to the particular operation being performed) even though I don’t use that field. I suspect that it is there because it was needed for POSIX compatibility.

So my take-aways: supporting POSIX is useful because it preserves existing applications, but it is definitely in need of revision and/or rethinking. More ideas in that direction are best saved for a future conversation, though.