Home » File Systems (Page 2)

Category Archives: File Systems

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 204 other subscribers
April 2024
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
282930  

Extension Framework for File Systems in User space

Extension Framework for File Systems in User space, Ashish Bijlani and Umakishore Ramachandran, USENIX Annual Technical Conference, 2019.

Useful Extensions

The idea of improving FUSE performance has become a common theme. This paper, which will be presented this week at USENIX ATC 2019 in Renton, WA, is one more to explore how we can improve FUSE performance.

One bit of feedback I received from the last FUSE performance paper I reviewed (last week) suggested that people do want to build file systems in user space for a variety of reasons, not the least of which is because they want to move that complexity out of the kernel environment. Thus, the argument is that the reason people build kernel file systems is because of performance. While I remain unconvinced that this is not the only impediment to a broader adoption of FUSE file systems, I will save that for a future discussion.

The approach the authors take this time does seem to try and bridge the gap: they’re proposal is to add kernel extensions that permit user mode file systems developers to add small modular components to the file system to optimize performance critical aspects. They address the increased security considerations inherent in allowing “kernel extensions” by sandboxing those extensions into an “in-kernel Virtual Machine (VM) runtime that safely executes the extensions”.

Their description of FUSE is quite a bit different than what I got from the FUSE performance paper at FAST 2018 – this paper describes FUSE as a “simple interposition layer”; the earlier description made it sound more complex than that. They do point out that FUSE file systems in production are becoming more common and point to Gluster, Ceph, and even Android’s SD card file system. For network file systems the overhead of FUSE is unlikely to have a material impact all but the most performance sensitive environments because the overhead of the network likely dominates. Similarly, SD card media is typically slow so once again the rate-limiting overhead is likely not the FUSE library and driver.

In addition to proposing an extension model, the authors also point out that there are a class of “unneeded” operations that are difficult to omit because the level of control offered by FUSE presently is not sufficiently fine grained enough; the authors propose enhancing FUSE to address these issues as well.

They set forth an interesting set of design considerations:

  • Compatibility – their observation is that the extension model must be something that works with existing file systems without requiring redesign or extensive coding.
  • Extensibility – the features offered by ExtFuse must allow adding specific features in a clean, minimalistic fashion, so that a FUSE file system developer can pick the specific features needed for their use case.
  • Safe and Performant – these are competing goals; the primary purpose of their work is to improve performance but they cannot do so at the expense of sacrificing security.
  • Correctness – they point out the challenge of having two operational paths (the “fast” path and the “slow” path, where the latter corresponds to the legacy path)
(Figure 1 from Paper)

The authors’ provide a graphical description of the architecture of their system in Figure 1 of the paper, which I have reproduced here. It shows the fact there are dual paths: the traditional FUSE path, as well as their accelerated path.

They move on to describe the extensions they implemented to demonstrate the range of functionality with their extension model:

  • Meta-data caching – the idea is that VFS itself cannot do effective caching due to the nature of its interface; the tighter interface between the extension and the user mode file system make this more practical.
  • I/O stacking – the concept here is that data may have multiple processing layers, such as logging, or union file systems. By permitting the extension to handle this, the overhead is minimized; indeed, this reminded me of the Scout Operating Systems work, which focuses on constructing optimized pipelines for such work.

Their evaluation focuses on a handful of critical operations: getattr, setattr, getxattr, and read/write. They looked at a mix of optimization models: the use of a smart attribute cache is clearly a win based upon their performance analysis. FUSE remains slower than a native file system in many scenarios however (e.g., they use EXT4 as a benchmark comparison) though the performance seems to be much closer than we’ve seen in prior work.

They also ported multiple different file systems to their extension library: StackFS, BindFS, Android’s sdcard file system, MergerFS, and LoggedFS. None of them required even 1,000 lines of new code for the kernel extensions. While the authors do discuss some of the observed performance improvements for those file systems, they do not provide us with general benchmark comparisons.

Overall, this is an interesting paper, which combines a number of ideas together into an intriguing package. It will be interesting to see if this gains traction in the FUSE community.

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support, Yue Zhu, Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, Muhib Khan, and Weikuan Yu, in Proceedings of the 8th International Workshop on Runtime Operating Systems for Supercomputers, page 6, 2018.

Modern Fuse, circuit breakers instead of actual fuses.
Modern Fuse

There are quite a few papers that discuss the performance of the FUSE model. I already discussed a recent paper that explored the performance of FUSE on Linux and that paper observed that I/O performance for FUSE is reasonably good due to the optimization work that has been done to minimize the data copy overhead that can occur with a naive implementation.

What I do find surprising is the emphasis on FUSE performance; this leads me to think that people look to user mode file systems as something viable for implementing production file systems. Of course, one motivation for this is that building a FUSE file system is generally simpler than implementing an in-kernel file system. Some of this is environmental – the kernel is a harsh development environment, in which the smallest bugs lead to the system crashing.

Of course, virtual machine technologies have done quite a lot to minimize this overhead, as the “machine” that crashes is now more like an application. If you are developing code for the UNIX, Linux, or Windows kernel you are likely to be developing using C, the most commonly used systems language these days. It is possible to bravely branch out and use other languages, but then you inherit other interesting restrictions and frequently find that you are developing the tools as much as you are developing the file system.

Thus, one benefit of the user space file systems model is that you can use other development tools – FUSE file system implementations us a much larger range of programming languages than is normally found in kernel file systems. The FUSE model also permits fairly rapid development of a prototypical file system.

Today’s paper touches on these traditional issues and points out that sometimes what you need isn’t a general-purpose file system but rather something that is specifically crafted to solve the problem at hand. For the HPC community, performance is an important driver for the specialized file systems of choice. The authors’ use an optimized library, libsysio, that provides a POSIX-like interface which intercepts I/O operations to a remote file system – in essence, a sort of automated mechanism for turning I/O calls into something reminiscent of RPC.

The emphasis of the authors is in eliminating the overhead of system calls. Their approach is certainly focused: this solution works for a single application that requires high performance operations.

They start off by evaluating the cost overhead of using FUSE. Because their emphasis is on I/O, that is what they evaluate. Thus, unlike the earlier FUSE analysis, which indicated that meta-data operations were the most significant bottleneck, this work concludes there is still substantial impact on I/O performance as well.

They take an existing library from Sandia Labs, libsysio. I found multiple different versions of this library available on the Internet and was interested to find that it has been integrated into other file systems, including Lustre, with which I have some familiarity from past work. The authors’ don’t discuss if their approach is better than using other HPC file systems, focusing on improving the performance of their specific use case.

One interesting design consideration for Direct-FUSE is they seek to support multiple FUSE file systems from a single application, using the same high performance communications approach. This is not usually an issue for applications with pure FUSE file systems because to the application the FUSE file system appears to be functionally equivalent to every other file system. This is, however, an issue that can arise when incorporating multiple I/O library based models into a single application; something they address in Direct-FUSE.

They describe their implementation model for supporting multiple distinct file systems, differentiating between file systems via a prefix matching model, and then forwarding name based requests as appropriate. File handle based operations work by using an indirection table for encapsulating the additional state needed to determine which file system should be used to satisfy requests against the particular file handle.

Much of the paper focuses on the evaluation of their solution. In keeping with their focus on raw I/O performance, the evaluation is all about bandwidth at various I/O sizes. Their results indicate that they are able to achieve performance that is comparable to similar native file systems (they use ext4 and tmpfs implementations for these benchmarks). Thus, they demonstrate that their approach has comparable performance to the native ext4 and tmpfs implementations.

They also compare their performance in the distributed file systems arena using FusionFS, an existing FUSE file system. They show comparable performance for read I/O bandwidth (including scalability to multiple nodes) as well as improved write I/O bandwidth.

They then evaluate the context switch difference between the two solutions (FUSE and Direct-FUSE) and observe that they have eliminated the context switch overhead.

Bottom line, they have found a way to improve performance over traditional FUSE file systems. They do not compare to other HPC oriented file system (e.g., Lustre) and thus it is difficult for me to tell if this is a viable contender for larger scale distributed file systems work. Nevertheless, they do point out the impact of the context switch costs inherent in the traditional FUSE model.

I am left asking myself “is the goal to make FUSE performance close enough to native kernel file systems that it makes sense to simply implement in FUSE?” Since they only focus on I/O bandwidth, I am not sure if they will achieve this goal for broader benchmarks.

Windows Filesystems: File Object Relationships

A complex intersection in Shanghai
A complex intersection (Shanghai)

One of the challenges of developing file systems in Windows is related to the complex relationships that exist between various data structures in the operating system that are part of the file systems domain. In this post I want to discuss one aspect of this complex relationship because it leads to behavior that makes sense when you understand it, but leads to very counter-intuitive behavior if you do not.

Each time a user opens a file the I/O Manager creates a new file object (I described this in Create previously.) A Windows File system is responsible for initializing specific fields of the File Object including the SectionObjectPointers field. This is a peculiar field – the file system allocates the space, but does not set the fields within it. There is a many-to-one relationship between File Objects and this field. The information here is used by two other parts of the operating system: the Memory Manager and the Cache Manager.

The Memory Manager is responsible for managing virtual memory. Each process has its own unique address space. An address space defines what code executing within that process will see. Code running within the address space works using virtual addresses. The CPU and operating system work together to use and manage this data. The details of how this labor is divided is a function of the CPU architecture and the details do vary across different architectures. The operating system must be modified to support new CPU architectures and this is one of the areas that can require additional programming effort.

There are quite a few distinct parts to the virtual memory system. For example, the CPU typically contains a special cache of recently used virtual to physical addresses (a translation lookaside buffer or TLB). When a virtual address needs to be interpreted by the processor it first looks in the TLB to see if it knows the correct mapping for the given virtual address. If it does, it uses that without accessing memory. This is quite important because when the processor does not find the entry in the TLB (a TLB “miss”) it must then convert the virtual to physical address using the data in memory. Memory, while fast, is much slower than the TLB. Modern TLBs can be accessed within about 0.5 nanoseconds (ns) while accessing memory can often require 75 ns. Plus, converting a virtual to physical address can require access to multiple physical memory pages, which further drives up the time required.

One reason for having a virtual memory system is that the operating system can set up a range of virtual addresses without assigning physical memory to those addresses. As necessary, the hardware will invoke the operating system to allocate physical memory and then assign it to the relevant virtual memory location. This process is typically called a page fault because memory is managed in small units called pages. In order to fill in the newly allocated physical page properly, the virtual memory system needs to keep track, for every virtual address, where its corresponding data is presently stored. This is why File Systems are closely involved: they manage data. Thus, there is a symbiotic relationship between the Windows Memory Manager and the File System.

File Systems in Windows also offer a non-block oriented interface for storing and retrieving content: a read/write interface. Such byte oriented interfaces are a mismatch with the block oriented interfaces of most storage devices and the File System converts between the two of them. In Windows, file systems typically do so by memory mapping a region of the file into memory. Windows provides another component – the Cache Manager – to assist in managing these mapped regions. Thus, we have a tight collaboration between the three components: Memory Manager, Cache Manager, and File System. Each has a distinct role to play, but relies upon the other to accomplish its task.

The following diagram attempts to capture the interesting subset of relationships that I am describing here:

Object Relationships in Windows File Systems
Object Relationships in Windows File Systems

Let’s start by describing the individual components:

  • File Handle – this is an abstract value that Windows gives to an application to represent a file. Any time we use a file handle in the kernel, it must be validated since we do not trust applications not to have bugs or nefarious intent.
  • Section Handle – this is similar to the file handle, in that it is an abstract value that Windows gives to an application. In this case it represents a section, which is used to permit memory mapping of some or all of a file into the address space of an application.
  • File Object – this is the internal Windows kernel data structure that is used to represent an open file; we use file in its broadest sense here because it could be a file, a directory, a device interface, a communications endpoint, or pretty much anything else that “behaves like a file”.
  • Section Object – this is the internal Windows kernel data structure that is used to represent anything that can be mapped into memory. Note that this is distinct from the actual mapping. There are two kinds of section objects with which a File System is concerned: the image section object and the data section object. One key difference here is that something mapped using an image section object will be copy-on-write, so changes are not written back to disk, while something mapped using a data section object will permit reading and writing and changes to it are normally written back to storage. Both support the concept of sharing memory – in which two or more processes using the same section will also use the same physical memory.
  • Shared Cache Map – this is a control structure used by the Windows Cache Manager to track regions of files mapped into memory for use by the file systems to handle read and write calls (though file systems may choose to not use the cache). This structure belongs to the Cache Manager and shouldn’t be used by any other component (though it is useful to us, as file systems developers, to look at it _in the kernel debugger).
  • Section Object Pointers – this structure is allocated by the file system from non-pageable memory. It is large enough to store three pointer values: a pointer to the ImageSectionObject, a pointer to the DataSectionObject, and a pointer to the SharedCacheMap. These map to, as you might expect, section objects and the shared cache map.

With this background, I can now describe my diagram. When a File System is initializing a new File Object, one of its responsibilities is to set up storage for the Section Object Pointers structure. Typically, this is the same for each new open instance of the same file. One reason to set them up to be different is if you want to construct a separate view of a given file. For example, I have used distinct views in the past to manage encrypted files: one view is the clear view of the file, while another view is the encrypted view of the file. Similarly, I could do this if I were supporting file versioning.

If an application opens a file that has not been previously opened, the section object pointers value will point to cleared memory (all values are set to zero). If the File System then invokes the Cache Manager to set up caching (CcInitializeCacheMap), upon return, the DataSectionObject and SharedCacheMap fields will have been initialized.

It turns out that the Cache Manager needs to have a File Object to implement some of its functionality. Given that the only File Object it has to use is the one that is passed into the cache map initialization function, it bumps the reference count on this object and stores a pointer to it.

It is rather common for a File System to defer initializing the cache until the first read, since many files are opened but never used for read or write. If an application memory maps a file, it does this in two steps: first, it opens the file, then it opens the section.

Since the Memory Manager “owns” sections, the call to open a section is sent to it, along with the file handle for the file to be mapped. The Memory Manager converts the file handle to a File Object. Then it looks inside the section object pointers value to see if there is an existing DataSectionObject value. If there is, it uses that section object to complete the call. There is no interaction with the file system required. If there is not an existing Data SectionObject, it will create a new one. This turns out to require some level of interaction with the File System because the Memory Manager needs to know how big the file is – after all, the size of the section and file need to be the same. Once the new section object is created, a pointer to it is stored in the SectionObjectPointers block associated with the file. Since the Memory Manager may need to retrieve data from the given file, it maintains a reference to the File Object.

At this point it is worth noting that if a file is memory mapped there is no shared cache map. Recall that when the File System calls the Cache Manager it passes a File Object, which the Cache Manager will use.

Thus, if a file is first memory mapped and then subsequently opened again and used for cached read/write operations, we find a situation where the Memory Manager is using one File Object, and the Cache Manager a second File Object. If we do this in the reverse order, the Cache Manager and Memory Manager use the same File Object for their references because the Cache Manager creates a section object using the same File Object that was passed to it by the File System.

So, when might we have a third distinct File Object? When the file is an executable image. Suppose you copy an executable program. The copy program opens the target file for data access; it then writes the contents into the file. Unless this is done without memory mapping or caching, there is now a Data Section Object backed by this file (and thus a File Object). Next you decide to execute it: this time when the file is loaded for execution the Memory Manager will need an Image Section Object. If one does not exist, it creates one. This is one way how you end up with two section objects for a single file. If you combine that with the memory mapping versus read/write usage described earlier, you can end up with three distinct File Objects for a single file that is only in use by one application.

But for a File System there are some more interesting effects as a result of this shuffling around:

  • A paging write operation may be performed against a File Object that was opened for read. This is because the Memory Manager uses the File Object stored in the section object; it does not know how the original file was opened. A subsequent open of the file for write continues to use the File Object that was opened for read and passed to the Memory Manager when the section was created. I will note there is now an API for switching this file object out, as this behavior can be challenging for some file systems to handle.
  • Paging I/O Request Packets (IRPs) can be sent to the file system with any of the three File Objects; it will depend upon the reason the paging request is being sent.
  • The lifetime of these three File Objects is now decoupled from the user file handle. What this means is that when the user application closes the file, the File Object’s reference count will not drop to zero. So the IRP_MJ_CLEANUP will not be immediately followed by an IRP_MJ_CLOSE. It also leads to a peculiar situation where the IRP_MJ_CLOSE is sent inside the cleanup handler, but I’ll talk about that another time.

These relationships, while complex, make sense in the context of the interactions between these components. Hopefully this basic description will help those writing Windows File Systems better understand the interaction patterns.

To FUSE or Not to FUSE: Performance of User-Space File Systems

To FUSE or Not to FUSE: Performance of User-Space File Systems
Bharath Kumar Reddy Vangoor, Vasily Tarasov, and Erez Zadok,
in The 15th USENIX Conference on File and Storage Technologies (FAST ’17),
February 27 – March 2, 2017, Santa Clara, CA, USA.

Previously, I discussed some of the rationale behind FUSE and a basic introduction to why we use it. This paper is actually a detailed analysis of the performance of FUSE. I have spent a fair bit of time reading – and re-reading – this paper, in order to understand what some of the bottlenecks are within FUSE itself. One area I have previously explored are possible ways of improving its performance; indeed, this is one of my current projects.

Figure 1 from original Vangoor Paper

Figure 1 in the paper is actually a basic block diagram providing an interface model for how FUSE fits into the file systems layer. FUSE consists of:

  • The kernel mode FUSE driver; on Linux this is part of the kernel; and
  • The user mode FUSE library. This handles interactions between the FUSE file system driver and the user mode file system process (referred to as a “daemon” or “background process”) in the diagram.
  • The user mode FUSE file system itself – this is the code that implements the FUSE library interface (one of them, since there are two different interfaces).

Applications can then access this FUSE file system without knowing any details of the implementation.

FUSE kernel driver queuing model.

In Figure 2 from the paper, the authors show the internal structure of how the FUSE kernel driver manages various internal data structures that handle events between the user mode file system and the kernel mode file system support. This includes the messages between the kernel and user mode application (requests and their corresponding replies) as well as synchronous and asynchronous file system operations and the cache invalidation mechanisms (“forgets”).

Caching is an essential part of the system because it allows the system to quickly respond to repetitive events. The downside to caches are that they can become invalid because the underlying state changes. Thus, there is a mechanism for invalidating the cache itself.

The author describes the model within the FUSE library and the two interfaces it exposes: the low level API, which provides greater control to the user mode file system, at the cost of more state management – specifically the mapping of a path name to the inode (index node).

The authors describe how they constructed a new FUSE file system, StackFS, for evaluating the behavior of the system. StackFS sits on top of an existing file system and attempts to minimize the amount of mapping it performs, since the goal of the authors is to evaluate the performance of FUSE itself.

Table 3 from the paper

The authors summarize their findings in Table 3, using both a hard disk and an SSD, the two most common types of storage media used on modern computer systems.

These results were both surprising and interesting to me because some of them surprised me. The performance bottlenecks were not for I/O as much as they were for meta-data operations. Random write performance (which is what we see with databases, for example) is not ideal, but their optimizations did a good job of addressing this, bringing the I/O overhead of the FUSE model down substantially relative to the native file system.

Bottom line: the challenge in improving FUSE performance now moves squarely into the arena of meta-data operations. Creating and deleting files is quite expensive in FUSE.

The authors conclude by pointing out that there is further room for improvement; they suggest some potential future directions. I have been looking at ways to improve this as well and I will discuss those in a future post.

FUSE: File Systems in User Space

File systems are notoriously difficult to implement: of all the pieces that appear in an operating system, they have the highest quality bar and are often called upon more than almost any other part of the operating system; virtual memory management may be called upon more.

Of course, the fact that modern operating systems tend to make the boundaries of file systems and virtual memory a bit fuzzy doesn’t really diminish their difficulty.

So, what makes building a file system challenging?

  • Persistence – file systems do things “for keeps”. If you build an application program, you can quit when things go bad. The user can just restart from scratch. A crashing application might leave its own files damaged and require recovery. When a file system gets it wrong (particularly a physical media file system) it can wipe out all the files.
  • Multi-threading – modern operating systems are heavily multi-threaded and often performing I/O. Frequently, that I/O is going through the file system. A multi-threaded application needs to worry about its own data. A file system needs to worry about its own data that’s being accessed by numerous threads across numerous applications.
  • Security – modern operating systems are written with the idea of multi-tenancy in mind. We can use lots of different isolation techniques to help mitigate some of these problems but file systems are fundamentally a layer that revolves around sharing data. I could (and have and probably will again) argue that sharing data might not always be what we want to do. Remember the CAP file system? One of its interesting features was that it did attempt to hide data and only expose it via capabilities.
  • Performance – file systems are extremely performance sensitive. Hard disks are slow, with high latencies and modest bandwidth. Of course, SSDs have fixed some of that, with lower latencies and higher bandwidth. They do introduce their own issues that file systems have had to adapt to handle.
  • Features – each feature you add in a file system tends to create an interference pattern with other features. For example, NTFS implements a file caching scheme for applications called oplocks (opportunistic locks). It also implements byte range locks. The two were not compatible. Then a new set of oplocks were added and they became compatible. The old oplocks are supported, the new oplocks are supported. The good news is that very few applications use byte range locks. It’s not the only example, it’s just one example.

In my experience, the file systems development cycle starts out with a bold new design: we’re going to fix all of the ills of prior file systems and/or implement bold, new features. We’ll be faster and more capable. We’ve studied the work that’s already been done and we know that we can do it better. Then you build your bold new design and begin to construct your file system. Eventually, you get to a point where it starts to work. Each new feature you add has side-effects that ripple through the code base. You begin to evaluate your file system and you realize it is slow. So you go through some cycles of iterative tuning. These introduce performance at the cost of complexity.

You find out about failure cases you hadn’t really understood before; maybe you read about them. You experience failures that you’d never heard of before – only to realize when you’re reading some other paper that they’re describing the same problem. That’s happened to me – the other paper said “we had these weird deadlocks, so we just increased the number of buffers we used and it went away.” We actually built a robust reservation scheme to ensure we’d never hit that deadlock.

Deadlocks in file systems are just part of life. You wanted performance so you added fine-grained data structure locking. Then you realized you had special cases where you had re-entrant calls. You find that you can have a thread moving file a to b at the same time that some other thread is moving file b to a. How do you lock that properly? In the past, I’ve built entire mechanisms for tracking and enforcing lock hierarchies, dealing with re-entrant calls, and ensuring we don’t deadlock.

So you performance tune, find and fix bugs, increase your parallelism, and continue your relentless march to victory. You learn about reference counting bugs (the ABA problem, for example, which I’ve seen in practice when trying to decrement and delete the reference count). You create interesting solutions that allow parallelism in all but the cases where it really matters.

You get old and grey in the process. You learn how to look at a damaged meta-data structure on disk and in your head theorize how that might happen, then you go look at the code and see if you can find that path.

All you wanted to do was build a file system. Moving a file system into user space is a very micro-kernel like thing to do; I’ve worked on micro-kernels where the file system was in user space. If it crashes, it doesn’t bring the machine down. The only threads it has to really worry about are those it creates. Maybe it isn’t as fast. Maybe it isn’t as challenging to build.

File Systems in User Space (FUSE) is a framework in which a kernel component interacts with an application program – the user-mode file system – and presents it to applications so that it looks much like a file system.

FUSE doesn’t fix all the challenges of building file systems, but it does address some of them. Security is addressed by restricting the file system process and the applications to belonging to the same security entity (“user”). The tools available are often easier for the debugging process; testing in user space is simpler due to the availability of test harnesses (kernel file systems can be run for testing purposes in user space as well, as I’ve done it before. Most of the file system logic isn’t tied to any specific operating mode.)

FUSE is a wildly popular interface. The last time I looked on Github.com the number of FUSE file systems numbered in the hundreds. At one point I was working on cataloguing them, more out of curiosity than anything else. Indeed, that might make a good write-up for some future post

Applications transparently use a FUSE file system because the file system supports the standard file systems interfaces. To the applications, it really does just look like another file system, mounted somewhere in the name space of the operating system itself, such as mounted on a directory in Linux or UNIX. Windows uses mount points and symbolic links to make file systems visible to applications. In either case, applications are oblivious to the fact that these file systems are implemented in user space.

The biggest advantage of this model is that building a new file system via FUSE is simpler. It isn’t going to win any performance records, it won’t be bootable, and its usage model is much more limited than a “general purpose” file system, but often that’s all that is required. For example, there are multiple implementations of a file system on top of Amazon’s Simple Storage Service (S3) – mostly because it provides a simple-to-use interface that works with existing tools. It certainly is not a performance-oriented approach, but often performance is not actually that important for a specialized service.

I will discuss some of the issues with fuse, and some potential solutions, in a future post.

https://github.com/libfuse/libfuse
https://github.com/osxfuse
https://github.com/billziss-gh/winfsp

File System Driver: Create

The road ahead.

The usual place to start when building a file system is to think about the Create operation. This may also be referred to as an open operation, but that conflates the object with the handle.

I think of Create as being “create a handle to the object”. The creation of the object itself can be a side effect of creating the handle; it does not make much sense to create a handle to something that doesn’t exist.

POSIX actually does have a create system call (and at one point it was creat for obscure historical reassons). Now you can call open and specify O_CREATE and it will perform the same operation. In Windows the native call is NtCreateFile, though there is also an NtOpenFile, but they both ultimately invoke the same internal kernel operation (IoCreateFile or its successor IoCreateFileEx – the Ex thing is one way of saying “oh, we need to pass in more parameters, so sorry.” I’ll just talk about IoCreateFile since it is shorter and if you need the other version for some peculiar reason you can figure it out. Remember, it’s an operating system internal call, so it only matters to people writing software to execute in the kernel.

Inside Windows, IoCreateFile is actually a rather large, complicated function: while the concept of creating a new file object is simple, the details are complicated because as much as we like to pretend files are nothing more than byte streams, there are so many special cases and exceptions to this that our illusion thin, at best. UNIX suffers from this as well, with symbolic links and special files. So, files are byte streams, except when they aren’t.

Since I’m writing this article as a description of how file systems on Windows work, I’ll stick with talking about Windows behavior and leave talking about other operating systems behavior to another time.

Windows Flow of Control for Create
Windows Flow of Control for Create (Simplified)

The animation shows the basic flow of a request through the system:

  1. An application opens (or creates) a file. Depending upon the implementation for the subsystem, it will call through the subsystem itself; often this is just implemented inside a shared library (dynamic link library or DLL).
  2. Since this requires a system call, it will invoke the relevant system call interface inside ntdll.dll, which is mapped into every process in the system. It will format the request as appropriate for the platform and then issue a system call (syscall/sysenter on Intel platforms, or swi on ARM platforms).
  3. The Windows system call dispatcher will forward this to the I/O Manager, since it handles files.
  4. The I/O Manager is presented with a name but it does not yet know which device and driver will handle this specific request. Thus, the I/O Manager will ask the Object Manager (ObLookupObject) to parse the name until it finds a relevant device. Assuming it does reference a specific device, the Object Manager will then invoke the I/O Manager, because DeviceObjects (and FileObjects) point to a function registered by the I/O Manager with the Object Manager. For example, this will invoke IopParseDevice. At this point the I/O Manager now has a DeviceObject and the balance of the name. At this point it can allocate an I/O Request Packet (IRP) and set it up. In the case of a physical file system, however, the I/O Manager must reference the volume parameter block (vpb). Thus, it retrieves the relevant file system’s device object from the vpb. It asks the Object Manager to create a new FileObject. The I/O Manager will complete initialization of the FileObject and format the IRP with the relevant parameters. Create is a complex I/O request and the fields are scattered, unlike other I/O requests. Once formatted, the I/O Manager will invoke the file system driver (via IoCallDriver).
  5. The file system driver will receive the IRP. It will process the request, which may involve creating a new file as a side effect of creating the file object. It will parse the balance of the name. It may check security, sharing, allocate space for new files, validate options. It must handle symbolic links at this point as well (including reparse points) and format return information appropriately. It might attach ExtraCreateParameters to the I/O request. It could perform the operation within the context of a kernel transaction if the FileObject is part of a transaction. It may have special cases for volumes, directories, alternate data streams, or other file system specific behavior. While the Windows I/O Model allows any IRP to be processed asynchronously, the I/O Manager will block and wait for completion of the request. The file system may also need to perform additional operations for oplocks, which even have a case where the create is completed, even though the file is not yet usable.
  6. Once the file system is done processing the request – successfully or not – it will complete the request by calling IoCompleteRequest. The I/O Manager will unwind the I/O request stack (in case there are filters, which there almost always are now) and once done, it will copy results from the kernel to the user address space and return control to the system call dispatcher.
  7. The system call dispatcher will restore state, set the return code in the return register, and complete the system call.
  8. In user mode, ntdll will process the request, may raise exceptions if needed, and return to the application caller.

For a Windows file system Create is often one of the most complex routines – I have seen file systems with almost 20% of their code in this path. There are numerous edge conditions: open by file ID, for example, as well as oplocks, not to mention creating new files, creating in-memory data structures and managing the many-to-one relationship between file objects and the file system control structures. Each open of a file creates a new FileObject. I will discuss why this becomes complicated in a future post, because it leads to unexpected behavior for the unwary.

What About POSIX?

Geriatric Care Sign
Has POSIX ceased to be relevant?

The POSIX specification was originally written to codify existing practice in UNIX. While it was called the PORTABLE Operating Systems Interface, it really set out to create that interface by documenting existing UNIX practice. While POSIX has certainly evolved, it has done so slowly.

But I’m not trying to dive into another polemic on the weaknesses of POSIX. That’s been done before.

Similarly, more tempered discussions of this subject have also been done before. For example, in USENIX login; Vaggelis Atlidakis et. al., explored this very issue in the article Posix Has Become Outdated. The authors make some useful points:

  • High-level frameworks now drive the use of POSIX. Since portability was a strong, motivating factor, the fact that applications now work against a different set (or sets) of APIs dilutes this motivation.
  • POSIX is missing abstractions. This is observed because the use of ioctl is high and that usage is precisely to provide functionality that isn’t present in POSIX.
  • The new abstractions are not converging. In other words, the three operating systems they studied are introducing new abstractions but they aren’t coalescing to a common set of abstractions. Of the three key points the authors’ make, this one is likely the most damning since it undermines the original intent of POSIX, namely portability.

Note: the underlying paper was presented at Eurosys 2013 (POSIX Abstractions in Modern OperatingSystems: The Old, the New, and the Missing).

I find it interesting that they didn’t even evaluate Windows, which I am sure would even further strengthen their findings of missing abstractions and divergence. While people might not realize it, Microsoft has had some level of POSIX compatibility throughout the lifetime of Windows (POSIX.1 specifically, which is the basic OS functionality level.) I often see people claim Windows isn’t “POSIX compliant” but I know that this isn’t strictly true. Back in 1992 when I first started looking at Windows NT, I was quite surprised at how closely the security model they had implemented mapped to what we had done in DCE/DFS for security as well – and our work was based on the draft POSIX security specification at the time.

Yet, I would also note that people continue to look at building new POSIX operating systems. For example, last year at OSDI, Cutler, et. al., presented a paper on building an operating system in the Go languages: The benefits and costs of writing a POSIX kernel in a high-level language.

One reason I care about this is because my own work in considering alternative file systems structures, e.g., the idea of having rich name spaces, has led me into this exact area multiple times. The existing interfaces aren’t adequate to the task. FUSE, which is a popular tool for file systems prototyping, has added support for ioctl, which provides a popular generic mechanism for adding such functionality. But of course the need to use a generic extension mechanism just underscores the weaknesses of the interface.

Jeff Darcy wrote an insightful piece back in 2016 entitled Updating POSIX, in which he looked at file systems specific portions of POSIX, described weaknesses, and made insightful observations about how POSIX doesn’t match what we need these days:

  • Rename: I certainly understand this morass since it is something with which I’ve been wrestling for most of my career. When we were designing the Episode file system (1989/1990) one reason we ultimately agreed on using a transaction log was because there was no correct order of operations that would yield a consistent file system. Given that I’m the one who designed and implemented the transaction log, I’m glad we did because our decision to use a log and the way we ultimately implemented it was certainly a strength of the final product.
  • Fsync: This is how application programs ensure their data is committed to disk and thus presumably durable. Very few applications worry about durability; one reason is that the cost is so high for ensuring it. One thing I really liked about Jeff’s discussion here is his observation that POSIX focuses on consistency but not on durability. Of course, I just mentioned that durability was one of the reasons applications used this API. The cost to check file system consistency was a serious pain point in the 1980s. At Transarc, when a file system would crash it could take hours before a file server could restart because the entire file system was scanned to ensure it was consistent. Durability wasn’t so much a file systems concern.
  • Readdir: This is interesting to me because one of my pain points when I first started working with Windows NT file systems is their decision to shift filtering directory enumeration into the kernel. Yet, his perspective is a good one: applications often are only interested in looking for a subset of files. If you have a directory of 35,000 files and you are looking for just two of them, that’s a lot of overhead moving data between the kernel and the user application. So, perhaps the pain of filtering in the kernel is worth it. There are plenty of other issues with respect to readdir as well. There’s quite a bit of impedence mismatch between the POSIX approach and the Windows NT approach. For example, NTFS actually maintains duplicate copies of file meta-data (timestamps and sizes) between the directory and the file itself. When I taught file systems classes I would sometimes show people the impact of the implementation, where directory enumerations could provide stale results under certain situations. Of course, readdir is of interest to me because as I look at converting the file system to a graph I think quite a lot about what “readdir” looks like when it comes to such a system.
  • Chmod: In fact, Jeff’s discussion here is broader than just chmod, it really deals with the complexity of modern security. There’s no simple mechanism for dealing with multiple security domains – try implementing NFS on Windows some time and find out what a pain point it is. His argument of using capabilities is an interesting one, and resonates with some of my own observations in this area. For example, in a graph I can easily conceptualize of disjoint clusters of the graph, where nobody else can find the stuff I’ve created until I give them a magic cookie for it (a capability). There’s no concept of capabilities in POSIX file systems.

I can point to other aspects of the file systems APIs in POSIX that cause grief as well. For example, the fact that the file handle embeds a current offset is definitely a pain point. I see this every semester as students in CS 6200 struggle to add multi-threading support and then have to learn that the read and write calls are not thread safe. What amazes me is that this mistake was carried over into Windows NT: the CurrentByteOffset is an attribute of the kernel FILE_OBJECT. When I write a Windows file system, I’m responsible for updating this field correctly (where “correctly” itself is related to the particular operation being performed) even though I don’t use that field. I suspect that it is there because it was needed for POSIX compatibility.

So my take-aways: supporting POSIX is useful because it preserves existing applications, but it is definitely in need of revision and/or rethinking. More ideas in that direction are best saved for a future conversation, though.

What is a File System?

I don’t know if I’ve discussed this previously, but if so, feel free to skip it. I had a meeting with one of my supervisors this morning and he observed that something I pointed out “would make a great HotOS paper…” I’m pretty much off the hook on that one, since there won’t be another HotOS for almost two years.

Neutron star collision
Copyright 2016 Skyworks Digital, Inc.

I said “we conflate name spaces and storage in file systems”. Or something equivalent. Here’s my real question: why do we mingle storage with indexing? I realize that indexing needs to be able to ultimately get us what we’re looking for, but traditionally we smash these together and call them a “file system”.

The Google File System paper certainly annoys me at some level each time I read it. But why does it disturb me? Because what they build isn’t a file system, it’s a key-value store. Yet over the years I have come back to it, repeatedly, and observed that there is an important insight in it that really does affect the way we build file systems: applications do not need hierarchical name spaces. What they need is a way to get the object they want.

Indeed, Amazon’s S3 service really is the same thing: an object store in which the “hierarchical name” is nothing more than an arbitrary key. I don’t know how it is organized internally, but the documentation is quite clear that the “hierarchical name space” is really just the way the key is encoded.

An Amazon S3 bucket has no directory hierarchy such as you would find in a typical computer file system. You can, however, create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, you can name it photos/2006/February/sample.jpg.

GetObject

For me, this is a useful insight because at one point I considered inverting the traditional directory hierarchy, so that rather than have a directory point to it’s children, I would have each file maintain a list of the directories to which it points. Then “enumerating a directory” becomes “finding all the files with a specific attribute”. One motivation for considering this approach was to avoid the scaling issues inherent in large directories (which led me to an interesting looking PhD thesis Scale and Concurrency of Massive File System Directories – more for the reading list!) This is an important concern in a world in which I’m suggesting adding multiple different indices.

But back to the more fundamental question: why do we conflate them? Recall that the original thinking around file systems mirrored the physical nature of existing physical filing systems (I discussed that somewhat back when I talked about my Eurosys presentation) rather than the more free-flowing linkage present back in the Memex model. In a world where the physical copy is located by using its naming information, this does make sense.

Consider how a modern library is organized (yes, some people still collect books!) The location of the object (book) is important, but the categories of the object are equally important – it’s properties. A card catalog could be organized by topic, and would then give you information allowing you to find the work itself.

Applications don’t need hierarchical name spaces – those are something that humans use for managing things. Applications need persistent location information. By using a mutable location key with application we create a situation in which things break just by reorganizing them.

S3 does not fix this issue, since “moving a file” is equivalent to “changing its key”. In that way, it misses my point here. However, I suspect that those keys don’t actually change very often. Plus, S3 has other useful concepts, such as versioning.

What you do lose here is the ability to “check directory-wise security”. That will likely bother some people, but for the vast majority of situations, it really doesn’t matter.

Typical file systems actually do use key-value stores, although local file systems do it with simple keys (e.g., “i-node numbers”). These keys turn out to be useful: NFS looks up files using that i-node number, which is embedded inside the file handle. We did the same thing in AFS, going so far as to add an “open by i-node number” system call so the server could be implemented in user mode. Windows does the same thing with CDFS and NTFS, and I’ve implemented file systems that support “open by ID” for both file and object IDs.

Thus, my point: we merge the name space and the storage management together because “that’s the way we’ve always done it”. The POSIX interface, which codified how things were done in UNIX, embeds this further as there is no open by i-node number and the security model requires a path-wise walk.

Ah, but the failings of POSIX are something that I’ll save for another post.

File System Driver: Structure

I thought I would discuss the structure of my driver and why I chose to structure it the way that I have done. Of course, I could change the structure of the driver again, but I will need to see a compelling reason to do so.

First, I will note that the driver has changed since I last wrote about it. I was not entirely happy with what I saw as a bit too much complexity, some of which is just me fighting with the vagaries of C++ versus C. So of course, I have refactored it (though honestly, my first post has not yet appeared as I write this, so you can probably ignore the refactoring).

I chose to structure my driver so that each device object can have its own request handlers. My rationale for this was fairly straight-forward: it is quite common in these drivers to have a control driver and then a file system driver. Usually these get conflated together, so I thought I would try to set up my driver so that I could more easily split out that functionality.

As of this writing, I now have a control object. The driver creates a control instance as part of DriverEntry. I’ve also moved the project over to the larger project of which it is a part (ergo, the reason I decided to build a new file system driver in the first place). You can find that project on GitHub – Project Araneae, a name that relates to spiders because the goal of the project is to create a relational web across files. Much of that project won’t have anything to do with Windows file systems, but I’m going to keep working on my file system because parts of it will eventually need just such a file system.

Fortunately, as soon as I started up Visual Studio 2019 it told me that had an update – so I’m using the time while that installs to write more about my work.

Since this is now the Tarantula project (a spider name), I have renamed things somewhat. The base device object extension class that I use is TNativeDevice. I’ve also started to try using the Windows Implementation Library (WIL) which has a nominal amount of Windows kernel support. I poked at this a bit and it is tantalizing to try using it, but rather than spend too much time on meta-programming (templates) for now, I’ll curb my enthusiasm and keep poking at it when I have more time to do so.

Trying to get C++ to use an array of function pointers to methods for each major function code turned into quite a bit of hassle so I decided to just encode the functions as a virtual interface; each one is set up to take the IRP as its input parameter, the default implementation does the right thing, and I moved the new operator into the class itself (I was getting ugly conflicts with WIL and truthfully, I prefer specializing new because I like having unique tags. The last time I did this (several years ago) I used a template technique for the operator overrides and constant tag values. This time I’ve just manually coded them and moved the tags into a common file (which has advantages later when I want to find all the tags that I’m using).

I created a specialization of the TNativeDevice for the control object and aptly named it TNControlDevice. It has support for Create, Cleanup, Close, and DeviceIoControl. The other functions are handled appropriately. I have kept things simple – I am not auto-creating the control device in the constructor, for example, and instead create it explicitly in DriverEntry. I still use the DeviceExtension for storing my control object (the class instance). When I load the driver, I see the control object appear in the object manager namespace. When I unload the driver, I see the control object disappear. I did notice that somehow I broke the inf script, so I’ll have to debug that at some point (though that is never fun).

I turned my attention to constructing code I’ve written previously. One is to load the registry parameters. I’ve tried a couple of approaches to this over the years and my favorite is to load all the values and then support a query interface. In the past I’ve even spawned a thread to monitor changes and reload things, but at this point I have not done so. This allows me to add things, such as the registry-controlled initial driver breakpoint (a BreakOnEntry value) as that’s proven useful over the years. It isn’t something that would be enabled in production. I expect to also add information about what pseudo file system volumes to add – but that’s really the point here, I can use that registry information however I see fit.

I have also added a library to wrap the ERESOURCE objects that are the lifeblood of the file systems locking mechanisms. I have not gone so far as to reconstruct the lock ranking package that I previously built, but I suspect I will do so at some point; the last version of that was quite good at validating the ranking of locks within the hierarchy and enforcing proper ordering, which in turn avoided deadlocks. It also allowed invocation of an optional function for computing a hash value of the protected object, which permitted me to detect when a data structure was modified without the lock being held. It would not tell me where that happened, but at least it would alert me to the need to do a code review (or set an access breakpoint and find where it was happening).

Doesn’t sound like a lot of file systems code yet, does it? So far, it isn’t. I will need another specialization of TNDeviceObject for my pseudo file system, which means I want to know what drive letter to use when creating it and the logical place to get that is from the registry – hence why I was working on the registry package in the first place.

Thinking through that point, I realized I need to construct a TNFileObject class as well. I suspect at some point I will want to have specializations of this, but I am fine with waiting until I need it to create one. The TNFileObject will contain at least two separate memory allocations: one is for the non-paged objects that must live in kernel memory, such as the SectionObjectPointers structure. There is always a fair bit of state that is safe to page out, and thus there will be a second memory allocation for the blocks of memory that can be safely paged out. Indeed, I suspect most of the state for my driver will be safely pageable.

One final note for those of you who might be reading and just getting started writing drivers for Windows: make sure you turn on driver verifier. This is done using the verifier application, which is installed on Windows. Make sure you enable verification for your drivers at least. Then learn how the !verifier command works inside WinDBG, because you will need it.

I’ll keep posting about this periodically as well. Feel free to also check out the code online. Let me know if you have questions or suggestions.