Home » Operating Systems

Category Archives: Operating Systems

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
January 2025
S M T W T F S
 1234
567891011
12131415161718
19202122232425
262728293031  

ZUFS

After one of my earlier posts on FUSE file system performance, someone mentioned this project to me – the Zero copy Userspace File System project (ZUFS) which appears to be a NetApp sponsored project.

Sometimes Zero is best
Sometimes Zero is best.

There have been a variety of talks about this project, including the Linux Plumber’s Conference (which was held next door to me – I can see the venue from my window as I write this), as well as the SNIA Persistent Memory Summit in 2018. The NetApp repositories on Github.com contain both a file system reflector (zufs-zuf), which appears to be similar to the FUSE kernel driver, as well as the user mode server (zufs-zus) which handles dispatching the kernel level requests to the user mode file system implementations.

Their concern appears to be eliminating the copy of any data between kernel and user mode, which makes sense given their objective of supporting persistent memory, such as the new Intel Optane DC Persistent Memory that has recently become commercially available.

Persistent memory benefits from a direct access model, in which traditional file data caching is eschewed in favor of direct access. Thus, data is read or written directly from the underlying persistent memory, rather than copied from a buffer cache.

There are a few persistent memory file systems, including UCSD’s NOVA file system, though usually they were developed using emulation of persistent memory. In such systems, there is no benefit to copying the data from persistent memory into DRAM and back; indeed, it is a significant performance impediment.

What is not currently present in the NetApp repository is an implementation of a user mode persistent file system (they have a dummy file system implementation, which appears to be the base from which one could build a real file system). This definitely presents an interesting alternative to using traditional FUSE.

Fuze vs ZUFS
FUSE vs ZUFS Performance (from NetApp SNIA presentation)

I have not had an opportunity to play with this new system yet, but it certainly does seem to be intriguing – and the performance graph from the SNIA presentation is rather compelling, given the massive improvement in scalable performance.

There sure are quite a few alternatives to traditional FUSE to consider…

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support, Yue Zhu, Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, Muhib Khan, and Weikuan Yu, in Proceedings of the 8th International Workshop on Runtime Operating Systems for Supercomputers, page 6, 2018.

Modern Fuse, circuit breakers instead of actual fuses.
Modern Fuse

There are quite a few papers that discuss the performance of the FUSE model. I already discussed a recent paper that explored the performance of FUSE on Linux and that paper observed that I/O performance for FUSE is reasonably good due to the optimization work that has been done to minimize the data copy overhead that can occur with a naive implementation.

What I do find surprising is the emphasis on FUSE performance; this leads me to think that people look to user mode file systems as something viable for implementing production file systems. Of course, one motivation for this is that building a FUSE file system is generally simpler than implementing an in-kernel file system. Some of this is environmental – the kernel is a harsh development environment, in which the smallest bugs lead to the system crashing.

Of course, virtual machine technologies have done quite a lot to minimize this overhead, as the “machine” that crashes is now more like an application. If you are developing code for the UNIX, Linux, or Windows kernel you are likely to be developing using C, the most commonly used systems language these days. It is possible to bravely branch out and use other languages, but then you inherit other interesting restrictions and frequently find that you are developing the tools as much as you are developing the file system.

Thus, one benefit of the user space file systems model is that you can use other development tools – FUSE file system implementations us a much larger range of programming languages than is normally found in kernel file systems. The FUSE model also permits fairly rapid development of a prototypical file system.

Today’s paper touches on these traditional issues and points out that sometimes what you need isn’t a general-purpose file system but rather something that is specifically crafted to solve the problem at hand. For the HPC community, performance is an important driver for the specialized file systems of choice. The authors’ use an optimized library, libsysio, that provides a POSIX-like interface which intercepts I/O operations to a remote file system – in essence, a sort of automated mechanism for turning I/O calls into something reminiscent of RPC.

The emphasis of the authors is in eliminating the overhead of system calls. Their approach is certainly focused: this solution works for a single application that requires high performance operations.

They start off by evaluating the cost overhead of using FUSE. Because their emphasis is on I/O, that is what they evaluate. Thus, unlike the earlier FUSE analysis, which indicated that meta-data operations were the most significant bottleneck, this work concludes there is still substantial impact on I/O performance as well.

They take an existing library from Sandia Labs, libsysio. I found multiple different versions of this library available on the Internet and was interested to find that it has been integrated into other file systems, including Lustre, with which I have some familiarity from past work. The authors’ don’t discuss if their approach is better than using other HPC file systems, focusing on improving the performance of their specific use case.

One interesting design consideration for Direct-FUSE is they seek to support multiple FUSE file systems from a single application, using the same high performance communications approach. This is not usually an issue for applications with pure FUSE file systems because to the application the FUSE file system appears to be functionally equivalent to every other file system. This is, however, an issue that can arise when incorporating multiple I/O library based models into a single application; something they address in Direct-FUSE.

They describe their implementation model for supporting multiple distinct file systems, differentiating between file systems via a prefix matching model, and then forwarding name based requests as appropriate. File handle based operations work by using an indirection table for encapsulating the additional state needed to determine which file system should be used to satisfy requests against the particular file handle.

Much of the paper focuses on the evaluation of their solution. In keeping with their focus on raw I/O performance, the evaluation is all about bandwidth at various I/O sizes. Their results indicate that they are able to achieve performance that is comparable to similar native file systems (they use ext4 and tmpfs implementations for these benchmarks). Thus, they demonstrate that their approach has comparable performance to the native ext4 and tmpfs implementations.

They also compare their performance in the distributed file systems arena using FusionFS, an existing FUSE file system. They show comparable performance for read I/O bandwidth (including scalability to multiple nodes) as well as improved write I/O bandwidth.

They then evaluate the context switch difference between the two solutions (FUSE and Direct-FUSE) and observe that they have eliminated the context switch overhead.

Bottom line, they have found a way to improve performance over traditional FUSE file systems. They do not compare to other HPC oriented file system (e.g., Lustre) and thus it is difficult for me to tell if this is a viable contender for larger scale distributed file systems work. Nevertheless, they do point out the impact of the context switch costs inherent in the traditional FUSE model.

I am left asking myself “is the goal to make FUSE performance close enough to native kernel file systems that it makes sense to simply implement in FUSE?” Since they only focus on I/O bandwidth, I am not sure if they will achieve this goal for broader benchmarks.

Windows Filesystems: File Object Relationships

A complex intersection in Shanghai
A complex intersection (Shanghai)

One of the challenges of developing file systems in Windows is related to the complex relationships that exist between various data structures in the operating system that are part of the file systems domain. In this post I want to discuss one aspect of this complex relationship because it leads to behavior that makes sense when you understand it, but leads to very counter-intuitive behavior if you do not.

Each time a user opens a file the I/O Manager creates a new file object (I described this in Create previously.) A Windows File system is responsible for initializing specific fields of the File Object including the SectionObjectPointers field. This is a peculiar field – the file system allocates the space, but does not set the fields within it. There is a many-to-one relationship between File Objects and this field. The information here is used by two other parts of the operating system: the Memory Manager and the Cache Manager.

The Memory Manager is responsible for managing virtual memory. Each process has its own unique address space. An address space defines what code executing within that process will see. Code running within the address space works using virtual addresses. The CPU and operating system work together to use and manage this data. The details of how this labor is divided is a function of the CPU architecture and the details do vary across different architectures. The operating system must be modified to support new CPU architectures and this is one of the areas that can require additional programming effort.

There are quite a few distinct parts to the virtual memory system. For example, the CPU typically contains a special cache of recently used virtual to physical addresses (a translation lookaside buffer or TLB). When a virtual address needs to be interpreted by the processor it first looks in the TLB to see if it knows the correct mapping for the given virtual address. If it does, it uses that without accessing memory. This is quite important because when the processor does not find the entry in the TLB (a TLB “miss”) it must then convert the virtual to physical address using the data in memory. Memory, while fast, is much slower than the TLB. Modern TLBs can be accessed within about 0.5 nanoseconds (ns) while accessing memory can often require 75 ns. Plus, converting a virtual to physical address can require access to multiple physical memory pages, which further drives up the time required.

One reason for having a virtual memory system is that the operating system can set up a range of virtual addresses without assigning physical memory to those addresses. As necessary, the hardware will invoke the operating system to allocate physical memory and then assign it to the relevant virtual memory location. This process is typically called a page fault because memory is managed in small units called pages. In order to fill in the newly allocated physical page properly, the virtual memory system needs to keep track, for every virtual address, where its corresponding data is presently stored. This is why File Systems are closely involved: they manage data. Thus, there is a symbiotic relationship between the Windows Memory Manager and the File System.

File Systems in Windows also offer a non-block oriented interface for storing and retrieving content: a read/write interface. Such byte oriented interfaces are a mismatch with the block oriented interfaces of most storage devices and the File System converts between the two of them. In Windows, file systems typically do so by memory mapping a region of the file into memory. Windows provides another component – the Cache Manager – to assist in managing these mapped regions. Thus, we have a tight collaboration between the three components: Memory Manager, Cache Manager, and File System. Each has a distinct role to play, but relies upon the other to accomplish its task.

The following diagram attempts to capture the interesting subset of relationships that I am describing here:

Object Relationships in Windows File Systems
Object Relationships in Windows File Systems

Let’s start by describing the individual components:

  • File Handle – this is an abstract value that Windows gives to an application to represent a file. Any time we use a file handle in the kernel, it must be validated since we do not trust applications not to have bugs or nefarious intent.
  • Section Handle – this is similar to the file handle, in that it is an abstract value that Windows gives to an application. In this case it represents a section, which is used to permit memory mapping of some or all of a file into the address space of an application.
  • File Object – this is the internal Windows kernel data structure that is used to represent an open file; we use file in its broadest sense here because it could be a file, a directory, a device interface, a communications endpoint, or pretty much anything else that “behaves like a file”.
  • Section Object – this is the internal Windows kernel data structure that is used to represent anything that can be mapped into memory. Note that this is distinct from the actual mapping. There are two kinds of section objects with which a File System is concerned: the image section object and the data section object. One key difference here is that something mapped using an image section object will be copy-on-write, so changes are not written back to disk, while something mapped using a data section object will permit reading and writing and changes to it are normally written back to storage. Both support the concept of sharing memory – in which two or more processes using the same section will also use the same physical memory.
  • Shared Cache Map – this is a control structure used by the Windows Cache Manager to track regions of files mapped into memory for use by the file systems to handle read and write calls (though file systems may choose to not use the cache). This structure belongs to the Cache Manager and shouldn’t be used by any other component (though it is useful to us, as file systems developers, to look at it _in the kernel debugger).
  • Section Object Pointers – this structure is allocated by the file system from non-pageable memory. It is large enough to store three pointer values: a pointer to the ImageSectionObject, a pointer to the DataSectionObject, and a pointer to the SharedCacheMap. These map to, as you might expect, section objects and the shared cache map.

With this background, I can now describe my diagram. When a File System is initializing a new File Object, one of its responsibilities is to set up storage for the Section Object Pointers structure. Typically, this is the same for each new open instance of the same file. One reason to set them up to be different is if you want to construct a separate view of a given file. For example, I have used distinct views in the past to manage encrypted files: one view is the clear view of the file, while another view is the encrypted view of the file. Similarly, I could do this if I were supporting file versioning.

If an application opens a file that has not been previously opened, the section object pointers value will point to cleared memory (all values are set to zero). If the File System then invokes the Cache Manager to set up caching (CcInitializeCacheMap), upon return, the DataSectionObject and SharedCacheMap fields will have been initialized.

It turns out that the Cache Manager needs to have a File Object to implement some of its functionality. Given that the only File Object it has to use is the one that is passed into the cache map initialization function, it bumps the reference count on this object and stores a pointer to it.

It is rather common for a File System to defer initializing the cache until the first read, since many files are opened but never used for read or write. If an application memory maps a file, it does this in two steps: first, it opens the file, then it opens the section.

Since the Memory Manager “owns” sections, the call to open a section is sent to it, along with the file handle for the file to be mapped. The Memory Manager converts the file handle to a File Object. Then it looks inside the section object pointers value to see if there is an existing DataSectionObject value. If there is, it uses that section object to complete the call. There is no interaction with the file system required. If there is not an existing Data SectionObject, it will create a new one. This turns out to require some level of interaction with the File System because the Memory Manager needs to know how big the file is – after all, the size of the section and file need to be the same. Once the new section object is created, a pointer to it is stored in the SectionObjectPointers block associated with the file. Since the Memory Manager may need to retrieve data from the given file, it maintains a reference to the File Object.

At this point it is worth noting that if a file is memory mapped there is no shared cache map. Recall that when the File System calls the Cache Manager it passes a File Object, which the Cache Manager will use.

Thus, if a file is first memory mapped and then subsequently opened again and used for cached read/write operations, we find a situation where the Memory Manager is using one File Object, and the Cache Manager a second File Object. If we do this in the reverse order, the Cache Manager and Memory Manager use the same File Object for their references because the Cache Manager creates a section object using the same File Object that was passed to it by the File System.

So, when might we have a third distinct File Object? When the file is an executable image. Suppose you copy an executable program. The copy program opens the target file for data access; it then writes the contents into the file. Unless this is done without memory mapping or caching, there is now a Data Section Object backed by this file (and thus a File Object). Next you decide to execute it: this time when the file is loaded for execution the Memory Manager will need an Image Section Object. If one does not exist, it creates one. This is one way how you end up with two section objects for a single file. If you combine that with the memory mapping versus read/write usage described earlier, you can end up with three distinct File Objects for a single file that is only in use by one application.

But for a File System there are some more interesting effects as a result of this shuffling around:

  • A paging write operation may be performed against a File Object that was opened for read. This is because the Memory Manager uses the File Object stored in the section object; it does not know how the original file was opened. A subsequent open of the file for write continues to use the File Object that was opened for read and passed to the Memory Manager when the section was created. I will note there is now an API for switching this file object out, as this behavior can be challenging for some file systems to handle.
  • Paging I/O Request Packets (IRPs) can be sent to the file system with any of the three File Objects; it will depend upon the reason the paging request is being sent.
  • The lifetime of these three File Objects is now decoupled from the user file handle. What this means is that when the user application closes the file, the File Object’s reference count will not drop to zero. So the IRP_MJ_CLEANUP will not be immediately followed by an IRP_MJ_CLOSE. It also leads to a peculiar situation where the IRP_MJ_CLOSE is sent inside the cleanup handler, but I’ll talk about that another time.

These relationships, while complex, make sense in the context of the interactions between these components. Hopefully this basic description will help those writing Windows File Systems better understand the interaction patterns.

File System Driver: Create

The road ahead.

The usual place to start when building a file system is to think about the Create operation. This may also be referred to as an open operation, but that conflates the object with the handle.

I think of Create as being “create a handle to the object”. The creation of the object itself can be a side effect of creating the handle; it does not make much sense to create a handle to something that doesn’t exist.

POSIX actually does have a create system call (and at one point it was creat for obscure historical reassons). Now you can call open and specify O_CREATE and it will perform the same operation. In Windows the native call is NtCreateFile, though there is also an NtOpenFile, but they both ultimately invoke the same internal kernel operation (IoCreateFile or its successor IoCreateFileEx – the Ex thing is one way of saying “oh, we need to pass in more parameters, so sorry.” I’ll just talk about IoCreateFile since it is shorter and if you need the other version for some peculiar reason you can figure it out. Remember, it’s an operating system internal call, so it only matters to people writing software to execute in the kernel.

Inside Windows, IoCreateFile is actually a rather large, complicated function: while the concept of creating a new file object is simple, the details are complicated because as much as we like to pretend files are nothing more than byte streams, there are so many special cases and exceptions to this that our illusion thin, at best. UNIX suffers from this as well, with symbolic links and special files. So, files are byte streams, except when they aren’t.

Since I’m writing this article as a description of how file systems on Windows work, I’ll stick with talking about Windows behavior and leave talking about other operating systems behavior to another time.

Windows Flow of Control for Create
Windows Flow of Control for Create (Simplified)

The animation shows the basic flow of a request through the system:

  1. An application opens (or creates) a file. Depending upon the implementation for the subsystem, it will call through the subsystem itself; often this is just implemented inside a shared library (dynamic link library or DLL).
  2. Since this requires a system call, it will invoke the relevant system call interface inside ntdll.dll, which is mapped into every process in the system. It will format the request as appropriate for the platform and then issue a system call (syscall/sysenter on Intel platforms, or swi on ARM platforms).
  3. The Windows system call dispatcher will forward this to the I/O Manager, since it handles files.
  4. The I/O Manager is presented with a name but it does not yet know which device and driver will handle this specific request. Thus, the I/O Manager will ask the Object Manager (ObLookupObject) to parse the name until it finds a relevant device. Assuming it does reference a specific device, the Object Manager will then invoke the I/O Manager, because DeviceObjects (and FileObjects) point to a function registered by the I/O Manager with the Object Manager. For example, this will invoke IopParseDevice. At this point the I/O Manager now has a DeviceObject and the balance of the name. At this point it can allocate an I/O Request Packet (IRP) and set it up. In the case of a physical file system, however, the I/O Manager must reference the volume parameter block (vpb). Thus, it retrieves the relevant file system’s device object from the vpb. It asks the Object Manager to create a new FileObject. The I/O Manager will complete initialization of the FileObject and format the IRP with the relevant parameters. Create is a complex I/O request and the fields are scattered, unlike other I/O requests. Once formatted, the I/O Manager will invoke the file system driver (via IoCallDriver).
  5. The file system driver will receive the IRP. It will process the request, which may involve creating a new file as a side effect of creating the file object. It will parse the balance of the name. It may check security, sharing, allocate space for new files, validate options. It must handle symbolic links at this point as well (including reparse points) and format return information appropriately. It might attach ExtraCreateParameters to the I/O request. It could perform the operation within the context of a kernel transaction if the FileObject is part of a transaction. It may have special cases for volumes, directories, alternate data streams, or other file system specific behavior. While the Windows I/O Model allows any IRP to be processed asynchronously, the I/O Manager will block and wait for completion of the request. The file system may also need to perform additional operations for oplocks, which even have a case where the create is completed, even though the file is not yet usable.
  6. Once the file system is done processing the request – successfully or not – it will complete the request by calling IoCompleteRequest. The I/O Manager will unwind the I/O request stack (in case there are filters, which there almost always are now) and once done, it will copy results from the kernel to the user address space and return control to the system call dispatcher.
  7. The system call dispatcher will restore state, set the return code in the return register, and complete the system call.
  8. In user mode, ntdll will process the request, may raise exceptions if needed, and return to the application caller.

For a Windows file system Create is often one of the most complex routines – I have seen file systems with almost 20% of their code in this path. There are numerous edge conditions: open by file ID, for example, as well as oplocks, not to mention creating new files, creating in-memory data structures and managing the many-to-one relationship between file objects and the file system control structures. Each open of a file creates a new FileObject. I will discuss why this becomes complicated in a future post, because it leads to unexpected behavior for the unwary.

File System Driver: Structure

I thought I would discuss the structure of my driver and why I chose to structure it the way that I have done. Of course, I could change the structure of the driver again, but I will need to see a compelling reason to do so.

First, I will note that the driver has changed since I last wrote about it. I was not entirely happy with what I saw as a bit too much complexity, some of which is just me fighting with the vagaries of C++ versus C. So of course, I have refactored it (though honestly, my first post has not yet appeared as I write this, so you can probably ignore the refactoring).

I chose to structure my driver so that each device object can have its own request handlers. My rationale for this was fairly straight-forward: it is quite common in these drivers to have a control driver and then a file system driver. Usually these get conflated together, so I thought I would try to set up my driver so that I could more easily split out that functionality.

As of this writing, I now have a control object. The driver creates a control instance as part of DriverEntry. I’ve also moved the project over to the larger project of which it is a part (ergo, the reason I decided to build a new file system driver in the first place). You can find that project on GitHub – Project Araneae, a name that relates to spiders because the goal of the project is to create a relational web across files. Much of that project won’t have anything to do with Windows file systems, but I’m going to keep working on my file system because parts of it will eventually need just such a file system.

Fortunately, as soon as I started up Visual Studio 2019 it told me that had an update – so I’m using the time while that installs to write more about my work.

Since this is now the Tarantula project (a spider name), I have renamed things somewhat. The base device object extension class that I use is TNativeDevice. I’ve also started to try using the Windows Implementation Library (WIL) which has a nominal amount of Windows kernel support. I poked at this a bit and it is tantalizing to try using it, but rather than spend too much time on meta-programming (templates) for now, I’ll curb my enthusiasm and keep poking at it when I have more time to do so.

Trying to get C++ to use an array of function pointers to methods for each major function code turned into quite a bit of hassle so I decided to just encode the functions as a virtual interface; each one is set up to take the IRP as its input parameter, the default implementation does the right thing, and I moved the new operator into the class itself (I was getting ugly conflicts with WIL and truthfully, I prefer specializing new because I like having unique tags. The last time I did this (several years ago) I used a template technique for the operator overrides and constant tag values. This time I’ve just manually coded them and moved the tags into a common file (which has advantages later when I want to find all the tags that I’m using).

I created a specialization of the TNativeDevice for the control object and aptly named it TNControlDevice. It has support for Create, Cleanup, Close, and DeviceIoControl. The other functions are handled appropriately. I have kept things simple – I am not auto-creating the control device in the constructor, for example, and instead create it explicitly in DriverEntry. I still use the DeviceExtension for storing my control object (the class instance). When I load the driver, I see the control object appear in the object manager namespace. When I unload the driver, I see the control object disappear. I did notice that somehow I broke the inf script, so I’ll have to debug that at some point (though that is never fun).

I turned my attention to constructing code I’ve written previously. One is to load the registry parameters. I’ve tried a couple of approaches to this over the years and my favorite is to load all the values and then support a query interface. In the past I’ve even spawned a thread to monitor changes and reload things, but at this point I have not done so. This allows me to add things, such as the registry-controlled initial driver breakpoint (a BreakOnEntry value) as that’s proven useful over the years. It isn’t something that would be enabled in production. I expect to also add information about what pseudo file system volumes to add – but that’s really the point here, I can use that registry information however I see fit.

I have also added a library to wrap the ERESOURCE objects that are the lifeblood of the file systems locking mechanisms. I have not gone so far as to reconstruct the lock ranking package that I previously built, but I suspect I will do so at some point; the last version of that was quite good at validating the ranking of locks within the hierarchy and enforcing proper ordering, which in turn avoided deadlocks. It also allowed invocation of an optional function for computing a hash value of the protected object, which permitted me to detect when a data structure was modified without the lock being held. It would not tell me where that happened, but at least it would alert me to the need to do a code review (or set an access breakpoint and find where it was happening).

Doesn’t sound like a lot of file systems code yet, does it? So far, it isn’t. I will need another specialization of TNDeviceObject for my pseudo file system, which means I want to know what drive letter to use when creating it and the logical place to get that is from the registry – hence why I was working on the registry package in the first place.

Thinking through that point, I realized I need to construct a TNFileObject class as well. I suspect at some point I will want to have specializations of this, but I am fine with waiting until I need it to create one. The TNFileObject will contain at least two separate memory allocations: one is for the non-paged objects that must live in kernel memory, such as the SectionObjectPointers structure. There is always a fair bit of state that is safe to page out, and thus there will be a second memory allocation for the blocks of memory that can be safely paged out. Indeed, I suspect most of the state for my driver will be safely pageable.

One final note for those of you who might be reading and just getting started writing drivers for Windows: make sure you turn on driver verifier. This is done using the verifier application, which is installed on Windows. Make sure you enable verification for your drivers at least. Then learn how the !verifier command works inside WinDBG, because you will need it.

I’ll keep posting about this periodically as well. Feel free to also check out the code online. Let me know if you have questions or suggestions.

Setting up Debugging

No matter how many times I do this over the years, I find it to be a slow and painful process – install a clean virtual machine image, set up kernel debugging, install the (test signed) driver.

It never ceases to amaze me how non-intuitive it is to set up talk to the new VM image that I’ve set up on my computer. This time was certainly no exception. Since I’m running on Windows 10, I choose to use Hyper-V (otherwise I can’t use the Windows 10 docker support, for example). I downloaded the latest and greatest version of Windows 10 (1903 Enterprise) and installed it inside a Generation 2 Hypervisor. There’s the usual challenges (oh right, I reset the machine, so of course none of my Hyper-V networking choices worked, so of course I don’t have an external network switch set up). So I set up the VM with no networking. Windows is most displeased with this option and warns me that a machine without networking is like a Bloody Mary without vodka; one nice upside to this was that it didn’t prompt me to log into an internet account, so I gleefully set up a local account.

Then of course I had to figure out (again) how to get things transferred between the local machine and the virtual machine. I made the usual changes to enable ICMP (“advanced settings” in the firewall, and then it is buried under “File and Printer Sharing” because obviously supporting ICMP is all about file and printer sharing….

I was amused to find that I could ping from my laptop to the virtual machine, but not the reverse. Repeating this process on the laptop of changing the File and Printer sharing options fixed that issue. Next trick was to get SMB working since I have to move files between the two machines. I enabled other File and Printer sharing options (for the private/domain network options – I don’t want to share on a public network, after all) but that still didn’t work. So I turned off the firewall on the virtual machine; not best practice vis-a-vis security but it is a test machine.

Voila! I can map a share. So now I copy kdnet from the host to the debug target. It of course tells me that I have to turn off secure boot to continue. I first tried with PowerShell but it balked at me so I sighed and shut down the VM image, changed it inside the Hyper-V management console, rebooted and restarted. I could then run kdnet successfully to enable boot debugging (admittedly, that is a nice improvement, since I’m used to dealing with bcdedit to enable kernel debugging). I fire up the debugger and reboot the VM – and it hangs.

So I start poking around on the host machine. Much to my annoyance, it seems that Hyper-V has installed a virtual adapter on top of my wireless and marked it as a public network. Well of course that isn’t going to work so well. The option to change the network type just silently doesn’t show up on the laptop (ironically it did show up inside the VM). After some internet searching and poking and prodding I get the network type changed to private using PowerShell – and then the debugger attached and the VM finished starting. Life is good!

I’ve now copied my test driver, inf script, and certificate over to the test machine. I went through the steps of installing the test certificate as well.

While doing all of this I also did a quick bit of research as to the least expensive code signing certificate. At the time I write this (May 2019) the best option (if your primary concern is cost) is Digicert. They list a price of $99 per year for an EV code signing cert. Of course EV certs are a bit of a pain to get, since they actually do some small amount of work to verify that you’re legit; I need to do a bit more work to ensure that my phone number is independently verifiable (one of their requirements) as I’m not quite sure how they will validate it. But I don’t expect I’ll need that for a while yet, since that’s really needed when I want to hand the driver to someone who isn’t a developer and wants to test the driver.

I installed the driver and… the machine crashed. So I know that my configuration is now working and I can debug the driver. Progress!

Starting the Skeleton Driver

Screen shot of my newly created skeleton driver

In my last post, I installed the WDK (and described why I want to build a file system driver). I started up Visual Studio 2019, said I wanted to create a new project, narrowed down the options to “WDK” related projects, and scrolled down to the WDM driver option.

I’m not building a WDM driver, but it is the closest project type to what I want to do. It creates a solution with the name specified and then creates a project with the same name. The only file that it pre-constructs for me is an inf file. I will need to do work on that before I can use it, but I’ll leave that for later. File system installation files are surprisingly uncomplicated, since all we really need to do to install a file system driver is set up a few registry keys.

Since I had just installed Visual Studio 2019, I’ll need to tune things to my working environment. I started by enabling git integration, since I will be using github.com for my source code repository (winskel).

That took more time than I anticipated – I installed the Github integration into Visual Studio, which restarted Visual Studio. I was then told “there is an update to GitHub extension for Visual Studio”. Thus, I installed the update next. That required another restart to install the update. I hope the Visual Studio folks take a lesson from the VS Code team, since I install VS Code extension updates all the time with just a refresh, not a full restart. Of course, I used that time to continue adding to my post here, so it wasn’t entirely wasted time. Still, it is stunning that they construct a restore point just for installing new extensions.

The new helpful error message upon restarting!

I really liked the fact that Visual Studio 2019 suggested to me that I could make startup faster by disabling the WDK extension – how helpful, given that the reason I’m running Visual Studio 2019 is because I want to use the WDK. It makes me long for the days of SOURCES files and command line program building. I know it is possible to develop without using Visual Studio and perhaps I’ll explore that again at some point, but I’d rather be writing code for my new driver rather than fussing with the tools and environment at this point.

I enable code analysis – it can be annoying, but it also finds bugs.

Since this is a new project, I’m going to enable the static code analysis tools. While not required, I choose the “All Rules” option because it is the most restrictive setting available. Note that I am applying this to all the configurations (debug and release) as well as the platforms for which I have installed the compiler tools (I did not install the ARM compiler tools, so I cannot include them).

Having enabled the checks, I built my simple file with just DriverEntry (and an error return). Of course, as I expected, the static analysis tools are now reporting issues, so I add annotations (DRIVER_INITIALIZE DriverEntry; for example) and modify my code (the static analyzer points out that both the DriverObject and RegistryPath can be set as const pointers). Since I will be changing DriverObject I suppress the warning. I don’t expect to change the RegistryPath, so I mark it as const.

I also had a warning that while the spectre/meltdown mitigation option has been selected for the compiler, the libraries with the needed mitigations are not installed. So back I went into the installer and installed the missing libraries. Things now build well, and I have my super-minimal driver. It won’t do much, since the DriverEntry function returns a failure code, which means it will load and then unload.

However, this is enough for me to make the inf file work, so I will do that next.

This is the default INF file that Visual Studio provided to me.

Visual Studio generated a default INF file for me. This isn’t quite enough for me to install a working driver, so I’ll need to modify it. Plus, Microsoft changed some details about INF files for Windows 10 1903 and created a new primitive driver type with rules that need to be followed if you want the driver to be properly (test) signed.

So I worked through the INF file issues and I now have a working INF file, with a driver that (of course) won’t actually do anything yet.

Next, I turned my attention to pulling together a C++ runtime so that I can use C++ if I want. Basically, there are several things that need to be done to make this work:

  • I need memory management functions
  • I need initializer support
  • I have to wrap the standard functionality (DriverEntry) and coordinate the Unload function so it calls the cleanup logic.

In the past, I’ve added a template layer above the allocators, which permits me to specify (on a per-object type) what the pool type and pool tag are for the allocations. Unlike in user mode memory, where we normally don’t worry about these things, in the kernel we do need to worry about whether memory is pageable or not. Plus, we have to provide some mechanism for finding memory leaks since there is no automatic garbage collection. Note that my goal isn’t to port STL into the Windows kernel (though I did see one project where it looked like someone had done that). Similarly, I don’t plan on supporting C++ structured exception handling. So it will provide me with most C++ code features, but I’ll eschew those that require specialized run-time support.

As I wrap this up for the day, I have the allocation routines plumbed. The next step is to get the initializer code written – it revolves around walking through some memory locations where global and static constructors need to be called – the Microsoft C++ compiler embeds some magic information in memory to do this. I also need to construct a list of things to be called when terminating the runtime.

Once that’s done, I’ll move on to adding basic functionality. One thing that will greatly simplify this initial effort is that I don’t have to worry about integration with the memory manager or cache manager because I can defer I/O management to the native file system. Perhaps, once we’ve proven the viability of this approach, I can look further at integration.

I will continue describing my progress and updating the repository as I work through this project over the coming months.


Let’s Build a Windows File System

It’s been a while since I upgraded my Windows kernel development tools, so I thought I’d write about the steps I’m taking to do so. How you build a Windows file system has changed over the years but the basic structure of the file system driver itself has not.

At the time I’m writing this, the standard tools to do this are the Microsoft Visual Studio Integrated Development Environment and the Windows Driver Kit (WDK). The WDK download page (which could move, I just search for “Microsoft WDK” when I need to find it) actually describes the basic steps needed to develop drivers for Windows. As I write this, Visual Studio 2019 is downloading. I do not need many of the parts of it, so I just installed the “Windows Desktop Apps” components. Don’t forget to install the Windows SDK as well (it’s an optional component as part of the Visual Studio installation).

Once Visual Studio is installed, I’ll install the WDK. It will add the WDK integration components into Visual Studio and those will permit me to build Windows drivers. That will also install one of the most important tools: the debugger. I have been working with WinDBG for several decades and it is now an indispensable tool for kernel debugging, including both a graphical user interface as well as a command line version. It has support for debugging over a variety of transports (it used to support debugging over modems, but I haven’t done that in many years, so it may no longer be supported) including serial ports, USB ports, networks, and synthetic debugging (over a virtual device) for virtual machines. Since I will be using a virtual machine for my development, I will set up synthetic debugging. Easing this, it seems Microsoft has automated the process. I have never done it automatically, so I look forward to seeing if that works.

I still need to write about what I am building, but I will defer that for the moment because it deserves a post of its own. I can explain why I want to build this as a file system, though. I hope this is useful for anyone reading it who is thinking of building a file system.

The file systems interface, whether it is UNIX, Linux, MacOS, or Windows, is a clearly delineated boundary at which I can implement functionality that becomes available to anyone using the file system itself. The specifics of that interface vary somewhat across operating systems, but there is a high degree of commonality. One reason for this is that mainstream operating systems have a common heritage. Perhaps that will change in the future (one recent paper suggests that our OS architecture assumes that I/O is slow relative to processors and memory, an assumption that is no longer true) but it seems unlikely to change in the near-term.

The decision to implement functionality at the file systems interface is because essentially all applications for our operating systems know how to use it. New functionality – provided it conforms to the file systems interface – can be exposed to all those applications by supporting the file systems interface, which means that it seamlessly integrates into the existing system. If the new functionality is substantially different than what can be provided by the file systems interface, this will not be a good solution. Since my goal is to provide both compatibility and new functionality, I am using the file systems interface but will also look at ways in which I can augment it further. Indeed, one interesting aspect of the GFS paper I described earlier is they tried to implement functionality within the constraints of the existing interface. This has made me think about how I would go beyond what they did, without changing the interface, though I do expect at some point I will need to augment the existing interface.

I decided to do this on Windows for several reasons:

  • I am familiar with Windows file systems development; I am comfortable in the kernel environment, and I know how to integrate user mode services and kernel mode drivers together to provide my desired functionality;
  • I am looking at how to extend the namespace management. By doing this as a file system, I know it will be visible in standard applications, which in turn will make it easier for me to gauge how effective the changes are for ordinary users.
  • More than a decade ago Microsoft incorporated quite a few interesting features into their user interface in anticipation of the new Windows File System (WinFS). While WinFS was never released, its goals of augmenting existing file management mechanisms were partially integrated into Windows.
  • The NTFS file system in Windows supports a change journal that will be a good place to start for prototyping capturing some relationship data. I expect that we will ultimately provide additional mechanisms for doing this, but this should provide a good head start.
  • User mode file systems, while convenient for prototyping, are a dead end if I need a kernel mode file system. User mode file systems are known to be much slower, and this is amplified for meta-data operations. Since I expect to be implementing primarily meta-data operations, I expect it is highly likely that I will not be able to demonstrate acceptable performance if I were to use FUSE for Windows, for example.
  • Windows already has a supported mechanism for accessing files via a file identifier. We know that applications do not need hierarchical name spaces – that is one of the lessons from the Google File System paper. What applications need is a key. This is hardly a new observation: both NFS and AFS employed file identifiers in their implementations. Indeed, one reason that CDFS and NTFS on Windows have long support “open by file ID” is because they needed it to support AFP for the Services for Macintosh support in Windows NT. While the NTFS source code is not generally available, the CDFS file system source code for Windows is in the WDK – and demonstrates how CDFS supports “open by file ID”. Note that NTFS is more permissive in its support, as it allows absolute path name opens, while CDFS does not.

Now that I’ve stalled a bit, my Visual Studio 2019 installation has finished and I went to install the WDK; that installation was unhappy that it could not find the right version of the Windows SDK to install and it pointed me off to another web page to download the correct version. Installation still is not a seamless experience, it would seem.

So now, as I install the SDK, I can return to discussing my project in a bit more detail, even though this is probably unfair since I haven’t really told you much about what I am trying to achieve.

I have opined about the challenges of managing hundreds of thousands of files across multiple devices. I pointed to projects that have explored alternative ways of managing the name space (e.g., QMDS and GFS) and they have done a good job of laying the groundwork. Prior discussions I have had with others have focused on search as a paradigm, but the survey paper on File Management Research I recently described helped me better understand that navigation seems to be the approach users prefer over search.

One possible reason for this preference for navigation is the nature of the task itself. When we search on the Internet, we are seeking an answer to our question. When we look for something within our personal files, we are seeking the answer to our question. We fall back to search when navigation fails us.

The SDK finally finished installing… I’ll try the WDK again.

Over the years, I have constructed a number of virtual or pseudo file systems. For example, I once constructed a file system that didn’t store any data, it just presented an ephemeral name space. I crafted it to support a broad range of meta-data features: object IDs, file IDs, extended attributes, alternate data streams, and access control lists (ACLs). But read operations were satisfied by zero-filling the memory buffer, and writes were discarded – thus, the only data that was visible is data which persisted in the virtual memory file data cache. This was a fun file system to build, not the last of which because it really was only the namespace and meta-data management parts of the file system. My motivation for doing this was performance testing for a file systems construction framework but I realized at some point that it could also be a good baseline from which to construct a new file system.

Thus, my project direction: building an abstract file systems interface that has rich support for meta-data. Actual file I/O can then be redirected to the underlying file, within its own native file system

The WDK installation is now finished! My next steps are:

  • Constructing a virtual machine image that I can use for debugging; the wonderful thing about most file systems work is we aren’t dependent on hardware, so it is easy to do development inside the virtual machine environment.
  • Sketch out my skeleton file system model.
  • Begin to implement it.

I don’t expect to build production quality code. Over the years I have learned that the bar for production quality file systems code is quite high. Thus, my goal is to construct a working prototype and then, from that, begin building my new file system namespace.

Moving to the now

I’ve been gone for a bit; real life keeping me busy.  Part of that has been exploring the current edge of my technology space.  So I’m going to take a break from the historical paper review (but it will come back) and instead start talking about new storage technologies and the interesting things that we can do with them.

Specifically, I have been looking at the brave new world of “non-volatile dual inline memory modules”.  What this means is that we now have persistent memory that is directly accessed via processor primitives.  That means a load or a store instruction can be used to access this kind of memory.  We have actually been talking about persistent memory since the dawn of time.  Many of the early operating systems papers think of disk storage and memory as being different forms of memory.

The big change here was when computers moved away from magnetic core memory into solid state memory.  Intel’s first product was dynamic RAM (though it was invented by IBM).  DRAM was cheaper and faster than core – and much faster than magnetic storage.  Thus the abstraction of memory and disk being part of the same continuum we began to think of them as being distinct.  DRAM was transient; disks and tapes were persistent (note that magnetic core was persistent until someone read it, at which point it lost its state and had to be rewritten – even across power cycles).

Disk storage was often used to contain data from memory, such as for paging or segmentation.  Over the years the operating systems community learned many ways to make disks seem fast: caching, read-ahead, asynchronous I/O, reordering operations, etc.

Computers – and memory – were able to experience vast increases in speed.  Disk drives did improve, but only modestly in comparison.

Memory technologies have continued to evolve.  Persistent memory options increased as well; flash memory is one of the most common today and is used inside solid state drives and NVMe disks, both of which are vastly faster than rotating media disks.

Another memory technology that has been discussed for more than 20 years has been persistent RAM.  One way this has been done in “real products” has been to simply add a backup power source: some sort of battery.  Then if the power is lost, the data contents of volatile memory (DRAM) can be written to persistent storage (e.g., Flash).  This approach has been a stop-gap on the way to the actual persistent memory solution that has been promised for at least a decade: persistent memory that acts like DRAM.  Intel is now starting to ship their new Optane memory products. Samsung has announced their new Z-NAND products.  Other technologies remain “in development” as well.

Why does this matter?  Storage is suddenly getting faster.  We went from 10 millisecond access times to 0.2 millisecond access times (HDDs to SSDs).  Now we are looking at going from 0.2 milliseconds to 200 nanoseconds – six orders of magnitude faster.  This sort of change is profound.  We’ve been talking about it for many years, trying to reason about it.  It is now materializing.  Over the next several years the other promise of NVM is going to materialize: it supports higher density than DRAM.  It is slower than DRAM; where DRAM access times are on the order of 50-100 nanoseconds, NVM access times are on the order of 125-800 nanoseconds (writes being slower than reads).

File systems that have been optimized for working on hard disks or SSDs don’t necessarily make sense on NVM.

Thus, the area I’ve been looking at: expanding my own understanding of NVM.  Since this is still related to my own exploration of file systems, I’ll use this as my soapbox for exploring the space.

Let’s see where this journey goes.  And I promise, I’ll come back to the old file systems papers after this diversion.

 

 

The DEMOS File System

The DEMOS File System
Michael L. Powell, In Proceedings of the sixth ACM symposium on Operating systems principles, pp. 33-42.

This paper delves into the nitty gritty details of constructing physical file systems.  I was surprised that it had relatively few citations (61 according to Google Scholar when I checked) because, having read it, I would hand this paper to someone asking me “what are file systems?”  I suspect that the more frequently cited paper in this area will be “A Fast File System for UNIX,” which cites to this paper.

The target for DEMOS is the CRAY-1 supercomputer, at the time the fastest computer in the world.  As a matter of comparison, modern mobile devices have more computational power (and often more I/O bandwidth) than the CRAY-1 did.

DEMOS Figures 1 and 2The author discusses the design of a new file system for use with a custom operating system for the Los Alamos National Laboratory’s CRAY-1 computer system.  A key for this project was that it seeks to improve performance as much as possible.  After all, why build a super-computer if you then cripple it with features that don’t enhance its performance?

What I find delightful about this paper is that it describes the basic constituent parts of a file system, as well as strategies for optimizing performance.  It does so in a clear and understandable fashion.DEMOS Figures 3 and 4

DEMOS utilizes a UNIX-like hierarchical file system model.  It has directories and files. It does not have the link model from Multics so paths to files are unique in DEMOS.  Files are managed in units of blocks (4096 bytes) but I/O is specified as bytes (interestingly, they specify eight bit bytes as nine bit machines were still in use.)

The authors discuss file sizes.  To the best of my knowledge this is one of the earliest papers covering this common subject  (which is revisited periodically because workloads change and file sizes also change).  One of the common themes I have seen in other work is mirrored here: most files are small.  Figure 1 shows a CDF for file sizes.  We note that the majority of files in their system are small, with approximately 75% being less than 1KB; this is consistent with later work as well.  Their second figure (Figure 2) describes the proportion of transfer sizes and their source.   We see a spike in the 100, perhaps 256 or 512 being “natural block sizes” that applications would use.

demos figure 5They establish lofty performance requirements: “[T]he file system will have to support a bandwidth of 20-60 megabits/second or higher”. Our performance requirements today would be much higher, but this recognizes the reality that then (as now) the I/O bandwidth of storage is often the rate limiting factor.

DEMOS is paired with a centralized storage facility (“Common File System” or CFS) that is to provide the function of what we would now think of as a centralized file server.  While not yet implemented by the time of the paper, their plan was to introduce automatic file migration and staging.

The central bit of the paper then describes the constituent parts of the file system.  This maps rather well onto what I have seen in the typical file system: a “request interpreter” that handles requests from applications.  Even their description is appropriate: “parameter validation and request translation”; a “buffer manager” that handles the allocation of buffer cache space (often virtual cache these days); and a “disk driver” that handles low level data operations, such as filling or storing the contents of buffers.

Figures 3 and 4 capture their insight into the disk manager.  This dovetails with their discussion about efficiency of I/O, including observations about queue management (“shortest seek time first” order for requests, and then sub-sorted by “shortest latency time first”).  This is a clear “hat tip” to the impact that rotational latency and track seek time has on performance.

Speaking of performance, the authors discuss this.  It leads to their observations on improving I/O performance: “I/O operations out to proceed in parallel with computation”.  Their point is that serializing these things decreases overall performance.  Their second observation: “[T]he length of time an I/O operation takes should be reduced as much as possible.”  This seems logical and is one reason why they use their optimized strategy.

There is a section on “file system buffering” that touches on the tradeoffs between using memory for buffer caching versus other possible uses.  Thus, the authors evaluate how increased buffering impacts their CPU utilization – this is in keeping with their goal of parallelizing I/O and computation.  Their observation?  The greatest benefit comes from a small number of buffers, in their analysis eight buffers provides most of the benefit. Building on that Figures 6 and 7, they observe there is a clear limit to the benefit of further buffering.  These days we do not think too much about this because we tend to use virtual caches, so the amount of physical memory is really managed by the virtual memory management code, yet the observation would likely still apply.  There is a limit to the benefit of buffering.

The authors also point out that disk allocation is a challenging.  They employ allocation bit maps, cluster allocations, over-allocate, and even use simplistic predictive read-ahead.  They refer to these as “strategy” routines.

In general, this is a great introduction to the basic structure of a media file system.  There are plenty of details that will be refined in later work.