Home » Operating Systems » Windows

Category Archives: Windows

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 204 other subscribers
April 2024
S M T W T F S
 123456
78910111213
14151617181920
21222324252627
282930  

Windows Filesystems: File Object Relationships

A complex intersection in Shanghai
A complex intersection (Shanghai)

One of the challenges of developing file systems in Windows is related to the complex relationships that exist between various data structures in the operating system that are part of the file systems domain. In this post I want to discuss one aspect of this complex relationship because it leads to behavior that makes sense when you understand it, but leads to very counter-intuitive behavior if you do not.

Each time a user opens a file the I/O Manager creates a new file object (I described this in Create previously.) A Windows File system is responsible for initializing specific fields of the File Object including the SectionObjectPointers field. This is a peculiar field – the file system allocates the space, but does not set the fields within it. There is a many-to-one relationship between File Objects and this field. The information here is used by two other parts of the operating system: the Memory Manager and the Cache Manager.

The Memory Manager is responsible for managing virtual memory. Each process has its own unique address space. An address space defines what code executing within that process will see. Code running within the address space works using virtual addresses. The CPU and operating system work together to use and manage this data. The details of how this labor is divided is a function of the CPU architecture and the details do vary across different architectures. The operating system must be modified to support new CPU architectures and this is one of the areas that can require additional programming effort.

There are quite a few distinct parts to the virtual memory system. For example, the CPU typically contains a special cache of recently used virtual to physical addresses (a translation lookaside buffer or TLB). When a virtual address needs to be interpreted by the processor it first looks in the TLB to see if it knows the correct mapping for the given virtual address. If it does, it uses that without accessing memory. This is quite important because when the processor does not find the entry in the TLB (a TLB “miss”) it must then convert the virtual to physical address using the data in memory. Memory, while fast, is much slower than the TLB. Modern TLBs can be accessed within about 0.5 nanoseconds (ns) while accessing memory can often require 75 ns. Plus, converting a virtual to physical address can require access to multiple physical memory pages, which further drives up the time required.

One reason for having a virtual memory system is that the operating system can set up a range of virtual addresses without assigning physical memory to those addresses. As necessary, the hardware will invoke the operating system to allocate physical memory and then assign it to the relevant virtual memory location. This process is typically called a page fault because memory is managed in small units called pages. In order to fill in the newly allocated physical page properly, the virtual memory system needs to keep track, for every virtual address, where its corresponding data is presently stored. This is why File Systems are closely involved: they manage data. Thus, there is a symbiotic relationship between the Windows Memory Manager and the File System.

File Systems in Windows also offer a non-block oriented interface for storing and retrieving content: a read/write interface. Such byte oriented interfaces are a mismatch with the block oriented interfaces of most storage devices and the File System converts between the two of them. In Windows, file systems typically do so by memory mapping a region of the file into memory. Windows provides another component – the Cache Manager – to assist in managing these mapped regions. Thus, we have a tight collaboration between the three components: Memory Manager, Cache Manager, and File System. Each has a distinct role to play, but relies upon the other to accomplish its task.

The following diagram attempts to capture the interesting subset of relationships that I am describing here:

Object Relationships in Windows File Systems
Object Relationships in Windows File Systems

Let’s start by describing the individual components:

  • File Handle – this is an abstract value that Windows gives to an application to represent a file. Any time we use a file handle in the kernel, it must be validated since we do not trust applications not to have bugs or nefarious intent.
  • Section Handle – this is similar to the file handle, in that it is an abstract value that Windows gives to an application. In this case it represents a section, which is used to permit memory mapping of some or all of a file into the address space of an application.
  • File Object – this is the internal Windows kernel data structure that is used to represent an open file; we use file in its broadest sense here because it could be a file, a directory, a device interface, a communications endpoint, or pretty much anything else that “behaves like a file”.
  • Section Object – this is the internal Windows kernel data structure that is used to represent anything that can be mapped into memory. Note that this is distinct from the actual mapping. There are two kinds of section objects with which a File System is concerned: the image section object and the data section object. One key difference here is that something mapped using an image section object will be copy-on-write, so changes are not written back to disk, while something mapped using a data section object will permit reading and writing and changes to it are normally written back to storage. Both support the concept of sharing memory – in which two or more processes using the same section will also use the same physical memory.
  • Shared Cache Map – this is a control structure used by the Windows Cache Manager to track regions of files mapped into memory for use by the file systems to handle read and write calls (though file systems may choose to not use the cache). This structure belongs to the Cache Manager and shouldn’t be used by any other component (though it is useful to us, as file systems developers, to look at it _in the kernel debugger).
  • Section Object Pointers – this structure is allocated by the file system from non-pageable memory. It is large enough to store three pointer values: a pointer to the ImageSectionObject, a pointer to the DataSectionObject, and a pointer to the SharedCacheMap. These map to, as you might expect, section objects and the shared cache map.

With this background, I can now describe my diagram. When a File System is initializing a new File Object, one of its responsibilities is to set up storage for the Section Object Pointers structure. Typically, this is the same for each new open instance of the same file. One reason to set them up to be different is if you want to construct a separate view of a given file. For example, I have used distinct views in the past to manage encrypted files: one view is the clear view of the file, while another view is the encrypted view of the file. Similarly, I could do this if I were supporting file versioning.

If an application opens a file that has not been previously opened, the section object pointers value will point to cleared memory (all values are set to zero). If the File System then invokes the Cache Manager to set up caching (CcInitializeCacheMap), upon return, the DataSectionObject and SharedCacheMap fields will have been initialized.

It turns out that the Cache Manager needs to have a File Object to implement some of its functionality. Given that the only File Object it has to use is the one that is passed into the cache map initialization function, it bumps the reference count on this object and stores a pointer to it.

It is rather common for a File System to defer initializing the cache until the first read, since many files are opened but never used for read or write. If an application memory maps a file, it does this in two steps: first, it opens the file, then it opens the section.

Since the Memory Manager “owns” sections, the call to open a section is sent to it, along with the file handle for the file to be mapped. The Memory Manager converts the file handle to a File Object. Then it looks inside the section object pointers value to see if there is an existing DataSectionObject value. If there is, it uses that section object to complete the call. There is no interaction with the file system required. If there is not an existing Data SectionObject, it will create a new one. This turns out to require some level of interaction with the File System because the Memory Manager needs to know how big the file is – after all, the size of the section and file need to be the same. Once the new section object is created, a pointer to it is stored in the SectionObjectPointers block associated with the file. Since the Memory Manager may need to retrieve data from the given file, it maintains a reference to the File Object.

At this point it is worth noting that if a file is memory mapped there is no shared cache map. Recall that when the File System calls the Cache Manager it passes a File Object, which the Cache Manager will use.

Thus, if a file is first memory mapped and then subsequently opened again and used for cached read/write operations, we find a situation where the Memory Manager is using one File Object, and the Cache Manager a second File Object. If we do this in the reverse order, the Cache Manager and Memory Manager use the same File Object for their references because the Cache Manager creates a section object using the same File Object that was passed to it by the File System.

So, when might we have a third distinct File Object? When the file is an executable image. Suppose you copy an executable program. The copy program opens the target file for data access; it then writes the contents into the file. Unless this is done without memory mapping or caching, there is now a Data Section Object backed by this file (and thus a File Object). Next you decide to execute it: this time when the file is loaded for execution the Memory Manager will need an Image Section Object. If one does not exist, it creates one. This is one way how you end up with two section objects for a single file. If you combine that with the memory mapping versus read/write usage described earlier, you can end up with three distinct File Objects for a single file that is only in use by one application.

But for a File System there are some more interesting effects as a result of this shuffling around:

  • A paging write operation may be performed against a File Object that was opened for read. This is because the Memory Manager uses the File Object stored in the section object; it does not know how the original file was opened. A subsequent open of the file for write continues to use the File Object that was opened for read and passed to the Memory Manager when the section was created. I will note there is now an API for switching this file object out, as this behavior can be challenging for some file systems to handle.
  • Paging I/O Request Packets (IRPs) can be sent to the file system with any of the three File Objects; it will depend upon the reason the paging request is being sent.
  • The lifetime of these three File Objects is now decoupled from the user file handle. What this means is that when the user application closes the file, the File Object’s reference count will not drop to zero. So the IRP_MJ_CLEANUP will not be immediately followed by an IRP_MJ_CLOSE. It also leads to a peculiar situation where the IRP_MJ_CLOSE is sent inside the cleanup handler, but I’ll talk about that another time.

These relationships, while complex, make sense in the context of the interactions between these components. Hopefully this basic description will help those writing Windows File Systems better understand the interaction patterns.

File System Driver: Create

The road ahead.

The usual place to start when building a file system is to think about the Create operation. This may also be referred to as an open operation, but that conflates the object with the handle.

I think of Create as being “create a handle to the object”. The creation of the object itself can be a side effect of creating the handle; it does not make much sense to create a handle to something that doesn’t exist.

POSIX actually does have a create system call (and at one point it was creat for obscure historical reassons). Now you can call open and specify O_CREATE and it will perform the same operation. In Windows the native call is NtCreateFile, though there is also an NtOpenFile, but they both ultimately invoke the same internal kernel operation (IoCreateFile or its successor IoCreateFileEx – the Ex thing is one way of saying “oh, we need to pass in more parameters, so sorry.” I’ll just talk about IoCreateFile since it is shorter and if you need the other version for some peculiar reason you can figure it out. Remember, it’s an operating system internal call, so it only matters to people writing software to execute in the kernel.

Inside Windows, IoCreateFile is actually a rather large, complicated function: while the concept of creating a new file object is simple, the details are complicated because as much as we like to pretend files are nothing more than byte streams, there are so many special cases and exceptions to this that our illusion thin, at best. UNIX suffers from this as well, with symbolic links and special files. So, files are byte streams, except when they aren’t.

Since I’m writing this article as a description of how file systems on Windows work, I’ll stick with talking about Windows behavior and leave talking about other operating systems behavior to another time.

Windows Flow of Control for Create
Windows Flow of Control for Create (Simplified)

The animation shows the basic flow of a request through the system:

  1. An application opens (or creates) a file. Depending upon the implementation for the subsystem, it will call through the subsystem itself; often this is just implemented inside a shared library (dynamic link library or DLL).
  2. Since this requires a system call, it will invoke the relevant system call interface inside ntdll.dll, which is mapped into every process in the system. It will format the request as appropriate for the platform and then issue a system call (syscall/sysenter on Intel platforms, or swi on ARM platforms).
  3. The Windows system call dispatcher will forward this to the I/O Manager, since it handles files.
  4. The I/O Manager is presented with a name but it does not yet know which device and driver will handle this specific request. Thus, the I/O Manager will ask the Object Manager (ObLookupObject) to parse the name until it finds a relevant device. Assuming it does reference a specific device, the Object Manager will then invoke the I/O Manager, because DeviceObjects (and FileObjects) point to a function registered by the I/O Manager with the Object Manager. For example, this will invoke IopParseDevice. At this point the I/O Manager now has a DeviceObject and the balance of the name. At this point it can allocate an I/O Request Packet (IRP) and set it up. In the case of a physical file system, however, the I/O Manager must reference the volume parameter block (vpb). Thus, it retrieves the relevant file system’s device object from the vpb. It asks the Object Manager to create a new FileObject. The I/O Manager will complete initialization of the FileObject and format the IRP with the relevant parameters. Create is a complex I/O request and the fields are scattered, unlike other I/O requests. Once formatted, the I/O Manager will invoke the file system driver (via IoCallDriver).
  5. The file system driver will receive the IRP. It will process the request, which may involve creating a new file as a side effect of creating the file object. It will parse the balance of the name. It may check security, sharing, allocate space for new files, validate options. It must handle symbolic links at this point as well (including reparse points) and format return information appropriately. It might attach ExtraCreateParameters to the I/O request. It could perform the operation within the context of a kernel transaction if the FileObject is part of a transaction. It may have special cases for volumes, directories, alternate data streams, or other file system specific behavior. While the Windows I/O Model allows any IRP to be processed asynchronously, the I/O Manager will block and wait for completion of the request. The file system may also need to perform additional operations for oplocks, which even have a case where the create is completed, even though the file is not yet usable.
  6. Once the file system is done processing the request – successfully or not – it will complete the request by calling IoCompleteRequest. The I/O Manager will unwind the I/O request stack (in case there are filters, which there almost always are now) and once done, it will copy results from the kernel to the user address space and return control to the system call dispatcher.
  7. The system call dispatcher will restore state, set the return code in the return register, and complete the system call.
  8. In user mode, ntdll will process the request, may raise exceptions if needed, and return to the application caller.

For a Windows file system Create is often one of the most complex routines – I have seen file systems with almost 20% of their code in this path. There are numerous edge conditions: open by file ID, for example, as well as oplocks, not to mention creating new files, creating in-memory data structures and managing the many-to-one relationship between file objects and the file system control structures. Each open of a file creates a new FileObject. I will discuss why this becomes complicated in a future post, because it leads to unexpected behavior for the unwary.

File System Driver: Structure

I thought I would discuss the structure of my driver and why I chose to structure it the way that I have done. Of course, I could change the structure of the driver again, but I will need to see a compelling reason to do so.

First, I will note that the driver has changed since I last wrote about it. I was not entirely happy with what I saw as a bit too much complexity, some of which is just me fighting with the vagaries of C++ versus C. So of course, I have refactored it (though honestly, my first post has not yet appeared as I write this, so you can probably ignore the refactoring).

I chose to structure my driver so that each device object can have its own request handlers. My rationale for this was fairly straight-forward: it is quite common in these drivers to have a control driver and then a file system driver. Usually these get conflated together, so I thought I would try to set up my driver so that I could more easily split out that functionality.

As of this writing, I now have a control object. The driver creates a control instance as part of DriverEntry. I’ve also moved the project over to the larger project of which it is a part (ergo, the reason I decided to build a new file system driver in the first place). You can find that project on GitHub – Project Araneae, a name that relates to spiders because the goal of the project is to create a relational web across files. Much of that project won’t have anything to do with Windows file systems, but I’m going to keep working on my file system because parts of it will eventually need just such a file system.

Fortunately, as soon as I started up Visual Studio 2019 it told me that had an update – so I’m using the time while that installs to write more about my work.

Since this is now the Tarantula project (a spider name), I have renamed things somewhat. The base device object extension class that I use is TNativeDevice. I’ve also started to try using the Windows Implementation Library (WIL) which has a nominal amount of Windows kernel support. I poked at this a bit and it is tantalizing to try using it, but rather than spend too much time on meta-programming (templates) for now, I’ll curb my enthusiasm and keep poking at it when I have more time to do so.

Trying to get C++ to use an array of function pointers to methods for each major function code turned into quite a bit of hassle so I decided to just encode the functions as a virtual interface; each one is set up to take the IRP as its input parameter, the default implementation does the right thing, and I moved the new operator into the class itself (I was getting ugly conflicts with WIL and truthfully, I prefer specializing new because I like having unique tags. The last time I did this (several years ago) I used a template technique for the operator overrides and constant tag values. This time I’ve just manually coded them and moved the tags into a common file (which has advantages later when I want to find all the tags that I’m using).

I created a specialization of the TNativeDevice for the control object and aptly named it TNControlDevice. It has support for Create, Cleanup, Close, and DeviceIoControl. The other functions are handled appropriately. I have kept things simple – I am not auto-creating the control device in the constructor, for example, and instead create it explicitly in DriverEntry. I still use the DeviceExtension for storing my control object (the class instance). When I load the driver, I see the control object appear in the object manager namespace. When I unload the driver, I see the control object disappear. I did notice that somehow I broke the inf script, so I’ll have to debug that at some point (though that is never fun).

I turned my attention to constructing code I’ve written previously. One is to load the registry parameters. I’ve tried a couple of approaches to this over the years and my favorite is to load all the values and then support a query interface. In the past I’ve even spawned a thread to monitor changes and reload things, but at this point I have not done so. This allows me to add things, such as the registry-controlled initial driver breakpoint (a BreakOnEntry value) as that’s proven useful over the years. It isn’t something that would be enabled in production. I expect to also add information about what pseudo file system volumes to add – but that’s really the point here, I can use that registry information however I see fit.

I have also added a library to wrap the ERESOURCE objects that are the lifeblood of the file systems locking mechanisms. I have not gone so far as to reconstruct the lock ranking package that I previously built, but I suspect I will do so at some point; the last version of that was quite good at validating the ranking of locks within the hierarchy and enforcing proper ordering, which in turn avoided deadlocks. It also allowed invocation of an optional function for computing a hash value of the protected object, which permitted me to detect when a data structure was modified without the lock being held. It would not tell me where that happened, but at least it would alert me to the need to do a code review (or set an access breakpoint and find where it was happening).

Doesn’t sound like a lot of file systems code yet, does it? So far, it isn’t. I will need another specialization of TNDeviceObject for my pseudo file system, which means I want to know what drive letter to use when creating it and the logical place to get that is from the registry – hence why I was working on the registry package in the first place.

Thinking through that point, I realized I need to construct a TNFileObject class as well. I suspect at some point I will want to have specializations of this, but I am fine with waiting until I need it to create one. The TNFileObject will contain at least two separate memory allocations: one is for the non-paged objects that must live in kernel memory, such as the SectionObjectPointers structure. There is always a fair bit of state that is safe to page out, and thus there will be a second memory allocation for the blocks of memory that can be safely paged out. Indeed, I suspect most of the state for my driver will be safely pageable.

One final note for those of you who might be reading and just getting started writing drivers for Windows: make sure you turn on driver verifier. This is done using the verifier application, which is installed on Windows. Make sure you enable verification for your drivers at least. Then learn how the !verifier command works inside WinDBG, because you will need it.

I’ll keep posting about this periodically as well. Feel free to also check out the code online. Let me know if you have questions or suggestions.

Setting up Debugging

No matter how many times I do this over the years, I find it to be a slow and painful process – install a clean virtual machine image, set up kernel debugging, install the (test signed) driver.

It never ceases to amaze me how non-intuitive it is to set up talk to the new VM image that I’ve set up on my computer. This time was certainly no exception. Since I’m running on Windows 10, I choose to use Hyper-V (otherwise I can’t use the Windows 10 docker support, for example). I downloaded the latest and greatest version of Windows 10 (1903 Enterprise) and installed it inside a Generation 2 Hypervisor. There’s the usual challenges (oh right, I reset the machine, so of course none of my Hyper-V networking choices worked, so of course I don’t have an external network switch set up). So I set up the VM with no networking. Windows is most displeased with this option and warns me that a machine without networking is like a Bloody Mary without vodka; one nice upside to this was that it didn’t prompt me to log into an internet account, so I gleefully set up a local account.

Then of course I had to figure out (again) how to get things transferred between the local machine and the virtual machine. I made the usual changes to enable ICMP (“advanced settings” in the firewall, and then it is buried under “File and Printer Sharing” because obviously supporting ICMP is all about file and printer sharing….

I was amused to find that I could ping from my laptop to the virtual machine, but not the reverse. Repeating this process on the laptop of changing the File and Printer sharing options fixed that issue. Next trick was to get SMB working since I have to move files between the two machines. I enabled other File and Printer sharing options (for the private/domain network options – I don’t want to share on a public network, after all) but that still didn’t work. So I turned off the firewall on the virtual machine; not best practice vis-a-vis security but it is a test machine.

Voila! I can map a share. So now I copy kdnet from the host to the debug target. It of course tells me that I have to turn off secure boot to continue. I first tried with PowerShell but it balked at me so I sighed and shut down the VM image, changed it inside the Hyper-V management console, rebooted and restarted. I could then run kdnet successfully to enable boot debugging (admittedly, that is a nice improvement, since I’m used to dealing with bcdedit to enable kernel debugging). I fire up the debugger and reboot the VM – and it hangs.

So I start poking around on the host machine. Much to my annoyance, it seems that Hyper-V has installed a virtual adapter on top of my wireless and marked it as a public network. Well of course that isn’t going to work so well. The option to change the network type just silently doesn’t show up on the laptop (ironically it did show up inside the VM). After some internet searching and poking and prodding I get the network type changed to private using PowerShell – and then the debugger attached and the VM finished starting. Life is good!

I’ve now copied my test driver, inf script, and certificate over to the test machine. I went through the steps of installing the test certificate as well.

While doing all of this I also did a quick bit of research as to the least expensive code signing certificate. At the time I write this (May 2019) the best option (if your primary concern is cost) is Digicert. They list a price of $99 per year for an EV code signing cert. Of course EV certs are a bit of a pain to get, since they actually do some small amount of work to verify that you’re legit; I need to do a bit more work to ensure that my phone number is independently verifiable (one of their requirements) as I’m not quite sure how they will validate it. But I don’t expect I’ll need that for a while yet, since that’s really needed when I want to hand the driver to someone who isn’t a developer and wants to test the driver.

I installed the driver and… the machine crashed. So I know that my configuration is now working and I can debug the driver. Progress!

Starting the Skeleton Driver

Screen shot of my newly created skeleton driver

In my last post, I installed the WDK (and described why I want to build a file system driver). I started up Visual Studio 2019, said I wanted to create a new project, narrowed down the options to “WDK” related projects, and scrolled down to the WDM driver option.

I’m not building a WDM driver, but it is the closest project type to what I want to do. It creates a solution with the name specified and then creates a project with the same name. The only file that it pre-constructs for me is an inf file. I will need to do work on that before I can use it, but I’ll leave that for later. File system installation files are surprisingly uncomplicated, since all we really need to do to install a file system driver is set up a few registry keys.

Since I had just installed Visual Studio 2019, I’ll need to tune things to my working environment. I started by enabling git integration, since I will be using github.com for my source code repository (winskel).

That took more time than I anticipated – I installed the Github integration into Visual Studio, which restarted Visual Studio. I was then told “there is an update to GitHub extension for Visual Studio”. Thus, I installed the update next. That required another restart to install the update. I hope the Visual Studio folks take a lesson from the VS Code team, since I install VS Code extension updates all the time with just a refresh, not a full restart. Of course, I used that time to continue adding to my post here, so it wasn’t entirely wasted time. Still, it is stunning that they construct a restore point just for installing new extensions.

The new helpful error message upon restarting!

I really liked the fact that Visual Studio 2019 suggested to me that I could make startup faster by disabling the WDK extension – how helpful, given that the reason I’m running Visual Studio 2019 is because I want to use the WDK. It makes me long for the days of SOURCES files and command line program building. I know it is possible to develop without using Visual Studio and perhaps I’ll explore that again at some point, but I’d rather be writing code for my new driver rather than fussing with the tools and environment at this point.

I enable code analysis – it can be annoying, but it also finds bugs.

Since this is a new project, I’m going to enable the static code analysis tools. While not required, I choose the “All Rules” option because it is the most restrictive setting available. Note that I am applying this to all the configurations (debug and release) as well as the platforms for which I have installed the compiler tools (I did not install the ARM compiler tools, so I cannot include them).

Having enabled the checks, I built my simple file with just DriverEntry (and an error return). Of course, as I expected, the static analysis tools are now reporting issues, so I add annotations (DRIVER_INITIALIZE DriverEntry; for example) and modify my code (the static analyzer points out that both the DriverObject and RegistryPath can be set as const pointers). Since I will be changing DriverObject I suppress the warning. I don’t expect to change the RegistryPath, so I mark it as const.

I also had a warning that while the spectre/meltdown mitigation option has been selected for the compiler, the libraries with the needed mitigations are not installed. So back I went into the installer and installed the missing libraries. Things now build well, and I have my super-minimal driver. It won’t do much, since the DriverEntry function returns a failure code, which means it will load and then unload.

However, this is enough for me to make the inf file work, so I will do that next.

This is the default INF file that Visual Studio provided to me.

Visual Studio generated a default INF file for me. This isn’t quite enough for me to install a working driver, so I’ll need to modify it. Plus, Microsoft changed some details about INF files for Windows 10 1903 and created a new primitive driver type with rules that need to be followed if you want the driver to be properly (test) signed.

So I worked through the INF file issues and I now have a working INF file, with a driver that (of course) won’t actually do anything yet.

Next, I turned my attention to pulling together a C++ runtime so that I can use C++ if I want. Basically, there are several things that need to be done to make this work:

  • I need memory management functions
  • I need initializer support
  • I have to wrap the standard functionality (DriverEntry) and coordinate the Unload function so it calls the cleanup logic.

In the past, I’ve added a template layer above the allocators, which permits me to specify (on a per-object type) what the pool type and pool tag are for the allocations. Unlike in user mode memory, where we normally don’t worry about these things, in the kernel we do need to worry about whether memory is pageable or not. Plus, we have to provide some mechanism for finding memory leaks since there is no automatic garbage collection. Note that my goal isn’t to port STL into the Windows kernel (though I did see one project where it looked like someone had done that). Similarly, I don’t plan on supporting C++ structured exception handling. So it will provide me with most C++ code features, but I’ll eschew those that require specialized run-time support.

As I wrap this up for the day, I have the allocation routines plumbed. The next step is to get the initializer code written – it revolves around walking through some memory locations where global and static constructors need to be called – the Microsoft C++ compiler embeds some magic information in memory to do this. I also need to construct a list of things to be called when terminating the runtime.

Once that’s done, I’ll move on to adding basic functionality. One thing that will greatly simplify this initial effort is that I don’t have to worry about integration with the memory manager or cache manager because I can defer I/O management to the native file system. Perhaps, once we’ve proven the viability of this approach, I can look further at integration.

I will continue describing my progress and updating the repository as I work through this project over the coming months.


Let’s Build a Windows File System

It’s been a while since I upgraded my Windows kernel development tools, so I thought I’d write about the steps I’m taking to do so. How you build a Windows file system has changed over the years but the basic structure of the file system driver itself has not.

At the time I’m writing this, the standard tools to do this are the Microsoft Visual Studio Integrated Development Environment and the Windows Driver Kit (WDK). The WDK download page (which could move, I just search for “Microsoft WDK” when I need to find it) actually describes the basic steps needed to develop drivers for Windows. As I write this, Visual Studio 2019 is downloading. I do not need many of the parts of it, so I just installed the “Windows Desktop Apps” components. Don’t forget to install the Windows SDK as well (it’s an optional component as part of the Visual Studio installation).

Once Visual Studio is installed, I’ll install the WDK. It will add the WDK integration components into Visual Studio and those will permit me to build Windows drivers. That will also install one of the most important tools: the debugger. I have been working with WinDBG for several decades and it is now an indispensable tool for kernel debugging, including both a graphical user interface as well as a command line version. It has support for debugging over a variety of transports (it used to support debugging over modems, but I haven’t done that in many years, so it may no longer be supported) including serial ports, USB ports, networks, and synthetic debugging (over a virtual device) for virtual machines. Since I will be using a virtual machine for my development, I will set up synthetic debugging. Easing this, it seems Microsoft has automated the process. I have never done it automatically, so I look forward to seeing if that works.

I still need to write about what I am building, but I will defer that for the moment because it deserves a post of its own. I can explain why I want to build this as a file system, though. I hope this is useful for anyone reading it who is thinking of building a file system.

The file systems interface, whether it is UNIX, Linux, MacOS, or Windows, is a clearly delineated boundary at which I can implement functionality that becomes available to anyone using the file system itself. The specifics of that interface vary somewhat across operating systems, but there is a high degree of commonality. One reason for this is that mainstream operating systems have a common heritage. Perhaps that will change in the future (one recent paper suggests that our OS architecture assumes that I/O is slow relative to processors and memory, an assumption that is no longer true) but it seems unlikely to change in the near-term.

The decision to implement functionality at the file systems interface is because essentially all applications for our operating systems know how to use it. New functionality – provided it conforms to the file systems interface – can be exposed to all those applications by supporting the file systems interface, which means that it seamlessly integrates into the existing system. If the new functionality is substantially different than what can be provided by the file systems interface, this will not be a good solution. Since my goal is to provide both compatibility and new functionality, I am using the file systems interface but will also look at ways in which I can augment it further. Indeed, one interesting aspect of the GFS paper I described earlier is they tried to implement functionality within the constraints of the existing interface. This has made me think about how I would go beyond what they did, without changing the interface, though I do expect at some point I will need to augment the existing interface.

I decided to do this on Windows for several reasons:

  • I am familiar with Windows file systems development; I am comfortable in the kernel environment, and I know how to integrate user mode services and kernel mode drivers together to provide my desired functionality;
  • I am looking at how to extend the namespace management. By doing this as a file system, I know it will be visible in standard applications, which in turn will make it easier for me to gauge how effective the changes are for ordinary users.
  • More than a decade ago Microsoft incorporated quite a few interesting features into their user interface in anticipation of the new Windows File System (WinFS). While WinFS was never released, its goals of augmenting existing file management mechanisms were partially integrated into Windows.
  • The NTFS file system in Windows supports a change journal that will be a good place to start for prototyping capturing some relationship data. I expect that we will ultimately provide additional mechanisms for doing this, but this should provide a good head start.
  • User mode file systems, while convenient for prototyping, are a dead end if I need a kernel mode file system. User mode file systems are known to be much slower, and this is amplified for meta-data operations. Since I expect to be implementing primarily meta-data operations, I expect it is highly likely that I will not be able to demonstrate acceptable performance if I were to use FUSE for Windows, for example.
  • Windows already has a supported mechanism for accessing files via a file identifier. We know that applications do not need hierarchical name spaces – that is one of the lessons from the Google File System paper. What applications need is a key. This is hardly a new observation: both NFS and AFS employed file identifiers in their implementations. Indeed, one reason that CDFS and NTFS on Windows have long support “open by file ID” is because they needed it to support AFP for the Services for Macintosh support in Windows NT. While the NTFS source code is not generally available, the CDFS file system source code for Windows is in the WDK – and demonstrates how CDFS supports “open by file ID”. Note that NTFS is more permissive in its support, as it allows absolute path name opens, while CDFS does not.

Now that I’ve stalled a bit, my Visual Studio 2019 installation has finished and I went to install the WDK; that installation was unhappy that it could not find the right version of the Windows SDK to install and it pointed me off to another web page to download the correct version. Installation still is not a seamless experience, it would seem.

So now, as I install the SDK, I can return to discussing my project in a bit more detail, even though this is probably unfair since I haven’t really told you much about what I am trying to achieve.

I have opined about the challenges of managing hundreds of thousands of files across multiple devices. I pointed to projects that have explored alternative ways of managing the name space (e.g., QMDS and GFS) and they have done a good job of laying the groundwork. Prior discussions I have had with others have focused on search as a paradigm, but the survey paper on File Management Research I recently described helped me better understand that navigation seems to be the approach users prefer over search.

One possible reason for this preference for navigation is the nature of the task itself. When we search on the Internet, we are seeking an answer to our question. When we look for something within our personal files, we are seeking the answer to our question. We fall back to search when navigation fails us.

The SDK finally finished installing… I’ll try the WDK again.

Over the years, I have constructed a number of virtual or pseudo file systems. For example, I once constructed a file system that didn’t store any data, it just presented an ephemeral name space. I crafted it to support a broad range of meta-data features: object IDs, file IDs, extended attributes, alternate data streams, and access control lists (ACLs). But read operations were satisfied by zero-filling the memory buffer, and writes were discarded – thus, the only data that was visible is data which persisted in the virtual memory file data cache. This was a fun file system to build, not the last of which because it really was only the namespace and meta-data management parts of the file system. My motivation for doing this was performance testing for a file systems construction framework but I realized at some point that it could also be a good baseline from which to construct a new file system.

Thus, my project direction: building an abstract file systems interface that has rich support for meta-data. Actual file I/O can then be redirected to the underlying file, within its own native file system

The WDK installation is now finished! My next steps are:

  • Constructing a virtual machine image that I can use for debugging; the wonderful thing about most file systems work is we aren’t dependent on hardware, so it is easy to do development inside the virtual machine environment.
  • Sketch out my skeleton file system model.
  • Begin to implement it.

I don’t expect to build production quality code. Over the years I have learned that the bar for production quality file systems code is quite high. Thus, my goal is to construct a working prototype and then, from that, begin building my new file system namespace.

US Patent 9,830,329 Issued

US Patent 9,830,329 issued November 28, 2017.

It is about a technique for utilizing a pre-existing tunneling mechanism to invoke remote functionality. The original inspiration for this was the need to communicate from a component on a client system to utilize functionality available on the server.

The first case for this was when I wanted to obtain meta-data that was maintained on the server, but not available to me on the client. By exploiting a pass-through mechanism in the Windows SMB client, I could send a query to the server, have a component on the server get the answer, and respond back to my software on the client. This turns out to be useful in several cases – and doing it through the existing SMB infrastructure allowed a clean, functional solution.

Windows and “reserved names”

I’ve fallen into the habit of answering some questions on Quora and those which involve answering questions about Windows file systems seem to garner the most interest and the most disagreement.

Yesterday I answered a question entitled Why can’t I save a folder name “con” in Windows?  I decided to respond to this because the other answers that I read said that Windows reserved this name.   My response demonstrated how you could easily do this with bash running in WSL.  Some comments seemed to think that using bash was cheating, so I sat down and wrote a simple program to create folders with reserved names.  It isn’t bound to WSL and executes as a standard console program (so it is executed from the command line).  I’ve published the source code on github.

It’s been a few months since I’ve posted anything about Windows file systems and I realized that I should start doing so again.  So, that little application program inspired me.

The difficulty that I have with assertions that “Windows doesn’t do that” or “Windows doesn’t allow that” is that they often conflate the limitations of applications, libraries, or subsystems as being limitations of Windows.  Windows NT was developed to be a general purpose host operating system.  The expectation was that it would have subsystems which implemented specific execution environments. While this includes Win32 (the subsystem most often used by applications) it has never been the only subsystem available.  Windows NT 3.1 shipped with Win32, NTVDM, OS/2, POSIX, and native subsystems (though native is more “not bound to a subsystem”).  I had also heard rumors of a UNIX System V Release 4 subsystem as well, but to the best of my knowledge it was never released and I never saw it.

Window dropped support for OS/2 in Windows XP.  The POSIX implementation changed as it became more like a library that called Win32 and native as needed.  What remained of POSIX was dropped in Windows 8.1.  The Windows Subsystem for Linux (WSL) has been added in Windows 10. It is not based upon the old POSIX subsystem, but it certainly exploits functionality that Windows includes because of the POSIX subsystem’s requirement.

Win32 does not have a concept of fork, but Windows includes all the support necessary to create a duplicate address space, either via NtCreateProcess or NtCreateUserProcess (you can see an example of the latter, including protected processes here).  This is used to implement this functionality.

NTFS has always supported the full range of POSIX compliant names.  The downside to using this is that it means Win32 application (like Explorer) can have problems accessing them.  That’s not a limitation of Windows it is a limitation of the application.

I’ve heard several people tell me that “Windows” must include everything installed, not just the operating system.  I disagree with this characterization, but I’ll address the why of that in a future post.

Bottom line: Windows itself does not have restricted names based upon MS-DOS.  Win32 applications usually do, however.  I would also note that Explorer cannot handle long path names – somewhere within the code is a hard coded buffer that is smaller than the OS level maximum path length (which happens to be 32767 16-bit UNICODE characters).  NTFS limits each name component (directory or file) to 256 bytes, but ReFS limits the component name size to 32767 (and this has inspired me in another area I’ll talk about at some point as well).  Other file systems may have their own limits, too.

That’s enough for now.  I’ll find more interesting things to discuss when it comes to Windows file systems, though.