Home » 2019 (Page 2)
Yearly Archives: 2019
What About POSIX?
The POSIX specification was originally written to codify existing practice in UNIX. While it was called the PORTABLE Operating Systems Interface, it really set out to create that interface by documenting existing UNIX practice. While POSIX has certainly evolved, it has done so slowly.
But I’m not trying to dive into another polemic on the weaknesses of POSIX. That’s been done before.
Similarly, more tempered discussions of this subject have also been done before. For example, in USENIX login; Vaggelis Atlidakis et. al., explored this very issue in the article Posix Has Become Outdated. The authors make some useful points:
- High-level frameworks now drive the use of POSIX. Since portability was a strong, motivating factor, the fact that applications now work against a different set (or sets) of APIs dilutes this motivation.
- POSIX is missing abstractions. This is observed because the use of ioctl is high and that usage is precisely to provide functionality that isn’t present in POSIX.
- The new abstractions are not converging. In other words, the three operating systems they studied are introducing new abstractions but they aren’t coalescing to a common set of abstractions. Of the three key points the authors’ make, this one is likely the most damning since it undermines the original intent of POSIX, namely portability.
Note: the underlying paper was presented at Eurosys 2013 (POSIX Abstractions in Modern OperatingSystems: The Old, the New, and the Missing).
I find it interesting that they didn’t even evaluate Windows, which I am sure would even further strengthen their findings of missing abstractions and divergence. While people might not realize it, Microsoft has had some level of POSIX compatibility throughout the lifetime of Windows (POSIX.1 specifically, which is the basic OS functionality level.) I often see people claim Windows isn’t “POSIX compliant” but I know that this isn’t strictly true. Back in 1992 when I first started looking at Windows NT, I was quite surprised at how closely the security model they had implemented mapped to what we had done in DCE/DFS for security as well – and our work was based on the draft POSIX security specification at the time.
Yet, I would also note that people continue to look at building new POSIX operating systems. For example, last year at OSDI, Cutler, et. al., presented a paper on building an operating system in the Go languages: The benefits and costs of writing a POSIX kernel in a high-level language.
One reason I care about this is because my own work in considering alternative file systems structures, e.g., the idea of having rich name spaces, has led me into this exact area multiple times. The existing interfaces aren’t adequate to the task. FUSE, which is a popular tool for file systems prototyping, has added support for ioctl, which provides a popular generic mechanism for adding such functionality. But of course the need to use a generic extension mechanism just underscores the weaknesses of the interface.
Jeff Darcy wrote an insightful piece back in 2016 entitled Updating POSIX, in which he looked at file systems specific portions of POSIX, described weaknesses, and made insightful observations about how POSIX doesn’t match what we need these days:
- Rename: I certainly understand this morass since it is something with which I’ve been wrestling for most of my career. When we were designing the Episode file system (1989/1990) one reason we ultimately agreed on using a transaction log was because there was no correct order of operations that would yield a consistent file system. Given that I’m the one who designed and implemented the transaction log, I’m glad we did because our decision to use a log and the way we ultimately implemented it was certainly a strength of the final product.
- Fsync: This is how application programs ensure their data is committed to disk and thus presumably durable. Very few applications worry about durability; one reason is that the cost is so high for ensuring it. One thing I really liked about Jeff’s discussion here is his observation that POSIX focuses on consistency but not on durability. Of course, I just mentioned that durability was one of the reasons applications used this API. The cost to check file system consistency was a serious pain point in the 1980s. At Transarc, when a file system would crash it could take hours before a file server could restart because the entire file system was scanned to ensure it was consistent. Durability wasn’t so much a file systems concern.
- Readdir: This is interesting to me because one of my pain points when I first started working with Windows NT file systems is their decision to shift filtering directory enumeration into the kernel. Yet, his perspective is a good one: applications often are only interested in looking for a subset of files. If you have a directory of 35,000 files and you are looking for just two of them, that’s a lot of overhead moving data between the kernel and the user application. So, perhaps the pain of filtering in the kernel is worth it. There are plenty of other issues with respect to readdir as well. There’s quite a bit of impedence mismatch between the POSIX approach and the Windows NT approach. For example, NTFS actually maintains duplicate copies of file meta-data (timestamps and sizes) between the directory and the file itself. When I taught file systems classes I would sometimes show people the impact of the implementation, where directory enumerations could provide stale results under certain situations. Of course, readdir is of interest to me because as I look at converting the file system to a graph I think quite a lot about what “readdir” looks like when it comes to such a system.
- Chmod: In fact, Jeff’s discussion here is broader than just chmod, it really deals with the complexity of modern security. There’s no simple mechanism for dealing with multiple security domains – try implementing NFS on Windows some time and find out what a pain point it is. His argument of using capabilities is an interesting one, and resonates with some of my own observations in this area. For example, in a graph I can easily conceptualize of disjoint clusters of the graph, where nobody else can find the stuff I’ve created until I give them a magic cookie for it (a capability). There’s no concept of capabilities in POSIX file systems.
I can point to other aspects of the file systems APIs in POSIX that cause grief as well. For example, the fact that the file handle embeds a current offset is definitely a pain point. I see this every semester as students in CS 6200 struggle to add multi-threading support and then have to learn that the read and write calls are not thread safe. What amazes me is that this mistake was carried over into Windows NT: the CurrentByteOffset is an attribute of the kernel FILE_OBJECT. When I write a Windows file system, I’m responsible for updating this field correctly (where “correctly” itself is related to the particular operation being performed) even though I don’t use that field. I suspect that it is there because it was needed for POSIX compatibility.
So my take-aways: supporting POSIX is useful because it preserves existing applications, but it is definitely in need of revision and/or rethinking. More ideas in that direction are best saved for a future conversation, though.
What is a File System?
I don’t know if I’ve discussed this previously, but if so, feel free to skip it. I had a meeting with one of my supervisors this morning and he observed that something I pointed out “would make a great HotOS paper…” I’m pretty much off the hook on that one, since there won’t be another HotOS for almost two years.
I said “we conflate name spaces and storage in file systems”. Or something equivalent. Here’s my real question: why do we mingle storage with indexing? I realize that indexing needs to be able to ultimately get us what we’re looking for, but traditionally we smash these together and call them a “file system”.
The Google File System paper certainly annoys me at some level each time I read it. But why does it disturb me? Because what they build isn’t a file system, it’s a key-value store. Yet over the years I have come back to it, repeatedly, and observed that there is an important insight in it that really does affect the way we build file systems: applications do not need hierarchical name spaces. What they need is a way to get the object they want.
Indeed, Amazon’s S3 service really is the same thing: an object store in which the “hierarchical name” is nothing more than an arbitrary key. I don’t know how it is organized internally, but the documentation is quite clear that the “hierarchical name space” is really just the way the key is encoded.
An Amazon S3 bucket has no directory hierarchy such as you would find in a typical computer file system. You can, however, create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object
GetObjectsample.jpg
, you can name itphotos/2006/February/sample.jpg
.
For me, this is a useful insight because at one point I considered inverting the traditional directory hierarchy, so that rather than have a directory point to it’s children, I would have each file maintain a list of the directories to which it points. Then “enumerating a directory” becomes “finding all the files with a specific attribute”. One motivation for considering this approach was to avoid the scaling issues inherent in large directories (which led me to an interesting looking PhD thesis Scale and Concurrency of Massive File System Directories – more for the reading list!) This is an important concern in a world in which I’m suggesting adding multiple different indices.
But back to the more fundamental question: why do we conflate them? Recall that the original thinking around file systems mirrored the physical nature of existing physical filing systems (I discussed that somewhat back when I talked about my Eurosys presentation) rather than the more free-flowing linkage present back in the Memex model. In a world where the physical copy is located by using its naming information, this does make sense.
Consider how a modern library is organized (yes, some people still collect books!) The location of the object (book) is important, but the categories of the object are equally important – it’s properties. A card catalog could be organized by topic, and would then give you information allowing you to find the work itself.
Applications don’t need hierarchical name spaces – those are something that humans use for managing things. Applications need persistent location information. By using a mutable location key with application we create a situation in which things break just by reorganizing them.
S3 does not fix this issue, since “moving a file” is equivalent to “changing its key”. In that way, it misses my point here. However, I suspect that those keys don’t actually change very often. Plus, S3 has other useful concepts, such as versioning.
What you do lose here is the ability to “check directory-wise security”. That will likely bother some people, but for the vast majority of situations, it really doesn’t matter.
Typical file systems actually do use key-value stores, although local file systems do it with simple keys (e.g., “i-node numbers”). These keys turn out to be useful: NFS looks up files using that i-node number, which is embedded inside the file handle. We did the same thing in AFS, going so far as to add an “open by i-node number” system call so the server could be implemented in user mode. Windows does the same thing with CDFS and NTFS, and I’ve implemented file systems that support “open by ID” for both file and object IDs.
Thus, my point: we merge the name space and the storage management together because “that’s the way we’ve always done it”. The POSIX interface, which codified how things were done in UNIX, embeds this further as there is no open by i-node number and the security model requires a path-wise walk.
Ah, but the failings of POSIX are something that I’ll save for another post.
File System Driver: Structure
I thought I would discuss the structure of my driver and why I chose to structure it the way that I have done. Of course, I could change the structure of the driver again, but I will need to see a compelling reason to do so.
First, I will note that the driver has changed since I last wrote about it. I was not entirely happy with what I saw as a bit too much complexity, some of which is just me fighting with the vagaries of C++ versus C. So of course, I have refactored it (though honestly, my first post has not yet appeared as I write this, so you can probably ignore the refactoring).
I chose to structure my driver so that each device object can have its own request handlers. My rationale for this was fairly straight-forward: it is quite common in these drivers to have a control driver and then a file system driver. Usually these get conflated together, so I thought I would try to set up my driver so that I could more easily split out that functionality.
As of this writing, I now have a control object. The driver creates a control instance as part of DriverEntry. I’ve also moved the project over to the larger project of which it is a part (ergo, the reason I decided to build a new file system driver in the first place). You can find that project on GitHub – Project Araneae, a name that relates to spiders because the goal of the project is to create a relational web across files. Much of that project won’t have anything to do with Windows file systems, but I’m going to keep working on my file system because parts of it will eventually need just such a file system.
Fortunately, as soon as I started up Visual Studio 2019 it told me that had an update – so I’m using the time while that installs to write more about my work.
Since this is now the Tarantula project (a spider name), I have renamed things somewhat. The base device object extension class that I use is TNativeDevice. I’ve also started to try using the Windows Implementation Library (WIL) which has a nominal amount of Windows kernel support. I poked at this a bit and it is tantalizing to try using it, but rather than spend too much time on meta-programming (templates) for now, I’ll curb my enthusiasm and keep poking at it when I have more time to do so.
Trying to get C++ to use an array of function pointers to methods for each major function code turned into quite a bit of hassle so I decided to just encode the functions as a virtual interface; each one is set up to take the IRP as its input parameter, the default implementation does the right thing, and I moved the new operator into the class itself (I was getting ugly conflicts with WIL and truthfully, I prefer specializing new because I like having unique tags. The last time I did this (several years ago) I used a template technique for the operator overrides and constant tag values. This time I’ve just manually coded them and moved the tags into a common file (which has advantages later when I want to find all the tags that I’m using).
I created a specialization of the TNativeDevice for the control object and aptly named it TNControlDevice. It has support for Create, Cleanup, Close, and DeviceIoControl. The other functions are handled appropriately. I have kept things simple – I am not auto-creating the control device in the constructor, for example, and instead create it explicitly in DriverEntry. I still use the DeviceExtension for storing my control object (the class instance). When I load the driver, I see the control object appear in the object manager namespace. When I unload the driver, I see the control object disappear. I did notice that somehow I broke the inf script, so I’ll have to debug that at some point (though that is never fun).
I turned my attention to constructing code I’ve written previously. One is to load the registry parameters. I’ve tried a couple of approaches to this over the years and my favorite is to load all the values and then support a query interface. In the past I’ve even spawned a thread to monitor changes and reload things, but at this point I have not done so. This allows me to add things, such as the registry-controlled initial driver breakpoint (a BreakOnEntry value) as that’s proven useful over the years. It isn’t something that would be enabled in production. I expect to also add information about what pseudo file system volumes to add – but that’s really the point here, I can use that registry information however I see fit.
I have also added a library to wrap the ERESOURCE objects that are the lifeblood of the file systems locking mechanisms. I have not gone so far as to reconstruct the lock ranking package that I previously built, but I suspect I will do so at some point; the last version of that was quite good at validating the ranking of locks within the hierarchy and enforcing proper ordering, which in turn avoided deadlocks. It also allowed invocation of an optional function for computing a hash value of the protected object, which permitted me to detect when a data structure was modified without the lock being held. It would not tell me where that happened, but at least it would alert me to the need to do a code review (or set an access breakpoint and find where it was happening).
Doesn’t sound like a lot of file systems code yet, does it? So far, it isn’t. I will need another specialization of TNDeviceObject for my pseudo file system, which means I want to know what drive letter to use when creating it and the logical place to get that is from the registry – hence why I was working on the registry package in the first place.
Thinking through that point, I realized I need to construct a TNFileObject class as well. I suspect at some point I will want to have specializations of this, but I am fine with waiting until I need it to create one. The TNFileObject will contain at least two separate memory allocations: one is for the non-paged objects that must live in kernel memory, such as the SectionObjectPointers structure. There is always a fair bit of state that is safe to page out, and thus there will be a second memory allocation for the blocks of memory that can be safely paged out. Indeed, I suspect most of the state for my driver will be safely pageable.
One final note for those of you who might be reading and just getting started writing drivers for Windows: make sure you turn on driver verifier. This is done using the verifier application, which is installed on Windows. Make sure you enable verification for your drivers at least. Then learn how the !verifier command works inside WinDBG, because you will need it.
I’ll keep posting about this periodically as well. Feel free to also check out the code online. Let me know if you have questions or suggestions.
Visualization
Last post I discussed relationships. But relationships really are not enough. Another key to this puzzle is visualization. In other words, how do we present the information to users so that it is useful.
But first, let me step back and point to a larger problem: information overload. If we present users with a list of 100,000 options, they won’t be able to necessarily find what they’re actually seeking. For example, one of the challenges of using an Internet search engine is that they can return millions of “answers” to my basic query. In fact, I just typed in “What is the meaning of life?” to a search engine and it responded back with 1.1 billion possible answers, all in under a second. They are a marvel at backing up the dump truck and offering to inundate me with answers. When is the last time you went through even a handful of these, let alone a large percentage of them?
I suspect that if I ask the HCI folks they will be able to tell me what works for ordinary mortals, but I assure you that computers are capable of returning more information than one can possibly ever process – many years ago I received a bug report about a file system directory that contained over 700,000 files and did not display with the file system kit I’d constructed and the company sold. I’d known I made decisions about resources when I did it; but we lifted that restriction and supported much larger directories than that. I’m quite sure that no mere human would look at such a directory in any meaningful sense. Maybe it was a collection of log files, in which case the names likely embed semantic information about the files themselves. Indeed, in Burrito the authors noted that scientists often embed the schema of their data within the file names. I know that I have done the same and when I’m looking for specific data I am often scripting code to sift through the pile of data to find the subset that is useful to me. The point remains the same: huge listings of files within a directory don’t work for humans.
One potential area for considering navigation is faceted search, a technique for making vast quantities of data searchable. Indeed, this fits well with the graph file system idea because what we expect to find in our graphs are clusters of related files. The graph is likely to be a sparse graph because most files are unlikely to share common features with one another. Thus, it suggests that at least one model for this data visualization problem is going to be rolling up data into these clusters, with an iterative approach to breaking it down and displaying it further. Of course, we might be able to do that within the confines of the existing hierarchical structure, which would be great for retrofitting this into the vast array of existing applications; still, my hope is that we can also provide novel new variations (or rather someone more clever than I am at these things will do so). The challenge is to build something that enables this means I need to have some level of understanding as to what kinds of information are needed to do so.
For example, I was wondering about first order approximations – those that mimic an existing hierarchical file system. Such a file system would present one or more views of the data. Maybe we have a Time view, and the time view then shows you all the files in time order. But if you have 1,000,000 files, that is going to be a mighty big list. One option might be to divide it up into ever smaller chunks of time. Eventually, we’d get to a point where you could see a few dozen files in some sort of time order. Of course, strict sharding of time might not make sense either: why should two files that are separated in time by small intervals end up in separate locations.
One possible option would be to consider a time slider that controls the view. This is something that we can find in other temporal data tools. Thus, creating a time slider might help make visualization easier to perform – in some ways this is similar to the Windows timeline feature that has recently been added into Windows 10. I suspect the file system doesn’t do much to facilitate this. From my own use of this feature it has some interesting limitations, not the least of which is that if you want full functionality they want to export data “to the cloud” for further analysis. That sounds like whatever they are doing is data intensive, which in turn suggests to me that the file system is not doing anything to facilitate this. If the answer really is “sorry, but you have to spend lots of computational cycles to mine this data” then my research is unlikely to be fruitful. I push forward because I don’t think this really is the case – the database community has done marvellous things that permit a vast treasure of relationships be mined.
If we combine this with a faceted search area, I can envision a filtered timeline model – in essence, this seems to be what the new Windows feature is doing, albeit by filtering the things they have deemed to be of interest. I suspect they will extend this capability over time, but a time ordered view is just one of the possible relationships I consider to be important for consideration. I don’t know what the relationships will be, but I do think that having a pre-defined list of those relationships will be self-limiting. Perhaps it will be enough. I proceed on the basis that I expect it will not be enough.
One possible visualization I’ve been considering – and part of the motivation for me deciding I need to start building a file system – is that this sort of namespace might be achievable on our existing hierarchical model if I just add “an extra level of indirection”. Suppose we construct a file system that, instead of having the existing file name has a directory of the same name. In turn, that directory then contains relationships associated with other objects. One of those other objects could be the actual file (so we can still access it), but we could also have a “temporal” directory that would then display a list of files that were created, modified, or accessed around the same timeframe. We could store information about what web sites were visited around times that the file was in use. We could keep track of the music that was playing around the same time. We could point to files that were similar to that specific file. Such a visualization could be easily achieved and compatible with existing tools. Rather than being an endpoint, this is more an intermediate staging area – a way of mocking up the concepts and ideas, and to see what works for people and what does not work.
Thus the desire for a file system. I think we can construct a static namespace by using an existing file system and symbolic links (so you can eventually get back to the real files) by mining existing data sources. But eventually, we are going to want dynamic support here. We can stub that out with FUSE (for example) but in my experience (and based upon the literature) FUSE is slow for meta-data operations, which is really all I will be doing. Building file systems is hard work, but it is something I’ve been doing for years, so that aspect doesn’t really scare me. Visualization on the other hand is an area in which I don’t have a lot of experience.
I’m certainly open to ideas…
Relationships
I recently described two file systems (QMDS and GFS) that attempted to capture additional context for files to improve their usability. At Eurosys, I argued (somewhat successfully) that a distinguishing characteristic of my proposed work is to capture relationships between files, something that goes beyond mere isolated analysis of such files.
Index servers, which are now ubiquitous on mainstream platforms, attempt to solve this problem by focusing on the specific attributes of the given file. This is useful, and indeed it seems to be consonant with the general approach of semantic file systems, which attempt to classify files based upon their semantic content. For example, this is how modern Internet search engines work – they classify a document based upon its content.
I pointed out in my recent discussion of personal information management (PIM) that there is a difference between navigation and search. Thinking about this further, I realize that this is more nuanced and reflects the way that we humans look for things: we go to where we expect something to be (navigation) and if it is not there we then go to other places where we think it might be. For example, when I’m looking for my keys, I have a list of places I look first. When I don’t find them there I begin to systematically search for them.
Thus, the natural progression for finding the item of interest is navigation and then search. Index servers are basically the search part of the equation and they do not tend to be where we start first. Instead, it is the fall-back.
As humans, we tend to create associations between events. When I can’t find my keys I start thinking about the last time I saw them (“I know I had them because I was able to get into the apartment. After I got back I went and checked my e-mail…”) These are temporal clues but they are associations between other unrelated events and the object for which we are searching.
Another observation that Margo Seltzer made in our discussions on this topic is that when we use a web search engine, we are looking for an answer, but when we are searching for something that we have we are looking for the answer. This is a subtle, but important, difference. I don’t want to find any set of keys, I want to find a specific set of keys. Internet search is notoriously unreliable in this regard; how many times have you gone back a few days later to find some interesting article, only to realize the list of results coming back from your search engine are different than they were before? This is how Internet search engines exploit the fact that you are (usually) just looking for an answer – they don’t have to give you definitive, reproducible results.
Yesterday I had an interesting conversation with Sasha Fedorova and towards the end of it she suggested that I might do better promoting this work by focusing on solving the needs of a particular community and she suggested the software engineering community, partially because she has worked with them before and also because she could see the kinds of relationships that might be useful to that community. Further, that community is used to testing out experimental approaches that promise to improve effectiveness and productivity. In that same conversation she pointed out the Stack Overflow community as being one of those places software engineers search for answers.
This morning, on the way to the gym, I realized that the Stack Overflow community is also an example of how we organically create relationships: people ask questions and get back specific answers of varying quality. The community rates the responses and preserves the answers. This has organically created a web of connections across topics and people.
Why is this important to understand? Because the work I’m doing proposes going beyond simple semantic analysis of individual files, or even clustering them based upon specific characteristics, but also by establishing a set of interrelationships across files and across applications. Focusing on solving this issue for a single application is much akin to focusing on just looking at the semantic content of a single file. Moving beyond this to realize that us humans create associations in our brains means we need to find ways of capturing those associations across applications that help us better navigate the vast trove of information we accumulate.
For example, Sasha suggested that it would be interesting for the software development community if there were an association between the web pages we accessed and the code we were editing. This makes sense to me: I tend to look up documentation or explanations of specific things as I write code. If we capture this relationship across applications, we can motivate why this is a systems problem and not an application problem – operating systems provide services that are common to applications (not necessarily all applications, but something that is of broader interest than a single or a few applications). One benefit of focusing on the software engineering community is that permits us to identify relationship of interest to the community and then mine those relationships.
Another fascinating conversation (late last week) was when I was discussing my research direction with Ghita Berrada (who is visiting us this month) and we ended up having a discussion of generalized concept analysis. I was familiar with Formal Concept Analysis (FCA) because it was used by Benjamin Martin in his work more than a decade ago (his dissertation “Formal Concept Analysis and Semantic File Systems” is something I have found delightfully insightful and I think it is time for me to read it again) but she pointed me to Temporal Concept Analysis and Relational Concept Analysis. Temporal Concept Analysis is fairly recent, having only been first formally described in 2000, and uses FCA as its basis, but it focuses on temporal events. Relational Concept Analysis is even more recent (2007) and is definitely germane to my research direction since it focuses on relationships and how to handle them within the context of FCA. Given my own focus on relationships across files, it definitely seems pertinent.
All of these are in turn based upon the mathematical model of lattices – partially ordered sets. Lattice theory has been around for quite some time and shows up in a broad range of areas, such as CRDTs which are used in distributed systems. For example, they are used in Anna, a key-value store I read about last year (surprisingly, I didn’t write it up – I should have). It in turn pointed to earlier work on Lattices in distributed systems, which I did describe previously. This has prompted me to go off and read about lattices; this definitely is challenging as my formal math skills are quite rusty, but I have been systematically working my way through the book on lattice theory I picked up. It has been interesting trying to construct an intuitive understanding of the concepts from more formal language describing them; hopefully that knowledge will prove useful.
This is turning into a long post and I still haven’t reached the point I wanted (yet).
A challenge with this work is identifying relationships that we want to be able to support. Of course, I don’t want to restrict these relationships, but rather use a sample of such relationships to motivate the work (or “Why would anyone care about my research?“) One of the challenges of doing good research is motivating that research: there are numerous questions to answer, but which questions deserve being answered?
File systems presently support two basic relationships:
- Contains – this is the directory metaphor. A “directory” is a container of other directories or files. It is the basis of the hierarchical relationship to which we have become accustomed.
- Points to – this is the function of a symbolic link, which is supported by most file systems.
It is relatively easy to focus on properties as well. In theory, the clustering of files based upon this characteristic is one (loose) form of relationship. For example, we routinely associate specific file names with a corresponding application (e.g., via the suffix of the file). Windows exhibits a strong relationship model here, in which applications register interest in particular types of files, based on suffix, and the operating system then uses that information to invoke the relevant application. Apple’s Mac used to use an embedded meta-data component (“resource fork“) for associating the file with the application; it still exists but is not commonly used in order to support compatibility.
Other types of file properties include:
- Timestamp – these capture temporal properties of a file. The most common are the creation time, last update time, and last access time. Last access time is often omitted these days because constantly updating it turns out to be expensive. For example, Windows NTFS does not update last access time by default. My recollection is that they did this because they found updating this field accounted for something like a 6% performance cost in NTFS. Thus, there is a question as to how reliable these values are.
- Size – we know what the size of files are.
- Name – I’ve already mentioned using the suffix, which is an aspect of the name. Is it possible to exploit this in other ways?
Then there are application relationships: what application created this file? It occurs to me that it might be useful to distinguish registered names (suffixes) from all files created by an application. For example, I don’t usually want to see the build artifacts of my software development environment. Microsoft Visual Studio handles this by allowing artifacts to be separated from source files, for example. Could we achieve something comparable within the file system by understanding these relationships? Would it be useful?
Another suggestion from Sasha was that we might want to record what music was playing when we were doing something specific because our brains may create an association across these seemingly unrelated events. This suggests another potential relationship: concurrent application execution. This is a sort of temporal relationship, but one that becomes more interesting when we consider it in a Memex style model of associations. How can I capture these relationships across different applications. Perhaps we can think of some sort of “current context” or “current activity” be associated with a given application that can then be queried and added to the files as we create them. These types of dynamic relationships are certainly more intriguing or interesting.
What kinds of relationships can you envision being useful when you are searching for that elusive file you know that you have but you don’t know where it lives in the hierarchy?
Setting up Debugging
No matter how many times I do this over the years, I find it to be a slow and painful process – install a clean virtual machine image, set up kernel debugging, install the (test signed) driver.
It never ceases to amaze me how non-intuitive it is to set up talk to the new VM image that I’ve set up on my computer. This time was certainly no exception. Since I’m running on Windows 10, I choose to use Hyper-V (otherwise I can’t use the Windows 10 docker support, for example). I downloaded the latest and greatest version of Windows 10 (1903 Enterprise) and installed it inside a Generation 2 Hypervisor. There’s the usual challenges (oh right, I reset the machine, so of course none of my Hyper-V networking choices worked, so of course I don’t have an external network switch set up). So I set up the VM with no networking. Windows is most displeased with this option and warns me that a machine without networking is like a Bloody Mary without vodka; one nice upside to this was that it didn’t prompt me to log into an internet account, so I gleefully set up a local account.
Then of course I had to figure out (again) how to get things transferred between the local machine and the virtual machine. I made the usual changes to enable ICMP (“advanced settings” in the firewall, and then it is buried under “File and Printer Sharing” because obviously supporting ICMP is all about file and printer sharing….
I was amused to find that I could ping from my laptop to the virtual machine, but not the reverse. Repeating this process on the laptop of changing the File and Printer sharing options fixed that issue. Next trick was to get SMB working since I have to move files between the two machines. I enabled other File and Printer sharing options (for the private/domain network options – I don’t want to share on a public network, after all) but that still didn’t work. So I turned off the firewall on the virtual machine; not best practice vis-a-vis security but it is a test machine.
Voila! I can map a share. So now I copy kdnet from the host to the debug target. It of course tells me that I have to turn off secure boot to continue. I first tried with PowerShell but it balked at me so I sighed and shut down the VM image, changed it inside the Hyper-V management console, rebooted and restarted. I could then run kdnet successfully to enable boot debugging (admittedly, that is a nice improvement, since I’m used to dealing with bcdedit to enable kernel debugging). I fire up the debugger and reboot the VM – and it hangs.
So I start poking around on the host machine. Much to my annoyance, it seems that Hyper-V has installed a virtual adapter on top of my wireless and marked it as a public network. Well of course that isn’t going to work so well. The option to change the network type just silently doesn’t show up on the laptop (ironically it did show up inside the VM). After some internet searching and poking and prodding I get the network type changed to private using PowerShell – and then the debugger attached and the VM finished starting. Life is good!
I’ve now copied my test driver, inf script, and certificate over to the test machine. I went through the steps of installing the test certificate as well.
While doing all of this I also did a quick bit of research as to the least expensive code signing certificate. At the time I write this (May 2019) the best option (if your primary concern is cost) is Digicert. They list a price of $99 per year for an EV code signing cert. Of course EV certs are a bit of a pain to get, since they actually do some small amount of work to verify that you’re legit; I need to do a bit more work to ensure that my phone number is independently verifiable (one of their requirements) as I’m not quite sure how they will validate it. But I don’t expect I’ll need that for a while yet, since that’s really needed when I want to hand the driver to someone who isn’t a developer and wants to test the driver.
I installed the driver and… the machine crashed. So I know that my configuration is now working and I can debug the driver. Progress!
Starting the Skeleton Driver
In my last post, I installed the WDK (and described why I want to build a file system driver). I started up Visual Studio 2019, said I wanted to create a new project, narrowed down the options to “WDK” related projects, and scrolled down to the WDM driver option.
I’m not building a WDM driver, but it is the closest project type to what I want to do. It creates a solution with the name specified and then creates a project with the same name. The only file that it pre-constructs for me is an inf file. I will need to do work on that before I can use it, but I’ll leave that for later. File system installation files are surprisingly uncomplicated, since all we really need to do to install a file system driver is set up a few registry keys.
Since I had just installed Visual Studio 2019, I’ll need to tune things to my working environment. I started by enabling git integration, since I will be using github.com for my source code repository (winskel).
That took more time than I anticipated – I installed the Github integration into Visual Studio, which restarted Visual Studio. I was then told “there is an update to GitHub extension for Visual Studio”. Thus, I installed the update next. That required another restart to install the update. I hope the Visual Studio folks take a lesson from the VS Code team, since I install VS Code extension updates all the time with just a refresh, not a full restart. Of course, I used that time to continue adding to my post here, so it wasn’t entirely wasted time. Still, it is stunning that they construct a restore point just for installing new extensions.
I really liked the fact that Visual Studio 2019 suggested to me that I could make startup faster by disabling the WDK extension – how helpful, given that the reason I’m running Visual Studio 2019 is because I want to use the WDK. It makes me long for the days of SOURCES files and command line program building. I know it is possible to develop without using Visual Studio and perhaps I’ll explore that again at some point, but I’d rather be writing code for my new driver rather than fussing with the tools and environment at this point.
Since this is a new project, I’m going to enable the static code analysis tools. While not required, I choose the “All Rules” option because it is the most restrictive setting available. Note that I am applying this to all the configurations (debug and release) as well as the platforms for which I have installed the compiler tools (I did not install the ARM compiler tools, so I cannot include them).
Having enabled the checks, I built my simple file with just DriverEntry (and an error return). Of course, as I expected, the static analysis tools are now reporting issues, so I add annotations (DRIVER_INITIALIZE DriverEntry; for example) and modify my code (the static analyzer points out that both the DriverObject and RegistryPath can be set as const pointers). Since I will be changing DriverObject I suppress the warning. I don’t expect to change the RegistryPath, so I mark it as const.
I also had a warning that while the spectre/meltdown mitigation option has been selected for the compiler, the libraries with the needed mitigations are not installed. So back I went into the installer and installed the missing libraries. Things now build well, and I have my super-minimal driver. It won’t do much, since the DriverEntry function returns a failure code, which means it will load and then unload.
However, this is enough for me to make the inf file work, so I will do that next.
Visual Studio generated a default INF file for me. This isn’t quite enough for me to install a working driver, so I’ll need to modify it. Plus, Microsoft changed some details about INF files for Windows 10 1903 and created a new primitive driver type with rules that need to be followed if you want the driver to be properly (test) signed.
So I worked through the INF file issues and I now have a working INF file, with a driver that (of course) won’t actually do anything yet.
Next, I turned my attention to pulling together a C++ runtime so that I can use C++ if I want. Basically, there are several things that need to be done to make this work:
- I need memory management functions
- I need initializer support
- I have to wrap the standard functionality (DriverEntry) and coordinate the Unload function so it calls the cleanup logic.
In the past, I’ve added a template layer above the allocators, which permits me to specify (on a per-object type) what the pool type and pool tag are for the allocations. Unlike in user mode memory, where we normally don’t worry about these things, in the kernel we do need to worry about whether memory is pageable or not. Plus, we have to provide some mechanism for finding memory leaks since there is no automatic garbage collection. Note that my goal isn’t to port STL into the Windows kernel (though I did see one project where it looked like someone had done that). Similarly, I don’t plan on supporting C++ structured exception handling. So it will provide me with most C++ code features, but I’ll eschew those that require specialized run-time support.
As I wrap this up for the day, I have the allocation routines plumbed. The next step is to get the initializer code written – it revolves around walking through some memory locations where global and static constructors need to be called – the Microsoft C++ compiler embeds some magic information in memory to do this. I also need to construct a list of things to be called when terminating the runtime.
Once that’s done, I’ll move on to adding basic functionality. One thing that will greatly simplify this initial effort is that I don’t have to worry about integration with the memory manager or cache manager because I can defer I/O management to the native file system. Perhaps, once we’ve proven the viability of this approach, I can look further at integration.
I will continue describing my progress and updating the repository as I work through this project over the coming months.
Let’s Build a Windows File System
It’s been a while since I upgraded my Windows kernel development tools, so I thought I’d write about the steps I’m taking to do so. How you build a Windows file system has changed over the years but the basic structure of the file system driver itself has not.
At the time I’m writing this, the standard tools to do this are the Microsoft Visual Studio Integrated Development Environment and the Windows Driver Kit (WDK). The WDK download page (which could move, I just search for “Microsoft WDK” when I need to find it) actually describes the basic steps needed to develop drivers for Windows. As I write this, Visual Studio 2019 is downloading. I do not need many of the parts of it, so I just installed the “Windows Desktop Apps” components. Don’t forget to install the Windows SDK as well (it’s an optional component as part of the Visual Studio installation).
Once Visual Studio is installed, I’ll install the WDK. It will add the WDK integration components into Visual Studio and those will permit me to build Windows drivers. That will also install one of the most important tools: the debugger. I have been working with WinDBG for several decades and it is now an indispensable tool for kernel debugging, including both a graphical user interface as well as a command line version. It has support for debugging over a variety of transports (it used to support debugging over modems, but I haven’t done that in many years, so it may no longer be supported) including serial ports, USB ports, networks, and synthetic debugging (over a virtual device) for virtual machines. Since I will be using a virtual machine for my development, I will set up synthetic debugging. Easing this, it seems Microsoft has automated the process. I have never done it automatically, so I look forward to seeing if that works.
I still need to write about what I am building, but I will defer that for the moment because it deserves a post of its own. I can explain why I want to build this as a file system, though. I hope this is useful for anyone reading it who is thinking of building a file system.
The file systems interface, whether it is UNIX, Linux, MacOS, or Windows, is a clearly delineated boundary at which I can implement functionality that becomes available to anyone using the file system itself. The specifics of that interface vary somewhat across operating systems, but there is a high degree of commonality. One reason for this is that mainstream operating systems have a common heritage. Perhaps that will change in the future (one recent paper suggests that our OS architecture assumes that I/O is slow relative to processors and memory, an assumption that is no longer true) but it seems unlikely to change in the near-term.
The decision to implement functionality at the file systems interface is because essentially all applications for our operating systems know how to use it. New functionality – provided it conforms to the file systems interface – can be exposed to all those applications by supporting the file systems interface, which means that it seamlessly integrates into the existing system. If the new functionality is substantially different than what can be provided by the file systems interface, this will not be a good solution. Since my goal is to provide both compatibility and new functionality, I am using the file systems interface but will also look at ways in which I can augment it further. Indeed, one interesting aspect of the GFS paper I described earlier is they tried to implement functionality within the constraints of the existing interface. This has made me think about how I would go beyond what they did, without changing the interface, though I do expect at some point I will need to augment the existing interface.
I decided to do this on Windows for several reasons:
- I am familiar with Windows file systems development; I am comfortable in the kernel environment, and I know how to integrate user mode services and kernel mode drivers together to provide my desired functionality;
- I am looking at how to extend the namespace management. By doing this as a file system, I know it will be visible in standard applications, which in turn will make it easier for me to gauge how effective the changes are for ordinary users.
- More than a decade ago Microsoft incorporated quite a few interesting features into their user interface in anticipation of the new Windows File System (WinFS). While WinFS was never released, its goals of augmenting existing file management mechanisms were partially integrated into Windows.
- The NTFS file system in Windows supports a change journal that will be a good place to start for prototyping capturing some relationship data. I expect that we will ultimately provide additional mechanisms for doing this, but this should provide a good head start.
- User mode file systems, while convenient for prototyping, are a dead end if I need a kernel mode file system. User mode file systems are known to be much slower, and this is amplified for meta-data operations. Since I expect to be implementing primarily meta-data operations, I expect it is highly likely that I will not be able to demonstrate acceptable performance if I were to use FUSE for Windows, for example.
- Windows already has a supported mechanism for accessing files via a file identifier. We know that applications do not need hierarchical name spaces – that is one of the lessons from the Google File System paper. What applications need is a key. This is hardly a new observation: both NFS and AFS employed file identifiers in their implementations. Indeed, one reason that CDFS and NTFS on Windows have long support “open by file ID” is because they needed it to support AFP for the Services for Macintosh support in Windows NT. While the NTFS source code is not generally available, the CDFS file system source code for Windows is in the WDK – and demonstrates how CDFS supports “open by file ID”. Note that NTFS is more permissive in its support, as it allows absolute path name opens, while CDFS does not.
Now that I’ve stalled a bit, my Visual Studio 2019 installation has finished and I went to install the WDK; that installation was unhappy that it could not find the right version of the Windows SDK to install and it pointed me off to another web page to download the correct version. Installation still is not a seamless experience, it would seem.
So now, as I install the SDK, I can return to discussing my project in a bit more detail, even though this is probably unfair since I haven’t really told you much about what I am trying to achieve.
I have opined about the challenges of managing hundreds of thousands of files across multiple devices. I pointed to projects that have explored alternative ways of managing the name space (e.g., QMDS and GFS) and they have done a good job of laying the groundwork. Prior discussions I have had with others have focused on search as a paradigm, but the survey paper on File Management Research I recently described helped me better understand that navigation seems to be the approach users prefer over search.
One possible reason for this preference for navigation is the nature of the task itself. When we search on the Internet, we are seeking an answer to our question. When we look for something within our personal files, we are seeking the answer to our question. We fall back to search when navigation fails us.
The SDK finally finished installing… I’ll try the WDK again.
Over the years, I have constructed a number of virtual or pseudo file systems. For example, I once constructed a file system that didn’t store any data, it just presented an ephemeral name space. I crafted it to support a broad range of meta-data features: object IDs, file IDs, extended attributes, alternate data streams, and access control lists (ACLs). But read operations were satisfied by zero-filling the memory buffer, and writes were discarded – thus, the only data that was visible is data which persisted in the virtual memory file data cache. This was a fun file system to build, not the last of which because it really was only the namespace and meta-data management parts of the file system. My motivation for doing this was performance testing for a file systems construction framework but I realized at some point that it could also be a good baseline from which to construct a new file system.
Thus, my project direction: building an abstract file systems interface that has rich support for meta-data. Actual file I/O can then be redirected to the underlying file, within its own native file system
The WDK installation is now finished! My next steps are:
- Constructing a virtual machine image that I can use for debugging; the wonderful thing about most file systems work is we aren’t dependent on hardware, so it is easy to do development inside the virtual machine environment.
- Sketch out my skeleton file system model.
- Begin to implement it.
I don’t expect to build production quality code. Over the years I have learned that the bar for production quality file systems code is quite high. Thus, my goal is to construct a working prototype and then, from that, begin building my new file system namespace.
GFS: a Graph-based File System Enhanced with Semantic Features
GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.
In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.
I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.
The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.
The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”
Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.
“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”
In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!
So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.
One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.
The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.
The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.
In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.
Graph File Systems
We submitted a paper to HotOS 2019 in which we (unsuccessfully) made the argument that file systems as hierarchies is hobbling our ability to enhance the usability of file systems.
One of the blind reviews pointed to a pair of papers, one of which I’ve already reviewed (I’ll review the other, but I didn’t consider it to be the same thing, except vaguely in name as it ends up being a semantic tagging system). This paper actually pre-dates the work I submitted to Eurosys and profoundly influenced that work.
Five pages isn’t really much space to explore this area. Further, it was about a week before the deadline that I found out HotOS, while an ACM workshop now, uses an older template for its format, with wider margins and larger text, so the five page draft version we had became 6.5 pages! After surgery, it was back down to five pages but missing some useful discussions.
After submission of the HotOS paper, someone pointed me to a Stack Overflow article describing a 1958 paper (An Information Filing and Retrieval System for the Engineering and Management Records of a Large-Scale Computer Development Project) that may be the earliest record of hierarchical file structure (Figure 1).
This is certainly not “new knowledge” as it has been extensively discussed in prior work – hierarchical structure fits the model in which physical filing was actually done.
This becomes clear by the time we get to Multics (Figure 2). We now have a model of directories and files organized in a strict hierarchical fashion.
In my experience, when one presents a model and then finds it necessary to “hack” the model to be usable, it suggests that maybe the model is wrong – or at least not optimal. In the same paper the authors observe that they found it useful to augment hierarchy with links. But the introduction of links converts their hierarchy into a directed acyclic graph. Similar, yet not the same.
In all fairness to the Multics folks, this was a reasonable option at that point. They had substantial limitations that would make graph processing impractical at that point (indeed, there are some who are likely to question whether or not graph processing at this level is practical now).
So what is it I envision? In Figure 4 I’ve started with a simplified graph model. In the model I’m envisioning (please keep in mind, this is a work in progress and quite likely to change) is that we have a clear separation between the name space (which is the graph) and the storage manager (which deals with figuring out how to deal with data).
One important benefit to come out of the rejection was identification of the QMDS paper – it helps establish why hierarchy isn’t good, even if the solution they put forward has limitations. For me, this is a blessing in disguise because I’ve had to spend so much time justifying why there is even a research question here that pointing to prior work (which wrestled with the same issues and made many similar arguments) allows me to focus future work more on the solution.
The graph model makes sense to me because it generalizes the hierarchical tree (a minimally connected graph) and existing relationships, including links. We are much more familiar with graphs now than we have been in the past: Facebook and LinkedIn are at their heart relationship graphs. Computer memories are much larger than in 1965, as are storage capacities. During the Eurosys Doctoral Workshop someone asked me about the overhead of such a system and I made the bold statement that I would be willing to spend 10% of my storage space if it dramatically improved my ability to find things. Surprisingly, that seemed to mollify the person asking.
It is the capture of relationships that distinguishes this approach from the more classic tagging approach. A tag represents an extension of some property of what a file is, not how it relates to other files. We’ve actually had tagging systems for a very long time – when I worked on Episode we explicitly decided to add “property lists” as a form of extended attribute; not quite as general as streams in NTFS, but a similar idea (as I understand it, they chose to do something similar in ReFS – they support alternate data streams, but they are limited to 128KB. Episode had a 64KB limitation for property lists.)
Why aren’t tags enough? Because they associate information with the specific file (or directory). What they fail to capture is relationships across file system objects. Why do we want relationships?
Up to this point I’ve been arguing that we want relationships because they provide us with the ability to find things. One of the very intriguing take-aways from The Ubiquitous Digital File paper is the observation that people prefer navigation to search. That’s a pretty profound observation when viewed against 30 years of research into tagging systems. Apple’s Spotlight and Microsoft’s search focus on improving search ability.
I’m pretty old-school here. When I am looking for something I often resort to searching for it by name from the command line and once I find it I navigate to the containing directory. I had not really considered that for me navigation is my primary mechanism and I use search as a secondary mechanism.
One of the most common uses of graphs by “real people” are maps. I’ve known this and I have considered visualizations of data as being a map between data elements. What I had not really considered is that we navigate maps all the time. If our data is organized in a graph fashion, we could consider navigating it much like we might navigate a map, or walk through relationship graphs such as Facebook or LinkedIn.
The foundation of this research direction is the relationship graph. Thus, the next phase of my work is really to explore what a reasonable representation of the namespace in this system would look like. More to discuss and consider in a future post!
Recent Comments