Let’s Build a Windows File System

It’s been a while since I upgraded my Windows kernel development tools, so I thought I’d write about the steps I’m taking to do so. How you build a Windows file system has changed over the years but the basic structure of the file system driver itself has not.

At the time I’m writing this, the standard tools to do this are the Microsoft Visual Studio Integrated Development Environment and the Windows Driver Kit (WDK). The WDK download page (which could move, I just search for “Microsoft WDK” when I need to find it) actually describes the basic steps needed to develop drivers for Windows. As I write this, Visual Studio 2019 is downloading. I do not need many of the parts of it, so I just installed the “Windows Desktop Apps” components. Don’t forget to install the Windows SDK as well (it’s an optional component as part of the Visual Studio installation).

Once Visual Studio is installed, I’ll install the WDK. It will add the WDK integration components into Visual Studio and those will permit me to build Windows drivers. That will also install one of the most important tools: the debugger. I have been working with WinDBG for several decades and it is now an indispensable tool for kernel debugging, including both a graphical user interface as well as a command line version. It has support for debugging over a variety of transports (it used to support debugging over modems, but I haven’t done that in many years, so it may no longer be supported) including serial ports, USB ports, networks, and synthetic debugging (over a virtual device) for virtual machines. Since I will be using a virtual machine for my development, I will set up synthetic debugging. Easing this, it seems Microsoft has automated the process. I have never done it automatically, so I look forward to seeing if that works.

I still need to write about what I am building, but I will defer that for the moment because it deserves a post of its own. I can explain why I want to build this as a file system, though. I hope this is useful for anyone reading it who is thinking of building a file system.

The file systems interface, whether it is UNIX, Linux, MacOS, or Windows, is a clearly delineated boundary at which I can implement functionality that becomes available to anyone using the file system itself. The specifics of that interface vary somewhat across operating systems, but there is a high degree of commonality. One reason for this is that mainstream operating systems have a common heritage. Perhaps that will change in the future (one recent paper suggests that our OS architecture assumes that I/O is slow relative to processors and memory, an assumption that is no longer true) but it seems unlikely to change in the near-term.

The decision to implement functionality at the file systems interface is because essentially all applications for our operating systems know how to use it. New functionality – provided it conforms to the file systems interface – can be exposed to all those applications by supporting the file systems interface, which means that it seamlessly integrates into the existing system. If the new functionality is substantially different than what can be provided by the file systems interface, this will not be a good solution. Since my goal is to provide both compatibility and new functionality, I am using the file systems interface but will also look at ways in which I can augment it further. Indeed, one interesting aspect of the GFS paper I described earlier is they tried to implement functionality within the constraints of the existing interface. This has made me think about how I would go beyond what they did, without changing the interface, though I do expect at some point I will need to augment the existing interface.

I decided to do this on Windows for several reasons:

  • I am familiar with Windows file systems development; I am comfortable in the kernel environment, and I know how to integrate user mode services and kernel mode drivers together to provide my desired functionality;
  • I am looking at how to extend the namespace management. By doing this as a file system, I know it will be visible in standard applications, which in turn will make it easier for me to gauge how effective the changes are for ordinary users.
  • More than a decade ago Microsoft incorporated quite a few interesting features into their user interface in anticipation of the new Windows File System (WinFS). While WinFS was never released, its goals of augmenting existing file management mechanisms were partially integrated into Windows.
  • The NTFS file system in Windows supports a change journal that will be a good place to start for prototyping capturing some relationship data. I expect that we will ultimately provide additional mechanisms for doing this, but this should provide a good head start.
  • User mode file systems, while convenient for prototyping, are a dead end if I need a kernel mode file system. User mode file systems are known to be much slower, and this is amplified for meta-data operations. Since I expect to be implementing primarily meta-data operations, I expect it is highly likely that I will not be able to demonstrate acceptable performance if I were to use FUSE for Windows, for example.
  • Windows already has a supported mechanism for accessing files via a file identifier. We know that applications do not need hierarchical name spaces – that is one of the lessons from the Google File System paper. What applications need is a key. This is hardly a new observation: both NFS and AFS employed file identifiers in their implementations. Indeed, one reason that CDFS and NTFS on Windows have long support “open by file ID” is because they needed it to support AFP for the Services for Macintosh support in Windows NT. While the NTFS source code is not generally available, the CDFS file system source code for Windows is in the WDK – and demonstrates how CDFS supports “open by file ID”. Note that NTFS is more permissive in its support, as it allows absolute path name opens, while CDFS does not.

Now that I’ve stalled a bit, my Visual Studio 2019 installation has finished and I went to install the WDK; that installation was unhappy that it could not find the right version of the Windows SDK to install and it pointed me off to another web page to download the correct version. Installation still is not a seamless experience, it would seem.

So now, as I install the SDK, I can return to discussing my project in a bit more detail, even though this is probably unfair since I haven’t really told you much about what I am trying to achieve.

I have opined about the challenges of managing hundreds of thousands of files across multiple devices. I pointed to projects that have explored alternative ways of managing the name space (e.g., QMDS and GFS) and they have done a good job of laying the groundwork. Prior discussions I have had with others have focused on search as a paradigm, but the survey paper on File Management Research I recently described helped me better understand that navigation seems to be the approach users prefer over search.

One possible reason for this preference for navigation is the nature of the task itself. When we search on the Internet, we are seeking an answer to our question. When we look for something within our personal files, we are seeking the answer to our question. We fall back to search when navigation fails us.

The SDK finally finished installing… I’ll try the WDK again.

Over the years, I have constructed a number of virtual or pseudo file systems. For example, I once constructed a file system that didn’t store any data, it just presented an ephemeral name space. I crafted it to support a broad range of meta-data features: object IDs, file IDs, extended attributes, alternate data streams, and access control lists (ACLs). But read operations were satisfied by zero-filling the memory buffer, and writes were discarded – thus, the only data that was visible is data which persisted in the virtual memory file data cache. This was a fun file system to build, not the last of which because it really was only the namespace and meta-data management parts of the file system. My motivation for doing this was performance testing for a file systems construction framework but I realized at some point that it could also be a good baseline from which to construct a new file system.

Thus, my project direction: building an abstract file systems interface that has rich support for meta-data. Actual file I/O can then be redirected to the underlying file, within its own native file system

The WDK installation is now finished! My next steps are:

  • Constructing a virtual machine image that I can use for debugging; the wonderful thing about most file systems work is we aren’t dependent on hardware, so it is easy to do development inside the virtual machine environment.
  • Sketch out my skeleton file system model.
  • Begin to implement it.

I don’t expect to build production quality code. Over the years I have learned that the bar for production quality file systems code is quite high. Thus, my goal is to construct a working prototype and then, from that, begin building my new file system namespace.

GFS: a Graph-based File System Enhanced with Semantic Features

GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.

In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.

I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.

The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.

The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”

Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.

“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”

In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!

So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.

One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.

The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.

The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.

In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.

Graph File Systems

We submitted a paper to HotOS 2019 in which we (unsuccessfully) made the argument that file systems as hierarchies is hobbling our ability to enhance the usability of file systems.

One of the blind reviews pointed to a pair of papers, one of which I’ve already reviewed (I’ll review the other, but I didn’t consider it to be the same thing, except vaguely in name as it ends up being a semantic tagging system). This paper actually pre-dates the work I submitted to Eurosys and profoundly influenced that work.

Five pages isn’t really much space to explore this area. Further, it was about a week before the deadline that I found out HotOS, while an ACM workshop now, uses an older template for its format, with wider margins and larger text, so the five page draft version we had became 6.5 pages! After surgery, it was back down to five pages but missing some useful discussions.

After submission of the HotOS paper, someone pointed me to a Stack Overflow article describing a 1958 paper (An Information Filing and Retrieval System for the Engineering and Management Records of a Large-Scale Computer Development Project) that may be the earliest record of hierarchical file structure (Figure 1).

Figure 1: ERMA Diagram mapping file folders to hierarcy

This is certainly not “new knowledge” as it has been extensively discussed in prior work – hierarchical structure fits the model in which physical filing was actually done.

This becomes clear by the time we get to Multics (Figure 2). We now have a model of directories and files organized in a strict hierarchical fashion.

Figure 2: 1965 Daley, et. al. Multics I/O Diagram (redrawn)

In my experience, when one presents a model and then finds it necessary to “hack” the model to be usable, it suggests that maybe the model is wrong – or at least not optimal. In the same paper the authors observe that they found it useful to augment hierarchy with links. But the introduction of links converts their hierarchy into a directed acyclic graph. Similar, yet not the same.

Figure 3: 1965 Daley, et. al. Multics I/O Diagram with links (redrawn)

In all fairness to the Multics folks, this was a reasonable option at that point. They had substantial limitations that would make graph processing impractical at that point (indeed, there are some who are likely to question whether or not graph processing at this level is practical now).

Simplified graph file system model
Figure 4: Simplified Graph Model

So what is it I envision? In Figure 4 I’ve started with a simplified graph model. In the model I’m envisioning (please keep in mind, this is a work in progress and quite likely to change) is that we have a clear separation between the name space (which is the graph) and the storage manager (which deals with figuring out how to deal with data).

One important benefit to come out of the rejection was identification of the QMDS paper – it helps establish why hierarchy isn’t good, even if the solution they put forward has limitations. For me, this is a blessing in disguise because I’ve had to spend so much time justifying why there is even a research question here that pointing to prior work (which wrestled with the same issues and made many similar arguments) allows me to focus future work more on the solution.

The graph model makes sense to me because it generalizes the hierarchical tree (a minimally connected graph) and existing relationships, including links. We are much more familiar with graphs now than we have been in the past: Facebook and LinkedIn are at their heart relationship graphs. Computer memories are much larger than in 1965, as are storage capacities. During the Eurosys Doctoral Workshop someone asked me about the overhead of such a system and I made the bold statement that I would be willing to spend 10% of my storage space if it dramatically improved my ability to find things. Surprisingly, that seemed to mollify the person asking.

It is the capture of relationships that distinguishes this approach from the more classic tagging approach. A tag represents an extension of some property of what a file is, not how it relates to other files. We’ve actually had tagging systems for a very long time – when I worked on Episode we explicitly decided to add “property lists” as a form of extended attribute; not quite as general as streams in NTFS, but a similar idea (as I understand it, they chose to do something similar in ReFS – they support alternate data streams, but they are limited to 128KB. Episode had a 64KB limitation for property lists.)

Why aren’t tags enough? Because they associate information with the specific file (or directory). What they fail to capture is relationships across file system objects. Why do we want relationships?

Up to this point I’ve been arguing that we want relationships because they provide us with the ability to find things. One of the very intriguing take-aways from The Ubiquitous Digital File paper is the observation that people prefer navigation to search. That’s a pretty profound observation when viewed against 30 years of research into tagging systems. Apple’s Spotlight and Microsoft’s search focus on improving search ability.

I’m pretty old-school here. When I am looking for something I often resort to searching for it by name from the command line and once I find it I navigate to the containing directory. I had not really considered that for me navigation is my primary mechanism and I use search as a secondary mechanism.

One of the most common uses of graphs by “real people” are maps. I’ve known this and I have considered visualizations of data as being a map between data elements. What I had not really considered is that we navigate maps all the time. If our data is organized in a graph fashion, we could consider navigating it much like we might navigate a map, or walk through relationship graphs such as Facebook or LinkedIn.

The foundation of this research direction is the relationship graph. Thus, the next phase of my work is really to explore what a reasonable representation of the namespace in this system would look like. More to discuss and consider in a future post!

Collaboration versus Cheating

Collaboration Versus Cheating: Reducing Code Plagiarism in an Online MS Computer Science Program Tony Mason, Ada Gavrilovska, and David A. Joyner, Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pp. 1004-1010, Minneapolis, MN, February 27 – March 2, 2019.

This is a rather different paper than my usual but I thought I would write about it since I was the one that drove this particular work. I first started as a teaching assistant for the Graduate Introduction to Operating Systems in Georgia Tech’s Online Master of Science in Computer Science program. One of my motivations at the time was to really learn more about how online education could be realized. The program’s goals were to achieve scale (which it has) and to maintain a quality level that was non-inferior to their online MSc program offering. I achieved many of my goals here, as I learned how to improve the pedagogical goals (by improving feedback to students, and helping them understand the topics) of the class while also finding ways to ensure academic honesty.

Figure 1: Detected plagiarism rates over time

I started working with the class in Fall 2015; at that point we had no proactive plagiarism detection in place, though we were collecting similarity data. It wasn’t until 2016 that I started looking at that data and began to realize we had a noticeable plagiarism rate. We tried a number of techniques to try and reduce this. We explained what plagiarism was, we offered “amnesties” if people that cheated would come forward and admit it. Neither worked. Amnesty actually lead to cases where students we did not think had plagiarized were admitting to things we considered to be fair use (e.g., using existing code to set up socket connections).

Figure 1 shows the progress over time, and clearly delineates when our intervention strategies were successful. What it does not capture is the qualitative difference. By Spring 2017, I was observing students wholesale submitting other students’ projects and claiming it as their own! It was that semester when I prioritized “early intervention” in which I quickly identified students suspected of cheating and confronted them. The plagiarism rate for subsequent projects dropped precipitously.

The downside to this approach is that it took a tremendous amount of work to do this, which makes it impractical because it does not scale. Effective is great, but scalability was also a requirement if it was going to be useful. In the summer of 2017 we introduced one new mechanism to the class: we added a mandatory quiz that went over the policy and asked students if they understood the course policies defining collaboration and cheating, as well as understanding why this was important and the potential penalties.

The results surprised me: the plagiarism rate plummeted. I was cautiously optimistic though because the Summer 2017 class was the smallest we’d ever had. Thus, I suspected it might simply be an anomaly. But this encouraged me to begin making the evaluation more rigorous. In the Fall of 2018 I spent time automating the process of doing code comparisons not only with other students in the same class, but also comparing submissions across all prior submissions. My motivation for doing this work was to validate what I thought I was seeing.

Thus I performed retrospective analysis on prior semesters in which the course was offered. I used that data to write a paper for Learning at Scale 2018 but that paper was rejected. I used the feedback from that to revise my draft and submit an update to it to ICER 2018 but that was also rejected. I once again dug deep and revised the paper and submitted it to SIGCSE 2019, where it was accepted (and astoundingly, all the reviews were accepts!) I revised the final paper to address the concerns of some reviewers (the weak accepts) and that is what was included at SIGCSE.

This doesn’t move my research forward per-se, but it helped me understand how incredibly important data visualization is – like Figure 1. I also learned how important it is to construct reproducible data analysis flows. The scripts I wrote helped me build a suite of tools for improving reproducibility. The research I did helped me better understand why the primary automated tool we use works (MOSS) and its limitations. When I was done I realized that I could easily continue exploring this research area, but since it isn’t core to my focus, I’ll leave it to others to push the frontiers.

Still it was an interesting project and helped confirm my interest in doing research.

I have a public repository with further information (beyond what is in my paper) on GitHub. It includes a copy of the slides from my presentation in March at SIGCSE (in which I think I had around 60 people, which was far more than I’d expected given that it was the last half-day of the conference!)

QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

QMDS: A File System Metadata Management Service Supporting a Graph Data Model-based Query Language

Sasha Ames, Maya B. Gokhale, and Carlos Maltzahn, International Journal of Parallel, Emergent and Distributed Systems, Volume 28, Number 2, pp. 159-183, 2013.

This paper came to my attention via feedback from an anonymous reviewer, observing that our idea of constructing a graph file system had “already been done”. It never ceases to amaze me that, despite how much time I have spent combing the literature, there seem to be things I miss. In this particular case, I have to agree with the reviewer that the basic idea we proposed really had been done before, though it seems as if the design space has not been exhausted and this paper actually will save me considerable time because up to this point it’s been a challenge to even explain why this kind of file system is useful.

Indeed, my read of this paper really suggests that these authors also struggled with similar objections because they spend considerable time justifying the need for their work: the introduction is fairly long because it explains the underlying problem, and the prior work section also goes to great length to explain why prior work is inadequate to the job.

In this paper, we discuss an exploration of our approach to the problem: the use of a graph data model for representing file system user-de fined metadata and a query language for retrieval. The purpose of this approach is to provide management of user-defined file metadata along with data under a single file system interface, delivering a common service across applications. Applications would be able to offload their metadata management needs to the service, alleviating the need for their own solution. This arrangement would benefit applications by reducing their code complexity, by virtue of not having their own custom metadata management components. A second benefit is improved opportunities for interoperability among separate applications.

The model we have been discussing, and trying to present, is one in which we have a richer model for meta-data to capture not only attributes of files, but also relationships across files. Like these authors we reached the conclusion that a graph is likely a better representational model for data. This encompasses the hierarchical model that is a fundamental part of POSIX, while at the same time providing us with a robust platform on which to build additional functionality. Before I start explaining that, though, I should go through this paper because it has valid results that I can use moving forward.

In Figure 1 from the paper, the authors describe the type of graph they are using: it has vertices (files) and edges (parent/child), with labels on the edges (attributes). I am not convinced this is the right graph model, but I will save that conversation for a future blog post.

In Figure 2, the authors delve into the structure of their graph in greater detail, as they compare their model to that of Resource Description Framework (RDF) triples that are used in several graph processing systems. Here we see a better description of their format (which is actually closer to what I’ll describe for my own work): “Our data model for file system metadata is a directed graph with attributes on nodes and edges, shown in Figure 1. Nodes in the graph can represent files, and this allows the system to manage relationships among files. We call our directed edges links, connecting parent and child nodes.”

They note that applications do not explicitly define schemas (something else I need to discuss in a future blog post) nor does their system require classes be defined. The authors argue this provides greater flexibility and indeed, the fact it does not force an application to be locked into a specific model.

“A heterogeneous approach to managing metadata gives all applications the
same tools to manage relationships.” It seems to me that this is one of the most compelling reasons on why this is a systems problem and not an application problem. If we insist applications implement this, most will not. Those that do will have no mechanism for interaction across applications. If files were truly isolated from one another, that would be fine, but in the real world, files do have relationships with other objects, whether it is other files, or external references (e.g., the “get me the e-mail from whence this file originated” example I described recently.) This helps with motivation, which I mentioned before has been an area of resistance I’ve received as well.

Figure 3 shows how the authors optimized their file system’s meta-data efficiency. This provides some interesting insight into the cases they expect to be common. I found their emphasis on navigation a useful one as well, particularly given the discussion of it recently.

The choice of optimization models certainly seems to be an important one, given that we can’t optimize for everything, and if we optimize for the wrong things we end up with something that looks much like brute-force search, which I can’t imagine is going to perform well.

In Figure 4 the authors turn their attention to their implementation model. They use the FUSE file systems interface to aid their implementation. This is interesting because one of the areas I’ve been exploring (sigh, yet another area to discuss in greater detail) is ways to more easily enhance the FUSE interface to enable exploring enhanced interfaces.

It seems that one downside to this approach is that it focuses on existing mechanisms for finding and accessing files, without providing a useful mechanism for exploiting enhanced search. Admittedly, the extended attribute interface does provide some mechanism for achieving this, but this is a useful paradigm for exploring how such a file system will work with existing applications – certainly an important aspect of constructing any file system that one expects will be useful.

Figure 5 was quite interesting to me, because it addresses one of the concerns I’ve seen in prior work based on relational databases (e.g., they’re often slow). I suspect that these results, with good times within QMDS relative to their evaluation relate to their optimization model for the queries they are executing.

One thing this doesn’t evaluate is how QDMS performs relative to other FUSE based file systems. The queries they do execute on QMDS seem to be targeted search queries and not necessarily well-correlated to actual usage as a file system.

There is no follow-up to this work, unfortunately, which makes it difficult to understand the general usefulness of QMDS. The upside to this is that it leaves considerable room for future work. It does provide a strong case for exploring this approach more thoroughly and I have already suggested that better evaluation seems justified under the circumstances.

I’ll discuss more of these issues as I turn my attention to describing my own work in future posts.

The Ubiquitous Digital File: A Review of File Management Research

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.


One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!


Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.

Eurosys 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve

I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.

Reboot time

It’s been almost a year since I posted anything substantive. It is so easy to just focus on other things, which is what I’ve been doing. In the past year I’ve continued to explore some interesting areas related to file systems. For example, for the past year I’ve been looking at persistent memory, which acts somewhat like storage (because it is persistent) and somewhat like DRAM (because it is byte addressed). The findings have been interesting: surprising in some cases, close to predicted in some cases. We’re still doing more work in this area, and I’m hoping to submit two papers later this year.

But the other big project is a file systems project. I’ve decided to use Windows as my platform of choice, both because I’m quite comfortable with Windows and because I think it’s the best choice for this project. Since it has been a goal of mine to do more writing in this area, I thought it would be great to use my blog to capture how I go about writing this file system for Windows with the (probably unrealistic) hope that I can eventually turn it into an online guide. I also plan on using a public repository for my project so that other people can see what I end up doing. I think I’ll turn that into a separate blog post, so I can talk more about that project.

Hopefully, by using this as a mechanism for describing my forward progress I can continue to post new information and content. That’s the theory at least.

If you are interested in persistent memory, two good recent papers are:

Basic Performance Measurements of the Intel Optane DC Persistent Memory Module

Single Machine Graph Analytics on Massive Datasets Using Intel® OptaneTM DC Persistent Memory

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage

MICA: A Holistic Approach to Fast In-Memory Key-Value Storage
Hyeontaek Lim, Dongsu Han, David G. Andersen, and Michael Kaminsky in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’14).  April 2-4, 2014, Seattle, WA.

It’s time to turn attention to key value stores.  There are quite a few papers on this subject.  Seemingly simple, they turn out to be an important storage mechanism.  I had a delightful discussion about them over the weekend with Keith Bostic, who knows a fair bit about key-value stores (being the author of one of the most widely used KV stores.)  Of course, I have discussed some KVS already: MassTree, NVMCached, and SILT.

I’m going to continue looking at key-value stores (KVS) by starting with MICA. It is neither the first, nor the latest, work in this area, but it seems like a good place to start really describing key value stores in this space.  I have quite a few to cover so let’s get started.

MICA is a scalable in-memory key-value store that handles 65.6 to 76.9 million key-value operations per second using a single general-purpose multi-core system. MICA is over 4–13.5x faster than current state-of-the-art systems, while providing consistently high throughput over a variety of mixed read and write workloads.

It is an interesting system in that the authors decided to look at making their KVS tunable – it can serve as a cache (which means items can be thrown out by the KVS as space becomes scarce) or it can serve as storage (which means items cannot be removed from the KVS by the system itself, though it permits the applications using it to do so.)

To achieve this, they propose a series of data structures tuned to the task: circular logs, lossy concurrent hash indexes, and bulk chaining.  Their goal is to provide high performance through low contention.  They utilize dpdk for fast network processing.  They split their data structures so they independently handle caching and storage separately.  I might argue that this makes it two systems fused together, but let’s give them the benefit of the doubt at this point by calling it “generality”.

One of the concerns they try to address is the hot key issue.  In many KVS systems they get terrific performance on synthetic workloads because they use a uniform distribution.  In the real world, particularly for web objects, the workloads tend to be mostly cold with a small subset of the keys that are hot.  Further, these keys change over time.

They do restrict their domain somewhat to make the problem more tractable.  For example, they do not handle large items – everything must fit within a single network packet.  For systems that require larger items, they suggest using a separate memory allocator.  Their argument is that the extra indirection cost is a marginal increase in total latency.  They are optimizing for the common case.  Another aspect they exclude is durability.  That makes sense – they are only implementing an in-memory KVS, so durability is not a logical part of their work.  This also plays well in terms of performance, since persistence makes things much more complicated.

So what do they want to achieve with MICA?

  • High single-node throughput
  • Low end-to-end latecy
  • Consistent performance across workloads
  • Support small (but variable length) key-value objects
  • Support basic KVS operations: get, put, delete
  • Work efficiently on commodity hardware

They propose achieving this using a combination of design choices:

  • Parallel data access – in fact, they mean read access, as they do not parallelize these, as the cost is too high.
  • Optimized network overhead – as noted previously, they use dpdk to achieve this
  • Efficient KV structures – this includes dual memory allocators (one optimized for caching the other for store) and efficient index implementations.

Figure 7

There are some interesting characteristics to this system, including the fact that it employs dangling pointer handling.  Rather than remove pointers from the index, they allow the pointers to become invalid and instead trap that condition.  The paper describes how they do that in more detail.

They also use an interesting “prefetcher” to encourage efficient memory loading.  This is done by receiving requests in bursts and scheduling those requests to be processed “soon” to this prefetch step.

Figure 9Their evaluation of the system tries to focus on a range of workloads, varying key and value size as well as the range of operations being performed, with a read intensive workload being 95% reads and 5% writes.  A write intensive workload is split evenly.

They also look at how the key space is handled.  In the “Exclusive Read, Exclusive Write” paradigm (EREW) each core of the system is responsible for handling both reads and write for a given key (or range of keys more likely).  In the “Concurrent Read, Exclusive Write” paradigm (CREW), any core may satisfy a read for the key, but only one core may satisfy a write operation for the core.  This is a hat tip towards the cost of “bouncing” control of the cache lines between cores and we will see this approach used by other systems as well.  One interesting question that this reminds me of each time I read about this model is what impact, if any, the Level 3 cache might have on this.  Level 1 and 2 caches are traditionally per core but Level 3 is typically per socket.  Thus, things stored in L3 cache are substantially faster to access than RAM.  Something to consider, perhaps.

The paper has an extensive evaluation section and, of course, they demonstrate how their solution is substantially faster.  One interesting aspect to their evaluation is that they claim that each component of their design decision is important in enabling them to achieve their target performance.   To do this, they evaluate each individual aspect of their design decision.  For example for their parallel data model, they evaluate the variations on their choice and demonstrate that CREW is only faster with read skewed workloads.  They even go so far as to model Concurrent Read Concurrent Write (CRCW) workloads and point out that the cache overhead blunts the benefits of the concurrency.

To make this point on their choice of network implementations, they take MassTree and convert it to use dpdk, which yields a substantial performance improvement.  While they argue that their key indexes also contribute, they do not have a distinct break-out of that implementation and fall back to pointing out how they work better, even in CREW than MassTree.

There are certainly grounds to criticize MICA.  For example, they do choose workloads that fit well with their available resources.  I suspect that once they become resource constrained (e.g., cache) that the performance will drop substantially.  However, it provides an interesting contribution in the space and does point out how picking your data structures carefully can make a tremendous difference in the overall performance of the system.


Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters

Boosting Timestamp-based Transactional Memory by Exploiting Hardware Cycle Counters
Wenjia Ruan, Yujie Liu, and Michael Spear, in ACM Transactions on Architecture and Code Optimization (TACO), Volume 10, Number 4, page 40, 2013, ACM.

This paper is interesting in its use of a system level global clock to define a strong ordering of operations across cores.  The authors point out that the idea of using timestamps for constructing a transactional system.  When the goal is to use the timestamp to establish ordering (e.g., a Lamport Clock) it isn’t really so difficult in a single system.  After all, you can just have a global counter that is incremented as each transaction proceeds.  That defines an ordering of events.

Here’s the problem with this approach: the clock becomes a bottleneck.  While we do not usually think of memory operations as being a bottleneck, they certainly can be.  This is because multi-processor computers look much like a distributed system.  They exchange messages to provide coherency guarantees.  Over time, in the drive to gain further performance, these systems have relaxed their consistency guarantees.  This works because in the common case all of the changes to memory occur on a single processor (in fact on a single core of a single processor).  So when a program starts changing a value in memory it acquires control over a small region of memory (a “cache line”) and makes the changes on the processor.   It writes those changes back at some point in the future.  One reason is because another processor tries to access something within that same memory region.  This is where the messages go flying back and forth, the modified cache line gets written back to main memory.  This turns out to be “expensive”.  Access to RAM on my laptop is around 65 nanoseconds.  Access to L1 cache is the same speed as access to a processor register, so it depends upon the clock speed of the CPU.  On my laptop (1.9GHz) the clock cycle is 0.5 nanoseconds.  So it is 130 times slower.  On my dual socket Xeon system, I see that memory access is slower: 95 ns for local RAM and 125 ns for remote RAM (this is part of the non-uniform memory architecture – NUMA – behavior model). So in a multi-socket system, the cost is even higher for our shared timestamp.

In this paper the authors explore using the CPU level tick counter as a form of Lamport clock.  They describe the facilities of two processors: the UltraSparc T2 and the Intel Xeon X5650.  The UltraSparc’s tick counter is not monotonically increasing, even on a single CPU.  From this they conclude they cannot use it as a source for a clock, since monotonic increase is the fundamental requirement for such a clock.  The Intel chip, on the other hand, has a clock that can be used to construct global atomicity.  There are certainly some restrictions on this, but the cited guarantee is that:

“On modern 64-bit x86 architectures, if one core writes the result of an rdtscp instruction to memory, and another core reads that value prior to issuing its own rdtscp instruction, then the second rdtscp will return a value that is not smaller than the first.

From this premise, the authors construct a mechanism to exploit this in ownership records of a software transactional memory system.   They then convert several existing systems to use their timestamp mechanism and show that on a series of micro-benchmarks that it substantially outperforms the global counter mechanism.

They do establish that this solution is not more efficient for all use cases.  They specifically point out that privatization safety considerations make this work more challenging. The micro-benchmarks demonstrate that in at least one case the use of the global processor timestamp is not faster; this is ultimately because the privatization serialization model forces them to create dependencies on global state, thus eliminating the very rationale for using the hardware clock semantics.

Their analysis and conclusions are why I found this paper useful: “[T]he strong performance of our non-privatization-safe algorithms leads to questions about the benefit fo implicit privatization safety.  Perhaps the absence of bottlenecks in our algorithm will make strong isolation viable for unmanaged languages, or at least provide an incentive for a new explorations [sic] programming models with explicit privatizations.”

This certainly seems to be a useful tool for speeding up software transactional memory schemes, which certainly arise on a regular basis.