GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.
In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.
I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.
The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.
The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”
Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.
“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”
In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!
So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.
One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.
The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.
The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.
In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.
We submitted a paper to HotOS 2019 in which we (unsuccessfully) made the argument that file systems as hierarchies is hobbling our ability to enhance the usability of file systems.
One of the blind reviews pointed to a pair of papers, one of which I’ve already reviewed (I’ll review the other, but I didn’t consider it to be the same thing, except vaguely in name as it ends up being a semantic tagging system). This paper actually pre-dates the work I submitted to Eurosys and profoundly influenced that work.
Five pages isn’t really much space to explore this area. Further, it was about a week before the deadline that I found out HotOS, while an ACM workshop now, uses an older template for its format, with wider margins and larger text, so the five page draft version we had became 6.5 pages! After surgery, it was back down to five pages but missing some useful discussions.
After submission of the HotOS paper, someone pointed me to a Stack Overflow article describing a 1958 paper (An Information Filing and Retrieval System for the Engineering and Management Records of a Large-Scale Computer Development Project) that may be the earliest record of hierarchical file structure (Figure 1).
This is certainly not “new knowledge” as it has been extensively discussed in prior work – hierarchical structure fits the model in which physical filing was actually done.
This becomes clear by the time we get to Multics (Figure 2). We now have a model of directories and files organized in a strict hierarchical fashion.
In my experience, when one presents a model and then finds it necessary to “hack” the model to be usable, it suggests that maybe the model is wrong – or at least not optimal. In the same paper the authors observe that they found it useful to augment hierarchy with links. But the introduction of links converts their hierarchy into a directed acyclic graph. Similar, yet not the same.
In all fairness to the Multics folks, this was a reasonable option at that point. They had substantial limitations that would make graph processing impractical at that point (indeed, there are some who are likely to question whether or not graph processing at this level is practical now).
So what is it I envision? In Figure 4 I’ve started with a simplified graph model. In the model I’m envisioning (please keep in mind, this is a work in progress and quite likely to change) is that we have a clear separation between the name space (which is the graph) and the storage manager (which deals with figuring out how to deal with data).
One important benefit to come out of the rejection was identification of the QMDS paper – it helps establish why hierarchy isn’t good, even if the solution they put forward has limitations. For me, this is a blessing in disguise because I’ve had to spend so much time justifying why there is even a research question here that pointing to prior work (which wrestled with the same issues and made many similar arguments) allows me to focus future work more on the solution.
The graph model makes sense to me because it generalizes the hierarchical tree (a minimally connected graph) and existing relationships, including links. We are much more familiar with graphs now than we have been in the past: Facebook and LinkedIn are at their heart relationship graphs. Computer memories are much larger than in 1965, as are storage capacities. During the Eurosys Doctoral Workshop someone asked me about the overhead of such a system and I made the bold statement that I would be willing to spend 10% of my storage space if it dramatically improved my ability to find things. Surprisingly, that seemed to mollify the person asking.
It is the capture of relationships that distinguishes this approach from the more classic tagging approach. A tag represents an extension of some property of what a file is, not how it relates to other files. We’ve actually had tagging systems for a very long time – when I worked on Episode we explicitly decided to add “property lists” as a form of extended attribute; not quite as general as streams in NTFS, but a similar idea (as I understand it, they chose to do something similar in ReFS – they support alternate data streams, but they are limited to 128KB. Episode had a 64KB limitation for property lists.)
Why aren’t tags enough? Because they associate information with the specific file (or directory). What they fail to capture is relationships across file system objects. Why do we want relationships?
Up to this point I’ve been arguing that we want relationships because they provide us with the ability to find things. One of the very intriguing take-aways from The Ubiquitous Digital File paper is the observation that people prefer navigation to search. That’s a pretty profound observation when viewed against 30 years of research into tagging systems. Apple’s Spotlight and Microsoft’s search focus on improving search ability.
I’m pretty old-school here. When I am looking for something I often resort to searching for it by name from the command line and once I find it I navigate to the containing directory. I had not really considered that for me navigation is my primary mechanism and I use search as a secondary mechanism.
One of the most common uses of graphs by “real people” are maps. I’ve known this and I have considered visualizations of data as being a map between data elements. What I had not really considered is that we navigate maps all the time. If our data is organized in a graph fashion, we could consider navigating it much like we might navigate a map, or walk through relationship graphs such as Facebook or LinkedIn.
The foundation of this research direction is the relationship graph. Thus, the next phase of my work is really to explore what a reasonable representation of the namespace in this system would look like. More to discuss and consider in a future post!