Home » File Systems » Graph File Systems

Category Archives: Graph File Systems

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
December 2024
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
293031  

Brainiattic: Remember more with your own Metaverse enhanced brain attic

Connecting devices and human cognition

I recently described the idea of “activity context” and suggested that providing this new type of information about data (meta-data) to applications would permit improve important tasks such as finding. My examining committee challenged me to think about what I would do if my proposed service – Indaleko – already existed today.

This is the second idea that I decided to propose on my blog. My goal is to find how activity context can be used to provide enhanced functionality. My first idea was fairly mundane: how can we improve the “file browsing” experience in a fashion that focuses on content and similarity by combining prior work with the additional insight provided by activity context.

My initial motivation for this second idea was motivated by my mental image of a personal library but I note that there’s a more general model here: displaying digital objects as something familiar. When I recently described this library instantiation of my brain attic the person said “but I don’t think of digital objects as being big enough to be books.” To address this point: I agree, another person’s mental model for how they want to represent digital data in a virtual world need not match my model. That’s one of the benefits of virtual worlds – we can represent things in forms that are not constrained by what things must be in the real world.

In my recent post about file browsers I discussed Focus, an alternative “table top” browser for making data accessible. One reason I liked Focus is that the authors observed how hierarchical organization does not work in this interface. They also show how the interface is useful and thus it is a concrete argument as to at least one limitation of the hierarchical file/folder browser model. Another important aspect of the Focus work was their observation that a benefit of the table top interface is it permits different users to organize information in their own way. A benefit of a virtual “library” is that the same data can be presented to different users in ways that are comfortable to them.

Of course, the “Metaverse” is still an emerging set of ideas. In a recent article about Second Life Philip Rosedale points out that existing advertising driven models don’t work well. This begs the question – what does work well?

My idea is that by having a richer set of environmental information available, it will be easier to construct virtual models that we can use to find information. Vannevar Bush had Memex, his extended memory tool. This idea turns out to be surprisingly ancient in origin, from a time before printing when most information was remembered. I was discussing this with a fellow researcher and he suggested this is like Sherlock Holmes’ Mind Palace. This led me to the model of a “brain attic” and I realized that this is similar to my model of a “personal virtual library.”

The Sherlock Holmes article has a brilliant quotation from Maria Konnikova: “The key insight from the brain attic is that you’re only going to be able to remember something, and you can only really say you know it, if you can access it when you need it,”

This resonates with my goal of improving finding, because improving finding improves access when you need it.

Thus, I decided to call this mental model “Braniattic.” It is certainly more general than my original mental model of a “personal virtual library,” yet I am also permitted to have my mental model of my pertinent digital objects being projected as books. I could then ask my personal digital librarian to show me works related to specific musical bands, or particular weather. As our virtual worlds become more capable – more like the holodeck of Star Trek – I can envision having control of the ambient room temperature and even the production of familiar smells. While our smart thermostats are now capturing the ambient room temperature and humidity level and we can query online sources for external temperatures, we don’t actively use that information to inform our finding activities, despite the reality is that human brains do recall such things; “it was cold out,” “I was listening to Beethovan,” or “I was sick that day.”

Thus, having additional contextual information can be used at least to improve finding by enabling your “brain attic.” I suspect that, once activity context is available we will find additional ways to use it in constructing some of our personal metaverse environments.

Relationships

I recently described two file systems (QMDS and GFS) that attempted to capture additional context for files to improve their usability. At Eurosys, I argued (somewhat successfully) that a distinguishing characteristic of my proposed work is to capture relationships between files, something that goes beyond mere isolated analysis of such files.

Index servers, which are now ubiquitous on mainstream platforms, attempt to solve this problem by focusing on the specific attributes of the given file. This is useful, and indeed it seems to be consonant with the general approach of semantic file systems, which attempt to classify files based upon their semantic content. For example, this is how modern Internet search engines work – they classify a document based upon its content.

I pointed out in my recent discussion of personal information management (PIM) that there is a difference between navigation and search. Thinking about this further, I realize that this is more nuanced and reflects the way that we humans look for things: we go to where we expect something to be (navigation) and if it is not there we then go to other places where we think it might be. For example, when I’m looking for my keys, I have a list of places I look first. When I don’t find them there I begin to systematically search for them.

Thus, the natural progression for finding the item of interest is navigation and then search. Index servers are basically the search part of the equation and they do not tend to be where we start first. Instead, it is the fall-back.

As humans, we tend to create associations between events. When I can’t find my keys I start thinking about the last time I saw them (“I know I had them because I was able to get into the apartment. After I got back I went and checked my e-mail…”) These are temporal clues but they are associations between other unrelated events and the object for which we are searching.

Another observation that Margo Seltzer made in our discussions on this topic is that when we use a web search engine, we are looking for an answer, but when we are searching for something that we have we are looking for the answer. This is a subtle, but important, difference. I don’t want to find any set of keys, I want to find a specific set of keys. Internet search is notoriously unreliable in this regard; how many times have you gone back a few days later to find some interesting article, only to realize the list of results coming back from your search engine are different than they were before? This is how Internet search engines exploit the fact that you are (usually) just looking for an answer – they don’t have to give you definitive, reproducible results.

Yesterday I had an interesting conversation with Sasha Fedorova and towards the end of it she suggested that I might do better promoting this work by focusing on solving the needs of a particular community and she suggested the software engineering community, partially because she has worked with them before and also because she could see the kinds of relationships that might be useful to that community. Further, that community is used to testing out experimental approaches that promise to improve effectiveness and productivity. In that same conversation she pointed out the Stack Overflow community as being one of those places software engineers search for answers.

This morning, on the way to the gym, I realized that the Stack Overflow community is also an example of how we organically create relationships: people ask questions and get back specific answers of varying quality. The community rates the responses and preserves the answers. This has organically created a web of connections across topics and people.

Why is this important to understand? Because the work I’m doing proposes going beyond simple semantic analysis of individual files, or even clustering them based upon specific characteristics, but also by establishing a set of interrelationships across files and across applications. Focusing on solving this issue for a single application is much akin to focusing on just looking at the semantic content of a single file. Moving beyond this to realize that us humans create associations in our brains means we need to find ways of capturing those associations across applications that help us better navigate the vast trove of information we accumulate.

For example, Sasha suggested that it would be interesting for the software development community if there were an association between the web pages we accessed and the code we were editing. This makes sense to me: I tend to look up documentation or explanations of specific things as I write code. If we capture this relationship across applications, we can motivate why this is a systems problem and not an application problem – operating systems provide services that are common to applications (not necessarily all applications, but something that is of broader interest than a single or a few applications). One benefit of focusing on the software engineering community is that permits us to identify relationship of interest to the community and then mine those relationships.

Another fascinating conversation (late last week) was when I was discussing my research direction with Ghita Berrada (who is visiting us this month) and we ended up having a discussion of generalized concept analysis. I was familiar with Formal Concept Analysis (FCA) because it was used by Benjamin Martin in his work more than a decade ago (his dissertation “Formal Concept Analysis and Semantic File Systems” is something I have found delightfully insightful and I think it is time for me to read it again) but she pointed me to Temporal Concept Analysis and Relational Concept Analysis. Temporal Concept Analysis is fairly recent, having only been first formally described in 2000, and uses FCA as its basis, but it focuses on temporal events. Relational Concept Analysis is even more recent (2007) and is definitely germane to my research direction since it focuses on relationships and how to handle them within the context of FCA. Given my own focus on relationships across files, it definitely seems pertinent.

All of these are in turn based upon the mathematical model of lattices – partially ordered sets. Lattice theory has been around for quite some time and shows up in a broad range of areas, such as CRDTs which are used in distributed systems. For example, they are used in Anna, a key-value store I read about last year (surprisingly, I didn’t write it up – I should have). It in turn pointed to earlier work on Lattices in distributed systems, which I did describe previously. This has prompted me to go off and read about lattices; this definitely is challenging as my formal math skills are quite rusty, but I have been systematically working my way through the book on lattice theory I picked up. It has been interesting trying to construct an intuitive understanding of the concepts from more formal language describing them; hopefully that knowledge will prove useful.

This is turning into a long post and I still haven’t reached the point I wanted (yet).

A challenge with this work is identifying relationships that we want to be able to support. Of course, I don’t want to restrict these relationships, but rather use a sample of such relationships to motivate the work (or “Why would anyone care about my research?“) One of the challenges of doing good research is motivating that research: there are numerous questions to answer, but which questions deserve being answered?

File systems presently support two basic relationships:

  • Contains – this is the directory metaphor. A “directory” is a container of other directories or files. It is the basis of the hierarchical relationship to which we have become accustomed.
  • Points to – this is the function of a symbolic link, which is supported by most file systems.

It is relatively easy to focus on properties as well. In theory, the clustering of files based upon this characteristic is one (loose) form of relationship. For example, we routinely associate specific file names with a corresponding application (e.g., via the suffix of the file). Windows exhibits a strong relationship model here, in which applications register interest in particular types of files, based on suffix, and the operating system then uses that information to invoke the relevant application. Apple’s Mac used to use an embedded meta-data component (“resource fork“) for associating the file with the application; it still exists but is not commonly used in order to support compatibility.

Other types of file properties include:

  • Timestamp – these capture temporal properties of a file. The most common are the creation time, last update time, and last access time. Last access time is often omitted these days because constantly updating it turns out to be expensive. For example, Windows NTFS does not update last access time by default. My recollection is that they did this because they found updating this field accounted for something like a 6% performance cost in NTFS. Thus, there is a question as to how reliable these values are.
  • Size – we know what the size of files are.
  • Name – I’ve already mentioned using the suffix, which is an aspect of the name. Is it possible to exploit this in other ways?

Then there are application relationships: what application created this file? It occurs to me that it might be useful to distinguish registered names (suffixes) from all files created by an application. For example, I don’t usually want to see the build artifacts of my software development environment. Microsoft Visual Studio handles this by allowing artifacts to be separated from source files, for example. Could we achieve something comparable within the file system by understanding these relationships? Would it be useful?

Another suggestion from Sasha was that we might want to record what music was playing when we were doing something specific because our brains may create an association across these seemingly unrelated events. This suggests another potential relationship: concurrent application execution. This is a sort of temporal relationship, but one that becomes more interesting when we consider it in a Memex style model of associations. How can I capture these relationships across different applications. Perhaps we can think of some sort of “current context” or “current activity” be associated with a given application that can then be queried and added to the files as we create them. These types of dynamic relationships are certainly more intriguing or interesting.

What kinds of relationships can you envision being useful when you are searching for that elusive file you know that you have but you don’t know where it lives in the hierarchy?

GFS: a Graph-based File System Enhanced with Semantic Features

GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.

In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.

I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.

The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.

The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”

Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.

“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”

In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!

So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.

One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.

The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.

The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.

In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.


Graph File Systems

We submitted a paper to HotOS 2019 in which we (unsuccessfully) made the argument that file systems as hierarchies is hobbling our ability to enhance the usability of file systems.

One of the blind reviews pointed to a pair of papers, one of which I’ve already reviewed (I’ll review the other, but I didn’t consider it to be the same thing, except vaguely in name as it ends up being a semantic tagging system). This paper actually pre-dates the work I submitted to Eurosys and profoundly influenced that work.

Five pages isn’t really much space to explore this area. Further, it was about a week before the deadline that I found out HotOS, while an ACM workshop now, uses an older template for its format, with wider margins and larger text, so the five page draft version we had became 6.5 pages! After surgery, it was back down to five pages but missing some useful discussions.

After submission of the HotOS paper, someone pointed me to a Stack Overflow article describing a 1958 paper (An Information Filing and Retrieval System for the Engineering and Management Records of a Large-Scale Computer Development Project) that may be the earliest record of hierarchical file structure (Figure 1).

Figure 1: ERMA Diagram mapping file folders to hierarcy

This is certainly not “new knowledge” as it has been extensively discussed in prior work – hierarchical structure fits the model in which physical filing was actually done.

This becomes clear by the time we get to Multics (Figure 2). We now have a model of directories and files organized in a strict hierarchical fashion.

Figure 2: 1965 Daley, et. al. Multics I/O Diagram (redrawn)

In my experience, when one presents a model and then finds it necessary to “hack” the model to be usable, it suggests that maybe the model is wrong – or at least not optimal. In the same paper the authors observe that they found it useful to augment hierarchy with links. But the introduction of links converts their hierarchy into a directed acyclic graph. Similar, yet not the same.

Figure 3: 1965 Daley, et. al. Multics I/O Diagram with links (redrawn)

In all fairness to the Multics folks, this was a reasonable option at that point. They had substantial limitations that would make graph processing impractical at that point (indeed, there are some who are likely to question whether or not graph processing at this level is practical now).

Simplified graph file system model
Figure 4: Simplified Graph Model

So what is it I envision? In Figure 4 I’ve started with a simplified graph model. In the model I’m envisioning (please keep in mind, this is a work in progress and quite likely to change) is that we have a clear separation between the name space (which is the graph) and the storage manager (which deals with figuring out how to deal with data).

One important benefit to come out of the rejection was identification of the QMDS paper – it helps establish why hierarchy isn’t good, even if the solution they put forward has limitations. For me, this is a blessing in disguise because I’ve had to spend so much time justifying why there is even a research question here that pointing to prior work (which wrestled with the same issues and made many similar arguments) allows me to focus future work more on the solution.

The graph model makes sense to me because it generalizes the hierarchical tree (a minimally connected graph) and existing relationships, including links. We are much more familiar with graphs now than we have been in the past: Facebook and LinkedIn are at their heart relationship graphs. Computer memories are much larger than in 1965, as are storage capacities. During the Eurosys Doctoral Workshop someone asked me about the overhead of such a system and I made the bold statement that I would be willing to spend 10% of my storage space if it dramatically improved my ability to find things. Surprisingly, that seemed to mollify the person asking.

It is the capture of relationships that distinguishes this approach from the more classic tagging approach. A tag represents an extension of some property of what a file is, not how it relates to other files. We’ve actually had tagging systems for a very long time – when I worked on Episode we explicitly decided to add “property lists” as a form of extended attribute; not quite as general as streams in NTFS, but a similar idea (as I understand it, they chose to do something similar in ReFS – they support alternate data streams, but they are limited to 128KB. Episode had a 64KB limitation for property lists.)

Why aren’t tags enough? Because they associate information with the specific file (or directory). What they fail to capture is relationships across file system objects. Why do we want relationships?

Up to this point I’ve been arguing that we want relationships because they provide us with the ability to find things. One of the very intriguing take-aways from The Ubiquitous Digital File paper is the observation that people prefer navigation to search. That’s a pretty profound observation when viewed against 30 years of research into tagging systems. Apple’s Spotlight and Microsoft’s search focus on improving search ability.

I’m pretty old-school here. When I am looking for something I often resort to searching for it by name from the command line and once I find it I navigate to the containing directory. I had not really considered that for me navigation is my primary mechanism and I use search as a secondary mechanism.

One of the most common uses of graphs by “real people” are maps. I’ve known this and I have considered visualizations of data as being a map between data elements. What I had not really considered is that we navigate maps all the time. If our data is organized in a graph fashion, we could consider navigating it much like we might navigate a map, or walk through relationship graphs such as Facebook or LinkedIn.

The foundation of this research direction is the relationship graph. Thus, the next phase of my work is really to explore what a reasonable representation of the namespace in this system would look like. More to discuss and consider in a future post!