Home » Research Ideas » Semantic File Systems
Category Archives: Semantic File Systems
Challenges of Capturing System Activity
A key aspect of the work I am doing for Indaleko is to “capture system activity” so that it can be used to form “activity contexts” that can then be used to inform the process of finding relevant information. As part of that, I have been working through the work of Daniela Vianna. While I have high-level descriptions of the information she collected and used, I need to reconstruct this. She collects data from a variety of sources. The most common source of information comes from web APIs to services such as Google and Facebook. In addition, she also uses file system activity information.
Since my background is file systems, I decided to start on the file system activity front first. Given that I’ve been working with Windows for three decades now, I decided to leverage my understanding of Windows file systems to collect such information. One nice feature of the NTFS file system on Windows is its support for a form of activity log known as the “USN Journal.” Of course, one of my handicaps is that I am used to using the native operating system API, not the libraries that are implemented on top of it. This is because when building file systems on Windows I have always been interested in testing the full kernel file systems interface. While there are a few specific features that cannot be exercised with just applications, there are still a number of interfaces that cannot be tested using the typical Win32 API that can be tested using the native API. In recent years the number of features that have been hidden from the Win32 API has continued to decrease, which has diminished the need to use the native API. I just haven’t had any strong need to learn the Win32 API – why start now?
I decided the model I want to use is a service that pulls data from the USN journal and converts it into a format suitable for storing in a MongoDB database. I decided to go with Mongo because that is what Vianna used for her work. The choice at this point is somewhat arbitrary but MongoDB makes sense because it tends to work well with semi-structured data, which is what I will be handling.
Similarly, I decided that I’d write my service for pulling USN Journal data from the NTFS file system(s) in C# since I have written some C# in the past, it makes doing some of the higher level tasks I have much easier, and is well-supported on Windows. I have made my repository public though I may restructure and/or rename it at some point (currently I call it CSharpToNativeTest because I was trying to invoke the native API as unmanaged code from C#). The most common approach to this is to utilize a specific mechanism (the “PInvoke” mechanism) but after a bit of trial-and-error I decided I wanted something that would be easier for me to debug, so instead of pulling the native routine directly from ntdll.dll I load it from my own DLL (written in C) and that then invokes the real native call. This allows me to see how data is being marshaled and delivered to the C language wrapper. I also tried to make the native API “more C# friendly.” I am sure it could be more efficient, but I wanted to support a model that I could extend and hopefully it will be easier to make it more efficient should that prove necessary.
One thing I did was to script the conversion of all the status values in ntstatus.h into a big C# enum type. The benefit of this is that when debugging I can automatically see the mnemonic name of the status code as well as its numeric value. I then decided to provide the layer needed to map the various volume names used on Windows around, with device names, device IDs, and symbolic links (drive letters) that can be mapped. While I have not yet added it, I wrote things so that it should be fairly straight-forward to add a background thread which wakes up when devices arrive or disappear. As I have noted before “naming is hard.” This is just one more example of the flexibility and challenges with aliasing and naming.
Finally, I turned my attention to the USN journal. I found some packages for decoding USN journal entries; most were written to parse the data from the drive, while a few managed dynamic access. Since I want this to be a service that monitors the USN journal and keeps adding information into the database, I decided to write C# code to use the API for retrieving that information. At this point, what I have is the ability to scan all the volumes on the machine – even if they do not have drive letters – and query them to see if they support a USN journal. I do this properly – I query the file system attributes (using the NtQueryVolumeInformationFile native API) and check if the bit showing USN journal support is marked. I do not use the file system name, an approach I’ve always considered to be a hack, especially since I have been in the habit of writing file systems that support NTFS features, including named data streams, extended attributes, and object IDs. In fact, the ReFS file system on Windows also supports USN journals, so I’m not just being my usual pedantic developer self in this instance.
At this point, I am able to identify volumes that support USN journals, open them and find out if USN is turned on (it is by default on the system volume, which is almost always the “C:” drive, though I enjoy watching things break when I configure a system to use some other drive letter.) I then extract the information and convert it to in-memory records. At the moment I just have it wait a few seconds and pull the newest records, but my plan is to evolve this into a service that I can run and it can keep pulling data and pushing it into my MongoDB instance.
At this point, I realized I do not really know that much about MongoDB so I have decided to start learning a bit more about it. Of course, I don’t want to be a MongoDB expert, so I also have been looking more carefully at Daniela Vianna’s work, trying to figure out what her data might have looked like and think about how I’m going to merge what she did into what I am doing. This is actually exciting because it means I’m starting to think of what we can do with this additional information.
This afternoon I had a great conversation with one of my PhD supervisors about this and she was making a couple of suggestions about ways to consume this data. That she was suggesting things I’d also added to my list was encouraging. What are we thinking:
- We can consider using “learned index structures” as we begin to build up data sets.
- We can use techniques such as Google BERT to facilitate dealing with the API data that Vianna’s work used. I pointed out that the challenges of APIs that Vianna pointed out are similar to languages: they have meaning and those meanings can be expressed in multiple ways.
- The need for being able to efficiently find things is growing rapidly. She was explaining some work that indicates our rate of data growth is outstripping our silicon capabilities. In other words, there is a point at which “brute force search” becomes impractical. I liked this because it suggests what we are seeing with our own personal data is a leading indicator of the larger problem. This idea of storing the meta-data independent of the data is a natural one in a world where the raw information is too abundant for us to just go looking for an item of interest.
So, my work continues, mostly mundane and boring, but there are some useful observations even at this early stage. Now to figure out what I want the data in my database to look like and start storing information there. Then I can go figure out what I did right, what I did wrong, and how to improve things.
Aside: one interesting aspect of the BERT work was their discussion of “transducers.” This reminded me of Gifford’s Semantic File System work, where he used transducers to suck out semantic information from existing files.
Visualization
Last post I discussed relationships. But relationships really are not enough. Another key to this puzzle is visualization. In other words, how do we present the information to users so that it is useful.
But first, let me step back and point to a larger problem: information overload. If we present users with a list of 100,000 options, they won’t be able to necessarily find what they’re actually seeking. For example, one of the challenges of using an Internet search engine is that they can return millions of “answers” to my basic query. In fact, I just typed in “What is the meaning of life?” to a search engine and it responded back with 1.1 billion possible answers, all in under a second. They are a marvel at backing up the dump truck and offering to inundate me with answers. When is the last time you went through even a handful of these, let alone a large percentage of them?
I suspect that if I ask the HCI folks they will be able to tell me what works for ordinary mortals, but I assure you that computers are capable of returning more information than one can possibly ever process – many years ago I received a bug report about a file system directory that contained over 700,000 files and did not display with the file system kit I’d constructed and the company sold. I’d known I made decisions about resources when I did it; but we lifted that restriction and supported much larger directories than that. I’m quite sure that no mere human would look at such a directory in any meaningful sense. Maybe it was a collection of log files, in which case the names likely embed semantic information about the files themselves. Indeed, in Burrito the authors noted that scientists often embed the schema of their data within the file names. I know that I have done the same and when I’m looking for specific data I am often scripting code to sift through the pile of data to find the subset that is useful to me. The point remains the same: huge listings of files within a directory don’t work for humans.
One potential area for considering navigation is faceted search, a technique for making vast quantities of data searchable. Indeed, this fits well with the graph file system idea because what we expect to find in our graphs are clusters of related files. The graph is likely to be a sparse graph because most files are unlikely to share common features with one another. Thus, it suggests that at least one model for this data visualization problem is going to be rolling up data into these clusters, with an iterative approach to breaking it down and displaying it further. Of course, we might be able to do that within the confines of the existing hierarchical structure, which would be great for retrofitting this into the vast array of existing applications; still, my hope is that we can also provide novel new variations (or rather someone more clever than I am at these things will do so). The challenge is to build something that enables this means I need to have some level of understanding as to what kinds of information are needed to do so.
For example, I was wondering about first order approximations – those that mimic an existing hierarchical file system. Such a file system would present one or more views of the data. Maybe we have a Time view, and the time view then shows you all the files in time order. But if you have 1,000,000 files, that is going to be a mighty big list. One option might be to divide it up into ever smaller chunks of time. Eventually, we’d get to a point where you could see a few dozen files in some sort of time order. Of course, strict sharding of time might not make sense either: why should two files that are separated in time by small intervals end up in separate locations.
One possible option would be to consider a time slider that controls the view. This is something that we can find in other temporal data tools. Thus, creating a time slider might help make visualization easier to perform – in some ways this is similar to the Windows timeline feature that has recently been added into Windows 10. I suspect the file system doesn’t do much to facilitate this. From my own use of this feature it has some interesting limitations, not the least of which is that if you want full functionality they want to export data “to the cloud” for further analysis. That sounds like whatever they are doing is data intensive, which in turn suggests to me that the file system is not doing anything to facilitate this. If the answer really is “sorry, but you have to spend lots of computational cycles to mine this data” then my research is unlikely to be fruitful. I push forward because I don’t think this really is the case – the database community has done marvellous things that permit a vast treasure of relationships be mined.
If we combine this with a faceted search area, I can envision a filtered timeline model – in essence, this seems to be what the new Windows feature is doing, albeit by filtering the things they have deemed to be of interest. I suspect they will extend this capability over time, but a time ordered view is just one of the possible relationships I consider to be important for consideration. I don’t know what the relationships will be, but I do think that having a pre-defined list of those relationships will be self-limiting. Perhaps it will be enough. I proceed on the basis that I expect it will not be enough.
One possible visualization I’ve been considering – and part of the motivation for me deciding I need to start building a file system – is that this sort of namespace might be achievable on our existing hierarchical model if I just add “an extra level of indirection”. Suppose we construct a file system that, instead of having the existing file name has a directory of the same name. In turn, that directory then contains relationships associated with other objects. One of those other objects could be the actual file (so we can still access it), but we could also have a “temporal” directory that would then display a list of files that were created, modified, or accessed around the same timeframe. We could store information about what web sites were visited around times that the file was in use. We could keep track of the music that was playing around the same time. We could point to files that were similar to that specific file. Such a visualization could be easily achieved and compatible with existing tools. Rather than being an endpoint, this is more an intermediate staging area – a way of mocking up the concepts and ideas, and to see what works for people and what does not work.
Thus the desire for a file system. I think we can construct a static namespace by using an existing file system and symbolic links (so you can eventually get back to the real files) by mining existing data sources. But eventually, we are going to want dynamic support here. We can stub that out with FUSE (for example) but in my experience (and based upon the literature) FUSE is slow for meta-data operations, which is really all I will be doing. Building file systems is hard work, but it is something I’ve been doing for years, so that aspect doesn’t really scare me. Visualization on the other hand is an area in which I don’t have a lot of experience.
I’m certainly open to ideas…
Relationships
I recently described two file systems (QMDS and GFS) that attempted to capture additional context for files to improve their usability. At Eurosys, I argued (somewhat successfully) that a distinguishing characteristic of my proposed work is to capture relationships between files, something that goes beyond mere isolated analysis of such files.
Index servers, which are now ubiquitous on mainstream platforms, attempt to solve this problem by focusing on the specific attributes of the given file. This is useful, and indeed it seems to be consonant with the general approach of semantic file systems, which attempt to classify files based upon their semantic content. For example, this is how modern Internet search engines work – they classify a document based upon its content.
I pointed out in my recent discussion of personal information management (PIM) that there is a difference between navigation and search. Thinking about this further, I realize that this is more nuanced and reflects the way that we humans look for things: we go to where we expect something to be (navigation) and if it is not there we then go to other places where we think it might be. For example, when I’m looking for my keys, I have a list of places I look first. When I don’t find them there I begin to systematically search for them.
Thus, the natural progression for finding the item of interest is navigation and then search. Index servers are basically the search part of the equation and they do not tend to be where we start first. Instead, it is the fall-back.
As humans, we tend to create associations between events. When I can’t find my keys I start thinking about the last time I saw them (“I know I had them because I was able to get into the apartment. After I got back I went and checked my e-mail…”) These are temporal clues but they are associations between other unrelated events and the object for which we are searching.
Another observation that Margo Seltzer made in our discussions on this topic is that when we use a web search engine, we are looking for an answer, but when we are searching for something that we have we are looking for the answer. This is a subtle, but important, difference. I don’t want to find any set of keys, I want to find a specific set of keys. Internet search is notoriously unreliable in this regard; how many times have you gone back a few days later to find some interesting article, only to realize the list of results coming back from your search engine are different than they were before? This is how Internet search engines exploit the fact that you are (usually) just looking for an answer – they don’t have to give you definitive, reproducible results.
Yesterday I had an interesting conversation with Sasha Fedorova and towards the end of it she suggested that I might do better promoting this work by focusing on solving the needs of a particular community and she suggested the software engineering community, partially because she has worked with them before and also because she could see the kinds of relationships that might be useful to that community. Further, that community is used to testing out experimental approaches that promise to improve effectiveness and productivity. In that same conversation she pointed out the Stack Overflow community as being one of those places software engineers search for answers.
This morning, on the way to the gym, I realized that the Stack Overflow community is also an example of how we organically create relationships: people ask questions and get back specific answers of varying quality. The community rates the responses and preserves the answers. This has organically created a web of connections across topics and people.
Why is this important to understand? Because the work I’m doing proposes going beyond simple semantic analysis of individual files, or even clustering them based upon specific characteristics, but also by establishing a set of interrelationships across files and across applications. Focusing on solving this issue for a single application is much akin to focusing on just looking at the semantic content of a single file. Moving beyond this to realize that us humans create associations in our brains means we need to find ways of capturing those associations across applications that help us better navigate the vast trove of information we accumulate.
For example, Sasha suggested that it would be interesting for the software development community if there were an association between the web pages we accessed and the code we were editing. This makes sense to me: I tend to look up documentation or explanations of specific things as I write code. If we capture this relationship across applications, we can motivate why this is a systems problem and not an application problem – operating systems provide services that are common to applications (not necessarily all applications, but something that is of broader interest than a single or a few applications). One benefit of focusing on the software engineering community is that permits us to identify relationship of interest to the community and then mine those relationships.
Another fascinating conversation (late last week) was when I was discussing my research direction with Ghita Berrada (who is visiting us this month) and we ended up having a discussion of generalized concept analysis. I was familiar with Formal Concept Analysis (FCA) because it was used by Benjamin Martin in his work more than a decade ago (his dissertation “Formal Concept Analysis and Semantic File Systems” is something I have found delightfully insightful and I think it is time for me to read it again) but she pointed me to Temporal Concept Analysis and Relational Concept Analysis. Temporal Concept Analysis is fairly recent, having only been first formally described in 2000, and uses FCA as its basis, but it focuses on temporal events. Relational Concept Analysis is even more recent (2007) and is definitely germane to my research direction since it focuses on relationships and how to handle them within the context of FCA. Given my own focus on relationships across files, it definitely seems pertinent.
All of these are in turn based upon the mathematical model of lattices – partially ordered sets. Lattice theory has been around for quite some time and shows up in a broad range of areas, such as CRDTs which are used in distributed systems. For example, they are used in Anna, a key-value store I read about last year (surprisingly, I didn’t write it up – I should have). It in turn pointed to earlier work on Lattices in distributed systems, which I did describe previously. This has prompted me to go off and read about lattices; this definitely is challenging as my formal math skills are quite rusty, but I have been systematically working my way through the book on lattice theory I picked up. It has been interesting trying to construct an intuitive understanding of the concepts from more formal language describing them; hopefully that knowledge will prove useful.
This is turning into a long post and I still haven’t reached the point I wanted (yet).
A challenge with this work is identifying relationships that we want to be able to support. Of course, I don’t want to restrict these relationships, but rather use a sample of such relationships to motivate the work (or “Why would anyone care about my research?“) One of the challenges of doing good research is motivating that research: there are numerous questions to answer, but which questions deserve being answered?
File systems presently support two basic relationships:
- Contains – this is the directory metaphor. A “directory” is a container of other directories or files. It is the basis of the hierarchical relationship to which we have become accustomed.
- Points to – this is the function of a symbolic link, which is supported by most file systems.
It is relatively easy to focus on properties as well. In theory, the clustering of files based upon this characteristic is one (loose) form of relationship. For example, we routinely associate specific file names with a corresponding application (e.g., via the suffix of the file). Windows exhibits a strong relationship model here, in which applications register interest in particular types of files, based on suffix, and the operating system then uses that information to invoke the relevant application. Apple’s Mac used to use an embedded meta-data component (“resource fork“) for associating the file with the application; it still exists but is not commonly used in order to support compatibility.
Other types of file properties include:
- Timestamp – these capture temporal properties of a file. The most common are the creation time, last update time, and last access time. Last access time is often omitted these days because constantly updating it turns out to be expensive. For example, Windows NTFS does not update last access time by default. My recollection is that they did this because they found updating this field accounted for something like a 6% performance cost in NTFS. Thus, there is a question as to how reliable these values are.
- Size – we know what the size of files are.
- Name – I’ve already mentioned using the suffix, which is an aspect of the name. Is it possible to exploit this in other ways?
Then there are application relationships: what application created this file? It occurs to me that it might be useful to distinguish registered names (suffixes) from all files created by an application. For example, I don’t usually want to see the build artifacts of my software development environment. Microsoft Visual Studio handles this by allowing artifacts to be separated from source files, for example. Could we achieve something comparable within the file system by understanding these relationships? Would it be useful?
Another suggestion from Sasha was that we might want to record what music was playing when we were doing something specific because our brains may create an association across these seemingly unrelated events. This suggests another potential relationship: concurrent application execution. This is a sort of temporal relationship, but one that becomes more interesting when we consider it in a Memex style model of associations. How can I capture these relationships across different applications. Perhaps we can think of some sort of “current context” or “current activity” be associated with a given application that can then be queried and added to the files as we create them. These types of dynamic relationships are certainly more intriguing or interesting.
What kinds of relationships can you envision being useful when you are searching for that elusive file you know that you have but you don’t know where it lives in the hierarchy?
The Ubiquitous Digital File: A Review of File Management Research
The Ubiquitous Digital File: A Review of File Management Research
Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.
I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.
This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.
This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.
Frameworks
One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!
Systems
Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.
They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”
Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).
One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.
Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.
This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.
Where does search functionality live?
In mulling over the depths of semantic knowledge and file systems, it occurs to me that one thing which differs between the world of Unix/Linux file systems and Windows file systems is that in Unix/Linux environments, search of a directory’s contents are done in the shell (or application) while in Windows they are a service of the file system.
I admit, when I first started working on Windows file systems, I thought this was an annoying decision, since it involved quite a bit of work inside the file system related to string handling and matching. Even as I write this, I still think that it is a lot of work that really doesn’t belong in the kernel, but, having said that, this distinction is one reason why a Unix/Linux file systems developer might not think of adding semantic support to a file system as something logical – after all, the purpose of the file system is to manage storage of file systems and associated meta-data, not to find things. Having experience in the Windows file systems space, I can understand why it might not be a great idea to do this in kernel mode. After all, C is not a language well-known for its strength and safety in handling strings, and the kernel is not an environment well-known for its tolerance of C runtime error tolerance.
But I digress. The point is this: when we begin to embed semantic knowledge inside the file system, we exploit a model in which the file system is involved in the search function and this would seem to be anathema to normal file systems behavior. This is a good challenge: does this need to be done in the file system? If not, perhaps there is instead an abstraction that the file system itself must be able to provide.
Each time I tackle this problem, my general sense is that the model I want is a case in which each file has a set of attributes. Ideally, what I want is some way to quickly and efficiently find things based upon those attributes. After all, how hard could this be?
One benefit to the current search paradigm with which users have been trained is that it does not provide reproducible search results. Thus, nobody will really be surprised if they repeat a search today and get back different results than they got back yesterday.
Hence, I keep coming back to this paradigm. It also gives me the sense that there are different characteristics of such a system – there are persistent attributes, like the timestamp, and ephemeral attributes, like semantic tags.
Plenty to think about, but this idea of where to draw the line of search is an important one. In either case, though, I need to determine efficient ways of rapidly finding files based upon these attributes.
Recent Comments