Home » Research Ideas » Semantic File Systems

Category Archives: Semantic File Systems

June 2025
S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

What is the Optimal Location for Storing Metadata

November 6, 2022

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

Challenges of Capturing System Activity

February 16, 2022

A key aspect of the work I am doing for Indaleko is to “capture system activity” so that it can be used to form “activity contexts” that can then be used to inform the process of finding relevant information. As part of that, I have been working through the work of Daniela Vianna. While I have high-level descriptions of the information she collected and used, I need to reconstruct this. She collects data from a variety of sources. The most common source of information comes from web APIs to services such as Google and Facebook. In addition, she also uses file system activity information.

Since my background is file systems, I decided to start on the file system activity front first. Given that I’ve been working with Windows for three decades now, I decided to leverage my understanding of Windows file systems to collect such information. One nice feature of the NTFS file system on Windows is its support for a form of activity log known as the “USN Journal.” Of course, one of my handicaps is that I am used to using the native operating system API, not the libraries that are implemented on top of it. This is because when building file systems on Windows I have always been interested in testing the full kernel file systems interface. While there are a few specific features that cannot be exercised with just applications, there are still a number of interfaces that cannot be tested using the typical Win32 API that can be tested using the native API. In recent years the number of features that have been hidden from the Win32 API has continued to decrease, which has diminished the need to use the native API. I just haven’t had any strong need to learn the Win32 API – why start now?

I decided the model I want to use is a service that pulls data from the USN journal and converts it into a format suitable for storing in a MongoDB database. I decided to go with Mongo because that is what Vianna used for her work. The choice at this point is somewhat arbitrary but MongoDB makes sense because it tends to work well with semi-structured data, which is what I will be handling.

Similarly, I decided that I’d write my service for pulling USN Journal data from the NTFS file system(s) in C# since I have written some C# in the past, it makes doing some of the higher level tasks I have much easier, and is well-supported on Windows. I have made my repository public though I may restructure and/or rename it at some point (currently I call it CSharpToNativeTest because I was trying to invoke the native API as unmanaged code from C#). The most common approach to this is to utilize a specific mechanism (the “PInvoke” mechanism) but after a bit of trial-and-error I decided I wanted something that would be easier for me to debug, so instead of pulling the native routine directly from ntdll.dll I load it from my own DLL (written in C) and that then invokes the real native call. This allows me to see how data is being marshaled and delivered to the C language wrapper. I also tried to make the native API “more C# friendly.” I am sure it could be more efficient, but I wanted to support a model that I could extend and hopefully it will be easier to make it more efficient should that prove necessary.

One thing I did was to script the conversion of all the status values in ntstatus.h into a big C# enum type. The benefit of this is that when debugging I can automatically see the mnemonic name of the status code as well as its numeric value. I then decided to provide the layer needed to map the various volume names used on Windows around, with device names, device IDs, and symbolic links (drive letters) that can be mapped. While I have not yet added it, I wrote things so that it should be fairly straight-forward to add a background thread which wakes up when devices arrive or disappear. As I have noted before “naming is hard.” This is just one more example of the flexibility and challenges with aliasing and naming.

Finally, I turned my attention to the USN journal. I found some packages for decoding USN journal entries; most were written to parse the data from the drive, while a few managed dynamic access. Since I want this to be a service that monitors the USN journal and keeps adding information into the database, I decided to write C# code to use the API for retrieving that information. At this point, what I have is the ability to scan all the volumes on the machine – even if they do not have drive letters – and query them to see if they support a USN journal. I do this properly – I query the file system attributes (using the NtQueryVolumeInformationFile native API) and check if the bit showing USN journal support is marked. I do not use the file system name, an approach I’ve always considered to be a hack, especially since I have been in the habit of writing file systems that support NTFS features, including named data streams, extended attributes, and object IDs. In fact, the ReFS file system on Windows also supports USN journals, so I’m not just being my usual pedantic developer self in this instance.

At this point, I am able to identify volumes that support USN journals, open them and find out if USN is turned on (it is by default on the system volume, which is almost always the “C:” drive, though I enjoy watching things break when I configure a system to use some other drive letter.) I then extract the information and convert it to in-memory records. At the moment I just have it wait a few seconds and pull the newest records, but my plan is to evolve this into a service that I can run and it can keep pulling data and pushing it into my MongoDB instance.

At this point, I realized I do not really know that much about MongoDB so I have decided to start learning a bit more about it. Of course, I don’t want to be a MongoDB expert, so I also have been looking more carefully at Daniela Vianna’s work, trying to figure out what her data might have looked like and think about how I’m going to merge what she did into what I am doing. This is actually exciting because it means I’m starting to think of what we can do with this additional information.

This afternoon I had a great conversation with one of my PhD supervisors about this and she was making a couple of suggestions about ways to consume this data. That she was suggesting things I’d also added to my list was encouraging. What are we thinking:

We can consider using “learned index structures” as we begin to build up data sets.
We can use techniques such as Google BERT to facilitate dealing with the API data that Vianna’s work used. I pointed out that the challenges of APIs that Vianna pointed out are similar to languages: they have meaning and those meanings can be expressed in multiple ways.
The need for being able to efficiently find things is growing rapidly. She was explaining some work that indicates our rate of data growth is outstripping our silicon capabilities. In other words, there is a point at which “brute force search” becomes impractical. I liked this because it suggests what we are seeing with our own personal data is a leading indicator of the larger problem. This idea of storing the meta-data independent of the data is a natural one in a world where the raw information is too abundant for us to just go looking for an item of interest.

So, my work continues, mostly mundane and boring, but there are some useful observations even at this early stage. Now to figure out what I want the data in my database to look like and start storing information there. Then I can go figure out what I did right, what I did wrong, and how to improve things.

Aside: one interesting aspect of the BERT work was their discussion of “transducers.” This reminded me of Gifford’s Semantic File System work, where he used transducers to suck out semantic information from existing files.

Laundry Baskets: The New File System Namespace Model

September 28, 2021

A large pile of laundry in a laundry basket, with a cat sleeping on the top. — The “Laundry Basket” model for storage.

While I’ve been quiet about what I’ve been doing research-wise, I have been making forward progress. Only recently have ideas been converging towards a concrete thesis and the corresponding research questions that I need to explore as part of verifying my thesis.

I received an interesting article today showing that my research is far more relevant than I’d considered: “FILE NOT FOUND“. The article describes that the predominant organizational scheme for “Gen Z” students is the “Laundry Basket” in which all of their files are placed. This is coming as a surprise to people who have been trained in the ways of the hierarchical folder metaphor.

While going through older work, I have found it is intriguing that early researchers did not see the hierarchical design as being the pinnacle of design; rather they saw it as a stop-gap measure on the way to richer models. Researchers have explored richer models. Jeff Mogul, now at Google Research, did his PhD thesis around various ideas about improving file organization. Eno Thereska, now at Amazon, wrote an intriguing paper while at Microsoft Research entitled “Beyond file systems: understanding the nature of places where people store their data” in which he and his team pointed out that cloud storage was creating a tension between file systems and cloud storage. The article from the Verge that prompted me to write this post logically makes sense in the context of what Thereska was saying back in 2014.

The challenge is to figure out what comes instead. Two summers ago I was fortunate enough to have a very talented young intern working with me for a couple months and during that time one of the interesting things he built was a tool that viewed files as a graph rather than a tree. The focus was always at the center, but then it would be surrounded by related files. Pick one of those files and it became the central focus, with a breadcrumb trail showing how you got there but also showing other related files.

The relationships we used were fairly simple and extracted from existing file meta-data. What was actually quite fascinating about it though was that we constructed it to tie two disjoint storage locations (his local laptop and his Google Drive) together into a single namespace. It was really an electrifying demonstration and I have been working to figure out how to enable that more fully – what we had was a mock-up, with static information, but the visualization aspects of “navigating” through files was quite powerful.

I have been writing my thesis proposal, and as part of that I’ve been working through and identifying key work that has already been done. Of course my goal is to build on top of this prior work and while I have identified ways of doing this, I also see that to be truly effective it should use as much of the prior work as possible. The idea of not having directories is a surprisingly powerful one. What I hadn’t heard previously was the idea of considering it to be a “laundry basket” yet the metaphor is quite apt. Thus, the question is how to enable building tools to find the specific thing you want from the basket as quickly as possible.

For example, the author of the Verge article observed: “More broadly, directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location.” Here is what I recently wrote in an early draft of my thesis proposal: “This work proposes to develop a model to separate naming from location, which enables the construction of dynamic cross-silo human usable name-spaces and show how that model extends the utility of computer storage to better meet the needs of human users.”

Naming tied to location is broken, at least for human users. Oh, sure, we need to keep track of where something is stored to actually retrieve the contents, but there is absolutely no reason that we need to embed that within the name we use to find that file. One reason for this is that we often choose the location due to external factors. For example, we might use cloud storage for sharing specific content with others. People that work with large data sets often use storage locations that are tuned to the needs of that particular data set. There is, however, no reason why you should store the Excel spreadsheet or Python notebook that you used to analyze that data in the same location. Right now, with hierarchical names, you need to do so in order to put them into the “right directory” with each other.

That’s just broken.

However, it’s also broken to expect human users to do the grunt work here. The reason Gen Z is using a “laundry basket” is because it doesn’t require any effort on their part to put something into that basket. The work then becomes when they need to find a particular item.

This isn’t a new idea. Vannevar Bush described this idea in 1945:

“Consider a future device for individual use, which is a sort of
mechanized private file and library. It needs a name, and, to coin
one at random, ”memex” will do. A memex is a device in which
an individual stores all his books, records, and communications,
and which is mechanized so that it may be consulted with exceeding
speed and flexibility. It is an enlarged intimate supplement to his
memory.”

He also did a good job of explaining why indexing (the basis of hierarchical file systems) was broken:

“Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

“The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental
pictures, is awe-inspiring beyond all else in nature.”

We knew it was broken in 1945. What we’ve been doing since then is using what we’ve been given and making it work as best we can. We knew it was broken. Seltzer wrote “Hierarchical File Systems Are Dead” back in 2009. Yet, that’s what computers still serve up as our primary interface.

The question then is what the right primary interface is. Of course, while I find that interesting I work with computer systems and I am equally concerned about how we can build in better support, using the vast amount of data that we have in modern computer systems, to construct better tools for navigating through the laundry basket to find the correct thing.

How I think that should be done will have to wait for another post, since that’s the point of my thesis proposal.

Where has the time gone?

August 5, 2020

It’s been more than a year since I last posted; it’s not that I haven’t been busy, but rather that I’ve been trying to do too many things and have been (more slowly than I’d like) cutting back on some of my activities.

Still, I miss using this as a (one way) discussion about my own work. In the past year I’ve managed to publish one new (short) paper, though the amount of work that I put into it was substantial (it was just published in Computer Architecture Letters). This short article (letter) journal normally provides at most one revise and resubmit opportunity, but they gave me two such opportunities, then accepted the paper, albeit begrudgingly over the objections of Reviewer # 2 (who agreed to accept it, but didn’t change their comments).

Despite the lack of clear publications to demonstrate forward progress, I’ve been working on a couple of projects to push them along. Both were presented, in some form, at Eurosys as posters.

Since I got back from a three month stint at Microsoft Research (in the UK) I’ve been working on one of those, evolving the idea of kernel bypasses and really analyzing why we keep doing these things; this time through the lens of building user mode file systems. I really should write more about it, since that’s on the drawing board for submission this fall.

The second idea is one that stemmed from my attendance at SOSP 2019. There were three papers that spoke directly to file systems:

Each of these had important insights into the crossover between file systems and persistent memory. One of the struggles I had with that short paper was explaining to people “why file systems are necessary for using persistent memory”. I was still able to capture some of what I’d learned, but a fair bit of it was sacrificed to adding background information.

One key observation was around the size of memory pages and their impact on performance; it convinced me that we’d benefit from using ever larger page sizes for PMEM. Some of this is because persistent memory is, well, persistent and thus we don’t need to “load the contents from storage”. Instead, it is storage. So, we’re off testing out some ideas in this area to see if we can contribute some additional insight.

The other area – the one that I have been ignoring too long – is the thesis of this PhD work in the first place. Part of the challenge is to reduce the problem down to something that is tractable and can be finished in a reasonable amount of time.

One of the questions (and the one I wanted to explore when I started writing this) is a rather famous article from 1945 entitled As We May Think. Vannevar Bush described something quite understandable, yet we have not achieve this, though we have been trying – one could argue that hypertext stems from these ideas, but I would argue that hypertext links are a pale imitation of the rich assistive model Bush lays out when he describes the Memex.

Thus, to the question, which I will reserve for another day: why have we not achieved this yet? What prevents us from having this, or something better, and how can I move us towards this goal?

I suspect, but am not certain, that one culprit may be the fact we decided to stick with an existing and well-understood model of organization:

Maybe the model is wrong when the data doesn’t fit?

360° Semantic File System: Augmented Directory Navigation for Nonhierarchical Retrieval of Files

July 18, 2019

360° Semantic File System: Augmented Directory Navigation for Nonhierarchical Retrieval of Files, Syed Rahman Mashwani and Shah Kusro, in IEEE Access, January 29, 2019.

This paper makes some interesting observations that resonate with my own research observations, though I will end up arguing (in a future blog post) that they don’t go far enough. But they do a good job of laying out the problem and why some solutions do not work. One of the common themes I have heard when discussing my own work is an insistence there really isn’t a problem, though usually a longer conversation ends up with us agreeing things could be done better.

The abstract is a bit long, but clearly describes the essence of the paper:

The organization of files in any desktop computer has been an issue since their inception. The file systems that are available today organize files in a strict hierarchy that facilitates their retrieval either through navigation, clicking directories and sub-directories, in a tree-like structure or by searching (which allows for finding of the desired files using a search tool). Research studies show that the users rarely (4% – 15%) use the latter approach, thus leaving navigation as the main mechanism for retrieving files.
However, navigation does not allow a user to retrieve files nonhierarchically, which makes it limited in terms of time, human effort, and cognitive overload. To mitigate this issue, several semantic file systems (SFSs) have been periodically proposed that have made the nonhierarchical navigation of files possible by exploiting some basic semantics but no more than that. None of these systems consider aspects such as time, location, file movement, content similarity, and territory together with learning from user file retrieval behaviors in identifying the desired file and accessing it in less time and with minimum human and cognitive efforts.
Moreover, most of the available SFSs replace the existing le system metaphor, which is normally not acceptable to users. To mitigate these issues, this research paper proposes 360 SFS that exploits the SFS ontology to capture all the possible relevant file metadata and learns from user browsing behaviors to semantically retrieve the desired files both easily and timely. Based on user studies, the evaluation results show that the proposed 360 SFS outperforms the existing traditional directory navigation and recently open files.
Paper Abstract

Of course, the problem existed before the appearance of the desktop computer: the original UNIX contained the permuted index server, which suggests to me that even in 1973 people were struggling to find things. What I do find interesting is the observation that people really do not like search – I recently described this insightful (to me at least) observation. Here it is once again, complete with references (in the paper) to prior work demonstrating this point. Indeed, it suggests to me one alternative explanation as to why the Google Desktop Search project was ultimately cancelled – not because it was “no longer necessary” but rather because it wasn’t useful to most computer users.

Another thing that I have observed, repeatedly, when working with students, is there seems to be a natural aversion to searching for answers. Thus, students will post on class forums (such as Piazza, with which I am most familiary) asking questions, even if the question has already been asked and answered. Searching for the answer does not seem to come naturally to such students. Indeed, I often find myself using search engines to find the answer and giving the results back to the original question. I have wondered about this “laziness” in the past and with this new insight I wonder if it is just because people prefer to navigate – and being told where to go is certainly one form of navigation.

Interestingly, like the Graph File System paper I described previously, these authors also argue that we cannot abandon the hierarchical view of files. I am not convinced of this, but I can understand the appeal of starting from it as a basic premise. We have been doing some work recently on a graph visualization model for the file system, more as a prototype, but it is surprisingly functional and encouraging us to look at alternative visualizations of file system data for navigation. In other words, thinking of the problem as a search problem ultimately seems to be the wrong path – yet that is the point of things like semantic file systems, to improve search.

The paper has an extensive review of prior work, much of which I’ve also previous described, though there were a few systems I had not previously seen. Table 1 of the paper has a comparison of features across the various file systems. Thus, the authors distinguish themselves from prior by focusing on providing enhanced functionality, using auxiliary directories, in which they display related content. They focus on:

Temporal characteristics – they focus on when files are being used, not merely frequency. This is an idea we’ve been exploring as well.
Geographical location – this is intriguing; identifying where the user was when they accessed a given file.
File movement – when files are reorganized and moved around.
File access patterns – they cluster files as related based upon the temporal proximity of their access; another idea we’ve been exploring. I found it insightful they describe this as a “relationship” though they do not explore a broader range of relationships.
Content similarity – files that are identical or substantially similar can be associated together; this is another technique that we’ve been actively investigating.
Manual tagging

They describe the file system they implement, which is essentially a layered file system in which they add two virtual directories: NOW and TAGS. They describe the interface for adding this information as well, which I found cumbersome, but it is in keeping with their goal of not deviating from the existing hierarchical interface. They do also permit the creation of custom virtual directories as well, though that is only briefly mentioned in the paper.

One of the problems they highlight, which resonated with me because we’ve been discussing the same problem, is how much information to display to users – in essence, when confronted with too many options, users quickly become overwhelmed.

Their evaluation focuses on the amount of time it took users to locate their files when using their enhanced file systems model and they lay out a case for the fact their system works well for their study group. One limitation of their study group is that it is based upon an experienced computer user group, but this is reflective of their environment.

One interesting comment was that while they used Linux for their evaluation, Windows would have been a better platform because of its broader usage within their organization. I have wondered how much the use of Linux tends to create a bias in an evaluation of this type, since most people are using Windows or Apple computers. Would the results be similar?

The authors do point to their open source implementation of their file system. I have not yet evaluated it, but it is definitely something on my (all too long) list of things to do.

Visualization

May 30, 2019

Last post I discussed relationships. But relationships really are not enough. Another key to this puzzle is visualization. In other words, how do we present the information to users so that it is useful.

But first, let me step back and point to a larger problem: information overload. If we present users with a list of 100,000 options, they won’t be able to necessarily find what they’re actually seeking. For example, one of the challenges of using an Internet search engine is that they can return millions of “answers” to my basic query. In fact, I just typed in “What is the meaning of life?” to a search engine and it responded back with 1.1 billion possible answers, all in under a second. They are a marvel at backing up the dump truck and offering to inundate me with answers. When is the last time you went through even a handful of these, let alone a large percentage of them?

I suspect that if I ask the HCI folks they will be able to tell me what works for ordinary mortals, but I assure you that computers are capable of returning more information than one can possibly ever process – many years ago I received a bug report about a file system directory that contained over 700,000 files and did not display with the file system kit I’d constructed and the company sold. I’d known I made decisions about resources when I did it; but we lifted that restriction and supported much larger directories than that. I’m quite sure that no mere human would look at such a directory in any meaningful sense. Maybe it was a collection of log files, in which case the names likely embed semantic information about the files themselves. Indeed, in Burrito the authors noted that scientists often embed the schema of their data within the file names. I know that I have done the same and when I’m looking for specific data I am often scripting code to sift through the pile of data to find the subset that is useful to me. The point remains the same: huge listings of files within a directory don’t work for humans.

One potential area for considering navigation is faceted search, a technique for making vast quantities of data searchable. Indeed, this fits well with the graph file system idea because what we expect to find in our graphs are clusters of related files. The graph is likely to be a sparse graph because most files are unlikely to share common features with one another. Thus, it suggests that at least one model for this data visualization problem is going to be rolling up data into these clusters, with an iterative approach to breaking it down and displaying it further. Of course, we might be able to do that within the confines of the existing hierarchical structure, which would be great for retrofitting this into the vast array of existing applications; still, my hope is that we can also provide novel new variations (or rather someone more clever than I am at these things will do so). The challenge is to build something that enables this means I need to have some level of understanding as to what kinds of information are needed to do so.

For example, I was wondering about first order approximations – those that mimic an existing hierarchical file system. Such a file system would present one or more views of the data. Maybe we have a Time view, and the time view then shows you all the files in time order. But if you have 1,000,000 files, that is going to be a mighty big list. One option might be to divide it up into ever smaller chunks of time. Eventually, we’d get to a point where you could see a few dozen files in some sort of time order. Of course, strict sharding of time might not make sense either: why should two files that are separated in time by small intervals end up in separate locations.

Time Slider

One possible option would be to consider a time slider that controls the view. This is something that we can find in other temporal data tools. Thus, creating a time slider might help make visualization easier to perform – in some ways this is similar to the Windows timeline feature that has recently been added into Windows 10. I suspect the file system doesn’t do much to facilitate this. From my own use of this feature it has some interesting limitations, not the least of which is that if you want full functionality they want to export data “to the cloud” for further analysis. That sounds like whatever they are doing is data intensive, which in turn suggests to me that the file system is not doing anything to facilitate this. If the answer really is “sorry, but you have to spend lots of computational cycles to mine this data” then my research is unlikely to be fruitful. I push forward because I don’t think this really is the case – the database community has done marvellous things that permit a vast treasure of relationships be mined.

If we combine this with a faceted search area, I can envision a filtered timeline model – in essence, this seems to be what the new Windows feature is doing, albeit by filtering the things they have deemed to be of interest. I suspect they will extend this capability over time, but a time ordered view is just one of the possible relationships I consider to be important for consideration. I don’t know what the relationships will be, but I do think that having a pre-defined list of those relationships will be self-limiting. Perhaps it will be enough. I proceed on the basis that I expect it will not be enough.

One possible visualization I’ve been considering – and part of the motivation for me deciding I need to start building a file system – is that this sort of namespace might be achievable on our existing hierarchical model if I just add “an extra level of indirection”. Suppose we construct a file system that, instead of having the existing file name has a directory of the same name. In turn, that directory then contains relationships associated with other objects. One of those other objects could be the actual file (so we can still access it), but we could also have a “temporal” directory that would then display a list of files that were created, modified, or accessed around the same timeframe. We could store information about what web sites were visited around times that the file was in use. We could keep track of the music that was playing around the same time. We could point to files that were similar to that specific file. Such a visualization could be easily achieved and compatible with existing tools. Rather than being an endpoint, this is more an intermediate staging area – a way of mocking up the concepts and ideas, and to see what works for people and what does not work.

Thus the desire for a file system. I think we can construct a static namespace by using an existing file system and symbolic links (so you can eventually get back to the real files) by mining existing data sources. But eventually, we are going to want dynamic support here. We can stub that out with FUSE (for example) but in my experience (and based upon the literature) FUSE is slow for meta-data operations, which is really all I will be doing. Building file systems is hard work, but it is something I’ve been doing for years, so that aspect doesn’t really scare me. Visualization on the other hand is an area in which I don’t have a lot of experience.

I’m certainly open to ideas…

Relationships

May 28, 2019

I recently described two file systems (QMDS and GFS) that attempted to capture additional context for files to improve their usability. At Eurosys, I argued (somewhat successfully) that a distinguishing characteristic of my proposed work is to capture relationships between files, something that goes beyond mere isolated analysis of such files.

Index servers, which are now ubiquitous on mainstream platforms, attempt to solve this problem by focusing on the specific attributes of the given file. This is useful, and indeed it seems to be consonant with the general approach of semantic file systems, which attempt to classify files based upon their semantic content. For example, this is how modern Internet search engines work – they classify a document based upon its content.

I pointed out in my recent discussion of personal information management (PIM) that there is a difference between navigation and search. Thinking about this further, I realize that this is more nuanced and reflects the way that we humans look for things: we go to where we expect something to be (navigation) and if it is not there we then go to other places where we think it might be. For example, when I’m looking for my keys, I have a list of places I look first. When I don’t find them there I begin to systematically search for them.

Thus, the natural progression for finding the item of interest is navigation and then search. Index servers are basically the search part of the equation and they do not tend to be where we start first. Instead, it is the fall-back.

As humans, we tend to create associations between events. When I can’t find my keys I start thinking about the last time I saw them (“I know I had them because I was able to get into the apartment. After I got back I went and checked my e-mail…”) These are temporal clues but they are associations between other unrelated events and the object for which we are searching.

Another observation that Margo Seltzer made in our discussions on this topic is that when we use a web search engine, we are looking for an answer, but when we are searching for something that we have we are looking for the answer. This is a subtle, but important, difference. I don’t want to find any set of keys, I want to find a specific set of keys. Internet search is notoriously unreliable in this regard; how many times have you gone back a few days later to find some interesting article, only to realize the list of results coming back from your search engine are different than they were before? This is how Internet search engines exploit the fact that you are (usually) just looking for an answer – they don’t have to give you definitive, reproducible results.

Yesterday I had an interesting conversation with Sasha Fedorova and towards the end of it she suggested that I might do better promoting this work by focusing on solving the needs of a particular community and she suggested the software engineering community, partially because she has worked with them before and also because she could see the kinds of relationships that might be useful to that community. Further, that community is used to testing out experimental approaches that promise to improve effectiveness and productivity. In that same conversation she pointed out the Stack Overflow community as being one of those places software engineers search for answers.

This morning, on the way to the gym, I realized that the Stack Overflow community is also an example of how we organically create relationships: people ask questions and get back specific answers of varying quality. The community rates the responses and preserves the answers. This has organically created a web of connections across topics and people.

Why is this important to understand? Because the work I’m doing proposes going beyond simple semantic analysis of individual files, or even clustering them based upon specific characteristics, but also by establishing a set of interrelationships across files and across applications. Focusing on solving this issue for a single application is much akin to focusing on just looking at the semantic content of a single file. Moving beyond this to realize that us humans create associations in our brains means we need to find ways of capturing those associations across applications that help us better navigate the vast trove of information we accumulate.

For example, Sasha suggested that it would be interesting for the software development community if there were an association between the web pages we accessed and the code we were editing. This makes sense to me: I tend to look up documentation or explanations of specific things as I write code. If we capture this relationship across applications, we can motivate why this is a systems problem and not an application problem – operating systems provide services that are common to applications (not necessarily all applications, but something that is of broader interest than a single or a few applications). One benefit of focusing on the software engineering community is that permits us to identify relationship of interest to the community and then mine those relationships.

Another fascinating conversation (late last week) was when I was discussing my research direction with Ghita Berrada (who is visiting us this month) and we ended up having a discussion of generalized concept analysis. I was familiar with Formal Concept Analysis (FCA) because it was used by Benjamin Martin in his work more than a decade ago (his dissertation “Formal Concept Analysis and Semantic File Systems” is something I have found delightfully insightful and I think it is time for me to read it again) but she pointed me to Temporal Concept Analysis and Relational Concept Analysis. Temporal Concept Analysis is fairly recent, having only been first formally described in 2000, and uses FCA as its basis, but it focuses on temporal events. Relational Concept Analysis is even more recent (2007) and is definitely germane to my research direction since it focuses on relationships and how to handle them within the context of FCA. Given my own focus on relationships across files, it definitely seems pertinent.

Introduction to Lattice Theory with Computer Science Applications

All of these are in turn based upon the mathematical model of lattices – partially ordered sets. Lattice theory has been around for quite some time and shows up in a broad range of areas, such as CRDTs which are used in distributed systems. For example, they are used in Anna, a key-value store I read about last year (surprisingly, I didn’t write it up – I should have). It in turn pointed to earlier work on Lattices in distributed systems, which I did describe previously. This has prompted me to go off and read about lattices; this definitely is challenging as my formal math skills are quite rusty, but I have been systematically working my way through the book on lattice theory I picked up. It has been interesting trying to construct an intuitive understanding of the concepts from more formal language describing them; hopefully that knowledge will prove useful.

This is turning into a long post and I still haven’t reached the point I wanted (yet).

A challenge with this work is identifying relationships that we want to be able to support. Of course, I don’t want to restrict these relationships, but rather use a sample of such relationships to motivate the work (or “Why would anyone care about my research?“) One of the challenges of doing good research is motivating that research: there are numerous questions to answer, but which questions deserve being answered?

File systems presently support two basic relationships:

Contains – this is the directory metaphor. A “directory” is a container of other directories or files. It is the basis of the hierarchical relationship to which we have become accustomed.
Points to – this is the function of a symbolic link, which is supported by most file systems.

It is relatively easy to focus on properties as well. In theory, the clustering of files based upon this characteristic is one (loose) form of relationship. For example, we routinely associate specific file names with a corresponding application (e.g., via the suffix of the file). Windows exhibits a strong relationship model here, in which applications register interest in particular types of files, based on suffix, and the operating system then uses that information to invoke the relevant application. Apple’s Mac used to use an embedded meta-data component (“resource fork“) for associating the file with the application; it still exists but is not commonly used in order to support compatibility.

Other types of file properties include:

Timestamp – these capture temporal properties of a file. The most common are the creation time, last update time, and last access time. Last access time is often omitted these days because constantly updating it turns out to be expensive. For example, Windows NTFS does not update last access time by default. My recollection is that they did this because they found updating this field accounted for something like a 6% performance cost in NTFS. Thus, there is a question as to how reliable these values are.
Size – we know what the size of files are.
Name – I’ve already mentioned using the suffix, which is an aspect of the name. Is it possible to exploit this in other ways?

Microsoft Visual Studio

Then there are application relationships: what application created this file? It occurs to me that it might be useful to distinguish registered names (suffixes) from all files created by an application. For example, I don’t usually want to see the build artifacts of my software development environment. Microsoft Visual Studio handles this by allowing artifacts to be separated from source files, for example. Could we achieve something comparable within the file system by understanding these relationships? Would it be useful?

Another suggestion from Sasha was that we might want to record what music was playing when we were doing something specific because our brains may create an association across these seemingly unrelated events. This suggests another potential relationship: concurrent application execution. This is a sort of temporal relationship, but one that becomes more interesting when we consider it in a Memex style model of associations. How can I capture these relationships across different applications. Perhaps we can think of some sort of “current context” or “current activity” be associated with a given application that can then be queried and added to the files as we create them. These types of dynamic relationships are certainly more intriguing or interesting.

What kinds of relationships can you envision being useful when you are searching for that elusive file you know that you have but you don’t know where it lives in the hierarchy?

The Ubiquitous Digital File: A Review of File Management Research

April 29, 2019

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.

Frameworks

One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!

Systems

Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.

Where does search functionality live?

May 27, 2017

In mulling over the depths of semantic knowledge and file systems, it occurs to me that one thing which differs between the world of Unix/Linux file systems and Windows file systems is that in Unix/Linux environments, search of a directory’s contents are done in the shell (or application) while in Windows they are a service of the file system.

I admit, when I first started working on Windows file systems, I thought this was an annoying decision, since it involved quite a bit of work inside the file system related to string handling and matching. Even as I write this, I still think that it is a lot of work that really doesn’t belong in the kernel, but, having said that, this distinction is one reason why a Unix/Linux file systems developer might not think of adding semantic support to a file system as something logical – after all, the purpose of the file system is to manage storage of file systems and associated meta-data, not to find things. Having experience in the Windows file systems space, I can understand why it might not be a great idea to do this in kernel mode. After all, C is not a language well-known for its strength and safety in handling strings, and the kernel is not an environment well-known for its tolerance of C runtime error tolerance.

But I digress. The point is this: when we begin to embed semantic knowledge inside the file system, we exploit a model in which the file system is involved in the search function and this would seem to be anathema to normal file systems behavior. This is a good challenge: does this need to be done in the file system? If not, perhaps there is instead an abstraction that the file system itself must be able to provide.

Each time I tackle this problem, my general sense is that the model I want is a case in which each file has a set of attributes. Ideally, what I want is some way to quickly and efficiently find things based upon those attributes. After all, how hard could this be?

One benefit to the current search paradigm with which users have been trained is that it does not provide reproducible search results. Thus, nobody will really be surprised if they repeat a search today and get back different results than they got back yesterday.

Hence, I keep coming back to this paradigm. It also gives me the sense that there are different characteristics of such a system – there are persistent attributes, like the timestamp, and ephemeral attributes, like semantic tags.

Plenty to think about, but this idea of where to draw the line of search is an important one. In either case, though, I need to determine efficient ways of rapidly finding files based upon these attributes.

Category Archives: Semantic File Systems

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email

Challenges of Capturing System Activity

Laundry Baskets: The New File System Namespace Model

Where has the time gone?

360° Semantic File System: Augmented Directory Navigation for Nonhierarchical Retrieval of Files

Visualization

Relationships

The Ubiquitous Digital File: A Review of File Management Research

Frameworks

Systems

Where does search functionality live?