Home » 2019 » April

Monthly Archives: April 2019

The Ubiquitous Digital File: A Review of File Management Research

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.

Frameworks

One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!

Systems

Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.


Eurosys 2019

I attended Eurosys 2019 last month and in fact just returned from that trip, as I added three weeks of vacation to the end of it, though the first week had me spending most of the time in a hotel room working on a paper for submission.

I attended the doctoral workshop at Eurosys to pitch my idea for an associative file system. I received useful feedback from both the paper I submitted as well as my presentation. My poster for the doctoral workshop attempted to capture a non-textual perspective on the problem and the approach I am seeking to achieve




I did achieve my goal of ensuring there was minimal text in the poster, though I’m not sure I quite hit the right balance with it.

I also presented a second poster based upon a paper we had submitted around the same idea. This was a very different realization of the same basic concepts.

As I promised in my earlier post, I will be discussing my forward moving file systems idea(s) in more detail as I move along through this project.


Reboot time

It’s been almost a year since I posted anything substantive. It is so easy to just focus on other things, which is what I’ve been doing. In the past year I’ve continued to explore some interesting areas related to file systems. For example, for the past year I’ve been looking at persistent memory, which acts somewhat like storage (because it is persistent) and somewhat like DRAM (because it is byte addressed). The findings have been interesting: surprising in some cases, close to predicted in some cases. We’re still doing more work in this area, and I’m hoping to submit two papers later this year.

But the other big project is a file systems project. I’ve decided to use Windows as my platform of choice, both because I’m quite comfortable with Windows and because I think it’s the best choice for this project. Since it has been a goal of mine to do more writing in this area, I thought it would be great to use my blog to capture how I go about writing this file system for Windows with the (probably unrealistic) hope that I can eventually turn it into an online guide. I also plan on using a public repository for my project so that other people can see what I end up doing. I think I’ll turn that into a separate blog post, so I can talk more about that project.

Hopefully, by using this as a mechanism for describing my forward progress I can continue to post new information and content. That’s the theory at least.

If you are interested in persistent memory, two good recent papers are:

Basic Performance Measurements of the Intel Optane DC Persistent Memory Module

Single Machine Graph Analytics on Massive Datasets Using IntelĀ® OptaneTM DC Persistent Memory