Home » 2019 » April » 29

Daily Archives: April 29, 2019

The Ubiquitous Digital File: A Review of File Management Research

The Ubiquitous Digital File: A Review of File Management Research

Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.

I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.

This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.

This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.

Frameworks

One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!

Systems

Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.

They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”

Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).

One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.

Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.

This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.