Home » Research Ideas (Page 2)
Category Archives: Research Ideas
Relationships
I recently described two file systems (QMDS and GFS) that attempted to capture additional context for files to improve their usability. At Eurosys, I argued (somewhat successfully) that a distinguishing characteristic of my proposed work is to capture relationships between files, something that goes beyond mere isolated analysis of such files.
Index servers, which are now ubiquitous on mainstream platforms, attempt to solve this problem by focusing on the specific attributes of the given file. This is useful, and indeed it seems to be consonant with the general approach of semantic file systems, which attempt to classify files based upon their semantic content. For example, this is how modern Internet search engines work – they classify a document based upon its content.
I pointed out in my recent discussion of personal information management (PIM) that there is a difference between navigation and search. Thinking about this further, I realize that this is more nuanced and reflects the way that we humans look for things: we go to where we expect something to be (navigation) and if it is not there we then go to other places where we think it might be. For example, when I’m looking for my keys, I have a list of places I look first. When I don’t find them there I begin to systematically search for them.
Thus, the natural progression for finding the item of interest is navigation and then search. Index servers are basically the search part of the equation and they do not tend to be where we start first. Instead, it is the fall-back.
As humans, we tend to create associations between events. When I can’t find my keys I start thinking about the last time I saw them (“I know I had them because I was able to get into the apartment. After I got back I went and checked my e-mail…”) These are temporal clues but they are associations between other unrelated events and the object for which we are searching.
Another observation that Margo Seltzer made in our discussions on this topic is that when we use a web search engine, we are looking for an answer, but when we are searching for something that we have we are looking for the answer. This is a subtle, but important, difference. I don’t want to find any set of keys, I want to find a specific set of keys. Internet search is notoriously unreliable in this regard; how many times have you gone back a few days later to find some interesting article, only to realize the list of results coming back from your search engine are different than they were before? This is how Internet search engines exploit the fact that you are (usually) just looking for an answer – they don’t have to give you definitive, reproducible results.
Yesterday I had an interesting conversation with Sasha Fedorova and towards the end of it she suggested that I might do better promoting this work by focusing on solving the needs of a particular community and she suggested the software engineering community, partially because she has worked with them before and also because she could see the kinds of relationships that might be useful to that community. Further, that community is used to testing out experimental approaches that promise to improve effectiveness and productivity. In that same conversation she pointed out the Stack Overflow community as being one of those places software engineers search for answers.
This morning, on the way to the gym, I realized that the Stack Overflow community is also an example of how we organically create relationships: people ask questions and get back specific answers of varying quality. The community rates the responses and preserves the answers. This has organically created a web of connections across topics and people.
Why is this important to understand? Because the work I’m doing proposes going beyond simple semantic analysis of individual files, or even clustering them based upon specific characteristics, but also by establishing a set of interrelationships across files and across applications. Focusing on solving this issue for a single application is much akin to focusing on just looking at the semantic content of a single file. Moving beyond this to realize that us humans create associations in our brains means we need to find ways of capturing those associations across applications that help us better navigate the vast trove of information we accumulate.
For example, Sasha suggested that it would be interesting for the software development community if there were an association between the web pages we accessed and the code we were editing. This makes sense to me: I tend to look up documentation or explanations of specific things as I write code. If we capture this relationship across applications, we can motivate why this is a systems problem and not an application problem – operating systems provide services that are common to applications (not necessarily all applications, but something that is of broader interest than a single or a few applications). One benefit of focusing on the software engineering community is that permits us to identify relationship of interest to the community and then mine those relationships.
Another fascinating conversation (late last week) was when I was discussing my research direction with Ghita Berrada (who is visiting us this month) and we ended up having a discussion of generalized concept analysis. I was familiar with Formal Concept Analysis (FCA) because it was used by Benjamin Martin in his work more than a decade ago (his dissertation “Formal Concept Analysis and Semantic File Systems” is something I have found delightfully insightful and I think it is time for me to read it again) but she pointed me to Temporal Concept Analysis and Relational Concept Analysis. Temporal Concept Analysis is fairly recent, having only been first formally described in 2000, and uses FCA as its basis, but it focuses on temporal events. Relational Concept Analysis is even more recent (2007) and is definitely germane to my research direction since it focuses on relationships and how to handle them within the context of FCA. Given my own focus on relationships across files, it definitely seems pertinent.
All of these are in turn based upon the mathematical model of lattices – partially ordered sets. Lattice theory has been around for quite some time and shows up in a broad range of areas, such as CRDTs which are used in distributed systems. For example, they are used in Anna, a key-value store I read about last year (surprisingly, I didn’t write it up – I should have). It in turn pointed to earlier work on Lattices in distributed systems, which I did describe previously. This has prompted me to go off and read about lattices; this definitely is challenging as my formal math skills are quite rusty, but I have been systematically working my way through the book on lattice theory I picked up. It has been interesting trying to construct an intuitive understanding of the concepts from more formal language describing them; hopefully that knowledge will prove useful.
This is turning into a long post and I still haven’t reached the point I wanted (yet).
A challenge with this work is identifying relationships that we want to be able to support. Of course, I don’t want to restrict these relationships, but rather use a sample of such relationships to motivate the work (or “Why would anyone care about my research?“) One of the challenges of doing good research is motivating that research: there are numerous questions to answer, but which questions deserve being answered?
File systems presently support two basic relationships:
- Contains – this is the directory metaphor. A “directory” is a container of other directories or files. It is the basis of the hierarchical relationship to which we have become accustomed.
- Points to – this is the function of a symbolic link, which is supported by most file systems.
It is relatively easy to focus on properties as well. In theory, the clustering of files based upon this characteristic is one (loose) form of relationship. For example, we routinely associate specific file names with a corresponding application (e.g., via the suffix of the file). Windows exhibits a strong relationship model here, in which applications register interest in particular types of files, based on suffix, and the operating system then uses that information to invoke the relevant application. Apple’s Mac used to use an embedded meta-data component (“resource fork“) for associating the file with the application; it still exists but is not commonly used in order to support compatibility.
Other types of file properties include:
- Timestamp – these capture temporal properties of a file. The most common are the creation time, last update time, and last access time. Last access time is often omitted these days because constantly updating it turns out to be expensive. For example, Windows NTFS does not update last access time by default. My recollection is that they did this because they found updating this field accounted for something like a 6% performance cost in NTFS. Thus, there is a question as to how reliable these values are.
- Size – we know what the size of files are.
- Name – I’ve already mentioned using the suffix, which is an aspect of the name. Is it possible to exploit this in other ways?
Then there are application relationships: what application created this file? It occurs to me that it might be useful to distinguish registered names (suffixes) from all files created by an application. For example, I don’t usually want to see the build artifacts of my software development environment. Microsoft Visual Studio handles this by allowing artifacts to be separated from source files, for example. Could we achieve something comparable within the file system by understanding these relationships? Would it be useful?
Another suggestion from Sasha was that we might want to record what music was playing when we were doing something specific because our brains may create an association across these seemingly unrelated events. This suggests another potential relationship: concurrent application execution. This is a sort of temporal relationship, but one that becomes more interesting when we consider it in a Memex style model of associations. How can I capture these relationships across different applications. Perhaps we can think of some sort of “current context” or “current activity” be associated with a given application that can then be queried and added to the files as we create them. These types of dynamic relationships are certainly more intriguing or interesting.
What kinds of relationships can you envision being useful when you are searching for that elusive file you know that you have but you don’t know where it lives in the hierarchy?
The Ubiquitous Digital File: A Review of File Management Research
The Ubiquitous Digital File: A Review of File Management Research
Jesse David Dinneen and Charles-Antoine Julien, Journal of the Association for Information Science and Technology, April 12, 2019.
I recently stumbled across this recent paper, which I found to be very useful and timely for my current project. As I mentioned in my recent post about Eurosys 2019, I am looking at how we can do a better job of creating associative relationships across our data.
This isn’t a new idea – I described the Memex previously, which posited the idea of an associative data storage model. The current hierarchical model does a poor job of capturing this idea, but observing this is definitely not new, as even a cursory review of the literature points out.
This paper is a survey paper, capturing decades of research in the area of “File Management”. This is reflected in the paper’s exhaustive bibliography, which is roughly 7.5 pages of 32 page paper, or almost 25% of the full paper (32 pages). Since I have spent a considerable amount of time digesting much of the systems focused research as well as some of the Human Computer Interface (HCI) focused research in this area, I found this paper to be particularly insightful, both for categorizing the literature as well as identifying useful research questions, some of which I find particularly interesting.
Frameworks
One of the observations that I found interesting was the authors’ identification that “[t]here do not currently exist any explicit theories about FM [File Management] or theoretical frameworks specifically for understanding it.” As a result, trying to evaluate alternative models or approaches remains particularly challenging. They do draw upon personal information management (PIM) as being valid for consideration and identify three categories to consider: keeping, exploiting, and managing data (or keeping, finding/refinding, and organizing). They do explore various ways of evaluation, but my sense from reading the paper is that the field is complex and not well-understood. This either creates complexity when it comes to evaluation or creates further research opportunities (or likely both!
Systems
Of course, my interest really lies in how this impacts systems. Ultimately, the only way to make effective system level optimizations is to understand the usage patterns of the applications. Some of the cases they observed resonated with me. For example “from a user-remembered event to an email in which it is discussed and then to a document that was attached to the email”. I liked this because I have used the reverse process of following back from a document to the e-mail from which it originated as a good use case for considering the design of a new file system.
They point out that their work is relevant to “computer science” (and particularly the branch with which I work): “… a considerable body of existing literature aims to understand the contents and access patterns of file systems, such as file size distribution, to optimize hardware, firmware, and software. FM studies focusing on real-world file systems that users have interacted with may provide valuable data sets for such design goals, especially given that most of such computer science studies have
examined only files stored on servers and software development
machines.”
Thus two important observations: (1) there is a synergy between file management and storage management that should be realized; and (2) prior work in systems really has focused on specific workloads that are not likely representative of what is useful for file management (and correspondingly, for users of file management).
One observation the users make is surprising to me: “A preference for navigating to files is much more common than a preference for searching , even among users who prefer to search rather than navigate folders when retrieving their emails”. What this suggests to me is that trying to shift people to a search based paradigm may not, in fact, be useful. Thus, it may be more important to consider ways in which information can be presented for navigation in a more flexible way than the current hierarchical model would suggest. The authors do point out that using augmented search mechanisms still likely have a place. Another potential model to consider is to provide mechanisms by which applications can convert navigation into search queries in a more dynamic fashion.
Perhaps something more radical is in order, some sort of automated mechanism for augmenting navigation and management functions: suggesting locations to create new files based upon similarity, for example, or allowing temporal navigation. Some of these are issues that I have been considering and discussing with others, but this paper really emphasizes their importance and I would be remiss to ignore the research literature they have summarized.
This is a text-dense paper, with no figures and only text tables. I’ve now read through it twice and expect I will do so several more times as I try to extract the salient points for my own work, which is what I will start describing in subsequent posts.
Of file handles and implicit offsets (Follow-up)
My post about file handles elicited some interesting feedback, so I wanted to capture it because I thought it provided some insight.
Shared libraries were not a standard part of UNIX systems in the 1980s (though they had certainly been described in prior work) and thus one interesting observation here is that putting code in the kernel was a way of minimizing the amplification of common runtime code. The use of shared libraries today and our increased certainty that kernels need to be as small as we can make them certainly would lead to a very different divide today.
Conversations with Malcolm are often insightful, so I wanted to capture it here because it is definitely germane to the area that I’m exploring, particularly as I try to explain some of the underlying rationale for it – a combination of software archaeology and pragmatically looking at how to evolve forward.
I always thought this was just a programming convenience, because it allows a simple program to have “read next” semantics, and half of the core UNIX utilities are stream parsers that want that semantic. While you’re in the area though, I’d ask “what’s the purpose of a current directory?” which is implemented (on Windows) as a process wide value. I’m guessing it also started (on UNIX) as a programming convenience, but being process-wide has meant that it disappeared as a convenience and reemerged as a headache. (DOS arguably had a different history since it was trying to run non-directory aware applications in the presence of directories.)
My response:
If it were just a convenience we could easily bury it in a library. I’m trying to do some hybrid file systems implementation work and it creates complications that seem unnecessary. But who really looks at this old cruft anyway?
Current directory is another good one. And the directory enumeration offset a third.
And his reply:
As you mentioned though, POSIX was trying to codify existing implementations, and presumably at some point the choice between kernel and library could have been made, but once it’s been made it’s hard to change. UNIX libraries always seemed strange to me in that in many cases they’re super-simple syscall wrappers (similar to NTDLL) but in some cases (file pattern matching) all the heavy lifting is there. Remember that in the beginning shared libraries weren’t a thing, so requiring functionality to be in a library meant duplicated code while the kernel provided a natural place to share code/functionality. This probably influenced a lot of choices.
Of file handles and implicit offsets
My current research direction (which is wandering a bit, as is common with research) has forced me to look as some of the vagaries of the POSIX interface. One of these is this intriguing decision to incorporate a piece of file descriptor specific state for the “file pointer” (note that in Windows there is an exact equivalent in the CurrentByteOffset of the file handle).
One thing to note about POSIX is that it was not designed initially. Rather, it captured the state of UNIX systems in the 1980s and codify it. Thus, rather than inventing this behavior, POSIX (or officially, IEEE Std. 1003.1-1988) codified a uniform interface acceptable to a variety of parties. Like any standards document, it is a compromise that attempts to mollify a variety of different players.
Here is a version of the Linux in-kernel file structure (from the main linux repository as of this morning):
struct file { union { struct llist_node fu_llist; struct rcu_head fu_rcuhead; } f_u; struct path f_path; struct inode *f_inode; /* cached value */ const struct file_operations *f_op; /* * Protects f_ep_links, f_flags. * Must not be taken from IRQ context. */ spinlock_t f_lock; enum rw_hint f_write_hint; atomic_long_t f_count; unsigned int f_flags; fmode_t f_mode; struct mutex f_pos_lock; loff_t f_pos; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; u64 f_version; #ifdef CONFIG_SECURITY void *f_security; #endif /* needed for tty driver, and maybe others */ void *private_data; #ifdef CONFIG_EPOLL /* Used by fs/eventpoll.c to link all the hooks to this file */ struct list_head f_ep_links; struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; errseq_t f_wb_err; } __randomize_layout __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
Note the f_pos field (which I’ve highlighted). This is the file pointer and it allows things like read and write to work without an explicit offset value.
Here’s the equivalent structure in Windows 10:
typedef struct _FILE_OBJECT { CSHORT Type; CSHORT Size; PDEVICE_OBJECT DeviceObject; PVPB Vpb; PVOID FsContext; PVOID FsContext2; PSECTION_OBJECT_POINTERS SectionObjectPointer; PVOID PrivateCacheMap; NTSTATUS FinalStatus; struct _FILE_OBJECT *RelatedFileObject; BOOLEAN LockOperation; BOOLEAN DeletePending; BOOLEAN ReadAccess; BOOLEAN WriteAccess; BOOLEAN DeleteAccess; BOOLEAN SharedRead; BOOLEAN SharedWrite; BOOLEAN SharedDelete; ULONG Flags; UNICODE_STRING FileName; LARGE_INTEGER CurrentByteOffset; __volatile ULONG Waiters; __volatile ULONG Busy; PVOID LastLock; KEVENT Lock; KEVENT Event; __volatile PIO_COMPLETION_CONTEXT CompletionContext; KSPIN_LOCK IrpListLock; LIST_ENTRY IrpList; __volatile PVOID FileObjectExtension; } FILE_OBJECT; typedef struct _FILE_OBJECT *PFILE_OBJECT;
I highlighted the equivalent field for this structure (from wdm.h in the Windows 10 WDK). I spent some time looking through the various fields and my observation is that this is the only piece of implicit user-visible shared mutable state.
This actually doesn’t work in multi-threaded environments (very common these days) if threads use the same file descriptor (file handle in Windows) since it doesn’t make any sense to arbitrarily interleave reads. In those environments, you use a different call – pread for POSIX systems, and in Windows it is explicit parameter in the native system call (NtReadFile where it is an optional parameter).
This led me to ask the question: why is this here? I haven’t found a definitive source since this predates the original POSIX specification, but my theory is that it is because it is the only way to properly implement sharing of the file descriptor. When UNIX added the fork call, one of the characteristics of it was “inheritance of file descriptors”.
The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal- driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)). (Source: http://man7.org/linux/man-pages/man2/fork.2.html)
The status flags describe how the file was opened so they aren’t changing (immutable). The addition of F_SETOWN and F_SETSIG
is more recent but it does appear to be explicitly mutable state (it allows programmatic changes).
Fork is not the only way that a file descriptor (or file handle). For example, it can be done
using UNIX domain sockets on UNIX and Linux systems. Windows provides a system call for doing something similar
as well (the documented version is ZwDuplicateObject).
I’ve spent time thinking about this and it seems that the reason to maintain this shared state is to ensure
that two processes sharing the same file descriptor/handle get the same position pointer value. This
then let me to ask why is this useful?
I have been able to construct a single scenario in which this is useful: appending to the end of a shared file.
Interleaving reads doesn’t make much sense. Interleaving writes inside the existing boundaries of a file
makes even less sense to me. I can construct peculiar scenarios in which I can write applications that
explicitly use this feature but they seem artificial.
Writing to a log file at the end seems like it would make sense. But if that’s my goal, it makes more
sense to just use O_APPEND mode:
The file is opened in append mode. Before each write(2), the file offset is positioned at the end of the file, as if with lseek(2). The modification of the file offset and the write operation are performed as a single atomic step. (Source: http://man7.org/linux/man-pages/man2/open.2.html)
Thus, this makes me wonder: could we just eliminate this piece of shared state? I have a reason for asking this question though I will save discussing that for another time.
Preserving the correct behavior for most applications will require fixing things up in the library – we could eliminate read as a system call and provide a library implementation that calls pread.
I’m considering doing that and seeing what breaks. it is more difficult to do that in Windows than in Linux, so I’m considering starting there.
For those problems that are nails, I have a sledgehammer
I’ve been involved in a project that requires I learn more about blockchain than I ever really wanted to know.
I had a basic appreciation of it from reading the Nakamoto paper many years ago. It provides an insightful solution to a challenging problem – the problem of distributed consensus in an untrusted network. There is an agreed mechanism (the protocol) and an incentive to participate (the fee).
Why is an untrusted network the problem? Because this came from the world of digital cash, and the concern in that environment was that someone could “double spend” money. The blockchain model makes that effectively impossible. It does so in an interesting way – by combining a consensus protocol (getting multiple systems to agree on a given outcome) along with an incentive to behave (participants get paid to behave). It’s an intriguing way of reaching common consensus.
But it has limitations. Like any consensus protocol, it takes time to send messages back and forth and reach an agreement. Further, blockchain requires that nodes attempt to solve a cryptographically hard problem. The node that is successful gets the reward. The algorithm is designed so that it takes about 10 minutes for someone to reach a valid solution. These solutions are easy to verify, so once someone has figured out a solution – which indicates real work – they present their answer and it is echoed throughout the network. By chaining the answer for the current block into the prior block it essentially creates chain that is not realistically mutable (e.g., it cannot easily change).
There are more technical details (Merkle Trees and cryptographic hashes and so forth) but the final solution is a “public ledger” that shows who owns what in the world.
Here’s the rub: this is interesting technology, but it is not the solution to every problem. As is so often the case in computing, though, we pick “the shiny new tool” and insist it must be the right tool to chose for our problem at hand. The challenge then for those of us who have been through this before is to actually analyze a problem and figure out a good solution. Of course, sometimes we’re called upon to utilize a particular tool, like blockchain, regardless of whether or not it is the right tool for the job. When that happens there’s an unenviable task of trying to explain why it is not the right solution. Doing so is tough, as it is often not what a customer wants to hear.
Where does search functionality live?
In mulling over the depths of semantic knowledge and file systems, it occurs to me that one thing which differs between the world of Unix/Linux file systems and Windows file systems is that in Unix/Linux environments, search of a directory’s contents are done in the shell (or application) while in Windows they are a service of the file system.
I admit, when I first started working on Windows file systems, I thought this was an annoying decision, since it involved quite a bit of work inside the file system related to string handling and matching. Even as I write this, I still think that it is a lot of work that really doesn’t belong in the kernel, but, having said that, this distinction is one reason why a Unix/Linux file systems developer might not think of adding semantic support to a file system as something logical – after all, the purpose of the file system is to manage storage of file systems and associated meta-data, not to find things. Having experience in the Windows file systems space, I can understand why it might not be a great idea to do this in kernel mode. After all, C is not a language well-known for its strength and safety in handling strings, and the kernel is not an environment well-known for its tolerance of C runtime error tolerance.
But I digress. The point is this: when we begin to embed semantic knowledge inside the file system, we exploit a model in which the file system is involved in the search function and this would seem to be anathema to normal file systems behavior. This is a good challenge: does this need to be done in the file system? If not, perhaps there is instead an abstraction that the file system itself must be able to provide.
Each time I tackle this problem, my general sense is that the model I want is a case in which each file has a set of attributes. Ideally, what I want is some way to quickly and efficiently find things based upon those attributes. After all, how hard could this be?
One benefit to the current search paradigm with which users have been trained is that it does not provide reproducible search results. Thus, nobody will really be surprised if they repeat a search today and get back different results than they got back yesterday.
Hence, I keep coming back to this paradigm. It also gives me the sense that there are different characteristics of such a system – there are persistent attributes, like the timestamp, and ephemeral attributes, like semantic tags.
Plenty to think about, but this idea of where to draw the line of search is an important one. In either case, though, I need to determine efficient ways of rapidly finding files based upon these attributes.
Recent Comments