Home » Research Ideas
Category Archives: Research Ideas
Challenges of Capturing System Activity
A key aspect of the work I am doing for Indaleko is to “capture system activity” so that it can be used to form “activity contexts” that can then be used to inform the process of finding relevant information. As part of that, I have been working through the work of Daniela Vianna. While I have high-level descriptions of the information she collected and used, I need to reconstruct this. She collects data from a variety of sources. The most common source of information comes from web APIs to services such as Google and Facebook. In addition, she also uses file system activity information.
Since my background is file systems, I decided to start on the file system activity front first. Given that I’ve been working with Windows for three decades now, I decided to leverage my understanding of Windows file systems to collect such information. One nice feature of the NTFS file system on Windows is its support for a form of activity log known as the “USN Journal.” Of course, one of my handicaps is that I am used to using the native operating system API, not the libraries that are implemented on top of it. This is because when building file systems on Windows I have always been interested in testing the full kernel file systems interface. While there are a few specific features that cannot be exercised with just applications, there are still a number of interfaces that cannot be tested using the typical Win32 API that can be tested using the native API. In recent years the number of features that have been hidden from the Win32 API has continued to decrease, which has diminished the need to use the native API. I just haven’t had any strong need to learn the Win32 API – why start now?
I decided the model I want to use is a service that pulls data from the USN journal and converts it into a format suitable for storing in a MongoDB database. I decided to go with Mongo because that is what Vianna used for her work. The choice at this point is somewhat arbitrary but MongoDB makes sense because it tends to work well with semi-structured data, which is what I will be handling.
Similarly, I decided that I’d write my service for pulling USN Journal data from the NTFS file system(s) in C# since I have written some C# in the past, it makes doing some of the higher level tasks I have much easier, and is well-supported on Windows. I have made my repository public though I may restructure and/or rename it at some point (currently I call it CSharpToNativeTest because I was trying to invoke the native API as unmanaged code from C#). The most common approach to this is to utilize a specific mechanism (the “PInvoke” mechanism) but after a bit of trial-and-error I decided I wanted something that would be easier for me to debug, so instead of pulling the native routine directly from ntdll.dll I load it from my own DLL (written in C) and that then invokes the real native call. This allows me to see how data is being marshaled and delivered to the C language wrapper. I also tried to make the native API “more C# friendly.” I am sure it could be more efficient, but I wanted to support a model that I could extend and hopefully it will be easier to make it more efficient should that prove necessary.
One thing I did was to script the conversion of all the status values in ntstatus.h into a big C# enum type. The benefit of this is that when debugging I can automatically see the mnemonic name of the status code as well as its numeric value. I then decided to provide the layer needed to map the various volume names used on Windows around, with device names, device IDs, and symbolic links (drive letters) that can be mapped. While I have not yet added it, I wrote things so that it should be fairly straight-forward to add a background thread which wakes up when devices arrive or disappear. As I have noted before “naming is hard.” This is just one more example of the flexibility and challenges with aliasing and naming.
Finally, I turned my attention to the USN journal. I found some packages for decoding USN journal entries; most were written to parse the data from the drive, while a few managed dynamic access. Since I want this to be a service that monitors the USN journal and keeps adding information into the database, I decided to write C# code to use the API for retrieving that information. At this point, what I have is the ability to scan all the volumes on the machine – even if they do not have drive letters – and query them to see if they support a USN journal. I do this properly – I query the file system attributes (using the NtQueryVolumeInformationFile native API) and check if the bit showing USN journal support is marked. I do not use the file system name, an approach I’ve always considered to be a hack, especially since I have been in the habit of writing file systems that support NTFS features, including named data streams, extended attributes, and object IDs. In fact, the ReFS file system on Windows also supports USN journals, so I’m not just being my usual pedantic developer self in this instance.
At this point, I am able to identify volumes that support USN journals, open them and find out if USN is turned on (it is by default on the system volume, which is almost always the “C:” drive, though I enjoy watching things break when I configure a system to use some other drive letter.) I then extract the information and convert it to in-memory records. At the moment I just have it wait a few seconds and pull the newest records, but my plan is to evolve this into a service that I can run and it can keep pulling data and pushing it into my MongoDB instance.
At this point, I realized I do not really know that much about MongoDB so I have decided to start learning a bit more about it. Of course, I don’t want to be a MongoDB expert, so I also have been looking more carefully at Daniela Vianna’s work, trying to figure out what her data might have looked like and think about how I’m going to merge what she did into what I am doing. This is actually exciting because it means I’m starting to think of what we can do with this additional information.
This afternoon I had a great conversation with one of my PhD supervisors about this and she was making a couple of suggestions about ways to consume this data. That she was suggesting things I’d also added to my list was encouraging. What are we thinking:
- We can consider using “learned index structures” as we begin to build up data sets.
- We can use techniques such as Google BERT to facilitate dealing with the API data that Vianna’s work used. I pointed out that the challenges of APIs that Vianna pointed out are similar to languages: they have meaning and those meanings can be expressed in multiple ways.
- The need for being able to efficiently find things is growing rapidly. She was explaining some work that indicates our rate of data growth is outstripping our silicon capabilities. In other words, there is a point at which “brute force search” becomes impractical. I liked this because it suggests what we are seeing with our own personal data is a leading indicator of the larger problem. This idea of storing the meta-data independent of the data is a natural one in a world where the raw information is too abundant for us to just go looking for an item of interest.
So, my work continues, mostly mundane and boring, but there are some useful observations even at this early stage. Now to figure out what I want the data in my database to look like and start storing information there. Then I can go figure out what I did right, what I did wrong, and how to improve things.
Aside: one interesting aspect of the BERT work was their discussion of “transducers.” This reminded me of Gifford’s Semantic File System work, where he used transducers to suck out semantic information from existing files.
Visualization
Last post I discussed relationships. But relationships really are not enough. Another key to this puzzle is visualization. In other words, how do we present the information to users so that it is useful.
But first, let me step back and point to a larger problem: information overload. If we present users with a list of 100,000 options, they won’t be able to necessarily find what they’re actually seeking. For example, one of the challenges of using an Internet search engine is that they can return millions of “answers” to my basic query. In fact, I just typed in “What is the meaning of life?” to a search engine and it responded back with 1.1 billion possible answers, all in under a second. They are a marvel at backing up the dump truck and offering to inundate me with answers. When is the last time you went through even a handful of these, let alone a large percentage of them?
I suspect that if I ask the HCI folks they will be able to tell me what works for ordinary mortals, but I assure you that computers are capable of returning more information than one can possibly ever process – many years ago I received a bug report about a file system directory that contained over 700,000 files and did not display with the file system kit I’d constructed and the company sold. I’d known I made decisions about resources when I did it; but we lifted that restriction and supported much larger directories than that. I’m quite sure that no mere human would look at such a directory in any meaningful sense. Maybe it was a collection of log files, in which case the names likely embed semantic information about the files themselves. Indeed, in Burrito the authors noted that scientists often embed the schema of their data within the file names. I know that I have done the same and when I’m looking for specific data I am often scripting code to sift through the pile of data to find the subset that is useful to me. The point remains the same: huge listings of files within a directory don’t work for humans.
One potential area for considering navigation is faceted search, a technique for making vast quantities of data searchable. Indeed, this fits well with the graph file system idea because what we expect to find in our graphs are clusters of related files. The graph is likely to be a sparse graph because most files are unlikely to share common features with one another. Thus, it suggests that at least one model for this data visualization problem is going to be rolling up data into these clusters, with an iterative approach to breaking it down and displaying it further. Of course, we might be able to do that within the confines of the existing hierarchical structure, which would be great for retrofitting this into the vast array of existing applications; still, my hope is that we can also provide novel new variations (or rather someone more clever than I am at these things will do so). The challenge is to build something that enables this means I need to have some level of understanding as to what kinds of information are needed to do so.
For example, I was wondering about first order approximations – those that mimic an existing hierarchical file system. Such a file system would present one or more views of the data. Maybe we have a Time view, and the time view then shows you all the files in time order. But if you have 1,000,000 files, that is going to be a mighty big list. One option might be to divide it up into ever smaller chunks of time. Eventually, we’d get to a point where you could see a few dozen files in some sort of time order. Of course, strict sharding of time might not make sense either: why should two files that are separated in time by small intervals end up in separate locations.
One possible option would be to consider a time slider that controls the view. This is something that we can find in other temporal data tools. Thus, creating a time slider might help make visualization easier to perform – in some ways this is similar to the Windows timeline feature that has recently been added into Windows 10. I suspect the file system doesn’t do much to facilitate this. From my own use of this feature it has some interesting limitations, not the least of which is that if you want full functionality they want to export data “to the cloud” for further analysis. That sounds like whatever they are doing is data intensive, which in turn suggests to me that the file system is not doing anything to facilitate this. If the answer really is “sorry, but you have to spend lots of computational cycles to mine this data” then my research is unlikely to be fruitful. I push forward because I don’t think this really is the case – the database community has done marvellous things that permit a vast treasure of relationships be mined.
If we combine this with a faceted search area, I can envision a filtered timeline model – in essence, this seems to be what the new Windows feature is doing, albeit by filtering the things they have deemed to be of interest. I suspect they will extend this capability over time, but a time ordered view is just one of the possible relationships I consider to be important for consideration. I don’t know what the relationships will be, but I do think that having a pre-defined list of those relationships will be self-limiting. Perhaps it will be enough. I proceed on the basis that I expect it will not be enough.
One possible visualization I’ve been considering – and part of the motivation for me deciding I need to start building a file system – is that this sort of namespace might be achievable on our existing hierarchical model if I just add “an extra level of indirection”. Suppose we construct a file system that, instead of having the existing file name has a directory of the same name. In turn, that directory then contains relationships associated with other objects. One of those other objects could be the actual file (so we can still access it), but we could also have a “temporal” directory that would then display a list of files that were created, modified, or accessed around the same timeframe. We could store information about what web sites were visited around times that the file was in use. We could keep track of the music that was playing around the same time. We could point to files that were similar to that specific file. Such a visualization could be easily achieved and compatible with existing tools. Rather than being an endpoint, this is more an intermediate staging area – a way of mocking up the concepts and ideas, and to see what works for people and what does not work.
Thus the desire for a file system. I think we can construct a static namespace by using an existing file system and symbolic links (so you can eventually get back to the real files) by mining existing data sources. But eventually, we are going to want dynamic support here. We can stub that out with FUSE (for example) but in my experience (and based upon the literature) FUSE is slow for meta-data operations, which is really all I will be doing. Building file systems is hard work, but it is something I’ve been doing for years, so that aspect doesn’t really scare me. Visualization on the other hand is an area in which I don’t have a lot of experience.
I’m certainly open to ideas…
Recent Comments