Home » 2022

Yearly Archives: 2022

May 2024
S	M	T	W	T	F	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

What is the Optimal Location for Storing Metadata

November 6, 2022

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

So Much Metadata, So Little Agreement

October 6, 2022

Earlier this year I was focused on collecting activity data. I made reasonable progress here, finding ways to capture local file system activity as well as activity against two different cloud service providers. I keep looking at other examples, as well, but rather than try for too much breadth, I decided to focus on the three sources I was able to get working and then push deeper into each source.

First, there is little agreement as to what metadata should be present. There are a few common fields, but then there are numerous fields that only show up in some subset of data sources – and this is just for file systems where presumably they’re storing the same basic stuff. What’s most common:

A name
A timestamp for when it was created
A timestamp for when it was modified
A timestamp for when it was accessed
Some attributes (read-only, file, directory, special/device)
A size

Of course, even here there isn’t necessarily agreement. Some file systems have limited size names or limited character sets they support. Timestamps are stored relative to some well-known value. UNIX traditionally chose January 1, 1970 00:00:00 UTC and that number comes up quite often. IBM DOS (and thus MS-DOS) for x86 PCs used January 1, 1980. Windows NT chose January 1, 1601. I do understand why this happens: we store timestamps in finite size fields. When the timestamp “rolls over” we have to deal with it. That was the basis of the Y2K crisis. Of course, I’ve been pretty anal about this. In the late 1970s when I was writing software, I made sure that my code would work at least to 2100 (2100 is not a leap year while 2000 was a leap year because of the rules for leap years.) I doubt that code survived to Y2K.

But file systems designers worry about these sorts of things because we know that file systems life surprisingly long lifetimes. When the Windows NT designers first settled on a 64 bit timestamp in the late 1980s they gleefully used high precision timestamps: 100 nanoseconds. But 64 bits is a lot of space and it allows storing date for many millennia to come.

Today, we store data all over the place. When we move it, those timestamps will be adjusted to fit whatever the recipient storage repository wants to use. In addition, any other “extra” metadata will silently disappear.

How much extra metadata exists? I’ve spent the past few weeks wading through Windows and even though I knew there were many different types of metadata that could be stored, I chuckled at the fact there is no simple way to retrieve all that metadata:

There are APIs for getting timestamps and sizes
There are APIs for getting file attributes
There are APIs for getting file names
There are APIs for getting a list of “alternate data streams” that are associated with a given file.
There are APIs for retrieving the file identifier of the file – that’s a magic number that can be combined with data from other APIs to associate activity information (and that is the reason I went spelunking for this information in the first place.)
There are APIs for retrieving “extended attributes” of files (EAs). EAs are older than Windows NT (1993) but have been difficult to use from the Win32 API that most applications use.
There are now APIs for retrieving linux related attribute information (see FILE_STAT_LX_INFORMATION) on top of the existing attributes.
There are 128 bit GUIDs and 128 bit File IDs

I’m sure I didn’t hit them all, but the point is that these various metadata types are not supported by all file systems. On Windows at least, when you try to copy a file from NTFS to FAT32 (or ExFAT) it will warn you about potential data loss if certain attribute data is present (specifically alternate data streams.) The reason I think they first added this (it was added a long time ago) was because in the early days of downloading files from the internet it became useful to tag them as being potentially suspect. This is done by adding an alternate data stream to the file (::Zone_Identifier) and then information about the remote location from which the file was downloaded.

Thus, this metadata isn’t added just because, it is added because it enables potentially useful functionality.

Here’s something I’ve never seen anyone do thus far – that doesn’t mean nobody does it, just that I haven’t seen it: nobody indexes based upon these attributes. The named stream Zone_Identifier could be used to find all the files that you’ve downloaded from the internet, regardless of where on your computer. I laugh at this because I know a number of times I’ve downloaded content and then had no idea where it was downloaded. With an index of downloaded content, I could just look at the last five things I downloaded – problem solved.

While I have spent a fair bit of time talking about Windows, I have seen similar issues on Linux. It is only in the past couple of years that the extended stat structure (statx) has become mainstream supported. Several file systems that run on Linux support extended attributes. The idea behind streams isn’t particularly novel (we implemented something we called property lists in Episode at the same time the NTFS team was deciding to all full-blown named alternate data streams to their file system. Ours were just limited in size – an approach that I think the ReFS team took because they found nobody was really using large alternate data streams.)

Bottom line: one of the interesting challenges in using activity data is that as similar as file systems seem on the surface they often implement different/special semantics using metadata. How to make sense of this is a significant problem and one that I do not expect to fully address. Despite this, I can see there is tremendous benefit to using even some of this metadata to build relationships between different storage locations. That, however, is a topic for another day.

Challenges of Capturing System Activity

February 16, 2022

A key aspect of the work I am doing for Indaleko is to “capture system activity” so that it can be used to form “activity contexts” that can then be used to inform the process of finding relevant information. As part of that, I have been working through the work of Daniela Vianna. While I have high-level descriptions of the information she collected and used, I need to reconstruct this. She collects data from a variety of sources. The most common source of information comes from web APIs to services such as Google and Facebook. In addition, she also uses file system activity information.

Since my background is file systems, I decided to start on the file system activity front first. Given that I’ve been working with Windows for three decades now, I decided to leverage my understanding of Windows file systems to collect such information. One nice feature of the NTFS file system on Windows is its support for a form of activity log known as the “USN Journal.” Of course, one of my handicaps is that I am used to using the native operating system API, not the libraries that are implemented on top of it. This is because when building file systems on Windows I have always been interested in testing the full kernel file systems interface. While there are a few specific features that cannot be exercised with just applications, there are still a number of interfaces that cannot be tested using the typical Win32 API that can be tested using the native API. In recent years the number of features that have been hidden from the Win32 API has continued to decrease, which has diminished the need to use the native API. I just haven’t had any strong need to learn the Win32 API – why start now?

I decided the model I want to use is a service that pulls data from the USN journal and converts it into a format suitable for storing in a MongoDB database. I decided to go with Mongo because that is what Vianna used for her work. The choice at this point is somewhat arbitrary but MongoDB makes sense because it tends to work well with semi-structured data, which is what I will be handling.

Similarly, I decided that I’d write my service for pulling USN Journal data from the NTFS file system(s) in C# since I have written some C# in the past, it makes doing some of the higher level tasks I have much easier, and is well-supported on Windows. I have made my repository public though I may restructure and/or rename it at some point (currently I call it CSharpToNativeTest because I was trying to invoke the native API as unmanaged code from C#). The most common approach to this is to utilize a specific mechanism (the “PInvoke” mechanism) but after a bit of trial-and-error I decided I wanted something that would be easier for me to debug, so instead of pulling the native routine directly from ntdll.dll I load it from my own DLL (written in C) and that then invokes the real native call. This allows me to see how data is being marshaled and delivered to the C language wrapper. I also tried to make the native API “more C# friendly.” I am sure it could be more efficient, but I wanted to support a model that I could extend and hopefully it will be easier to make it more efficient should that prove necessary.

One thing I did was to script the conversion of all the status values in ntstatus.h into a big C# enum type. The benefit of this is that when debugging I can automatically see the mnemonic name of the status code as well as its numeric value. I then decided to provide the layer needed to map the various volume names used on Windows around, with device names, device IDs, and symbolic links (drive letters) that can be mapped. While I have not yet added it, I wrote things so that it should be fairly straight-forward to add a background thread which wakes up when devices arrive or disappear. As I have noted before “naming is hard.” This is just one more example of the flexibility and challenges with aliasing and naming.

Finally, I turned my attention to the USN journal. I found some packages for decoding USN journal entries; most were written to parse the data from the drive, while a few managed dynamic access. Since I want this to be a service that monitors the USN journal and keeps adding information into the database, I decided to write C# code to use the API for retrieving that information. At this point, what I have is the ability to scan all the volumes on the machine – even if they do not have drive letters – and query them to see if they support a USN journal. I do this properly – I query the file system attributes (using the NtQueryVolumeInformationFile native API) and check if the bit showing USN journal support is marked. I do not use the file system name, an approach I’ve always considered to be a hack, especially since I have been in the habit of writing file systems that support NTFS features, including named data streams, extended attributes, and object IDs. In fact, the ReFS file system on Windows also supports USN journals, so I’m not just being my usual pedantic developer self in this instance.

At this point, I am able to identify volumes that support USN journals, open them and find out if USN is turned on (it is by default on the system volume, which is almost always the “C:” drive, though I enjoy watching things break when I configure a system to use some other drive letter.) I then extract the information and convert it to in-memory records. At the moment I just have it wait a few seconds and pull the newest records, but my plan is to evolve this into a service that I can run and it can keep pulling data and pushing it into my MongoDB instance.

At this point, I realized I do not really know that much about MongoDB so I have decided to start learning a bit more about it. Of course, I don’t want to be a MongoDB expert, so I also have been looking more carefully at Daniela Vianna’s work, trying to figure out what her data might have looked like and think about how I’m going to merge what she did into what I am doing. This is actually exciting because it means I’m starting to think of what we can do with this additional information.

This afternoon I had a great conversation with one of my PhD supervisors about this and she was making a couple of suggestions about ways to consume this data. That she was suggesting things I’d also added to my list was encouraging. What are we thinking:

We can consider using “learned index structures” as we begin to build up data sets.
We can use techniques such as Google BERT to facilitate dealing with the API data that Vianna’s work used. I pointed out that the challenges of APIs that Vianna pointed out are similar to languages: they have meaning and those meanings can be expressed in multiple ways.
The need for being able to efficiently find things is growing rapidly. She was explaining some work that indicates our rate of data growth is outstripping our silicon capabilities. In other words, there is a point at which “brute force search” becomes impractical. I liked this because it suggests what we are seeing with our own personal data is a leading indicator of the larger problem. This idea of storing the meta-data independent of the data is a natural one in a world where the raw information is too abundant for us to just go looking for an item of interest.

So, my work continues, mostly mundane and boring, but there are some useful observations even at this early stage. Now to figure out what I want the data in my database to look like and start storing information there. Then I can go figure out what I did right, what I did wrong, and how to improve things.

Aside: one interesting aspect of the BERT work was their discussion of “transducers.” This reminded me of Gifford’s Semantic File System work, where he used transducers to suck out semantic information from existing files.

Brainiattic: Remember more with your own Metaverse enhanced brain attic

January 13, 2022

I recently described the idea of “activity context” and suggested that providing this new type of information about data (meta-data) to applications would permit improve important tasks such as finding. My examining committee challenged me to think about what I would do if my proposed service – Indaleko – already existed today.

This is the second idea that I decided to propose on my blog. My goal is to find how activity context can be used to provide enhanced functionality. My first idea was fairly mundane: how can we improve the “file browsing” experience in a fashion that focuses on content and similarity by combining prior work with the additional insight provided by activity context.

My initial motivation for this second idea was motivated by my mental image of a personal library but I note that there’s a more general model here: displaying digital objects as something familiar. When I recently described this library instantiation of my brain attic the person said “but I don’t think of digital objects as being big enough to be books.” To address this point: I agree, another person’s mental model for how they want to represent digital data in a virtual world need not match my model. That’s one of the benefits of virtual worlds – we can represent things in forms that are not constrained by what things must be in the real world.

In my recent post about file browsers I discussed Focus, an alternative “table top” browser for making data accessible. One reason I liked Focus is that the authors observed how hierarchical organization does not work in this interface. They also show how the interface is useful and thus it is a concrete argument as to at least one limitation of the hierarchical file/folder browser model. Another important aspect of the Focus work was their observation that a benefit of the table top interface is it permits different users to organize information in their own way. A benefit of a virtual “library” is that the same data can be presented to different users in ways that are comfortable to them.

Of course, the “Metaverse” is still an emerging set of ideas. In a recent article about Second Life Philip Rosedale points out that existing advertising driven models don’t work well. This begs the question – what does work well?

My idea is that by having a richer set of environmental information available, it will be easier to construct virtual models that we can use to find information. Vannevar Bush had Memex, his extended memory tool. This idea turns out to be surprisingly ancient in origin, from a time before printing when most information was remembered. I was discussing this with a fellow researcher and he suggested this is like Sherlock Holmes’ Mind Palace. This led me to the model of a “brain attic” and I realized that this is similar to my model of a “personal virtual library.”

The Sherlock Holmes article has a brilliant quotation from Maria Konnikova: “The key insight from the brain attic is that you’re only going to be able to remember something, and you can only really say you know it, if you can access it when you need it,”

This resonates with my goal of improving finding, because improving finding improves access when you need it.

Thus, I decided to call this mental model “Braniattic.” It is certainly more general than my original mental model of a “personal virtual library,” yet I am also permitted to have my mental model of my pertinent digital objects being projected as books. I could then ask my personal digital librarian to show me works related to specific musical bands, or particular weather. As our virtual worlds become more capable – more like the holodeck of Star Trek – I can envision having control of the ambient room temperature and even the production of familiar smells. While our smart thermostats are now capturing the ambient room temperature and humidity level and we can query online sources for external temperatures, we don’t actively use that information to inform our finding activities, despite the reality is that human brains do recall such things; “it was cold out,” “I was listening to Beethovan,” or “I was sick that day.”

Thus, having additional contextual information can be used at least to improve finding by enabling your “brain attic.” I suspect that, once activity context is available we will find additional ways to use it in constructing some of our personal metaverse environments.

Using Focus, Relationship, Breadcrumbs, and Trails for Success in Finding

January 12, 2022 / 2 Comments on Using Focus, Relationship, Breadcrumbs, and Trails for Success in Finding

As I mentioned in my last post, I am considering how to add activity context as a system service that can be useful in improving findings. Last month (December 2021) my examination committee asked me to consider a useful question: “If this service already existed what would you build using it?”

The challenge in answering this question was not finding examples, but rather finding examples that fit into the “this is a systems problem” box that I had been thinking about while framing my research proposal. It has now been a month and I realized at some point that I do not need to constrain myself to systems. From that, I was able to pull a number of examples that I had considered while writing my thesis proposal.

The first of this is likely what I would consider the closest to being “systems related.” This hearkens back to the original motivation for my research direction: I was taking Dr. David Joyner’s “Human-Computer Interaction” course at Georgia Tech and at one point he used the “file/folder” metaphor as an example of HCI. I had been wrestling with the problem of scope and finding and this simple presentation made it clear why we were not escaping the file/folder metaphor – it has been “good enough” for decades.

More recently, I have been working on figuring out better ways to encourage finding, and that is the original motivation for my thesis proposal. The key idea of “activity context” has potentially broader usage beyond building better search tools.

In my research I have learned that humans do not like to search unless they have no other option. Instead, they prefer to navigate. The research literature says that this is because searching creates more cognitive load for the human user than navigation does. I think of this as meaning that people prefer to be told where to go rather than being given a list of possible options.

Several years ago (pre-pandemic) Ashish Nair came and worked with us for nine weeks one summer. I worked with him to look at building tools to take existing file data across multiple distinct storage domains and present them based upon commonality. By clustering files according to both their meta-data and simply extracted semantic context, he was able to modify an existing graph data visualizer to permit browsing files based on those relationships, regardless of where they were actually stored. While simple, this demonstration has stuck with me.

Ashish Nair (Systopia Intern) worked with us to build an interesting file browser using a graph data visualizer.

Thus, pushed to think of ways in which I would use Indaleko, my proposed activity context system, it occurred to me that using activity context to cluster related objects would be a natural way to exploit this information. This is also something easy to achieve. Unlike some of my other ideas, this is a tool that can demonstrate an associative model because “walking a graph” is an easy to understand way to walk related information.

There is a small body of research that has looked at similar interfaces. One that stuck in my mind was called Focus. While the authors were thinking of tabletop interfaces, the basic paradigm they describe, where one starts with a “primary file” (the focus) and then shows similar files (driven by content and meta-data) along the edges. This is remarkably like Ashish’s demo.

The exciting thing about having activity context is that it provides interesting new ways of associating files together: independent of location and clustered together by commonality. Both the demo and Focus use existing file meta-data and content similarity, which is useful. With activity context added as well, there is further information that can be used to both refine similar associations as well as cluster along a greater number of axis.

Thus, I can show off the benefits of Indaleko‘s activity context support by using a Focus-style file browser.

Better Finding: Combine Semantic and Associative Context with Indaleko

January 11, 2022

Last month I presented my thesis proposal to my PhD committee. My proposal doesn’t mean that I am done, rather it means that I have more clearly identified what I intend to make the focus of my final research.

It has certainly taken longer to get to this point than I had anticipated. Part of the challenge is that there is quite a lot of work that has been done previously around search and semantic context. Very recent work by Daniela Vianna relates to the use of “personal digital traces” to augment search. It was Dr. Vianna’s work that provided a solid theoretical basis for my own proposed work.

Our computer systems collect quite an array of information, not only about us but also about the environment in which we work.

In 1945 Vannevar Bush described the challenges to humans of finding things in a codified system of records. His observations continue to be insightful more than 75 years later:

Our ineptitude in getting at the record is largely caused by the artificiality of systems of indexing. When data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. It can be in only one place, unless duplicates are used; one has to have rules as to which path will locate it, and the rules are cumbersome. Having found one item, moreover, one has to emerge from the system and re-enter on a new path.

The human mind does not work that way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate web of trails carried by the cells of the brain. It has other characteristics, of course; trails that are not frequently followed are prone to fade, items are not fully permanent, memory is transitory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature.

I find myself returning to Bush’s observations. Those observations have led me to ask if it is possible for us to build systems that get us closer to this ideal?

My thesis is that collecting, storing, and disseminating information about the environment in which digital objects are being used provides us with new context that enables better finding.

So, my proposal is about how to collect, store, and disseminate this type of external contextual information. I envision combining this with existing data sources and indexing mechanisms to allow capturing activity context in which digital objects are used by humans. A systems level service that can do this will then enable a broad range of applications to exploit this information to reconstruct context that is helpful to human users. Over my next several blog posts I will describe some ideas that I have with what I envision being possible with this new service.

The title of my proposal is: Indaleko: Using System Activity Context to Improve Finding. One of the key ideas from this is the idea that we can collect information the computer might not find particularly relevant but the human user will. This could be something as simple as the ambient noise in the user’s background (“what music are you listening to?” or “Is your dog barking in the background”) or environmental events (“it is raining”) or even personal events (“my heart rate was elevated” or “I just bought a new yoga mat”). Humans associate things together – not in the same way, nor the same specific elements – using a variety of contextual mechanisms. My objective is to enable capturing data that we can then use to replicate this “associative thinking” that helps humans.

Ultimately, such a system will help human users find connections between objects. My focus is on storage because that is my background: in essence, I am interested in how the computer can extend human memory without losing the amazing flexibility of that memory to connect seemingly unrelated “things” together.

In my next several posts I will explore potential uses for Indaleko.

Yearly Archives: 2022

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email

So Much Metadata, So Little Agreement

Challenges of Capturing System Activity

Brainiattic: Remember more with your own Metaverse enhanced brain attic

Using Focus, Relationship, Breadcrumbs, and Trails for Success in Finding

Better Finding: Combine Semantic and Associative Context with Indaleko