Home » Research Ideas » Activity Context » What is the Optimal Location for Storing Metadata

November 2022
S	M	T	W	T	F	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

What is the Optimal Location for Storing Metadata

By Tony Mason in Activity Context, File Systems, Key-Value Stores, Media File Systems, Metaverse, Network File Systems, Research Ideas, Semantic File Systems on November 6, 2022.

The past month has included both a very interesting talk from someone at a major storage vendor and an in-depth discussion about my work and how it might be applicable to an issue that confronts the Metaverse community. I haven’t been at the keyboard much (at least not for my research) but I have been mulling this over as I have worked to try and explain these insights. Each iteration helps me refine my mental model by considering what else I have learned. Fortunately, this latest round doesn’t impact the work that I have done, but it has provided me with a model that I think could be useful in explaining this work to others.

I have previously talked about a type of metadata that I call activity context. Of course, there is quite a lot of metadata that is involved in managing storage and I have been using a model in which the metadata I am collecting is not at the point of storage but rather at the point of analysis. In my case, the point of analysis is on (or near) my local device ecosystem. As I learned more about the needs of the emerging metaverse field (by speaking with my friend Royal O’Brien, who is the general manager for the Open 3D Foundation, which is part of the Linux Foundation) and combined some of what I learned there with insights I gained from a recent talk given to my research group I observed what I think are some useful insights:

Storage vendors have no mechanism for capturing all the kinds of activity data that I envision using as the basis for activity context.
Some high-performance data consumers need to maintain replicated data and use metadata about that data to make critical decisions.
Metadata needs to be close to where it will be consumed.
Metadata needs to be produced where the information is available and optimally where it is least expensive to do so.

That isn’t a long list, but it is one that requires a bit more unpacking. So I’m going to dive deeper, step by step. This probably isn’t the right order, but I will start here and worry about (re)-organizing it later.

Metadata Production

I had not really considered the depth of the question about where to produce the meta-data until I started mulling over the myriad of questions that have arisen recently. The cost of producing metadata can be a critical factor. Agents that extract semantic information about the data (e.g., its content) need to be close to the data. However, it is important to note that is not the same as “the final location of the data” but rather “a current location of the data.” Yet, even that isn’t quite right: metadata might be extracted from something other than the data, like something from the running system, or even an external source. For example, the activity data that I have been focused on collecting (see System Activity) largely arises on the system where the data itself is accessed. The metaverse model is one where the user has considerable insight (ah, but a bit more on this later) and since I’ve always envisioned an extensible metadata management system it makes sense to permit a specialized application to contribute to the overall body of metadata.

Thus, the insight here is that it makes sense to generate metadata at the “lowest cost” point to do so. For example, the activity data on my local machine can’t be collected by a cloud storage engine. It could be collected by an agent on the local machine and sent to the cloud storage engine, but that runs into a separate cost that I’ll touch on when I describe where we should be storing metadata. For example, extracting semantic content makes sense to do at the point of production and again at the point of storage. Activity data, which is related to “what else is happening” can’t be extracted at the point of storage. Even causal data (e.g., the kinds of activity information we convert into provenance data to represent causal relationships) can’t easily be replicated at the storage engine. There’s another subtle point here to consider: if I’m willing to pay the cost of producing metadata it seems intuitively obvious that it is probably worth storing the results of that metadata. For example, I find that I often end up doing repetitive searches – this past week, working on a project completely unrelated to my research, I found myself repeatedly doing searches over the same data set using the same or similar terms. For example, if I want to find files that have both the term “customer” and “order” in them and then repeat that with “customer” and “device_id” I have to do complex compound searches that can take 5-10 minutes to produce. I suspect this can be made more efficient (though I don’t know if this is really a useful test case – I just keep wondering how I could support this sort of functionality, which would enable us to figure out if it is useful.)

So, back to producing metadata. Another cost to consider is the cost to fetch the data. For example, if I want to compute the checksum of a file, it is probably most efficient to do so when it is in the memory of the original device creating it or possibly on the device where it is stored (e.g., a remote storage server.) Even if it is the same cost I need to keep in mind that I will be using devices that don’t compute the same checksum. That lack of service uniformity helps me better understand the actual cost: if the storage device does not support the generation of the metadata that I want then my cost rises dramatically because now I have to pull the data back from the storage server so I can compute the checksum I want to use. Thus, I think what drives this question is where we store that metadata, which is leading to my next rambling thought process in the next section.

In the case where the metadata is being provided externally, I probably don’t care where it is produced – that’s their problem. So, for the metaverse data storage challenge I really need to focus more on where I am storing the metadata rather than where it is generated (at least for now.)

Medata Storage

One question I’ve been handwaving is the “where do you store the metadata?” I started thinking about this because the real answer is ugly. Some of that metadata will be stored on the underlying storage, e.g., a file system is going to store timestamps and length information in some form regardless of specific issues like time epochs. However, as I was mulling over some of the issues involved in object management needs for metaverse platforms (ugh, a tongue-twister with the “metaverse” buzzword) I realized that one of the challenges described to me (namely the cost associated with fetching data) is really important to me as well:

To be useful, this metadata needs to be present everywhere it is analyzed – it is impractical for us to be fetching data across the network if we want this to have decent performance. I can certainly handwave some of this away (“oh, we’ll just use eventually consistent replication of the metadata”) but I don’t expect that’s terribly realistic to add to a prototype system. What probably does make sense is to think that this will be stored on a system that is “close to” the sources that generate the metadata. It might be possible to construct a cloud-based metadata service, but that case has additional considerations that I’m mulling over (and plan on capturing in a future blog post – this one is already too long!) Thus, I suspect that this is a restricted implementation of the replication problem.
Metadata does not need to be close to the data. In fact, one of the interesting advantages of having the metadata close to where it is needed is that it helps overcome a major challenge in using distributed storage: the farther away the data storage is from the data consumer, the higher the cost of fetching that data. In turn, the benefits of having more metadata is that it helps improve the efficiency of fetching data, since fetching data that we don’t need is wasteful. In other words, a cost benefit associated with having more metadata is that we can work to minimize unnecessary data fetching. Indeed, this could be a solid metric for determining the efficiency of metadata and search algorithms that use the metadata: the “false fetch rate.” The benefits of this are definitely related to the cost of retrieving data. Imagine (for example) that you are looking through data that is expensive to retrieve, such as Azure Cold Blob Storage or Amazon Glacial Storage. The reason that people use these slow storage services is that they are extremely cost efficient: this is data that is unlikely to be needed. While this is an extreme example, it also makes it easier to understand why additional metadata is broadly beneficial, since any fetch of data from a remote system is that is not useful is a complete waste of resources. Again, my inspiration here was the discussion with Royal about multiple different instantiations of the same object that appear in the metaverse. I will touch on this when I get into that metaverse conversation. For now, I note that these instantiations of a single digital object might be stored in different locations. The choice of a specific instance of this is typically bounded by several costs involved, including the fetch cost (latency + bandwidth) and any transformation costs (e.g., CPU cost.) This becomes quite interesting in mobile networks where the network could impose surge pricing as well and there are capacity limitations combined with the hard requirements that these objects need to be available for use quickly (another aspect of cost.)

My sense is there is probably more to say here, but I captured some key ideas and I will consider how to build on this in the future.

Metaverse Data Needs

That conversation with Royal was quite interesting. I’ve known him for more than a decade and some of what I learned from him about the specialized needs of the game industry led me to question things that I learned from decades of building storage systems. That background in game development has positioned him to point out that many of the challenges in metaverse construction have already been addressed in the game development area. One interesting aspect of this is in the world of “asset management.” An asset in a game is pretty much anything that the game uses to create the game world. Similarly, a metaverse must also combine assets to permit 3D scaling as it renders the world for each participant of that world. He explained to me by way of example, that one type of graphical object is often computed at different resolutions. While it is possible for our devices to scale these, the size of the objects and the computational cost of scaling is high. In addition, the cost of fetching these objects can be high as well; he was telling me that you might need 200 objects in order to render the current state of the world for an individual user. If their average size is 60MB it becomes easy to see how this is not terribly practical. In fact, what is usually required are a few of these very high-resolution graphical objects and lower resolution versions of the others. For example, objects that are “far away in the distance” need not have the same resolution. While he didn’t point it out, I know that I have seen games where sometimes objects have low resolution and are later repainted with higher resolution images. I am now wondering if I saw this exact type of behavior already being practiced.

Let’s combine this with the need to distribute these objects broadly and to realize there is a high degree of locality involved. Metaverse participants interacting with each other in a 5G or 6G network are likely to be accessing many of the same objects. Thus, we are likely to see a high degree of correlation across edge nodes within the mobile network. Similarly, it moves to a very distributed storage model, where data objects are not necessarily being retrieved from a central storage server but rather edge storage servers or even peer clients. One benefit of using strong checksums is that it allows easy to verify replication in untrusted networks – something like bittorrent or even IPFS do with their own checksums. As long as the checksum comes from a trusted source, the data retrieved can be verified.

In this case the metadata would correspond to something very different than I’d been considering:

An identifier of the object itself
A list of one or more specific instances of that objects with a set of properties
A list of where each of these instances might be stored (I’m choosing to use an optimistic list here because the reality is sources will appear and disappear.)

Independent of this would be information about the constraints involved: the deadline required for receiving the data to be timely, the cost for retrieving the various versions, etc. With this information both the edge and end devices can make decisions: which versions to fetch and from where as well as placement, caching, and pre-fetching decisions. All of these are challenging and none of them are new so I’m not going to dive in further. What is new is the idea that we could embed the necessary metadata within a more general-purpose metadata management system overlaying disparate storage systems. This is a fairly specialized need, but it is also one Royal observed needs to be solved.

Oh, one final number that sticks out in my mind: Royal told me that a single asset could consist of around 200 different versions, including different resolutions and different formats required by the various devices. I was quite surprised at this, but it also helped me understand the magnitude of the problem.

While I have considered versioning as a desirable feature, I had never considered parallel versions quite like this. Having these kinds of conversations helps me better understand new perspectives and broaden my own thinking.

I left that conversation knowing that I had just barely started to wrap my head around the specific needs of this area. I capture those thoughts here in hopes I can foster further thought about them, including more conversations with others.

Storage Vendors

A couple weeks ago we had a guest speaker from a storage vendor talking about his thoughts along the future for his company and their products. There were specific aspects of that talk that really stood out to me:

Much of what he talked about was inward focused. In other words, it was about the need for better semantic understanding. I realized that the ideas on which I’m working – of using extrinsic information to find relationships between files was not even on his horizon, yet could be very beneficial to him – or to any large storage vendor.
He acknowledged many of the challenges that are arising as the sheer volume of storage continues to grow. Indeed, each time I think about this I remember that for all the emphasis on fast access storage (e.g., NVRAM and SSDs) the slower storage tiers continue to expand as well: hard disks now play more of an archival role. Microsoft Research’s Holographic Storage Device, for example, offers a potential higher capacity device for data center use. Libraries of recordable optical storage or even high capacity linear tape also exist and are used to keep vast amounts of data.
During that time I’d been also thinking about how to protect sensitive information from being exploited or mined. In other words, as a user of these services, how can I store data and/or metadata with them that doesn’t divulge information. After the talk I realized that the approach I’d been considering (basically providing labels the meaning of which requires a separate decoder ring) could be quite useful to a storage vendor: such sanitized information could still be used to better understand the relationships – ML driven pattern recognition (e.g., clustering) without requiring that the storage vendor understand what those patterns mean. Even providing that information to the end user could minimize the amount of extra data being fetched which in turn would improve the use of their own storage products. Again, I don’t think this is fully fleshed out, but it does seem to provide some argument for storage vendors to consider supporting enhanced metadata services.

I admit, I like the idea of enabling storage vendors to provide optimization services that do not require they understand the innards of the data itself. This would allow customers with highly sensitive data to store it in a public cloud service (for example) in fully encrypted form and still provide indexing information for it. The “secret decoder rings” can be maintained by the data owner yet the storage vendor can provide useful value-added services at enterprise scale. Why? Because, as I noted earlier, the right place to store metadata is as close as possible to the place where it is consumed. At enterprise scale, that would logically be someplace that is accessible throughout the enterprise.

At this point I realized that our propensity to store the metadata with the data really does not make sense when we think of multiple storage silos – it’s the wrong location. Separating the metadata service, placing it close to where the metadata is being absorbed, and using strategically located agents for generating the various types of metadata, including activity context and semantic information, all make sense because the owner of that data is really “closest” to where that metadata is used. A “file system” that maintains no metadata is really little more than a key-value store, as the metadata server can be maintained separately. Of course, that potentially creates other issues (e.g., space reuse.) I don’t think I need to solve such issues because in the end that consideration is not important at this point in my own research.

Tags: Activity Context, cloud storage, metadata, Metaverse

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email