Category Archives: Questions

April 2024
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery

June 25, 2019

Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery, Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kahthryn Mohror, Bradley Settlemyer, and Matthew Wolf, United States: N. p., 2019. Web. doi:10.2172/1491994.

Google brought this to my attention recently because I’ve set triggers for people citing prior works that I’ve found interesting or useful. There are a number of things that caught my eye with this technical report that apply to my own current research focus. In addition, I received a suggestion that I pick a specific field or discipline and look at how to solve problems in that specific community as an aid to providing focus and motivation for my own work.

Data center Image (supercomputer). — Supercomputer Cluster

This report is a wealth of interesting information and suggestions for areas in which additional work will be useful. The report itself is quite long – 134 pages – so I am really only going to discuss the sections I found useful for framing my own research interests.

The report itself is essentially capturing the discussion at a US DOE workshop. Based upon my reading of the report, it appears that the workshop was structured, with a presentation of information and seeded with topics expected to elicit discussion. Much of the interesting information for me was in Section 4.2 “Metadata, Name Spaces, and Provenance”.

The report starts by defining what they mean by metadata:

Metadata, in this context, refers generally to the information about data. It may include traditional user-visible file system metadata (e.g., file names, permissions, and access times), internal storage system constructs (e.g., data layout information), and extended metadata in support of features such as provenance or user-defined attributes. Metadata access is often characterized by small, latency-bound operations that present a significant challenge for SSIO systems that are optimized for large, bandwidth-intensive transfers. Other challenging aspects of metadata management are the interdependencies among metadata items, consistency requirements of the information about the data, and volume and diversity of metadata workloads.
4.2.1 Metadata (p. 39)

I was particularly interested in seeing their observation that relationships were one of the things that they found missing from existing systems. While somewhat self-serving, this is the observation that has led me into my current research direction. The observation about how meta-data I/O behavior is a mismatch for the bandwidth-intensive nature of accessing the data was a useful insight: there is a real mismatch between the needs of accessing meta-data versus accessing the data itself, particularly for HPC style workloads.

Not explicitly mentioned here is the challenge of constructing a system in which meta-data is an inherent part of the file itself, despite these very different characteristics. The authors do point out other challenges, such as the difficulty in constructing efficient indexing, which suffers from scalability issues:

While most HPC file systems support some notion of extended attributes for files [Braam2002, Welch2008, Weil2006], this type of support is insufficient to capture the desired requirements to establish relationships between distributed datasets, files, and databases; attribute additional complex metadata based on provenance information; and support the mining and analysis of data. Some research systems provide explicit support for searching the file system name space based on attributes [Aviles-Gonzalez2014, Leung2009], but most of these systems rely on effective indexing, which has its own scalability and data-consistency challenges [Chou2011].
4.2.1 Metadata – “State of the Art” (p. 39)

In other words: the performance requirements of accessing meta-data versus data are quite different; the obvious solution is to provide separate storage tiers or services for satisfying these needs. The disadvantage to this is that when we start separating meta-data from data, we create consistency problems and classic solutions to these consistency problems in turn create challenges in scalability. In other words, we are faced with classic distributed systems problems, in which we must trade off consistency versus performance. That is the CAP Theorem in a nutshell.

Another important point in looking at this technical report is that it emphasizes the vast size of datasets in the scientific and HPC communities. These issues of scale are exacerbated in these environments because their needs are extraordinary, with vast data sets, large compute clusters, geographical diversity, and performance demands.

The need for solutions is an important one in this space, as these are definitely pain points. Further, the needs of reproducibility in science are important – the report expressly mentions the DOE policy now requires a data management plan. The emphasis is clearly on reproducibility and data sharing. It seems fairly clear (to me, at least) that having better data sharing can only benefit scientific work.

The seeding discussion for the Metadata section of the report raises some excellent points that, again, help buttress my own arguments and hopefully will be useful in shaping the forward direction:

A number of nontraditional use cases for the metadata management system have emerged as key to DOE missions. These include multiple views of the metadata to support, for example, different views at different levels of the name space hierarchy and different views for different users’ purposes; user-defined metadata; provenance of the metadata; and the ability to define relationships between metadata from different experiments (e.g., to support the provenance use case).
As the collection of metadata expands, it is important to ensure that all metadata associated with a dataset remains with the data. Metadata storage at different storage tiers, storage and recovery of metadata from archive, and the transfer of datasets to different storage systems are all important use cases to consider.
4.2.1 Metadata – “Seeding Workshop Discussion” (p. 40)

The idea of multiple views is important. It is something we’ve been exploring recently as we consider how to look at data just using current information available to us, something I should describe in another post.

So what do I pick out here as being important considerations? Different views, user-defined metadata, provenance, and defining relationships. Their requirement that metadata be associated with the underlying dataset. As I noted previously, this becomes more challenging when you consider that metadata operations have very different interface and performance characteristics than data access.

I have not really been looking at the issues related to metadata management in tiered storage systems, but clearly I have to do so if I want to address the concerns of the HPC community.

The attendees agreed that the added complexity in storage hierarchies presents challenges for locating users’ data. A primary reason is that the community does not yet have efficient mechanisms for representing and querying the metadata of users’ data in a storage hierarchy. Given the bottlenecks that already exist in metadata operations for simple parallel file systems, there is a strong research need to explore how to efficiently support metadata in hierarchical storage systems. A promising direction of research could be to allow users to tag and name their data to facilitate locating the data in the future. The appropriate tagging and naming schemes need investigation and could include information about the data contents to facilitate locating particular datasets, as well as to communicate I/O requirements for the data (e.g., data lifetime or resilience).
Section 4.4.5 Hierarchy and Data Management

How do we efficiently support medatadata in hierarchical storage systems? Implicit in this question is the assumption that there are peculiar challenges for doing this. The report does delve into some of the challenges:

Hashing can be useful in sharding data for distribution and load balancing, but this does not capture locality – the fact that various files are actually related to one another. We have been considering file clustering; I suspect that cluster sharding might be a useful mechanisms for providing load balancing.
Metadata generation consists of both automatically collected information (e.g., timestamps, sizes, and name) as well as manually generated information (e.g., tags). The report argues that manual generation is not a particularly effective approach and suggests automatically capturing workflow and provenance information is important. As I was reading this, I was wondering if we might be able to apply inheritance to metadata, in a way that is similar to taint tracking systems.

The report has a short, but useful discussion of namespaces. This includes the traditional POSIX hierarchical name space as well as object oriented name spaces. They point to views as a well-understood approach to the problem from the database community. I would point out the hierarchical approach is one possible view. The report is arguing that their needs would be best met by having multiple views.

The existing work generally is hierarchical and focused on file systems. A number of researchers, however, have argued that such hierarchical namespaces impose inherent limitations on concurrency and usability. Eliminating these limitations with object storage systems or higher-level systems could be the fundamental breakthrough needed to scale namespaces to million-way concurrency and to enable new and more productive interaction modalities.
4.2.2 Namespaces “Seeding Workshop Discussion”

There is a dynamic tension here, between search and navigation. I find myself returning to this issue repeatedly lately and this section reminds me that this is, in fact, an important challenge. Navigation becomes less useful when the namespace becomes large and poorly organized; humans then turn to search. Views become alternative representations of the namespace that humans can use to navigate. They can filter out data that is not useful, which simplifies the task of finding relevant data. We apply views already: we hide “hidden files” or directories beginning with a special character (e.g., “.” in UNIX derived systems). The source code control system git will ignore files (filter them from its view) via a .gitignore file. Thus, we are already applying a primitive, limited form of filtering to create the actual view we show.

This report goes on further. It considers some really interesting issues within this area:

Storage aware systems for maintaining provenance data.
The scaling issues inherent in collecting more provenance data; or what do we do when managing the metadata becomes a huge issue itself?
Cross-system considerations. This doesn’t require HPC data – I have commented more than once that when humans are looking for something, they don’t want to restrict it to the current storage device. Data flows across devices and storage systems; we need to be able to capture these relationships. “[T]here is no formal way to construct, capture, and manage this type of data in an interoperable manner.”
External meta-data. We need to remember that the context in which the data is collected or consumed is an important aspect of the data itself. Thus, the tools used, the systems, etc. might be factors. I would argue that a storage system can’t reasonably be expected to capture these, but it certainly should be able to store this metadata.

The discussion for this section is equally interesting, because it reflects the thoughts of practitioners and thus their own struggles with the current system:

Attendees mentioned that tracking provenance is a well-explored aspect of many other fields (art history, digital library science, etc.) and that effort should be made to apply the best practices from those fields to our challenges, rather than reinventing them. Attendees extensively discussed the high value of provenance in science reproducibility, error detection and correction in stored data, software fault detection, and I/O performance improvement of current and future systems.
Attendees also discussed the need for research into how much provenance information to store, for how long, in what level of detail, and how to ensure that the provenance information was immutable and trustworthy. The value of using provenance beyond strictly validating science data itself was brought up; attendees pointed out that provenance information can be used to train new staff as well as help to retain and propagate institutional knowledge of data gathering processes and procedures.
4.2.4 Discussion Themes “Provenance”

A generally useful observation: look to how other fields have approached common problems, to see if there are insights from those fields that we can use to address them here. I found the vast reach of the discussion here interesting – the idea that such a system can be used to “… retain and propagate institutional knowledge…”

Finally, I’m going to capture the areas in which the report indicates participants at the workshop reached consensus. I’ll paraphrase, rather than quote:

Scalable metadata storage – a key point for me here was decoupling meta-data from the data itself. That, despite the seeding suggestion that we keep meta-data associated with the file.
Improve namespace query and display capabilities – make them dynamic, programmable, and extensible.
Better provenance information – the emphasis was on reproducibility of results, but they wanted to ensure that this could embed domain specific features, so that such systems can be useful beyond just reproducibility.

I’ve really only touched on a small part of this report’s total content; there is quite a bit of other, useful, insight within the report. I will be mulling over these issues

Of file handles and implicit offsets (Follow-up)

December 2, 2017

My post about file handles elicited some interesting feedback, so I wanted to capture it because I thought it provided some insight.

Shared libraries were not a standard part of UNIX systems in the 1980s (though they had certainly been described in prior work) and thus one interesting observation here is that putting code in the kernel was a way of minimizing the amplification of common runtime code. The use of shared libraries today and our increased certainty that kernels need to be as small as we can make them certainly would lead to a very different divide today.

Conversations with Malcolm are often insightful, so I wanted to capture it here because it is definitely germane to the area that I’m exploring, particularly as I try to explain some of the underlying rationale for it – a combination of software archaeology and pragmatically looking at how to evolve forward.

I always thought this was just a programming convenience, because it allows a simple program to have “read next” semantics, and half of the core UNIX utilities are stream parsers that want that semantic. While you’re in the area though, I’d ask “what’s the purpose of a current directory?” which is implemented (on Windows) as a process wide value. I’m guessing it also started (on UNIX) as a programming convenience, but being process-wide has meant that it disappeared as a convenience and reemerged as a headache. (DOS arguably had a different history since it was trying to run non-directory aware applications in the presence of directories.)

My response:

If it were just a convenience we could easily bury it in a library. I’m trying to do some hybrid file systems implementation work and it creates complications that seem unnecessary. But who really looks at this old cruft anyway?

Current directory is another good one. And the directory enumeration offset a third.

And his reply:

As you mentioned though, POSIX was trying to codify existing implementations, and presumably at some point the choice between kernel and library could have been made, but once it’s been made it’s hard to change. UNIX libraries always seemed strange to me in that in many cases they’re super-simple syscall wrappers (similar to NTDLL) but in some cases (file pattern matching) all the heavy lifting is there. Remember that in the beginning shared libraries weren’t a thing, so requiring functionality to be in a library meant duplicated code while the kernel provided a natural place to share code/functionality. This probably influenced a lot of choices.

Of file handles and implicit offsets

November 25, 2017

My current research direction (which is wandering a bit, as is common with research) has forced me to look as some of the vagaries of the POSIX interface. One of these is this intriguing decision to incorporate a piece of file descriptor specific state for the “file pointer” (note that in Windows there is an exact equivalent in the CurrentByteOffset of the file handle).

One thing to note about POSIX is that it was not designed initially. Rather, it captured the state of UNIX systems in the 1980s and codify it. Thus, rather than inventing this behavior, POSIX (or officially, IEEE Std. 1003.1-1988) codified a uniform interface acceptable to a variety of parties. Like any standards document, it is a compromise that attempts to mollify a variety of different players.

Here is a version of the Linux in-kernel file structure (from the main linux repository as of this morning):

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	/*
	 * Protects f_ep_links, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	enum rw_hint		f_write_hint;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Note the f_pos field (which I’ve highlighted). This is the file pointer and it allows things like read and write to work without an explicit offset value.

Here’s the equivalent structure in Windows 10:

typedef struct _FILE_OBJECT {
    CSHORT Type;
    CSHORT Size;
    PDEVICE_OBJECT DeviceObject;
    PVPB Vpb;
    PVOID FsContext;
    PVOID FsContext2;
    PSECTION_OBJECT_POINTERS SectionObjectPointer;
    PVOID PrivateCacheMap;
    NTSTATUS FinalStatus;
    struct _FILE_OBJECT *RelatedFileObject;
    BOOLEAN LockOperation;
    BOOLEAN DeletePending;
    BOOLEAN ReadAccess;
    BOOLEAN WriteAccess;
    BOOLEAN DeleteAccess;
    BOOLEAN SharedRead;
    BOOLEAN SharedWrite;
    BOOLEAN SharedDelete;
    ULONG Flags;
    UNICODE_STRING FileName;
    LARGE_INTEGER CurrentByteOffset;
    __volatile ULONG Waiters;
    __volatile ULONG Busy;
    PVOID LastLock;
    KEVENT Lock;
    KEVENT Event;
    __volatile PIO_COMPLETION_CONTEXT CompletionContext;
    KSPIN_LOCK IrpListLock;
    LIST_ENTRY IrpList;
    __volatile PVOID FileObjectExtension;
} FILE_OBJECT;
typedef struct _FILE_OBJECT *PFILE_OBJECT;

I highlighted the equivalent field for this structure (from wdm.h in the Windows 10 WDK). I spent some time looking through the various fields and my observation is that this is the only piece of implicit user-visible shared mutable state.

This actually doesn’t work in multi-threaded environments (very common these days) if threads use the same file descriptor (file handle in Windows) since it doesn’t make any sense to arbitrarily interleave reads. In those environments, you use a different call – pread for POSIX systems, and in Windows it is explicit parameter in the native system call (NtReadFile where it is an optional parameter).

This led me to ask the question: why is this here? I haven’t found a definitive source since this predates the original POSIX specification, but my theory is that it is because it is the only way to properly implement sharing of the file descriptor. When UNIX added the fork call, one of the characteristics of it was “inheritance of file descriptors”.

          The child inherits copies of the parent's set of open file
          descriptors.  Each file descriptor in the child refers to the same
          open file description (see open(2)) as the corresponding file
          descriptor in the parent.  This means that the two file
          descriptors share open file status flags, file offset, and signal-
          driven I/O attributes (see the description of F_SETOWN and
          F_SETSIG in fcntl(2)).

(Source: http://man7.org/linux/man-pages/man2/fork.2.html)

The status flags describe how the file was opened so they aren’t changing (immutable). The addition of F_SETOWN and F_SETSIG
is more recent but it does appear to be explicitly mutable state (it allows programmatic changes).

Fork is not the only way that a file descriptor (or file handle). For example, it can be done
using UNIX domain sockets on UNIX and Linux systems. Windows provides a system call for doing something similar
as well (the documented version is ZwDuplicateObject).

I’ve spent time thinking about this and it seems that the reason to maintain this shared state is to ensure
that two processes sharing the same file descriptor/handle get the same position pointer value. This
then let me to ask why is this useful?

I have been able to construct a single scenario in which this is useful: appending to the end of a shared file.
Interleaving reads doesn’t make much sense. Interleaving writes inside the existing boundaries of a file
makes even less sense to me. I can construct peculiar scenarios in which I can write applications that
explicitly use this feature but they seem artificial.

Writing to a log file at the end seems like it would make sense. But if that’s my goal, it makes more
sense to just use O_APPEND mode:

              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  The modification of the file offset and the write
              operation are performed as a single atomic step.

(Source: http://man7.org/linux/man-pages/man2/open.2.html)

Thus, this makes me wonder: could we just eliminate this piece of shared state? I have a reason for asking this question though I will save discussing that for another time.

Preserving the correct behavior for most applications will require fixing things up in the library – we could eliminate read as a system call and provide a library implementation that calls pread.

I’m considering doing that and seeing what breaks. it is more difficult to do that in Windows than in Linux, so I’m considering starting there.

Category Archives: Questions

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email

Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery

Of file handles and implicit offsets (Follow-up)

Of file handles and implicit offsets