Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 210 other subscribers
January 2025
S M T W T F S
 1234
567891011
12131415161718
19202122232425
262728293031  

On the history of File Systems

I have made it a goal for 2018 to answer a question several people have asked me: what papers should I read to learn more about file systems. So I’ve decided to attempt to copy a format that I’ve found useful – The Morning Paper. I admit, I am not sure I’ll be able to keep up the frenetic pace of a paper each day that he maintains, but I do know there are plenty of papers to read.

My motivation for doing this is simple: there are quite a few papers on this topic, I certainly haven’t read them all (or even the majority of them) and reading through them, along with my own interpretation of why the paper matters to File Systems will be useful for me.  It also gives me a set of information that I can point out to people that ask me “so what should I read to learn more about this topic.”

Why File Systems?  For me it’s been largely a quirk of fate.  I’ve been working on operating systems for many years now, and file systems happens to be the area in which I’ve spent more time than any other.  For something that is conceptually so simple – after all, it just maps files to blocks on your disk, it has considerable complexity.  File systems are also the gateway to one of the more challenging parts of the operating system: the bit that stores persistent state.  Errors in file systems are often not transient.  That makes them challenging.  The bar is set high because we don’t want to lose anyone’s data.

File systems are part of the plumbing of the operating system.  They are essential to proper operation but when everything is working properly, nobody really notices they are there.  Only when things go wrong does anyone notice.  So it is within this world that I will delve.

The place to start is to even attempt to define file systems.  While you might think this is simple, it turns out to be surprisingly challenging.  So I’ll approach it by providing some examples and then delving into what means to be a file system.

Media File Systems

The easiest place to start is in the concept of the media file system.  “Media” in this context means any tangible medium on which we can systematically record and/or read persistent information.  Examples would include disks, tapes, non-volatile memory, or optical media. Whether or not RAM file systems fall into this category is an interesting question – and it helps demonstrate my point that pinning down this concept is more challenging than one might otherwise think.

Disk drives are typically what most people think of first for this category, though we now often view solid-state disk drives in the same category.  File systems reside on top of media devices and keep track of the organization of the data itself.  Some file systems can span multiple disk, or use transactional journals, or provide resilience against errors in the media itself.  The continuum here is surprisingly broad, but in the end the purpose of the file system is to map from the vagaries of the media to a (mostly) uniform model of behavior so application programs “just work”.

Network File Systems

As we added computers to networks, we found it useful to be able to transfer information stored on one computer to another.  This permitted applications to access data, regardless of where it was actually stored.  By constructing a file system that uses a remote access protocol, we can present the remote storage device to the local system as if it were just a local file system – or rather, almost the same.

Good examples of this include the Network File System (NFS) originally developed by SUN Microsystems in the 1980s and the Andrew File System (AFS) originally developed by Carnegie-Mellon University in the same time period.  These days many people also use the Server Message Block (SMB) based network file systems; the roots of this are also back in the 1980s.  All three of these remain in use today, in fact, though they have evolved from the earliest versions.

Key-Value Stores

There is considerable overlap between the world of file systems and databases.  Several early file systems were really just single level lookup mechanisms, rather than the hierarchical name space that is so common these days.  It is actually common to implement file systems so that hierarchy is added to a flat indexed name space (a “flat” file system).  These are in essence one form of key-value store.  Some file systems permit addressing by using keys, whether explicitly assigned or implicitly generated.  A subsection of the literature focus more on this type of file system and I will make sure to cover several of these.

Name Spaces

Sometimes a file system is more about the namespace than anything else.  For example the proc file system does not actually reside on top of any sort of storage.  The name space provides a convenient way to find information that is generated on demand.  In these cases, it is the name space that really matters, not how the information is stored.

This conflation of storage, presentation, and name-space add to the richness and complexity of work being done in file systems.  My goal is to explore these issues, by reviewing the literature.

Of file handles and implicit offsets (Follow-up)

My post about file handles elicited some interesting feedback, so I wanted to capture it because I thought it provided some insight.

Shared libraries were not a standard part of UNIX systems in the 1980s (though they had certainly been described in prior work) and thus one interesting observation here is that putting code in the kernel was a way of minimizing the amplification of common runtime code. The use of shared libraries today and our increased certainty that kernels need to be as small as we can make them certainly would lead to a very different divide today.

Conversations with Malcolm are often insightful, so I wanted to capture it here because it is definitely germane to the area that I’m exploring, particularly as I try to explain some of the underlying rationale for it – a combination of software archaeology and pragmatically looking at how to evolve forward.

I always thought this was just a programming convenience, because it allows a simple program to have “read next” semantics, and half of the core UNIX utilities are stream parsers that want that semantic. While you’re in the area though, I’d ask “what’s the purpose of a current directory?” which is implemented (on Windows) as a process wide value. I’m guessing it also started (on UNIX) as a programming convenience, but being process-wide has meant that it disappeared as a convenience and reemerged as a headache. (DOS arguably had a different history since it was trying to run non-directory aware applications in the presence of directories.)

My response:

If it were just a convenience we could easily bury it in a library. I’m trying to do some hybrid file systems implementation work and it creates complications that seem unnecessary. But who really looks at this old cruft anyway?

Current directory is another good one. And the directory enumeration offset a third.

And his reply:

As you mentioned though, POSIX was trying to codify existing implementations, and presumably at some point the choice between kernel and library could have been made, but once it’s been made it’s hard to change. UNIX libraries always seemed strange to me in that in many cases they’re super-simple syscall wrappers (similar to NTDLL) but in some cases (file pattern matching) all the heavy lifting is there. Remember that in the beginning shared libraries weren’t a thing, so requiring functionality to be in a library meant duplicated code while the kernel provided a natural place to share code/functionality. This probably influenced a lot of choices.

US Patent 9,830,329 Issued

US Patent 9,830,329 issued November 28, 2017.

It is about a technique for utilizing a pre-existing tunneling mechanism to invoke remote functionality. The original inspiration for this was the need to communicate from a component on a client system to utilize functionality available on the server.

The first case for this was when I wanted to obtain meta-data that was maintained on the server, but not available to me on the client. By exploiting a pass-through mechanism in the Windows SMB client, I could send a query to the server, have a component on the server get the answer, and respond back to my software on the client. This turns out to be useful in several cases – and doing it through the existing SMB infrastructure allowed a clean, functional solution.

Of file handles and implicit offsets

My current research direction (which is wandering a bit, as is common with research) has forced me to look as some of the vagaries of the POSIX interface.  One of these is this intriguing decision to incorporate a piece of file descriptor specific state for the “file pointer” (note that in Windows there is an exact equivalent in the CurrentByteOffset of the file handle).

One thing to note about POSIX is that it was not designed initially.  Rather, it captured the state of UNIX systems in the 1980s and codify it.  Thus, rather than inventing this behavior, POSIX (or officially, IEEE Std. 1003.1-1988)  codified a uniform interface  acceptable to a variety of parties.  Like any standards document, it is a compromise that attempts to mollify a variety of different players.

Here is a version of the Linux in-kernel file structure (from the main linux repository as of this morning):

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;

	/*
	 * Protects f_ep_links, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	enum rw_hint		f_write_hint;
	atomic_long_t		f_count;
	unsigned int 		f_flags;
	fmode_t			f_mode;
	struct mutex		f_pos_lock;
	loff_t			f_pos;
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;

	u64			f_version;
#ifdef CONFIG_SECURITY
	void			*f_security;
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;
	errseq_t		f_wb_err;
} __randomize_layout
  __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

Note the f_pos field (which I’ve highlighted).  This is the file pointer and it allows things like read and write to work without an explicit offset value.

Here’s the equivalent structure in Windows 10:

typedef struct _FILE_OBJECT {
    CSHORT Type;
    CSHORT Size;
    PDEVICE_OBJECT DeviceObject;
    PVPB Vpb;
    PVOID FsContext;
    PVOID FsContext2;
    PSECTION_OBJECT_POINTERS SectionObjectPointer;
    PVOID PrivateCacheMap;
    NTSTATUS FinalStatus;
    struct _FILE_OBJECT *RelatedFileObject;
    BOOLEAN LockOperation;
    BOOLEAN DeletePending;
    BOOLEAN ReadAccess;
    BOOLEAN WriteAccess;
    BOOLEAN DeleteAccess;
    BOOLEAN SharedRead;
    BOOLEAN SharedWrite;
    BOOLEAN SharedDelete;
    ULONG Flags;
    UNICODE_STRING FileName;
    LARGE_INTEGER CurrentByteOffset;
    __volatile ULONG Waiters;
    __volatile ULONG Busy;
    PVOID LastLock;
    KEVENT Lock;
    KEVENT Event;
    __volatile PIO_COMPLETION_CONTEXT CompletionContext;
    KSPIN_LOCK IrpListLock;
    LIST_ENTRY IrpList;
    __volatile PVOID FileObjectExtension;
} FILE_OBJECT;
typedef struct _FILE_OBJECT *PFILE_OBJECT; 

I highlighted the equivalent field for this structure (from wdm.h in the Windows 10 WDK).  I spent some time looking through the various fields and my observation is that this is the only piece of implicit user-visible shared mutable state.

This actually doesn’t work in multi-threaded environments (very common these days) if threads use the same file descriptor (file handle in Windows) since it doesn’t make any sense to arbitrarily interleave reads.  In those environments, you use a different call – pread for POSIX systems, and in Windows it is explicit parameter in the native system call (NtReadFile where it is an optional parameter).

This led me to ask the question: why is this here?  I haven’t found a definitive source since this predates the original POSIX specification, but my theory is that it is because it is the only way to properly implement sharing of the file descriptor.  When UNIX added the fork call, one of the characteristics of it was “inheritance of file descriptors”.

          The child inherits copies of the parent's set of open file
          descriptors.  Each file descriptor in the child refers to the same
          open file description (see open(2)) as the corresponding file
          descriptor in the parent.  This means that the two file
          descriptors share open file status flags, file offset, and signal-
          driven I/O attributes (see the description of F_SETOWN and
          F_SETSIG in fcntl(2)).

(Source: http://man7.org/linux/man-pages/man2/fork.2.html)

The status flags describe how the file was opened so they aren’t changing (immutable). The addition of F_SETOWN and F_SETSIG
is more recent but it does appear to be explicitly mutable state (it allows programmatic changes).

Fork is not the only way that a file descriptor (or file handle). For example, it can be done
using UNIX domain sockets on UNIX and Linux systems. Windows provides a system call for doing something similar
as well (the documented version is ZwDuplicateObject).

I’ve spent time thinking about this and it seems that the reason to maintain this shared state is to ensure
that two processes sharing the same file descriptor/handle get the same position pointer value. This
then let me to ask why is this useful?

I have been able to construct a single scenario in which this is useful: appending to the end of a shared file.
Interleaving reads doesn’t make much sense. Interleaving writes inside the existing boundaries of a file
makes even less sense to me. I can construct peculiar scenarios in which I can write applications that
explicitly use this feature but they seem artificial.

Writing to a log file at the end seems like it would make sense. But if that’s my goal, it makes more
sense to just use O_APPEND mode:

              The file is opened in append mode.  Before each write(2), the
              file offset is positioned at the end of the file, as if with
              lseek(2).  The modification of the file offset and the write
              operation are performed as a single atomic step.

(Source: http://man7.org/linux/man-pages/man2/open.2.html)

Thus, this makes me wonder: could we just eliminate this piece of shared state?  I have a reason for asking this question though I will save discussing that for another time.

Preserving the correct behavior for most applications will require fixing things up in the library – we could eliminate read as a system call and provide a library implementation that calls pread.

I’m considering doing that and seeing what breaks. it is more difficult to do that in Windows than in Linux, so I’m considering starting there.

Shanghai Bound!

I’m on my way to the the Symposium on Operating Systems Principles, which is a well-known operating systems conference.  This time it is in Shanghai, China so I’m making my way there.  I do not have the privilege of presenting anything there this trip, but hopefully I will be able to do so in the future – it really is just a small matter of coming up with some interesting research that is worthy.

This will be my first time in China.  Definitely looking forward to seeing some of the interesting talks that are scheduled throughout the conference!

For those problems that are nails, I have a sledgehammer

I’ve been involved in a project that requires I learn more about blockchain than I ever really wanted to know.

I had a basic appreciation of it from reading the Nakamoto paper many years ago.  It provides an insightful solution to a challenging problem – the problem of distributed consensus in an untrusted network.  There is an agreed mechanism (the protocol) and an incentive to participate (the fee).

Why is an untrusted network the problem?  Because this came from the world of digital cash, and the concern in that environment was that someone could “double spend” money.  The blockchain model makes that effectively impossible.  It does so in an interesting way – by combining a consensus protocol (getting multiple systems to agree on a given outcome) along with an incentive to behave (participants get paid to behave).  It’s an intriguing way of reaching common consensus.

But it has limitations.  Like any consensus protocol, it takes time to send messages back and forth and reach an agreement.  Further, blockchain requires that nodes attempt to solve a cryptographically hard problem.  The node that is successful gets the reward.  The algorithm is designed so that it takes about 10 minutes for someone to reach a valid solution.  These solutions are easy to verify, so once someone has figured out a solution – which indicates real work – they present their answer and it is echoed throughout the network.   By chaining the answer for the current block into the prior block it essentially creates chain that is not realistically mutable (e.g., it cannot easily change).

There are more technical details (Merkle Trees and cryptographic hashes and so forth) but the final solution is a “public ledger” that shows who owns what in the world.

Here’s the rub: this is interesting technology, but it is not the solution to every problem.  As is so often the case in computing, though, we pick “the shiny new tool” and insist it must be the right tool to chose for our problem at hand.  The challenge then for those of us who have been through this before is to actually analyze a problem and figure out a good solution.  Of course, sometimes we’re called upon to utilize a particular tool, like blockchain, regardless of whether or not it is the right tool for the job.  When that happens there’s an unenviable task of trying to explain why it is not the right solution.  Doing so is tough, as it is often not what a customer wants to hear.

The advisory shuffle

Friday, October 13, was my adviser’s last day.  He remains as an adjunct professor, but as such he cannot directly advise or supervise graduate students.  Thus, just a few weeks into my official life as a PhD student I find myself adrift.  While I started adrift, over the past 10 months I stopped drifting and began focusing in a specific direction.  It is a peculiar thing – to be adrift, then to find direction and focus, and then be adrift again.  Humbling, certainly.  When adrift the first time around I didn’t know what to expect, or how good it is to work with someone that encourages crazy ideas.  I touched on this in my last post. But life doesn’t always go the way we anticipate.

I spent a bit of time going through an existential crisis of sorts because being adrift meant that I had to once again consider all the various directions I might go. So that’s what I’ve been doing.  Considering externalities (they’re trying to recruit someone that would be an awesome match from a research perspective).  Considering crazy options (applying to go somewhere else).  Of course, there’s also the option of considering other faculty to serve as adviser, but that is surprisingly challenging.  One of the best fits isn’t in the same department, which creates other potential complications.  Sorting it all out takes time.

Then I realized that this is part of the challenge of completing a PhD – the willingness to persevere in the face of factors beyond my control.  Hence why I found myself asking why I wanted to do this.  I’m not a brilliant researcher though I like to think I’m at least a competent one.  I can write reasonably well.  My hope is that the insights I do have can be used to contribute something.  I’m also pragmatic enough to know that it is rather unlikely I will change the world.  Still, I’m idealistic enough to want to try to change the world.

The good news is that none of this is an emergency.  I have time to look at options, to consider what I wish to achieve.  In the end, it will force me to look for clarity.  My hope is that I will actually find it.  Only time will tell.

“So why are you doing this?”

One of the questions that I’m often asked is “why would you want to get a PhD?” or “why do you want to do research?”

Everyone that even considers pursuing a PhD will no doubt be asked this question at some point.  I’ve been asked it a number of times recently, no doubt due to the various events swirling around my life and my decision to pursue this unusual direction.

Working in industry can afford numerous opportunities to make incremental improvements.  These improvements are certainly valuable, but they are also quite focused on work within the current paradigm.  Changes to the paradigm can arise in this model, but they are definitely heavily shaped by commercial considerations.

Working in academia tends to focus more on exploring in a broader range of directions.  Sometimes it is funded by industry, but more as a long-term investment, with an expectation that most ideas just won’t pan out.  Sometimes research is done within industry as well – a long-term bet on developing the next great technology.

So, what does this have to do with pursuing a PhD?  One difference between me and the typical PhD student is that I have done quite a lot of work in industry and have gained insights into things I see that look like problems ingrained in the system.  Thus, for me, this PhD offers an opportunity to explore ideas that reflect those insights – a willingness to question what is just assumed, and then see if I can find a way to try something new.

In essence, I want to change the world.  But the other thing I’ve learned is pragmatism, so I’ll consider it to be a success if I can at least rock the boat.  Maybe in that way I can build a foundation upon which someone smarter and with greater insight than me can actually build something that does change the world.  I don’t see that happening for me in industry.  It’s certainly not guaranteed going down the academic research path either, but I will do my best to do good work and enjoy the experience.

If I knew what the outcome was going to be, it wouldn’t be research, it would be development.

 

Windows and “reserved names”

I’ve fallen into the habit of answering some questions on Quora and those which involve answering questions about Windows file systems seem to garner the most interest and the most disagreement.

Yesterday I answered a question entitled Why can’t I save a folder name “con” in Windows?  I decided to respond to this because the other answers that I read said that Windows reserved this name.   My response demonstrated how you could easily do this with bash running in WSL.  Some comments seemed to think that using bash was cheating, so I sat down and wrote a simple program to create folders with reserved names.  It isn’t bound to WSL and executes as a standard console program (so it is executed from the command line).  I’ve published the source code on github.

It’s been a few months since I’ve posted anything about Windows file systems and I realized that I should start doing so again.  So, that little application program inspired me.

The difficulty that I have with assertions that “Windows doesn’t do that” or “Windows doesn’t allow that” is that they often conflate the limitations of applications, libraries, or subsystems as being limitations of Windows.  Windows NT was developed to be a general purpose host operating system.  The expectation was that it would have subsystems which implemented specific execution environments. While this includes Win32 (the subsystem most often used by applications) it has never been the only subsystem available.  Windows NT 3.1 shipped with Win32, NTVDM, OS/2, POSIX, and native subsystems (though native is more “not bound to a subsystem”).  I had also heard rumors of a UNIX System V Release 4 subsystem as well, but to the best of my knowledge it was never released and I never saw it.

Window dropped support for OS/2 in Windows XP.  The POSIX implementation changed as it became more like a library that called Win32 and native as needed.  What remained of POSIX was dropped in Windows 8.1.  The Windows Subsystem for Linux (WSL) has been added in Windows 10. It is not based upon the old POSIX subsystem, but it certainly exploits functionality that Windows includes because of the POSIX subsystem’s requirement.

Win32 does not have a concept of fork, but Windows includes all the support necessary to create a duplicate address space, either via NtCreateProcess or NtCreateUserProcess (you can see an example of the latter, including protected processes here).  This is used to implement this functionality.

NTFS has always supported the full range of POSIX compliant names.  The downside to using this is that it means Win32 application (like Explorer) can have problems accessing them.  That’s not a limitation of Windows it is a limitation of the application.

I’ve heard several people tell me that “Windows” must include everything installed, not just the operating system.  I disagree with this characterization, but I’ll address the why of that in a future post.

Bottom line: Windows itself does not have restricted names based upon MS-DOS.  Win32 applications usually do, however.  I would also note that Explorer cannot handle long path names – somewhere within the code is a hard coded buffer that is smaller than the OS level maximum path length (which happens to be 32767 16-bit UNICODE characters).  NTFS limits each name component (directory or file) to 256 bytes, but ReFS limits the component name size to 32767 (and this has inspired me in another area I’ll talk about at some point as well).  Other file systems may have their own limits, too.

That’s enough for now.  I’ll find more interesting things to discuss when it comes to Windows file systems, though.

Where does search functionality live?

In mulling over the depths of semantic knowledge and file systems, it occurs to me that one thing which differs between the world of Unix/Linux file systems and Windows file systems is that in Unix/Linux environments, search of a directory’s contents are done in the shell (or application) while in Windows they are a service of the file system.

I admit, when I first started working on Windows file systems, I thought this was an annoying decision, since it involved quite a bit of work inside the file system related to string handling and matching.  Even as I write this, I still think that it is a lot of work that really doesn’t belong in the kernel, but, having said that, this distinction is one reason why a Unix/Linux file systems developer might not think of adding semantic support to a file system as something logical – after all, the purpose of the file system is to manage storage of file systems and associated meta-data, not to find things.  Having experience in the Windows file systems space, I can understand why it might not be a great idea to do this in kernel mode.  After all, C is not a language well-known for its strength and safety in handling strings, and the kernel is not an environment well-known for its tolerance of C runtime error tolerance.

But I digress.  The point is this: when we begin to embed semantic knowledge inside the file system, we exploit a model in which the file system is involved in the search function and this would seem to be anathema to normal file systems behavior.   This is a good challenge: does this need to be done in the file system?  If not, perhaps there is instead an abstraction that the file system itself must be able to provide.

Each time I tackle this problem, my general sense is that the model I want is a case in which each file has a set of attributes.  Ideally, what I want is some way to quickly and efficiently find things based upon those attributes.  After all, how hard could this be?

One benefit to the current search paradigm with which users have been trained is that it does not provide reproducible search results.  Thus, nobody will really be surprised if they repeat a search today and get back different results than they got back yesterday.

Hence, I keep coming back to this paradigm.  It also gives me the sense that there are different characteristics of such a system – there are persistent attributes, like the timestamp, and ephemeral attributes, like semantic tags.

Plenty to think about, but this idea of where to draw the line of search is an important one.  In either case, though, I need to determine efficient ways of rapidly finding files based upon these attributes.