Home » Uncategorized

Category Archives: Uncategorized

April 2024
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Reflections on Teaching Distributed Systems

June 10, 2023

During January through April 2023 I taught CPSC 416 at UBC. It was the first time I had taught this course and it would not have been possible without the assistance I received from others, notably Ada Gavrilovska and Ivan Bestchastnikh both of whom allowed me to use their materials in creating my own course. Of course, I reordered things, and adapted them to fit the class.

Teaching a course for the first time is always an illuminating experience. I’ve designed and taught classes on systems topics, including elements of DCE/DFS, the distributed file system on which I worked back when I was a twenty-something software developer, classes on Windows driver development (device drivers, file systems, and file system filter drivers,) and Windows kernel debugging. I even explored utilizing online mechanisms for providing a non-linear educational approach to core OS concepts (processes, threads, scheduling, and synchronization) as part of my MSCS work at Georgia Tech. Thus, I have enjoyed engaging in education and looking for ways to do better for much of my life. I have always found insights each time I teach a new course. CPSC 416 was no exception to this.

I have never taken a distributed systems class. I learned about distributed systems organically. In my senior year at the University of Chicago I worked for the nascent Computer Science department as part of the facilities team and part of that work included networking. I remember soldering connectors together for Ethernet connections of the time – vastly different than the RJ-45 connections we use now, but the same technology we use today. After graduation I took a job at Stanford working with David Cheriton, who ran the “Distributed Systems Group.” The V operating system, which his group developed, is what I used on my desktop, and I built a number of network components as part of my work for him, including a network protocol (VMTP) which I implemented on a BSD 4.2 UNIX based system, diskless bootstrap drivers, and even an IP-multicast version of a multi-player Mazewar variant.

From Stanford I went to Transarc, a CMU-research inspired start-up company that had two very different product directions: one was an online transaction processing system, and the other was a commercialization of the AFS distributed file system. I also worked on the successor (the DCE/DFS project I mentioned earlier.)

Thus, my background in distributed systems was building distributed systems, often from the perspective of not knowing what I was doing but being surrounded by smart people that helped me figure it out. In some ways, that was my objective in teaching CPSC 416: paying that hard work forward.

One observation now: things that you learn in your twenties become “assumed knowledge” quite easily in your fifties. Thus, I just assumed that everyone knew about how databases maintain consistency in the face of failures. This turns out not to be true. So, one of my first lessons here was that I need to explain this up front in order for many of the things I say afterwards make sense. What is the point about replicating a log (what database people usually refer to as a journal) if you don’t understand that a log is the basic mechanism we use to restore consistency in a database. Lesson 1: teach people about databases and recoverability.

The second observation stems from my first oversight: building transactionally safe recoverable systems is hard. You’d think I’d know that, since I built a transactionally safe recoverable system back in the late 1980s and early 1990s as part of my work on Episode, the local physical file system that we used to support some of the nifty features of DCE/DFS. Episode, in some form, continues to be used in production today (file systems have unnaturally long lives if they get any serious adoption.) I would be quite surprised if the underlying transactional system were significantly different than it was thirty years ago when I worked on it. Lesson 2: teach people about building transactionally safe databases. Related to this is explaining key-value stores explicitly. They are a key part of the programming assignments and understanding why we use them helps. They typically form the basis of databases and file systems. Indeed, file systems are typically key-value stores with a name space built on top of them. The keys are limited (integers representing an entry in a potentially sparse table of objects) and the values are mutable (which complicates implementation and correctness).

The third observation is not an original one, as I have learned in conversations with other educators. There is a fundamental mis-alignment between the objectives of students (which is essentially grade maximization) and my objectives (which is “learn useful stuff that will help you throughout your career.”) This isn’t a big surprise – after all, I have published work in plagiarism reduction – but trying to find ways to fix it is challenging. I had not expected the insane amount of pressure students seem to feel to maximize their grades. I did try to mitigate this somewhat by offering extra credit opportunities, though in the end that seemed to create stress for many to exploit those. Why people who are getting grades in the 90%+ range are “afraid of failing” is beyond me. Lesson 3: extra credit creates more stress than it alleviates. I don’t think this is entirely the case but I’ll be more cautious about using it in the future. Still, I don’t want people to worry that they are going to fail so my thought is to provide an incentive to participate that mitigates the likelihood of them failing.

My fourth observation is that I had never read the various papers about distributed consensus side by side before. Doing so was an eye-opening experience. What I learned is that in many cases the complications in those papers relate to: (1) optimizations; and (2) recovery. Thus, next time I teach this class I want to spend more time walking through the baseline protocol and then pointing out optimizations and handling recovery. One example of this was when I had a student point out that the Paxos Made Moderately Complex paper (PMMC) states that during the leader election phase the voting party sends along a list of their accepted but not committed proposals (from the previous leadership). This is not part of the protocol. It is an optimization that makes recovery faster and more efficient, but you can’t rely upon it to maintain correctness. Now that I understand this point of confusions better, I think I can walk people through it and distinguish this. Doing so will help people understand the underlying protocol better and then the optimizations we use to ensure it works correctly. Lesson 4: walk through the papers more carefully, explaining the base protocol and then pointing out that the primary difference is in optimizations and recovery mechanisms.

My fifth observation is that students focus too much on code and not enough on understanding. Distributed systems is an area in which one must think through failure cases, identify how you will handle them, and what you assume is going to be true (your “invariants”) throughout your code base. I did introduce some tools for doing this (modeling and TLA+ specifically) but I did not incorporate them into the actual assignments. I did have them write reports, but those were post-hoc reports. I would like to try making the design cycle a more prominent portion of this work, encouraging people to think about what they are building rather than trying to hack their way through it. One piece of feedback from several students was that my advice to “walk away and think through the project” was quite helpful. I’d like to make the structure of the course make that happen more naturally. I also think that by having explicit design milestones it would reduce stress by encouraging students to work on the projects before the deadline. Lesson 5: design is more important than code, but code helps students verify their design reflects good understanding. The challenge will be in finding the right balance between the two.

I have other, smaller observations as well that I won’t break out but I’ll capture here:

Extensions don’t really help.
Providing a flexible late policy can be helpful, but it often creates quite a lot of stress.
Some students abhor teams, some like them. I need to find a way to accommodate both learning styles in a way that is equitable.
Do as much as possible to simplify grading exams.
Make a conscious effort after each lesson to create questions for the exam. I provided individualized exams (drawn from a pool of questions) and I wish I’d had more questions from which to draw. What was particularly nice was being able to provide people with their own personalized exam’s answers right at the end of the exam. It also helped me identify some issues.

I do think a number of steps that I took in this course worked well. People (generally) liked the failure examples, they liked the responsiveness, they liked the material. It is easy to focus on just the negatives, but I want to make sure and acknowledge the positives because it is important to preserve those elements.

Finally, I have agreed to teach this course again in the fall (which for UBC means “Winter Term 1”) so I will have an opportunity to incorporate what I have learned into the next course offering. I’m sure I’ll have more insights after that class.

So Much Metadata, So Little Agreement

October 6, 2022

Earlier this year I was focused on collecting activity data. I made reasonable progress here, finding ways to capture local file system activity as well as activity against two different cloud service providers. I keep looking at other examples, as well, but rather than try for too much breadth, I decided to focus on the three sources I was able to get working and then push deeper into each source.

First, there is little agreement as to what metadata should be present. There are a few common fields, but then there are numerous fields that only show up in some subset of data sources – and this is just for file systems where presumably they’re storing the same basic stuff. What’s most common:

A name
A timestamp for when it was created
A timestamp for when it was modified
A timestamp for when it was accessed
Some attributes (read-only, file, directory, special/device)
A size

Of course, even here there isn’t necessarily agreement. Some file systems have limited size names or limited character sets they support. Timestamps are stored relative to some well-known value. UNIX traditionally chose January 1, 1970 00:00:00 UTC and that number comes up quite often. IBM DOS (and thus MS-DOS) for x86 PCs used January 1, 1980. Windows NT chose January 1, 1601. I do understand why this happens: we store timestamps in finite size fields. When the timestamp “rolls over” we have to deal with it. That was the basis of the Y2K crisis. Of course, I’ve been pretty anal about this. In the late 1970s when I was writing software, I made sure that my code would work at least to 2100 (2100 is not a leap year while 2000 was a leap year because of the rules for leap years.) I doubt that code survived to Y2K.

But file systems designers worry about these sorts of things because we know that file systems life surprisingly long lifetimes. When the Windows NT designers first settled on a 64 bit timestamp in the late 1980s they gleefully used high precision timestamps: 100 nanoseconds. But 64 bits is a lot of space and it allows storing date for many millennia to come.

Today, we store data all over the place. When we move it, those timestamps will be adjusted to fit whatever the recipient storage repository wants to use. In addition, any other “extra” metadata will silently disappear.

How much extra metadata exists? I’ve spent the past few weeks wading through Windows and even though I knew there were many different types of metadata that could be stored, I chuckled at the fact there is no simple way to retrieve all that metadata:

There are APIs for getting timestamps and sizes
There are APIs for getting file attributes
There are APIs for getting file names
There are APIs for getting a list of “alternate data streams” that are associated with a given file.
There are APIs for retrieving the file identifier of the file – that’s a magic number that can be combined with data from other APIs to associate activity information (and that is the reason I went spelunking for this information in the first place.)
There are APIs for retrieving “extended attributes” of files (EAs). EAs are older than Windows NT (1993) but have been difficult to use from the Win32 API that most applications use.
There are now APIs for retrieving linux related attribute information (see FILE_STAT_LX_INFORMATION) on top of the existing attributes.
There are 128 bit GUIDs and 128 bit File IDs

I’m sure I didn’t hit them all, but the point is that these various metadata types are not supported by all file systems. On Windows at least, when you try to copy a file from NTFS to FAT32 (or ExFAT) it will warn you about potential data loss if certain attribute data is present (specifically alternate data streams.) The reason I think they first added this (it was added a long time ago) was because in the early days of downloading files from the internet it became useful to tag them as being potentially suspect. This is done by adding an alternate data stream to the file (::Zone_Identifier) and then information about the remote location from which the file was downloaded.

Thus, this metadata isn’t added just because, it is added because it enables potentially useful functionality.

Here’s something I’ve never seen anyone do thus far – that doesn’t mean nobody does it, just that I haven’t seen it: nobody indexes based upon these attributes. The named stream Zone_Identifier could be used to find all the files that you’ve downloaded from the internet, regardless of where on your computer. I laugh at this because I know a number of times I’ve downloaded content and then had no idea where it was downloaded. With an index of downloaded content, I could just look at the last five things I downloaded – problem solved.

While I have spent a fair bit of time talking about Windows, I have seen similar issues on Linux. It is only in the past couple of years that the extended stat structure (statx) has become mainstream supported. Several file systems that run on Linux support extended attributes. The idea behind streams isn’t particularly novel (we implemented something we called property lists in Episode at the same time the NTFS team was deciding to all full-blown named alternate data streams to their file system. Ours were just limited in size – an approach that I think the ReFS team took because they found nobody was really using large alternate data streams.)

Bottom line: one of the interesting challenges in using activity data is that as similar as file systems seem on the surface they often implement different/special semantics using metadata. How to make sense of this is a significant problem and one that I do not expect to fully address. Despite this, I can see there is tremendous benefit to using even some of this metadata to build relationships between different storage locations. That, however, is a topic for another day.

GFS: a Graph-based File System Enhanced with Semantic Features

May 14, 2019

GFS: a Graph-based File System Enhanced with Semantic Features
Daniele Di Sarli and Filippo Geraci, Proceedings of the 2017 International Conference on Information System and Data Mining, pp. 51-55, Charleston, SC, US.

In this paper we describe GFS (graph-based file system) a new hybrid file system that extends the standard hierarchical organization of files with semantic features. GFS allows the user to nest semantic spaces inside the directory hierarchy leaving unaltered system folders. Semantic spaces allow customized file tagging and leverage on browsing to guide file searching.

I found this paper shortly after it was published and was intrigued by its name. I described our HotOS 2019 paper previously, which was rejected and one reviewer specifically cited to this paper (as well as the QDMS paper). I thought I had cited this paper and explained why it really wasn’t the same thing we were proposing, but apparently I did not do a good enough job of distinguishing this from our work.

The abstract does a good job of explaining how this work is different than what we proposed and what I’m trying to construct: a relationship graph file system that captures a richer set of relationships between files rather than just characteristics of the files themselves.

The authors do a good job of establishing the status quo: “Handmade directory hierarchies still remain the only method to classify documents for most computer users. Surprisingly, even public administrations as well as small and medium enterprises rely on manual classification.”

Indeed, one of the challenges in this space is that what we have has been “good enough” for a surprisingly long time, despite the fact that we know that it is rudimentary and shifts much of the cognitive burden to users.

“In this paper we try to address the question whether it is possible to extend standard file systems adding extra semantic features without altering the API or not.”

In my own way, I have been looking at this question for quite some time. Over a year ago I was working on finding a way in which I could support both classic file system interfaces as well as augmenting them with new features without requiring invasive operating systems level changes. While I expect that ultimately a successful demonstration of new interfaces will lead to OS level changes, it makes more sense to explore what interface changes are useful before actually making those changes. In that work (which I haven’t written about yet) I looked at constructing a hybrid FUSE file systems model where FUSE requests could be delivered via multiple paths: one is the classic kernel reflector model (e.g., FUSE for Linux as well as FUSE for Windows, and quite a few other OS platforms too) and the other is a message passing mechanism that directly routes requests from the application to the user mode FUSE library implementation. I am still working on that, so I expect to write more about it in the coming months!

So this paper explores the question of “what can we do without changing the existing APIs?” I had someone in my lab question why I cared about backwards compatibility with existing file systems APIs at one point; my position on this then (and now) is that insisting all applications change to support a new API is unrealistic if I want to make an impact.

One of the strengths of this paper is the emphasis on navigation versus search. This is the important distinction that I extracted from my recent review of the personal information manager survey paper. Trying to argue that search is the solution doesn’t fit with the way that users look for data; perhaps there are better search solutions, but ultimately the goal is to provide better services to the user which means helping them in the way they use the system now. I suspect the ideal will be to enhance both the current way, as well as provide better search tools; in other words navigation and search are not mutually exclusive approaches to the problem.

The authors are focused on navigation, not search. They use tags as an additional way to navigate the file system; they separate the semantic spaces from the hierarchical spaces, though. My concern is that this creates the semantic spaces as second class citizens (though, this system pushes them to the front of the bus). One thing that surprised me is their comment about how they returned semantic information before regular directory information. In my experience, application programs sort the results of directory enumerations and do not rely upon the order in which entries are returned.

The authors do identify complications for ordinary operations, notably copy, which in a graph can be complex because of the potential for cycles. They also identify the desirability of pushing multiple tags at once, which avoids repeated calls into the file systems interface. Copy needs to be optimized as well to deal with the inherent non-atomic nature of the beast. Rename and unlink also have complications given traditional POSIX semantics. The authors identify potential concerns about security that I have been considering as well, though I can point to Windows as being a real-world counter-example to the idea that you need path based security to work properly; while NTFS supports path-based security, the OS default is to grant traverse right to all users on the system. POSIX compatible applications disable that and force traverse checking, which has a noticeable impact on performance. Indeed, it seems one of the complications of extending the file system interface is defining the behavior between POSIX and the extension. That’s certainly a useful lesson.

In the end, this paper focuses on using tags for their files and creating namespace extensions that identify the files. It is a short (4 page) paper, and there is no evaluation of what they constructed or how effective it was. It presents one point in the design space and it is certainly a useful paper to consider as I design my own point in the design space.

Reboot time

April 22, 2019 / 1 Comment on Reboot time

It’s been almost a year since I posted anything substantive. It is so easy to just focus on other things, which is what I’ve been doing. In the past year I’ve continued to explore some interesting areas related to file systems. For example, for the past year I’ve been looking at persistent memory, which acts somewhat like storage (because it is persistent) and somewhat like DRAM (because it is byte addressed). The findings have been interesting: surprising in some cases, close to predicted in some cases. We’re still doing more work in this area, and I’m hoping to submit two papers later this year.

But the other big project is a file systems project. I’ve decided to use Windows as my platform of choice, both because I’m quite comfortable with Windows and because I think it’s the best choice for this project. Since it has been a goal of mine to do more writing in this area, I thought it would be great to use my blog to capture how I go about writing this file system for Windows with the (probably unrealistic) hope that I can eventually turn it into an online guide. I also plan on using a public repository for my project so that other people can see what I end up doing. I think I’ll turn that into a separate blog post, so I can talk more about that project.

Hopefully, by using this as a mechanism for describing my forward progress I can continue to post new information and content. That’s the theory at least.

If you are interested in persistent memory, two good recent papers are:

Basic Performance Measurements of the Intel Optane DC Persistent Memory Module

Single Machine Graph Analytics on Massive Datasets Using Intel® OptaneTM DC Persistent Memory

Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory

April 19, 2018

Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell in Proceedings of File Systems and Storage Technology 2011, Volume 11, pp 61-75, USENIX.

In this paper the authors turn their attention to data structure considerations for Non-Volatile Memory (NVM). Unlike the previous papers I have covered (Mnemosyne and NV-Heaps) they look at data structures specifically optimized to exploit the capabilities of NVM. From the abstract:

For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging.

Some other aspects of this paper that stand out:

They are looking at NVM replacing both DRAM and storage – thus, they view this as a single level store.
They use versioning to protect their data structures, versus logging.
They describe how to achieve this without hardware changes.

The paper has a good review of NVM memory technologies that may emerge. Table 1 (from the paper) underscores the dramatic decrease in latency. This is why we’ve been waiting for this for more than 10 years now. It really does change the platform in ways that we have not yet realized.

But it is not just the speed aspect that matters, it is also the persistence aspect. The density is much larger for these memories as well. Anyone that has looked at an NVMe M.2 drive can notice how few and small the components on it.

Do we treat it as storage? If so, perhaps we should look to the file systems world for some insight. The authors turn to the WAFL shadow page mechanism. They point to BTRFS and their use of this technique with B-trees. They dismiss this approach, concluding that they have “fewer data-copies” in CDDS. They distinguish this work because it is byte addressable versus prior work that was block oriented (page addressable). Again, the lessons learned from working with it aren’t directly applicable. They do point out that using NVM makes sense in a world of large, persistent storage backed data farms. So there is a need, and one they see fulfilled by NVM. It just needs efficient use of that NVM.

Thus, the authors walk their own path.

The speed of NVM is such that direct access is the only mechanism that makes sense. System calls impose too much overhead, doubling the cost of accessing the NVM itself. Thus they posit that it will be direct access (and indeed that is what seems to come to pass).

They observe that one of the challenges for persistent data is that CPUs do not provide mechanisms for ordering persistent writes (though they do, but at a fairly coarse granularity of a fence.) So they describe the issues that must be handled:

Reordering of writes from the caching behavior of the CPU itself as well as a multi-level cache hierarchy.
Failure semantics of atomic operations across power failures.

They summarize the various approaches available to them for ensuring data has been stored properly in NVM. This includes memory fences, cache writeback and invalidate operations, marking memory as non-cacheable (which forces write-back), cache line flushes, and atomic processor operations. They point out that this is not sufficient for more complex updates, such as tree rebalancing operation. This leads them to versioning.

I found it interesting that their goals were similar to those I have seen previously: durability, consistency, scalability, and ease-of-use for the programmer. They also note that they focus on physical consistency of the data contents in memory. Logical consistency of higher level meta-data structures is not addressed in the context of this work.

Thus, the authors point out that a CDDS is an abstract idea; they demonstrate how they envision using it by implementing a b-tree structure (Figure 1 is from the paper as they describe their B-tree).

In versioning, changes are not made in place to the current version; instead, a new version is written. The current version is immutable (at least as long as it is the current version). Atomic operation and copy-on-write techniques are used to make changes persistent. Once done, a new version number is assigned and the new version becomes the current version. The old version can be recycled once its reference count drops to zero.

Failure recovery then becomes a function of “cleaning up” any in-progress operations that had not been written to disk.

The authors then walk through their B-tree example. They explain their use of versioning, they provide pseudo-code for the various B-tree operations (lookup, insert, delete) as well as the internal operations needed to support them.

They evaluate their solution by simulating NVM on top of DRAM (a common solution as we have seen). They compare against BerkeleyDB, STX B-Tree, Tembo, Cassandra, and Redis. They were slower than the entirely in-memory STX B-Tree, presumably due to the cost overhead. They are much faster than BerkeleyDB (even BerkeleyDB runing on a RAM disk.) They also tested using YCSB as their “end to end” benchmark.

In the end, they do demonstrate that it is possible to rebuild existing data structures – preserving the interface – so they work efficiently with NVM. They even point out it does not require processor changes to do so. Given that better processor cache control mechanisms have been introduced since then, I would expect that exploiting them will lead to even better performance.

HotOS 17

May 11, 2017

This week I had the opportunity to attend ACM SIGOPS HotOS ’17. This is a workshop that focuses on discussing work in some stage of progress. The papers are not as polished as one would find at a full conference and in some cases they are more about being provocative – encouraging the reader to consider some existing problem in a new light.

It is a small function, with a limited number of people invited. Because I am working with one of the people involved in organizing it, when an opportunity to help out arose I was invited to attend. My role was to help out with various functions, including assisting in the live blogging of the workshop (hotos17.tumblr.com) as well as leading tours, handling registration, and generally being part of making the participants feel welcome.

It also provided an opportunity to begin learning more about members of the research community. Several things surprised me somewhat: (1) these are highly specialized researchers. The field is broad enough that there were only ever a handful of experts in any one aspect of it as the discussions and papers flowed around. (2) I was able to actually really understand a couple of the talks. One of them made me angry, partially because it was an attack on people not present that I know. Still, many of the talks were in areas where I only had general understanding, not a specific understanding of the subject matter.

The one talk that annoyed me? Transactional file systems!

Artificial Intelligence in the virtual classroom

December 26, 2016

There have been a number of articles about the AI TA for a class at Georgia Tech. I find these fascinating because I took this class in the summer 2016 session.

I’ve considered ways to use this approach to facilitate online learning in other environments as well and I’d like to look into it more in the future. Nothing definite now, but I’ve had a few preliminary discussions. Let’s hope they bear fruit.

Stay tuned!

Welcome!

November 4, 2016

File Systems

This is my new blog, dedicated to the world of file systems.

I’ve been working with file systems for some time now (longer than I’d care to admit). If you are not familiar with a file system think of it as the software inside your computer that stores your data away and gets it back for you when you want it. It provides the organization to your storage device – whether it is a disk drive on your local computer or storage in some far-away location (“the cloud”).

It plugs into the critical software that runs your computer system and provides services to that applications can find your data as well. When it works right, you hardly notice it – much like plumbing. When it doesn’t work right, you suffer.

I’m not sure where this blog will take me, but I’m going to use it as a mechanism for tracking my own journey, organizing my own research. If you find it interesting as well, that’s awesome! If you don’t, it won’t diminish my own purpose for doing this.

Category Archives: Uncategorized

Recent Posts

Recent Comments

Archives

Categories

Subscribe to Blog via Email