Home » Uncategorized
Category Archives: Uncategorized
Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory
Shivaram Venkataraman, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell in Proceedings of File Systems and Storage Technology 2011, Volume 11, pp 61-75, USENIX.
In this paper the authors turn their attention to data structure considerations for Non-Volatile Memory (NVM). Unlike the previous papers I have covered (Mnemosyne and NV-Heaps) they look at data structures specifically optimized to exploit the capabilities of NVM. From the abstract:
For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging.
Some other aspects of this paper that stand out:
- They are looking at NVM replacing both DRAM and storage – thus, they view this as a single level store.
- They use versioning to protect their data structures, versus logging.
- They describe how to achieve this without hardware changes.
The paper has a good review of NVM memory technologies that may emerge. Table 1 (from the paper) underscores the dramatic decrease in latency. This is why we’ve been waiting for this for more than 10 years now. It really does change the platform in ways that we have not yet realized.
But it is not just the speed aspect that matters, it is also the persistence aspect. The density is much larger for these memories as well. Anyone that has looked at an NVMe M.2 drive can notice how few and small the components on it.
Do we treat it as storage? If so, perhaps we should look to the file systems world for some insight. The authors turn to the WAFL shadow page mechanism. They point to BTRFS and their use of this technique with B-trees. They dismiss this approach, concluding that they have “fewer data-copies” in CDDS. They distinguish this work because it is byte addressable versus prior work that was block oriented (page addressable). Again, the lessons learned from working with it aren’t directly applicable. They do point out that using NVM makes sense in a world of large, persistent storage backed data farms. So there is a need, and one they see fulfilled by NVM. It just needs efficient use of that NVM.
Thus, the authors walk their own path.
The speed of NVM is such that direct access is the only mechanism that makes sense. System calls impose too much overhead, doubling the cost of accessing the NVM itself. Thus they posit that it will be direct access (and indeed that is what seems to come to pass).
They observe that one of the challenges for persistent data is that CPUs do not provide mechanisms for ordering persistent writes (though they do, but at a fairly coarse granularity of a fence.) So they describe the issues that must be handled:
- Reordering of writes from the caching behavior of the CPU itself as well as a multi-level cache hierarchy.
- Failure semantics of atomic operations across power failures.
They summarize the various approaches available to them for ensuring data has been stored properly in NVM. This includes memory fences, cache writeback and invalidate operations, marking memory as non-cacheable (which forces write-back), cache line flushes, and atomic processor operations. They point out that this is not sufficient for more complex updates, such as tree rebalancing operation. This leads them to versioning.
I found it interesting that their goals were similar to those I have seen previously: durability, consistency, scalability, and ease-of-use for the programmer. They also note that they focus on physical consistency of the data contents in memory. Logical consistency of higher level meta-data structures is not addressed in the context of this work.
Thus, the authors point out that a CDDS is an abstract idea; they demonstrate how they envision using it by implementing a b-tree structure (Figure 1 is from the paper as they describe their B-tree).
In versioning, changes are not made in place to the current version; instead, a new version is written. The current version is immutable (at least as long as it is the current version). Atomic operation and copy-on-write techniques are used to make changes persistent. Once done, a new version number is assigned and the new version becomes the current version. The old version can be recycled once its reference count drops to zero.
Failure recovery then becomes a function of “cleaning up” any in-progress operations that had not been written to disk.
The authors then walk through their B-tree example. They explain their use of versioning, they provide pseudo-code for the various B-tree operations (lookup, insert, delete) as well as the internal operations needed to support them.
They evaluate their solution by simulating NVM on top of DRAM (a common solution as we have seen). They compare against BerkeleyDB, STX B-Tree, Tembo, Cassandra, and Redis. They were slower than the entirely in-memory STX B-Tree, presumably due to the cost overhead. They are much faster than BerkeleyDB (even BerkeleyDB runing on a RAM disk.) They also tested using YCSB as their “end to end” benchmark.
In the end, they do demonstrate that it is possible to rebuild existing data structures – preserving the interface – so they work efficiently with NVM. They even point out it does not require processor changes to do so. Given that better processor cache control mechanisms have been introduced since then, I would expect that exploiting them will lead to even better performance.
This week I had the opportunity to attend ACM SIGOPS HotOS ’17. This is a workshop that focuses on discussing work in some stage of progress. The papers are not as polished as one would find at a full conference and in some cases they are more about being provocative – encouraging the reader to consider some existing problem in a new light.
It is a small function, with a limited number of people invited. Because I am working with one of the people involved in organizing it, when an opportunity to help out arose I was invited to attend. My role was to help out with various functions, including assisting in the live blogging of the workshop (hotos17.tumblr.com) as well as leading tours, handling registration, and generally being part of making the participants feel welcome.
It also provided an opportunity to begin learning more about members of the research community. Several things surprised me somewhat: (1) these are highly specialized researchers. The field is broad enough that there were only ever a handful of experts in any one aspect of it as the discussions and papers flowed around. (2) I was able to actually really understand a couple of the talks. One of them made me angry, partially because it was an attack on people not present that I know. Still, many of the talks were in areas where I only had general understanding, not a specific understanding of the subject matter.
The one talk that annoyed me? Transactional file systems!
There have been a number of articles about the AI TA for a class at Georgia Tech. I find these fascinating because I took this class in the summer 2016 session.
I’ve considered ways to use this approach to facilitate online learning in other environments as well and I’d like to look into it more in the future. Nothing definite now, but I’ve had a few preliminary discussions. Let’s hope they bear fruit.
This is my new blog, dedicated to the world of file systems.
I’ve been working with file systems for some time now (longer than I’d care to admit). If you are not familiar with a file system think of it as the software inside your computer that stores your data away and gets it back for you when you want it. It provides the organization to your storage device – whether it is a disk drive on your local computer or storage in some far-away location (“the cloud”).
It plugs into the critical software that runs your computer system and provides services to that applications can find your data as well. When it works right, you hardly notice it – much like plumbing. When it doesn’t work right, you suffer.
I’m not sure where this blog will take me, but I’m going to use it as a mechanism for tracking my own journey, organizing my own research. If you find it interesting as well, that’s awesome! If you don’t, it won’t diminish my own purpose for doing this.