Home » 2018 » April (Page 2)

Monthly Archives: April 2018

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 204 other subscribers
April 2018
S M T W T F S
1234567
891011121314
15161718192021
22232425262728
2930  

The Log-Structured Merge Tree (LSM-Tree)

The Log-Structured Merge Tree
Patrick O’Neil, Edward Cheng, Dieter Gawlick, Elizabeth O’Neil in Acta Informatica, Volume 33, Issue 4pp 351–385.

This paper does not relate to non-volatile memory, but we will see Log-Structured Merge Trees (LSMTs) used in quite a few projects.  From the abstract:

The log-structured mergetree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components.

So LSMTs originate from concerns about the latency issues around disk drives.

In a nutshell, the challenge with disk drives are they have mechanical parts that must be moved in order to read the data.  Data is written in concentric bands around the center.  The angular velocity of the disk platter is the same, but of course the surface velocity is lowest towards the center and fastest towards the outer edge.  The disk drive “head” is moved in and out to read from each of those concentric circles.  Since it can only read what is beneath the head, it also must wait for the desired data to rotate under the head.  This is one reason why faster disk drives (usually measured by the rotations-per-minute number) provide faster response times.  On the other hand, faster disk drives generate more heat and are more expensive to build.

Thus, an important consideration for file systems working on rotating media is the latency to perform random access.  Tape drives have the highest latency, since we have to reposition the tape to get to another location and read its data (there are other complexities as well, such as the fact that tape benefits most from streaming write).  Hard disk drives (HDDs) are not so bad as tape drives in general (though SMR drives act much like tape drives, which is one reason I mention that here.)  Solid State Disks are even better than disk drives, though even for an SSD random access is slower than sequential access – but both are much faster than HDDs.

Some of those papers that I have yet to cover describe the concept of a log-structured file system.  One of the things that I learned when working on the Episode File System was that converting random I/O to sequential I/O was definitely a win (so was asynchronous logging).  It is this observation: converting random I/O to synchronous I/O that provides the benefit of using journaling techniques (the “log”).

So LSMTs capitalize upon this advantage.  Note that an LSMT is not a single data structure; rather it is a general technique for working with systems where insert and delete are the common operations, such as meta-data within a file system, or (key,value) tuples in a key-value store.  It also points out that reading large sequential block is generally more efficient; they cite IBM when noting that a single page read from the DB2 Database takes approximately 10 milliseconds. A read of 64 continuous pages costs about 2 milliseconds per page (125ms total). So batching I/O operations is also an important mechanism for improving performance.

So what is an LSMT?  “An LSM-tree is composed of two or more tree-like component data structures.”  From there the authors describe their initial case: where one tree is memory resident and the other is disk resident.  Note, however, this is the smallest set for a valid LSMT.  Systems with more than two have been built – and one way to use non-volatile memory (NVM) is to add it as another layer to an LSMT.

Figure 2.1 (from the paper) shows the high level structure of the LSMT – a tree-like structure at each level of the storage hierarchy; these levels can have different characteristics, such as being ephemeral (in memory) or persistent (on disk, tape, SSD, etc.)  Figure 2.2 provides greater detail, showing how data is merged from one level to the next (“rolling merge”).  Figure 3.1 shows a generalization of the LSMT, in which data is migrated from one level of the storage hierarchy to the next.

Figure 6.1 then helps motivate this work: data that is seldom accessed (which is most of the data) is “cold” and can be stored in lower cost storage.  Data that is frequently accessed (which is a small amount of the data) is “hot” and benefits from being stored in faster but more expensive storage.  Indeed, there is a substantial body of work at this point that demonstrates how data tends to cycle from being hot to being cold.  Thus, there is a period of migration for “warm” data.  This also helps explain why having a multi-stage model makes sense. This behavior is quite general, in fact.  Disk drives are constructed with caches on them.  The cache is for the hot data, the disk storage for the cold data. SSDs are often structured with multiple classes of NVM; a small amount of expensive but fast NVM and then a larger amount of less expensive (often block oriented) NVM.  Even CPUs work this way (as I will be discussing ad nauseum), where there are multiple levels of caching: L1 cache is small but very fast, L2 cache is larger and slower (and cheaper), L3 cache is again larger and slower.  Then we get to memory (DRAM) which is even slower.  That’s all before we get to storage!

This is quite a long paper: they describe how data is merged from one level to the next as well as do an in-depth analysis of cost versus performance.  But I will leave ferreting out those details to the interested reader.  I got what I came for: a basic description of the tiered nature of LSMTs and how we can use them to make storage more efficient without driving up costs.

Moving to the now

I’ve been gone for a bit; real life keeping me busy.  Part of that has been exploring the current edge of my technology space.  So I’m going to take a break from the historical paper review (but it will come back) and instead start talking about new storage technologies and the interesting things that we can do with them.

Specifically, I have been looking at the brave new world of “non-volatile dual inline memory modules”.  What this means is that we now have persistent memory that is directly accessed via processor primitives.  That means a load or a store instruction can be used to access this kind of memory.  We have actually been talking about persistent memory since the dawn of time.  Many of the early operating systems papers think of disk storage and memory as being different forms of memory.

The big change here was when computers moved away from magnetic core memory into solid state memory.  Intel’s first product was dynamic RAM (though it was invented by IBM).  DRAM was cheaper and faster than core – and much faster than magnetic storage.  Thus the abstraction of memory and disk being part of the same continuum we began to think of them as being distinct.  DRAM was transient; disks and tapes were persistent (note that magnetic core was persistent until someone read it, at which point it lost its state and had to be rewritten – even across power cycles).

Disk storage was often used to contain data from memory, such as for paging or segmentation.  Over the years the operating systems community learned many ways to make disks seem fast: caching, read-ahead, asynchronous I/O, reordering operations, etc.

Computers – and memory – were able to experience vast increases in speed.  Disk drives did improve, but only modestly in comparison.

Memory technologies have continued to evolve.  Persistent memory options increased as well; flash memory is one of the most common today and is used inside solid state drives and NVMe disks, both of which are vastly faster than rotating media disks.

Another memory technology that has been discussed for more than 20 years has been persistent RAM.  One way this has been done in “real products” has been to simply add a backup power source: some sort of battery.  Then if the power is lost, the data contents of volatile memory (DRAM) can be written to persistent storage (e.g., Flash).  This approach has been a stop-gap on the way to the actual persistent memory solution that has been promised for at least a decade: persistent memory that acts like DRAM.  Intel is now starting to ship their new Optane memory products. Samsung has announced their new Z-NAND products.  Other technologies remain “in development” as well.

Why does this matter?  Storage is suddenly getting faster.  We went from 10 millisecond access times to 0.2 millisecond access times (HDDs to SSDs).  Now we are looking at going from 0.2 milliseconds to 200 nanoseconds – six orders of magnitude faster.  This sort of change is profound.  We’ve been talking about it for many years, trying to reason about it.  It is now materializing.  Over the next several years the other promise of NVM is going to materialize: it supports higher density than DRAM.  It is slower than DRAM; where DRAM access times are on the order of 50-100 nanoseconds, NVM access times are on the order of 125-800 nanoseconds (writes being slower than reads).

File systems that have been optimized for working on hard disks or SSDs don’t necessarily make sense on NVM.

Thus, the area I’ve been looking at: expanding my own understanding of NVM.  Since this is still related to my own exploration of file systems, I’ll use this as my soapbox for exploring the space.

Let’s see where this journey goes.  And I promise, I’ll come back to the old file systems papers after this diversion.