
Storage and Retreival
Storage Log-Structured Storage Page-Oriented Storage B-Tree
This is a skimmed version of Chapter ‘Storage and Retreival’ from the book design_data_intensive_applications. I would recommend you to first read the whole chapter and use these notes for revision purpose only.
INTRODUCTION
- We’ll start this by talking about storage engines.
- log-structured storage engines
- page-oriented storage engines such as B-trees.
LOG-STRUCTURED STORAGE ENGINES
- A log-structured storage engine is the easiest and simplest form of a database.
- It breaks down the database into segments, typically several megabytes or more in size, and write is always in sequence.
- The log-structured storage engine’s idea is to treat the database as an append-only log file.
How do you write data in log structured storage engine?
- Always append our data to its end. Once it writes to the disk, it is immutable
- The data is packaged in segments, where the most recent write will always be in the most recent segments.
How do you read data in log structured storage engine?
- You will need to start from the most recent segments and traverse backward.
- Therefore, read will be a linear time.
To speed up the read, you will need to have some sort of index.
HASH INDEX
-
Key-value stores are quite similar to the dictionary type that you can find in most programming languages, and which is usually implemented as a hash map (hash table).
-
Let’s say our data storage consists only of appending to a file
- Then the simplest indexing strategy is to keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found
- Whenever you insert new key-value and for updating existing keys, you also update the hash map to reflect the offset of the data you just wrote.
- When you want to look up a value, use the hash map to find the offset ,seek to that location, and read the value.
-
This is essentially what Bitcask (the default storage engine in Riak) does.
- A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently.
- For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory.
- A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently.
How do we avoid eventually running out of disk space?
- A good solution is to break the log into segments of a certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these segments,
More Optimization
- Since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction
- Segments are never modified after they have been written, so the merged segment is written to a new file.
- The merging and compaction of frozen segments can be done in a background thread, and while it is going on, we can still continue to serve read and write requests as normal, using the old segment files.
- After the merging process is complete, we switch read requests to using the new merged segment instead of the old segments—and then the old segment files can simply be deleted.
ISSUES IN IMPLEMENTATION
File format
- CSV is not the best format for a log.
- It’s faster and simpler to use a binary format that first encodes the length of a string in bytes, followed by the raw string (without need for escaping).
Deleting records
- If you want to delete a key and its associated value, you have to append a special deletion record to the data file (sometimes called a tombstone).
- When log segments are merged, the tombstone tells the merging process to discard any previous values for the deleted key.
Crash recovery
- If the database is restarted, the inmemory hash maps are lost.
- In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along.
- However, that might take a long time if the segment files are large, which would make server restarts painful.
- Bitcask speeds up recovery by storinga snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.
- In principle, you can restore each segment’s hash map by reading the entire segment file from beginning to end and noting the offset of the most recent value for every key as you go along.
Partially written records
- The database may crash at any time, including halfway through appending a record to the log. Bitcask files include checksums, allowing such corrupted parts of the log to be detected and ignored.
Concurrency control
- As writes are appended to the log in a strictly sequential order, a common imple‐ mentation choice is to have only one writer thread. Data file segments are append-only and otherwise immutable, so they can be read concurrently by mul‐ tiple threads.
An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value?
BENEFITS OF SEQUENTIAL LOG STRUCTURED
- Appending and segment merging are sequential write operations, which are gen‐ erally much faster than random writes, especially on magnetic spinning-disk hard drives.
- To some extent sequential writes are also preferable on flash-based solid state drives (SSDs) • Concurrency and crash recovery are much simpler if segment files are append- only or immutable.
- For example, you don’t have to worry about the case where a crash happened while a value was being overwritten, leaving you with a file containing part of the old and part of the new value spliced together. • Merging old segments avoids the problem of data files getting fragmented over time.
LIMITATION OF HASH TABLE INDEX
- The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck.
- You could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well.
- It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic.
- Range queries are not efficient.
- For example, you cannot easily scan over all keys between kitty00000 and kitty99999—you’d have to look up each key individually in the hash maps.
SSTABLES AND LSM-TREES [SORTED STRING TABLE AND LOG_STRUCTURED MERGE TREES]
Previously on Simple Log Structured Storage Each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log. Apart from that, the order of key-value pairs in the file does not matter.
- We will make a simple change to the format of our segment files: we require that the sequence of key-value pairs is sorted by key.We call this format Sorted String Table, or SSTable for short.
CREATING AND MAINTAINING SSL
There are plenty of well-known tree data structures that you can use, such as red-black trees or AVL trees. With these data structures, you can insert keys in any order and read them back in sorted order.
- When a write comes in, add it to an in-memory balanced tree data structure(sometimes called a memtable).
- When the memtable gets bigger than some threshold—typically a few megabytes —write it out to disk as an SSTable file.
- This can be done efficiently because the tree already maintains the key-value pairs sorted by key.
- The new SSTable file becomes the most recent segment of the database.
- While the SSTable is being written out to disk, writes can continue to a new memtable instance.
- In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
- From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
Performance optimizations
- The LSM-tree algorithm can be slow when looking up keys that do not exist in the database:
- you have to check the memtable, then the segments all the way back to the oldest (possibly having to read from disk for each one) before you can be sure that the key does not exist.
- In order to optimize this kind of access, storage engines often use additional Bloom filters. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.
- There are also different strategies to determine the order and timing of how SSTables are compacted and merged.
- The most common options are size-tiered and leveled compaction. LevelDB and RocksDB use leveled compaction (hence the name of LevelDB), HBase uses size-tiered, and Cassandra supports both.
- In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.
- In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space. Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective.
- You can efficiently perform range queries (scanning all keys above some minimum and up to some maximum), and because the disk writes are sequential the LSM-tree can support remarkably high write throughput.
ADVANTAGES OF SSTables OVER LOG SEGMENT WITH HASH INDEX
Merging segments is simple and efficient
- Even if the files are bigger than the available memory. The approach is like the one used in the mergesort algorithm.This produces a new merged segment file, also sorted by key.
When multiple segments contain the same key, we can keep the value from the most recent segment and discard the values in older segments.
In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory
- Say you’re looking for the key handiwork, but you don’t know the exact offset of that key in the segment file. However, you do know the offsets for the keys handbag and handsome, and because of the sorting you know that handiwork must appear between those two. This means you can jump to the offset for handbag and scan from there until you find handiwork (or not, if the key is not present in the file).
PAGE-ORIENTED STORAGE ENGINE (B-Tree)
A page-oriented storage engine is more common than a log-structured storage engine. One of the indexing structures in a page-oriented storage engine is B-Tree.
B-TREES
- The log-structured indexes we saw earlier break the database down into variable-size segments, typically several megabytes or more in size, and always write a segment sequentially.
- By contrast, B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger), and read or write one page at a time. This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.
- Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer, but on disk instead of in memory.
- One page is designated as the root of the B-tree; whenever you want to look up a key in the index, you start here.
- The page contains several keys and references to child pages.
- Each child is responsible for a continuous range of keys, and the keys between the references indicate where the boundaries between those ranges lie.
- Eventually we get down to a page containing individual keys (a leaf page), which either contains the value for each key inline or contains references to the pages where the values can be found.
The number of references to child pages in one page of the B-tree is called the branching factor .In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.
The algorithm ensures that while reading, writing, updating the key the tree remains balanced: a B-tree with n keys always has a depth of O(log n).(A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB.)
ADVANTAGES OF PAGE-ORIENTED STORAGE ENGINE
- Because B-Tree is organized in a tree-structured, and the algorithm ensures that the tree remains balanced, we can assume that readers will just need to go down into the depth of the tree.
- Reading data can be advantageous because it only takes O(logN) time. Most data can fit into B-Tree with 4-5 levels deep.
- There are many innovations within B-Tree since it is a more mature storage engine than a Log-structured storage engine.
- Lastly, since the key exists only once in B-Tree, it is very attractive to create databases that offer strong transactional semantics. Therefore, many relational databases internal storage engines are using B-Tree.
DISADVANTAGES OF PAGE-ORIENTED STORAGE ENGINE
- If in the middle of updating the value in B-Tree, the server crashes, it can result in corrupted indexes - a child that is an orphan. Therefore, most of the B-Tree designs will write to a write-ahead log file before doing any further operation to provide reliability in the storage system. For this reason, writing to B-Tree often will write to multiple sources and thus slow down operation compared to LSM Tree.
- There will always be a space overhead in B-Tree because we want to leave some space for dealing with fragmentation.
- Updating values in a concurrent environment in B-Tree is quite complex - you will need to have careful concurrency control to avoid any inconsistent state in the tree. Usually, we protect the tree data structure in latches (lightweight locks) when updating or accessing B-Tree.
COMPARING B-TREES AND LSM TREES
- As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction. However, benchmarks are often inconclusive and sensitive to details of the workload. You need to test systems with your particular workload in order to make a valid comparison. In this section we will briefly discuss a few things that are worth considering when measuring the performance of a storage engine.
ADVANTAGES OF LSM TREES
- A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself (and perhaps again as pages are split). There is also overhead from having to write an entire page at a time, even if only a few bytes in that page changed. Some storage engines even overwrite the same page twice in order to avoid ending up with a partially updated page in the event of a power failure.
- Log-structured indexes also rewrite data multiple times due to repeated compaction and merging of SSTables. This effect—one write to the database resulting in multiple writes to the disk over the course of the database’s lifetime—is known as write ampli‐ fication.
It is of particular concern on SSDs, which can only overwrite blocks a limited number of times before wearing out.
- In write-heavy applications, the performance bottleneck might be the rate at which the database can write to disk. In this case, write amplification has a direct performance cost: the more that a storage engine writes to disk, the fewer writes per second it can handle within the available disk bandwidth.
- Moreover, LSM-trees are typically able to sustain higher write throughput than B- trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree [26]. This difference is particularly important on magnetic hard drives, where sequential writes are much faster than random writes.
- LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees.
- B-tree storage engines leave some disk space unused due to fragmentation: when a page is split or when a row cannot fit into an existing page, some space in a page remains unused.
- Since LSM-trees are not page-oriented and periodically rewrite SSTables to remove fragmentation, they have lower storage overheads, especially when using leveled compaction.
DOWNSIDES OF LSM TREES
-
A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes.
- Even though storage engines try to perform compaction incrementally and without affecting concurrent access, disks have limited resources, so it can easily happen that a request needs to wait while the disk finishes an expensive compaction operation.
-
The impact on throughput and average response time is usually small, but at higher percentiles the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable.
-
Another issue with compaction arises at high write throughput: the disk’s finite write bandwidth needs to be shared between the initial write (logging and flushing a memtable to disk) and the compaction threads running in the background.
- When writing to an empty database, the full disk bandwidth can be used for the initial write, but the bigger the database gets, the more disk bandwidth is required for compaction.
- If write throughput is high and compaction is not configured carefully, it can happen that compaction cannot keep up with the rate of incoming writes.
- In this case, the number of unmerged segments on disk keeps growing until you run out of disk space, and reads also slow down because they need to check more segment files. Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation.
-
An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments.
- This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree.
B-trees are very ingrained in the architecture of databases and provide consistently good performance for many workloads, so it’s unlikely that they will go away anytime soon. In new datastores, log-structured indexes are becoming increasingly popular. There is no quick and easy rule for determining which type of storage engine is better for your use case, so it is worth testing empirically.