Overview

In this tutorial, you will learn how to build a simple LSM-Tree storage engine in the Rust programming language.

What is LSM, and Why LSM?

Log-structured merge tree is a data structure to maintain key-value pairs. This data structure is widely used in distributed database systems like TiDB and CockroachDB as their underlying storage engine. RocksDB, based on LevelDB, is an implementation of LSM-Tree storage engine. It provides a wide range of key-value access functionalities and is used in a lot of production systems.

Generally speaking, LSM Tree is an append-friendly data structure. It is more intuitive to compare LSM to other key-value data structure like RB-Tree and B-Tree. For RB-Tree and B-Tree, all data operations are in-place. That is to say, when you update the value corresponding to the key, the value will be overwritten at its original memory or disk space. But in an LSM Tree, all write operations, i.e., insertions, updates, deletions, are performed in somewhere else. These operations will be batched into SST (sorted string table) files and be written to the disk. Once written to the disk, the file will not be changed. These operations are applied lazily on disk with a special task called compaction. The compaction job will merge multiple SST files and remove unused data.

This architectural design makes LSM tree easy to work with.

  1. Data are immutable on persistent storage, which means that it is easier to offload the background tasks (compaction) to remote servers. It is also feasible to directly store and serve data from cloud-native storage systems like S3.
  2. An LSM tree can balance between read, write and space amplification by changing the compaction algorithm. The data structure itself is super versatile and can be optimized for different workloads.

In this tutorial, we will learn how to build an LSM-Tree-based storage engine in the Rust programming language.

Prerequisites of this Tutorial

  • You should know the basics of the Rust programming language. Reading the Rust book is enough.
  • You should know the basic concepts of key-value storage engines, i.e., why we need somehow complex design to achieve persistence. If you have no experience with database systems and storage systems before, you can implement Bitcask in PingCAP Talent Plan.
  • Knowing the basics of an LSM tree is not a requirement but we recommend you to read something about it, e.g., the overall idea of LevelDB. This would familiarize you with concepts like mutable and immutable mem-tables, SST, compaction, WAL, etc.

Overview of LSM

An LSM storage engine generally contains 3 parts:

  1. Write-ahead log to persist temporary data for recovery.
  2. SSTs on the disk for maintaining a tree structure.
  3. Mem-tables in memory for batching small writes.

The storage engine generally provides the following interfaces:

  • Put(key, value): store a key-value pair in the LSM tree.
  • Delete(key): remove a key and its corresponding value.
  • Get(key): get the value corresponding to a key.
  • Scan(range): get a range of key-value pairs.

To ensure persistence,

  • Sync(): ensure all the operations before sync are persisted to the disk.

Some engines choose to combine Put and Delete into a single operation called WriteBatch, which accepts a batch of key value pairs.

In this tutorial, we assume the LSM tree is using leveled compaction algorithm, which is commonly used in real-world systems.

Write Flow

Write Flow

The write flow of LSM contains 4 steps:

  1. Write the key-value pair to write-ahead log, so that it can be recovered after the storage engine crashes.
  2. Write the key-value pair to memtable. After (1) and (2) completes, we can notify the user that the write operation is completed.
  3. When a memtable is full, we will flush it to the disk as an SST file in the background.
  4. We will compact some files in some level into lower levels to maintain a good shape for the LSM tree, so that read amplification is low.

Read Flow

Read Flow

When we want to read a key,

  1. We will first probe all the memtables from latest to oldest.
  2. If the key is not found, we will then search the entire LSM tree containing SSTs to find the data.

Tutorial Overview

Tutorial Overview

In this tutorial, we will build the LSM tree structure in 7 days:

  • Day 1: Block encoding. SSTs are composed of multiple data blocks. We will implement the block encoding.
  • Day 2: SST encoding.
  • Day 3: MemTable and Merge Iterators.
  • Day 4: Block cache and Engine. To reduce disk I/O and maximize performance, we will use moka-rs to build a block cache for the LSM tree. In this day we will get a functional (but not persistent) key-value engine with get, put, scan, delete API.
  • Day 5: Compaction. Now it's time to maintain a leveled structure for SSTs.
  • Day 6: Recovery. We will implement WAL and manifest so that the engine can recover after restart.
  • Day 7: Bloom filter and key compression. They are widely-used optimizations in LSM tree structures.

Development Guide

We provide you starter code (see mini-lsm-starter crate), where we simply replace all function body with unimplemented!(). You can start your project based on this starter code. We provide test cases, but they are very simple. We recommend you to think carefully about your implementation and write test cases by yourself.

  • You can use cargo x scheck to run all test cases and do style check in your codebase.
  • You can use cargo x copy-test dayX to copy test cases to the starter code.

About the Author

As of writing (at the end of 2022), Chi is a first-year master's student in Carnegie Mellon University. He has 5 years' experience with the Rust programming language since 2018. He has been working on a variety of database systems including TiKV, AgateDB, TerarkDB, RisingLight, and RisingWave. In his first semester in CMU, he worked as a teaching assistant for CMU's 15-445/645 Intro to Database Systems course, where he built a new SQL processing layer for the BusTub educational database system, added more query optimization stuff into the course, and made the course more challenging than ever before. Chi is interested in exploring how the Rust programming language can fit in the database world. Check out his previous tutorial on building a vectorized expression framework if you are also interested in that topic.

Get Started

The starter code and reference solution is available at https://github.com/skyzh/mini-lsm.

Install Rust

See https://rustup.rs for more information.

Clone the repo

git clone https://github.com/skyzh/mini-lsm

Starter code

cd mini-lsm/mini-lsm-starter
code .

Install Tools

cargo x install-tools

Run tests

cargo x copy-test day1
cargo x scheck

Block Builder and Block Iterator

In this part, you will need to modify:

  • src/block/builder.rs
  • src/block/iterator.rs
  • src/block.rs

You can use cargo x copy-test day1 to copy our provided test cases to the starter code directory. After you have finished this part, use cargo x scheck to check the style and run all test cases. If you want to write your own test cases, write a new module #[cfg(test)] mod user_tests { /* your test cases */ } in block.rs. Remember to remove #![allow(...)] at the top of the modules you modified so that cargo clippy can actually check the styles.

Task 1 - Block Builder

Block is the minimum read unit in LSM. It is of 4KB size in general, similar database pages. In each block, we will store a sequence of sorted key value pairs.

You will need to modify BlockBuilder to build the encoded data and the offset array. The block contains two parts: data and offsets.

|          data         |           offsets         |
|entry|entry|entry|entry|offset|offset|offset|offset|num_of_elements|

When user adds a key-value pair to a block (which is an entry), we will need to serialize it into the following format:

|                             entry1                            |
| key_len (2B) | key (keylen) | value_len (2B) | value (varlen) | ... |

Key length and value length are 2B, which means their maximum length is 65536.

We assume that keys will never be empty, and values can be empty. An empty value means that the corresponding key has been deleted in the view of other parts of the system. For the block builder and iterator, we just treat empty value as-is.

At the end of the block, we will store the offsets of each entry and the total number of entries. For example, if the first entry is at 0th position of the block, and the second is at 12th position,

|offset|offset|num_of_elements|
|   0  |  12  |       2       |

The footer of the block will be as above. Each of the number is stored as u16.

The block has a size limit, which is target_size. Unless the first key-value pair exceeds the target block size, you should ensure that the encoded block size is always less than or equal to target_size.

The BlockBuilder will produce the data part and unencoded entry offsets when build is called. The information will be stored in the Block struct. As key-value entries are stored in the raw format and offsets are stored in a separate vector, this reduces unnecessary memory allocations and processing overhead when decoding data -- what you need to do is to simply copy the raw block data to the data vector and decode the entry offsets every 2 bytes, instead of creating something like Vec<(Vec<u8>, Vec<u8>)> to store all the key value pairs in one block in memory. This compact memory layout is very efficient. Block::encode and Block::decode will encode to / decode from the data layout illustrated in the above figures.

Task 2 - Block Iterator

Given a block object, we will need to extract the key-value pairs. To do this, we create an iterator over a block and find the information we want.

BlockIterator can be created with an Arc<Block>. If create_and_seek_to_first is called, it will be positioned at the first key in the block. If create_and_seek_to_key is called, the iterator will be positioned at the first key which is >= the provided key. For example, if 1, 3, 5 is in a block,

#![allow(unused)]
fn main() {
let mut iter = BlockIterator::create_and_seek_to_key(block, b"2");
assert_eq!(iter.key(), b"3");
}

seek 2 will make the iterator to be positioned at the next available key of 2, which is 3.

The iterator should copy key and value from the block and store them inside the iterator, so that users can access the key and the value without any extra copy with fn key(&self) -> &[u8], which directly returns the reference of the locally-stored key and value.

When next is called, the iterator will move to the next position. If we reach the end of the block, we can set key to empty and return false from is_valid, so that the caller can switch to another block if possible.

After implementing this part, you should be able to pass all tests in block/tests.rs.

Extra Tasks

Here is a list of extra tasks you can do to make the block encoding more robust and efficient.

Note: Some test cases might not pass after implementing this part. You might need to write your own test cases.

  • Implement block checksum. Verify checksum when decoding the block.
  • Compress / uncompress block. Compress on build and uncompress on decoding.

SST Builder and SST Iterator

In this part, you will need to modify:

  • src/table/builder.rs
  • src/table/iterator.rs
  • src/table.rs

You can use cargo x copy-test day2 to copy our provided test cases to the starter code directory. After you have finished this part, use cargo x scheck to check the style and run all test cases. If you want to write your own test cases, write a new module #[cfg(test)] mod user_tests { /* your test cases */ } in table.rs. Remember to remove #![allow(...)] at the top of the modules you modified so that cargo clippy can actually check the styles.

Task 1 - SST Builder

SST is composed of data blocks and index blocks stored on the disk. Usually, data blocks are lazily loaded -- they will not be loaded into the memory until a user requests it. Index blocks can also be loaded on-demand, but in this tutorial, we make simple assumptions that all SST index blocks (meta blocks) can fit in memory. Generally, an SST file is of 256MB size.

The SST builder is similar to block builder -- users will call add on the builder. You should maintain a BlockBuilder inside SST builder and split block when necessary. Also, you will need to maintain block metadata BlockMeta, which includes the first key in each block and the offset of each block. The build function will encode the SST, write everything to disk using FileObject::create, and return an SsTable object. Note that in part 2, you don't need to actually write the data to the disk. Just store everything in memory as a vector until we implement a block cache.

The encoding of SST is like:

| data block | data block | data block | data block | meta block | meta block offset (u32) |

You also need to implement estimated_size function of SsTableBuilder, so that the caller can know when can it start a new SST to write data. The function don't need to be very accurate. Given the assumption that data blocks contain much more data than meta block, we can simply return the size of data blocks for estimated_size.

You can also align blocks to 4KB boundary so as to make it possible to do direct I/O in the future. This is an optional optimization.

Task 2 - SST Iterator

Like BlockIteartor, you will need to implement an iterator over an SST. Note that you should load data on demand. For example, if your iterator is at block 1, it should not hold any other block content in memory until it reaches the next block.

SsTableIterator should implement the StorageIterator trait, so that it can be composed with other iterators in the future.

One thing to note is seek_to_key function. Basically, you will need to do binary search on block metadata to find which block might possibly contain the key. It is possible that the key doesn't exist in the LSM tree so that the block iterator will be invalid immediately after a seek. For example,

| block 1 | block 2 | block meta |  
| a, b, c | e, f, g | 1: a, 2: e |

If we do seek(b) in this SST, it is quite simple -- using binary search, we can know block 1 contains keys a <= keys < e. Therefore, we load block 1 and seek the block iterator to the corresponding position.

But if we do seek(d), we will position to block 1, but seeking d in block 1 will reach the end of the block. Therefore, we should check if the iterator is invalid after seek, and switch to the next block if necessary.

Extra Tasks

Here is a list of extra tasks you can do to make the block encoding more robust and efficient.

Note: Some test cases might not pass after implementing this part. You might need to write your own test cases.

  • Implement index checksum. Verify checksum when decoding.
  • Explore different SST encoding and layout. For example, in the Lethe paper, the author adds secondary key support to SST.

Mem Table and Merge Iterators

In this part, you will need to modify:

  • src/iterators/merge_iterator.rs
  • src/iterators/two_merge_iterator.rs
  • src/mem_table.rs

You can use cargo x copy-test day3 to copy our provided test cases to the starter code directory. After you have finished this part, use cargo x scheck to check the style and run all test cases. If you want to write your own test cases, write a new module #[cfg(test)] mod user_tests { /* your test cases */ } in table.rs. Remember to remove #![allow(...)] at the top of the modules you modified so that cargo clippy can actually check the styles.

This is the last part for the basic building blocks of an LSM tree. After implementing the merge iterators, we can easily merge data from different part of the data structure (mem table + SST) and get an iterator over all data. And in part 4, we will compose all these things together to make a real storage engine.

Task 1 - Mem Table

In this tutorial, we use crossbeam-skiplist as the implementation of memtable. Skiplist is like linked-list, where data is stored in a list node and will not be moved in memory. Instead of using a single pointer for the next element, the nodes in skiplists contain multiple pointers and allow user to "skip some elements", so that we can achieve O(log n) search, insertion, and deletion.

In storage engine, users will create iterators over the data structure. Generally, once user modifies the data structure, the iterator will become invalid (which is the case for C++ STL and Rust containers). However, skiplists allow us to access and modify the data structure at the same time, therefore potentially improving the performance when there is concurrent access. There are some papers argue that skiplists are bad, but the good property that data stays in its place in memory can make the implementation easier for us.

In mem_table.rs, you will need to implement a mem-table based on crossbeam-skiplist. Note that the memtable only supports get, scan, and put without delete. The deletion is represented as a tombstone key -> empty value, and the actual data will be deleted during the compaction process (day 5). Note that all get, scan, put functions only need &self, which means that we can concurrently call these operations.

Task 2 - Mem Table Iterator

You can now implement an iterator MemTableIterator for MemTable. memtable.iter(start, end) will create an iterator that returns all elements within the range start, end. Here, start is std::ops::Bound, which contains 3 variants: Unbounded, Included(key), Excluded(key). The expresiveness of std::ops::Bound eliminates the need to memorizing whether an API has a closed range or open range.

Note that crossbeam-skiplist's iterator has the same lifetime as the skiplist itself, which means that we will always need to provide a lifetime when using the iterator. This is very hard to use. You can use the ouroboros crate to create a self-referential struct that erases the lifetime. You will find the ouroboros examples helpful.

#![allow(unused)]
fn main() {
pub struct MemTableIterator {
    /// hold the reference to the skiplist so that the iterator will be valid.
    map: Arc<SkipList>
    /// then the lifetime of the iterator should be the same as the `MemTableIterator` struct itself
    iter: SkipList::Iter<'this>
}
}

You will also need to convert the Rust-style iterator API to our storage iterator. In Rust, we use next() -> Data. But in this tutorial, next doesn't have a return value, and the data should be fetched by key() and value(). You will need to think a way to implement this.

Spoiler: the MemTableIterator struct
#![allow(unused)]
fn main() {
#[self_referencing]
pub struct MemTableIterator {
    map: Arc<SkipMap<Bytes, Bytes>>,
    #[borrows(map)]
    #[not_covariant]
    iter: SkipMapRangeIter<'this>,
    item: (Bytes, Bytes),
}
}

We have map serving as a reference to the skipmap, iter as a self-referential item of the struct, and item as the last item from the iterator. You might have thought of using something like iter::Peekable, but it requires &mut self when retrieving the key and value. Therefore, one approach is to (1) get the element from the iterator on initializing the MemTableIterator, store it in item (2) when calling next, we get the element from inner iter's next and move the inner iter to the next position.

In this design, you might have noticed that as long as we have the iterator object, the mem-table cannot be freed from the memory. In this tutorial, we assume user operations are short, so that this will not cause big problems. See extra task for possible improvements.

You can also consider using AgateDB's skiplist implementation, which avoids the problem of creating a self-referential struct.

Task 3 - Merge Iterator

Now that you have a lot of mem-tables and SSTs, you might want to merge them to get the latest occurrence of a key. In merge_iterator.rs, we have MergeIterator, which is an iterator that merges all iterators of the same type. The iterator at the lower index position of the new function has higher priority, that is to say, if we have:

iter1: 1->a, 2->b, 3->c
iter2: 1->d
iter: MergeIterator::create(vec![iter1, iter2])

The final iterator will produce 1->a, 2->b, 3->c. The data in iter1 will overwrite the data in other iterators.

You can use a BinaryHeap to implement this merge iterator. Note that you should never put any invalid iterator inside the binary heap. One common pitfall is on error handling. For example,

#![allow(unused)]
fn main() {
let Some(mut inner_iter) = self.iters.peek_mut() {
    inner_iter.next()?; // <- will cause problem
}
}

If next returns an error (i.e., due to disk failure, network failure, checksum error, etc.), it is no longer valid. However, when we go out of the if condition and return the error to the caller, PeekMut's drop will try move the element within the heap, which causes an access to an invalid iterator. Therefore, you will need to do all error handling by yourself instead of using ? within the scope of PeekMut.

You will also need to define a wrapper for the storage iterator so that BinaryHeap can compare across all iterators.

Task 4 - Two Merge Iterator

The LSM has two structures for storing data: the mem-tables in memory, and the SSTs on disk. After we constructed the iterator for all SSTs and all mem-tables respectively, we will need a new iterator to merge iterators of two different types. That is TwoMergeIterator.

You can implement TwoMergeIterator in two_merge_iter.rs. Similar to MergeIterator, if the same key is found in both of the iterator, the first iterator takes precedence.

In this tutorial, we explicitly did not use something like Box<dyn StorageIter> to avoid dynamic dispatch. This is a common optimization in LSM storage engines.

Extra Tasks

  • Implement different mem-table and see how it differs from skiplist. i.e., BTree mem-table. You will notice that it is hard to get an iterator over the B+ tree without holding a lock of the same timespan as the iterator. You might need to think of smart ways of solving this.
  • Async iterator. One interesting thing to explore is to see if it is possible to asynchronize everything in the storage engine. You might find some lifetime related problems and need to workaround them.
  • Foreground iterator. In this tutorial we assumed that all operations are short, so that we can hold reference to mem-table in the iterator. If an iterator is held by users for a long time, the whole mem-table (which might be 256MB) will stay in the memory even if it has been flushed to disk. To solve this, we can provide a ForegroundIterator / LongIterator to our user. The iterator will periodically create new underlying storage iterator so as to allow garbage collection of the resources.

Storage Engine and Block Cache

In this part, you will need to modify:

  • src/lsm_iterator.rs
  • src/lsm_storage.rs
  • src/table.rs
  • Other parts that use SsTable::read_block

You can use cargo x copy-test day4 to copy our provided test cases to the starter code directory. After you have finished this part, use cargo x scheck to check the style and run all test cases. If you want to write your own test cases, write a new module #[cfg(test)] mod user_tests { /* your test cases */ } in table.rs. Remember to remove #![allow(...)] at the top of the modules you modified so that cargo clippy can actually check the styles.

Task 1 - Put and Delete

Before implementing put and delete, let's revisit how LSM tree works. The structure of LSM includes:

  • Mem-table: one active mutable mem-table and multiple immutable mem-tables.
  • Write-ahead log: each mem-table corresponds to a WAL.
  • SSTs: mem-table can be flushed to the disk in SST format. SSTs are organized in multiple levels.

In this part, we only need to take the lock, write the entry (or tombstone) into the active mem-table. You can modify lsm_storage.rs.

Task 2 - Get

To get a value from the LSM, we can simply probe from active memtable, immutable memtables (from latest to earliest), and all the SSTs. To reduce the critical section, we can hold the read lock to copy all the pointers to mem-tables and SSTs out of the LsmStorageInner structure, and create iterators out of the critical section. Be careful about the order when creating iterators and probing.

Task 3 - Scan

To create a scan iterator LsmIterator, you will need to use TwoMergeIterator to merge MergeIterator on mem-table and MergeIterator on SST. You can implement this in lsm_iterator.rs. Optionally, you can implement FusedIterator so that if a user accidentally calls next after the iterator becomes invalid, the underlying iterator won't panic.

The sequence of key-value pairs produced by TwoMergeIterator may contain empty value, which means that the value is deleted. LsmIterator should filter these empty values. Also it needs to correctly handle the start and end bounds.

Task 4 - Sync

In this part, we will implement mem-tables and flush to L0 SSTs in lsm_storage.rs. As in task 1, write operations go directly into the active mutable mem-table. Once sync is called, we flush SSTs to the disk in two steps:

  • Firstly, move the current mutable mem-table to immutable mem-table list, so that no future requests will go into the current mem-table. Create a new mem-table. All of these should happen in one single critical section and stall all reads.
  • Then, we can flush the mem-table to disk as an SST file without holding any lock.
  • Finally, in one critical section, remove the mem-table and put the SST into l0_tables.

Only one thread can sync at a time, and therefore you should use a mutex to ensure this requirement.

Task 5 - Block Cache

Now that we have implemented the LSM structure, we can start writing something to the disk! Previously in table.rs, we implemented a FileObject struct, without writing anything to disk. In this task, we will change the implementation so that:

  • read will read from the disk without any caching using read_exact_at in std::os::unix::fs::FileExt.
  • The size of the file should be stored inside the struct, and size function directly returns it.
  • create should write the file to the disk. Generally you should call fsync on that file. But this would slow down unit tests a lot. Therefore, we don't do fsync until day 6 recovery.
  • open remains unimplemented until day 6 recovery.

After that, we can implement a new read_block_cached function on SsTable so that we can leverage block cache to serve read requests. Upon initializing the LsmStorage struct, you should create a block cache of 4GB size using moka-rs. Blocks are cached by SST id + block id. Use try_get_with to get the block from cache / populate the cache if cache miss. If there are multiple requests reading the same block and cache misses, try_get_with will only issue a single read request to the disk and broadcast the result to all requests.

Remember to change SsTableIterator to use the block cache.

Extra Tasks

  • As you might have seen, each time we do a get, put or deletion, we will need to take a read lock protecting the LSM structure; and if we want to flush, we will need to take a write lock. This can cause a lot of problems. Some lock implementations are fair, which means as long as there is a writer waiting on the lock, no reader can take the lock. Therefore, the writer will wait until the slowest reader finishes its operation before it can actually do some work. One possible optimization is to implement WriteBatch. We don't need to immediately write users' requests into mem-table + WAL. We can allow users to do a batch of writes.
  • Align blocks to 4K and use direct I/O.

Leveled Compaction

Write-Ahead Log for Recovery

Bloom Filters

Key Compression

What's Next