SSD and storage engines: Perspective on RocksDB on SSD

RocksDB is key value based embedded storage engine. RocksDB improves LevelDB by adding lot of extra features on top of it and performance for flash drives. RocksDB want to exploit full potential of high read/write rates of flash drives. RocksDB storage engine is based on LSM (Log-structured Merge). The wiki of RocksDB is awesome to understand full amount of details about RocksDB.

Key Prefix

Key-prefix is very important for performance of data retrieval since RocksDB correctly points out that most applications do not do pure-random scans of key ranges in the database; instead applications typically scan within a key-prefix. In real world applications this is true but scan is mostly on secondary indexes rather than primary keys or record Ids. Also the scan is generally part of a join where you need index range scan on secondary index with a join on specific attribute value which also can be supported very well in RocksDB.

Bloom Filter

Bloom filters help speed up query a lot. Bloom filter is for every key-prefix so if your search is on different attribute than primary key key-prefix then bloom filter will be of not much use. Also range queries can not make use of bloom filter much. For most point queries bloom filter will speed up the query. Bloom filter takes considerable amount of space for better false positive rate without storing any information regarding data index for complex queries.

Index

Index is the most important part for data retrieval in any storage engine. Index needs to be updated on every update operation if you do not use LSM based storage. If you can do in -place updates then it is fine. One obvious problem on SSD is that garbage collection process usually erases entire super block consisting of several erase blocks. This creates lot of fragmentation for non LSM based storage index e.g., B-Tree index. LSM based storage helps batch index modification and hence it is good for SSD. But if we have too many in place updates the space utilization will drop because of duplication of data as well as index and compaction will take over the nice benefit we were having about index since it will also invoke garbage collection behind the scene in SSD to clean the old SST files.

The advantage of LSM gets lost through the level approach. In real world scenario compaction drops about average of 20% of duplicate values from higher level while 80% are just rewritten causing high write amplification. The levels approach also increases the read query time to go through multiple levels in many cases. The worst of all is levels cause space amplification and write amplification which leads to faster SSD wear out.

Since index for each SST file is written inside of the SST file itself it is clustured index. As the data grows lookup may cause looking in memtable and in few levels bloom filter or index. The read gets more and more expensive with increase in number of levels. Rather than using lot of memory for bloom filters, we can create sparse index that can be in DRAM along with the clustered index like SST file. We use right data structure for sparse index that can directly point to data, this will minimize the DRAM index memory footprint and speed up all kinds of queries.

Memtable

Skiplist is the default implementation of memtable. B-Tree is by far the most commonly used data structure for indexing in databases since it provides range queries and complex queries can be performed using the same. Skiplist data structure is very similar to B-Tree and has two advantages over B-Tree. The concurrency issues are reduced and tree rebalancing is not required. Skiplist is good but not the best for memory footprint of index since multiple levels will increase the memory footprint as well as the keys (may be without prefix) needs to be stored at every level.

Memtable pipelining helps increasing the write throughput of RocksDB. Some compactions which spills to multiple levels drastically affects the write throughput and completely stalls the write. This is quire undesirable for highly available storage.

Inline-compaction helps improve space amplification and removes duplicate records but it can not help after memtable is flushed to storage. Compactions increases write amplification greatly since SSD does erase on one or more (super block) erase blocks at a time. So deleting small e..g, 2MB file still causes huge fragmentation issue.

Compression

Another One of the most important benefit of LSM approach is it allows you to compress your data without much complexity. This is currently most important factor why LSM is chosen so widely since compression saves space and hence money and most of the workload does not have heavy updates on records.

Space Amplification

The biggest factor to consider regarding level based storage engines which are spawned from leveldb is space amplification. Since data is stored in multiple levels. When we are writing to level 0 we are not aware of how many copies of old values for the key are already stored in higher levels. Consider all update only load, there is a possibility that if you have 5 levels you will have every key at least twice and on average 3 times. This means that the same data will take 3 times more space. If you need high speed queries and need to support lot of data updates LSM may not be the perfect choice. Considering garbage collection inside SSD this will affect performance badly and as number of levels increases it will get worse so LSM is good for embedded data storage with moderate amount of data till level 5 but not large data set with more than 5 levels.

Another problem here for SSD is that life span of SSD gets affected with space amplification and write amplification. Many times in compaction very few records gets dropped and most of the records simply makes their way to higher level. Space amplification of 3x could be quite common for LSM. This write amplification is application specific. The write amplification we get from garbage collection inside SSD because we stripe the data will get multiplied with application specific write amplification. The total amount of write amplification for this will reduce the life span of SSD considerably.

Cascading compactions - I may be wrong on this. Please correct me if so.

Cascading compactions is a problem where compaction spills over to multiple levels, in this case writes will be halted if all the write buffers are full and multilevel compaction is still in progress. Multilevel means when level 0 gets filled up it will merge level 0 files with level 1 files which may foresee that level 1 will get filled up. This will cause level 1 files to merge (compact) with level 2 files which may cause level 3 to get fill up and so on. This wont be frequent but this can potentially spike up write latency and the spikes can be huge up to few seconds on high capacity high bandwidth SSD / NVMe drives.

We do not certainly want write latency to spike to seconds. It happens when you insert 1 billlion or more key values in the DB, increasing the trigger point for L0 only delays the occurrence of the latency spike. This is quite undesirable since it will get worse with more inserts or simply updates.

SSD and storage engines

Friday, June 5, 2015

Perspective on RocksDB on SSD

1 comment: