Friday, June 5, 2015

Perspective on RocksDB on SSD


RocksDB is key value based embedded storage engine. RocksDB improves LevelDB by adding lot of extra features on top of it and performance for flash drives. RocksDB want to exploit full potential of high read/write rates of flash drives. RocksDB storage engine is based on LSM (Log-structured Merge). The wiki of RocksDB is awesome to understand full amount of details about RocksDB.

Key Prefix

Key-prefix is very important for performance of data retrieval since RocksDB correctly points out that most applications do not do pure-random scans of key ranges in the database; instead applications typically scan within a key-prefix. In real world applications this is true but scan is mostly on secondary indexes rather than primary keys or record Ids. Also the scan is generally part of a join where you need index range scan on secondary index with a join on specific attribute value which also can be supported very well in RocksDB.


Bloom Filter

Bloom filters help speed up query a lot. Bloom filter is for every key-prefix so if your search is on different attribute than primary key key-prefix then bloom filter will be of not much use. Also range queries can not make use of bloom filter much. For most point queries bloom filter will speed up the query. Bloom filter takes considerable amount of space for better false positive rate without storing any information regarding data index for complex queries.


Index

Index is the most important part for data retrieval in any storage engine. Index needs to be updated on every update operation if you do not use LSM based storage. If you can do in -place updates then it is fine. One obvious problem on SSD is that garbage collection process usually erases entire super block consisting of several erase blocks. This creates lot of fragmentation for non LSM based storage index e.g., B-Tree index. LSM based storage helps batch index modification and hence it is good for SSD. But if we have too many in place updates the space utilization will drop because of duplication of data as well as index and compaction will take over the nice benefit we were having about index since it will also invoke garbage collection behind the scene in SSD to clean the old SST files.

The advantage of LSM gets lost through the level approach. In real world scenario compaction drops about average of 20% of duplicate values from higher level while 80% are just rewritten causing high write amplification. The levels approach also increases the read query time to go through multiple levels in many cases. The worst of all is levels cause space amplification and write amplification which leads to faster SSD wear out.

Since index for each SST file is written inside of the SST file itself it is clustured index. As the data grows lookup may cause looking in memtable and in few levels bloom filter or index. The read gets more and more expensive with increase in number of levels. Rather than using lot of memory for bloom filters,  we can create sparse index that can be in DRAM along with the clustered index like SST file. We use right data structure for sparse index that can directly point to data, this will minimize the DRAM index memory footprint and speed up all kinds of queries.


Memtable 

Skiplist is the default implementation of memtable. B-Tree is by far the most commonly used data structure for indexing in databases since it provides range queries and complex queries can be performed using the same. Skiplist data structure is very similar to B-Tree and has two advantages over B-Tree. The concurrency issues are reduced and tree rebalancing is not required. Skiplist is good but not the best for memory footprint of index since multiple levels will increase the memory footprint as well as the keys (may be without prefix) needs to be stored at every level.

Memtable pipelining helps increasing the write throughput of RocksDB. Some compactions which spills to multiple levels drastically affects the write throughput and completely stalls the write. This is quire undesirable for highly available storage.

Inline-compaction helps improve space amplification and removes duplicate records but it can not help after memtable is flushed to storage. Compactions increases write amplification greatly since SSD does erase on one or more (super block) erase blocks at a time. So deleting small e..g, 2MB file still causes huge fragmentation issue.


Compression

Another One of the most important benefit of LSM approach is it allows you to compress your data without much complexity. This is currently most important factor why LSM is chosen so widely since compression saves space and hence money and most of the workload does not have heavy updates on records.


Space Amplification 

The biggest factor to consider regarding level based storage engines which are spawned from leveldb is space amplification. Since data is stored in multiple levels. When we are writing to level 0 we are not aware of how many copies of old values for the key are already stored in higher levels. Consider all update only load, there is a possibility that if you have 5 levels you will have every key at least twice and on average 3 times. This means that the same data will take 3 times more space. If you need high speed queries and need to support lot of data updates LSM may not be the perfect choice. Considering garbage collection inside SSD this will affect performance badly and as number of levels increases it will get worse so LSM is good for embedded data storage with moderate amount of data till level 5 but not large data set with more than 5 levels.

Another problem here for SSD is that life span of SSD gets affected with space amplification and write amplification. Many times in compaction very few records gets dropped and most of the records simply makes their way to higher level. Space amplification of 3x could be quite common for LSM. This write amplification is application specific. The write amplification we get from garbage collection inside SSD because we stripe the data will get multiplied with application specific write amplification. The total amount of write amplification for this will reduce the life span of SSD considerably.


Cascading compactions - I may be wrong on this. Please correct me if so.

Cascading compactions is a problem where compaction spills over to multiple levels, in this case writes will be halted if all the write buffers are full and multilevel compaction is still in progress. Multilevel means when level 0 gets filled up it will merge level 0 files with level 1 files which may foresee that level 1 will get filled up. This will cause level 1 files to merge (compact) with level 2 files which may cause level 3 to get fill up and so on. This wont be frequent but this can potentially spike up write latency and the spikes can be huge up to few seconds on high capacity high bandwidth SSD / NVMe drives.

We do not certainly want write latency to spike to seconds. It happens when you insert 1 billlion or more key values in the DB, increasing the trigger point for L0 only delays the occurrence of the latency spike. This is quite undesirable since it will get worse with more inserts or simply updates.



SSD controller firmware inefficiency and possible optimizations

1. Over provisioning - To hide the fact of inefficient single controller firmware with usually more than one (4 or 8) embedded processors, SSD vendors tend to have over provisioning to speed up the garbage collection and writing of data. If you see the sizes of the drive of SSD e.g., Intel sells 400 GB drive so they take 512 - 400 = 112, whopping 112 GB for garbage collection.


2. Writing is interleaved over many flash chips to take benefit of parallelism creating fragmentation for what is logically sequential write. SSD will write 1 MB of data to several chips mostly writing 1 page = 8 KB on one chip and next page on the next chip. So if you are using LSM storage engine you are basically not getting much benefit of writing the sorted data sequentially since SSD internally is never going to write sequentially. Technically, in a conventional SSD  concurrency is maintained by striping data across the channels so that one request can be served by multiple channels. However, using this approach to achieving high bandwidth is in conflict with the goal of eliminating write amplification

3. SSDs consist of multiple NAND flash chips, and conventional SSDs provide data protection by using RAID 5 parity coding across the flash chips in addition to powerful BCH ECC protection within individual flash chips for fault detection and correction with attendant cost, complexity, and reduced flash space available for user data. However, in our large-scale Internet service infrastructure, data reliability is provided by data replication across multiple racks.

4. Asymmetry of write and erase causes lot of write amplification since no matter how you write data there is going to be write amplification. To cover it you will need quite expensive garbage collection. For garbage collection to work in multi-terabyte SSD, we need hybrid mapping (block-page) which needs garbage collection to work on superblock level (few aligned erase blocks) creating more write amplification. To cover this one you need extra over provisioning so after 112 GB taken in over provisioning to make software perform better software administrator put extra over provisioning taking another 100 GB for it reducing the drive capacity to 300 GB out of 512 GB (about 60%).

5. The layer of I/O stack, such as the block layer, has become a bottleneck for today’s high-performance SSD. There is a duplicate mapping exists inside SSD to map outside exposed logical block address to physical blocks. This mapping takes lot of DRAM space of SSD controller and complicates controller logic increasing cost and energy requirement since DRAM consumes lot of energy.

This extra mapping of LBA logical block address to PBA physical block address is stored inside of SSD DRAM. This mapping is simple array based but the LBAs are interleaved over multiple flash chips with simple mathematical power calculation - e.g., last 2 bits are (4) planes inside a chip then next 4 bits for (16) erase blocks inside a chip, next 4 bits for (16) ways per channel, next 5 bits for (32) channels in SSD etc. and rest of the bits sequentially grows that forms the physical address while write the data on multiple flash chips in parallel. This could completely change your imagination of flash drive if you are unaware of this. This mapping is completely redundant since in NoSQL storage engines we want to store data as key values. Now key needs to be mapped to one or more LBAs or multiple keys may map to one LBA where LBA is sector addressable. A sector is 512 Bytes of storage. Bottom line - we simply do not need this logical block layer.


Database Storage Engines with non volatile memory (flash SSD NVME, RRAM, PCM)


Having worked with RDBMS, Distributed Systems (Hadoop, S4, storm), NoSQL data stores (HBase in particular) and also SSD internals recently (Solid State Drive), I want to convey my 2 cents about storage engines for SSD. I want to elaborate conceptually how database storage engine can take full (better) benefit of flash drives parallelism. 

I want to tour through what you need to really care about SSD since FTL (Flash Translation Layer) inside of SSD is quite involved and vendor specific (I mean each company has its own proprietary FTL algorithms and implementation.) There is a layer above FTL mapping, a layer that interfaces to host and a layer below FTL that manages flash blocks. The lower layer performs bad block management, error correction code, garbage collection and wear leveling. You can find several blogs and research papers on FTL components but there are way many components to be covered if you want to read those research papers. So I am going to take you all through basic tour of only the components that matters the most and provide link to research papers that you can read to learn ample amount of details to understand and optimize data storage engine software.




Basics
First and foremost thing you need to visualize is that since there are no mechanical parts, there is no head that would seek correct device location to read or write. There is a DRAM buffer usually 0.1% of the device capacity used by SSD controller firmware. NAND flash chips have small capacity, holding up to 16/32 GB per chip. An SSD contains 64-1024 chips. A single chip is capable of reading up to 400-500 MB/sec but is capable of writing at only 20-25 MB/sec because of the complexity of programming. Because of the internal parallelism of writes inside SSD multiple chips will be active at the same time, so larger-capacity SSDs tend to yield better performance. 

Flash Controller
Flash controller is most commonly implemented as a SoC (system-on-a-chip) design. The controller consists of multiple hardware-accelerated functional blocks, coupled to one or more embedded processor cores. Nowadays it is common to see 4/8/16 embedded processor cores inside flash controller.

Host Interface
The host interface is the physical interface from the host system to the SSD. The majority of SSDs leverage existing storage standards and interfaces, such as SATA (serial ATA) and SAS (serial attached SCSI). Majority of them use traditional block-access protocols. There is a project to expose the underlying parallelism to host to take benefit of it - https://github.com/OpenChannelSSD/linux. The low-level primitives and high-speed serializer/deserializer are accelerated in hardware, with high-level block-access protocols implemented by firmware running on one or more embedded processor.

A newer storage interface for SSD is PCIe (Peripheral Component Interconnect Express) which allows host to send up to 16 GB/s. These SSDs are called NVMe drives.


Channel
A flash channel is for communication between subset of flash chips on SSD and the controller on SSD. This interface is shared by bunch of NAND chips, with only one chip able to communicate with the controller at any point of time. 


FTL Operations
On codecapsule.com website author Emmanuel Goossaert has explained the internal operations of SSD very well in simple language. Please read part 3 to part 5. The link to the content is here - http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/


Writes on Flash
By now you must have realized that the writes on flash are interleaved on chips inside the SSD. So even if you write 64KB data sequentially it will be interleaved inside an SSD on different chips across channels and ways etc. based on the current page allocation strategy employed by the currently running proprietary firmware.

If you want a deeper look inside the page allocation strategies you can refer this paper, it will make you dive through complexities and possibilities of parallelism inside SSD you never would have thought of before. The link to the paper is - https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final55.pdf


Back to Storage Engine
I was always fond of LSM storage engine since I read the BigTable paper since any kind of parallelism slows down throughput of read/write on HDD because of the disk seek is very expensive. The whole purpose of LSM was to mainly avoid the disk seek in my perspective.  

The father of LSM Patrick O'Neil original paper on The Log-Structured Merge-Tree mentions the following fact : 

Standard disk-based index structures such as the B-tree will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost up to fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components. The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve cost performance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs. The LSM-tree approach also generalizes to operations other than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency in some cases, so the LSM-tree is most useful in applications where index inserts are more common than finds that retrieve the entries. This seems to be a common property for History tables and log files.

It is quite evident from the above that originally LSM was designed for only hard disk drive while it has benefits on using it on SSD but the fact is it can not utilize the full potential of SSD and its parallellism.

Compared to HDDs, SSDs provide much higher throughput and lower latency, especially for random operations. However, SSDs also have their own limitations: performance of random write substantially lags behind that of sequential write operations and read operations, mainly due to its incurred expensive garbage collections. As mentioned before, since the LSM-tree-based KV stores can effectively eliminate random writes, it proves promising to integrate LSM-tree-based KV stores with NAND flash-based SSDs so that high throughput can be achieved for both read and write I/O operations.

This seems like an obvious benefit of using LSM in SSD. Better write amplification due to less fragmentation of data on the device. But the way SSD writes to the flash chips, this advantage is completely lost. SSD stripes the write on to multiple flash chips with each stripe write size equal to one page unit. So we employ LSM to avoid random writes while SSD fragments the LSM based Sorted String Table file by writing 8KB page on each flash chip internally. 

Although researchers have been aware of the potential advantage of combining LSM-tree-based KV stores with SSDs, this topic has not been well studied.

As we have seen process of data movement in LSM-tree-based KV stores is originally designed for HDDs rather than SSDs. Since an HDD has only one disk head, the KV store serially issues I/O requests to the HDD. We need to increase the concurrency level in order to take advantage of SSD’s performance advantage on random read. That is precisely what RocksDB is doing but without knowing how internal parallelism on SSD works to take full advantage of it. Although the modern SSD usually consists of multiple channels, the SSD controller hardware provides only one block device interface to the operating system, and the scheduling and dispatching of I/O requests between internal channels are hidden from the software level. This makes the LSMtree-based KV stores unaware of the SSD’s multi-channel architecture, and the scheduling and dispatching decisions made by the SSD controller are not optimized according to the data access patterns from LSM-tree-based KV stores also it causes too much overhead and duplicate mappings inside SSD.

Since LSM writes data contiguously (keys are sorted inside of SST files), one may think that when SST file gets deleted in compaction more data gets invalidated in few SSD blocks so when garbage collection runs it can see the Invalid page count of these few erase block is high and it can target those for garbage collection with less valid data to be copied. One catch here is that data is never written sequentially inside SSD. Even if the SST file data size is exactly equal to 2 erase blocks it would write the SST file data to many erase blocks based on page allocation strategy inside the SSD controller firmware. This problem still could be partially addressed by carefully assigning correct number of erase blocks in a superblock* which is basic unit of garbage collection in advanced controller firmware. Although one advantage of SST file in LSM is that it still avoids random writing to the disk so there is less fragmentation of data. This helps when SST file gets deleted, there wont be one or two key entries of that SST file over 100s of erase blocks. So write amplification get some benefit but not a lot of benefit using LSM.

Erasing memory in flash memory happens at the block level. The entire block of data consisting of usually 128 pages needs to be erased as a single operation (need to supply high current that erases the entire block of data). In general a page is 8 KB (4 KB - 16 KB) in Flash memory nowadays. If page is more granular the latency is better but read / write throughput suffers and more DRAM memory is required for the controller for block-page mapping. 

References