SSD and storage engines: Database Storage Engines with non volatile memory (flash SSD NVME, RRAM, PCM)

Having worked with RDBMS, Distributed Systems (Hadoop, S4, storm), NoSQL data stores (HBase in particular) and also SSD internals recently (Solid State Drive), I want to convey my 2 cents about storage engines for SSD. I want to elaborate conceptually how database storage engine can take full (better) benefit of flash drives parallelism.

I want to tour through what you need to really care about SSD since FTL (Flash Translation Layer) inside of SSD is quite involved and vendor specific (I mean each company has its own proprietary FTL algorithms and implementation.) There is a layer above FTL mapping, a layer that interfaces to host and a layer below FTL that manages flash blocks. The lower layer performs bad block management, error correction code, garbage collection and wear leveling. You can find several blogs and research papers on FTL components but there are way many components to be covered if you want to read those research papers. So I am going to take you all through basic tour of only the components that matters the most and provide link to research papers that you can read to learn ample amount of details to understand and optimize data storage engine software.

Basics

First and foremost thing you need to visualize is that since there are no mechanical parts, there is no head that would seek correct device location to read or write. There is a DRAM buffer usually 0.1% of the device capacity used by SSD controller firmware. NAND flash chips have small capacity, holding up to 16/32 GB per chip. An SSD contains 64-1024 chips. A single chip is capable of reading up to 400-500 MB/sec but is capable of writing at only 20-25 MB/sec because of the complexity of programming. Because of the internal parallelism of writes inside SSD multiple chips will be active at the same time, so larger-capacity SSDs tend to yield better performance.

Flash Controller

Flash controller is most commonly implemented as a SoC (system-on-a-chip) design. The controller consists of multiple hardware-accelerated functional blocks, coupled to one or more embedded processor cores. Nowadays it is common to see 4/8/16 embedded processor cores inside flash controller.

Host Interface

The host interface is the physical interface from the host system to the SSD. The majority of SSDs leverage existing storage standards and interfaces, such as SATA (serial ATA) and SAS (serial attached SCSI). Majority of them use traditional block-access protocols. There is a project to expose the underlying parallelism to host to take benefit of it - https://github.com/OpenChannelSSD/linux. The low-level primitives and high-speed serializer/deserializer are accelerated in hardware, with high-level block-access protocols implemented by firmware running on one or more embedded processor.

A newer storage interface for SSD is PCIe (Peripheral Component Interconnect Express) which allows host to send up to 16 GB/s. These SSDs are called NVMe drives.

Channel

A flash channel is for communication between subset of flash chips on SSD and the controller on SSD. This interface is shared by bunch of NAND chips, with only one chip able to communicate with the controller at any point of time.

FTL Operations

On codecapsule.com website author Emmanuel Goossaert has explained the internal operations of SSD very well in simple language. Please read part 3 to part 5. The link to the content is here - http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/

Writes on Flash

By now you must have realized that the writes on flash are interleaved on chips inside the SSD. So even if you write 64KB data sequentially it will be interleaved inside an SSD on different chips across channels and ways etc. based on the current page allocation strategy employed by the currently running proprietary firmware.

If you want a deeper look inside the page allocation strategies you can refer this paper, it will make you dive through complexities and possibilities of parallelism inside SSD you never would have thought of before. The link to the paper is - https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final55.pdf

Back to Storage Engine

I was always fond of LSM storage engine since I read the BigTable paper since any kind of parallelism slows down throughput of read/write on HDD because of the disk seek is very expensive. The whole purpose of LSM was to mainly avoid the disk seek in my perspective.

The father of LSM Patrick O'Neil original paper on The Log-Structured Merge-Tree mentions the following fact :

Standard disk-based index structures such as the B-tree will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost up to fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components. The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve cost performance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs. The LSM-tree approach also generalizes to operations other than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency in some cases, so the LSM-tree is most useful in applications where index inserts are more common than finds that retrieve the entries. This seems to be a common property for History tables and log files.

It is quite evident from the above that originally LSM was designed for only hard disk drive while it has benefits on using it on SSD but the fact is it can not utilize the full potential of SSD and its parallellism.

Compared to HDDs, SSDs provide much higher throughput and lower latency, especially for random operations. However, SSDs also have their own limitations: performance of random write substantially lags behind that of sequential write operations and read operations, mainly due to its incurred expensive garbage collections. As mentioned before, since the LSM-tree-based KV stores can effectively eliminate random writes, it proves promising to integrate LSM-tree-based KV stores with NAND flash-based SSDs so that high throughput can be achieved for both read and write I/O operations.

This seems like an obvious benefit of using LSM in SSD. Better write amplification due to less fragmentation of data on the device. But the way SSD writes to the flash chips, this advantage is completely lost. SSD stripes the write on to multiple flash chips with each stripe write size equal to one page unit. So we employ LSM to avoid random writes while SSD fragments the LSM based Sorted String Table file by writing 8KB page on each flash chip internally.

Although researchers have been aware of the potential advantage of combining LSM-tree-based KV stores with SSDs, this topic has not been well studied.

As we have seen process of data movement in LSM-tree-based KV stores is originally designed for HDDs rather than SSDs. Since an HDD has only one disk head, the KV store serially issues I/O requests to the HDD. We need to increase the concurrency level in order to take advantage of SSD’s performance advantage on random read. That is precisely what RocksDB is doing but without knowing how internal parallelism on SSD works to take full advantage of it. Although the modern SSD usually consists of multiple channels, the SSD controller hardware provides only one block device interface to the operating system, and the scheduling and dispatching of I/O requests between internal channels are hidden from the software level. This makes the LSMtree-based KV stores unaware of the SSD’s multi-channel architecture, and the scheduling and dispatching decisions made by the SSD controller are not optimized according to the data access patterns from LSM-tree-based KV stores also it causes too much overhead and duplicate mappings inside SSD.

Since LSM writes data contiguously (keys are sorted inside of SST files), one may think that when SST file gets deleted in compaction more data gets invalidated in few SSD blocks so when garbage collection runs it can see the Invalid page count of these few erase block is high and it can target those for garbage collection with less valid data to be copied. One catch here is that data is never written sequentially inside SSD. Even if the SST file data size is exactly equal to 2 erase blocks it would write the SST file data to many erase blocks based on page allocation strategy inside the SSD controller firmware. This problem still could be partially addressed by carefully assigning correct number of erase blocks in a superblock* which is basic unit of garbage collection in advanced controller firmware. Although one advantage of SST file in LSM is that it still avoids random writing to the disk so there is less fragmentation of data. This helps when SST file gets deleted, there wont be one or two key entries of that SST file over 100s of erase blocks. So write amplification get some benefit but not a lot of benefit using LSM.

Erasing memory in flash memory happens at the block level. The entire block of data consisting of usually 128 pages needs to be erased as a single operation (need to supply high current that erases the entire block of data). In general a page is 8 KB (4 KB - 16 KB) in Flash memory nowadays. If page is more granular the latency is better but read / write throughput suffers and more DRAM memory is required for the controller for block-page mapping.

Further Reading -

http://surajwaghulde.blogspot.com/2015/06/ssd-controller-firmware-inefficiency.html
http://surajwaghulde.blogspot.com/2015/06/perspective-on-rocksdb-on-ssd.html

References

1. Coding for SSD - http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/

2. Log-Structured Merge-Tree - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf

3. Superblock* based FTL - http://dl.acm.org/citation.cfm?id=1176911&dl=ACM&coll=DL&CFID=517102487&CFTOKEN=87363593

SSD and storage engines

Friday, June 5, 2015

Database Storage Engines with non volatile memory (flash SSD NVME, RRAM, PCM)

1 comment: