Having worked with RDBMS, Distributed Systems (Hadoop, S4, storm), NoSQL data stores (HBase in particular) and also SSD internals recently (Solid State Drive), I want to convey my 2 cents about storage engines for SSD. I want to elaborate conceptually how database storage engine can take full (better) benefit of flash drives parallelism.
I want to tour through what you need to really care about SSD since FTL (Flash Translation Layer) inside of SSD is quite involved and vendor specific (I mean each company has its own proprietary FTL algorithms and implementation.) There is a layer above FTL mapping, a layer that interfaces to host and a layer below FTL that manages flash blocks. The lower layer performs bad block management, error correction code, garbage collection and wear leveling. You can find several blogs and research papers on FTL components but there are way many components to be covered if you want to read those research papers. So I am going to take you all through basic tour of only the components that matters the most and provide link to research papers that you can read to learn ample amount of details to understand and optimize data storage engine software.
Basics
First and foremost thing you need to visualize is that since there are no mechanical parts, there is no head that would seek correct device location to read or write. There is a DRAM buffer usually
0.1% of the device capacity used by SSD controller firmware. NAND flash chips have small capacity, holding up to 16/32 GB per chip. An SSD contains
64-1024 chips. A single chip is capable of reading up to 400-500 MB/sec but
is capable of writing at only 20-25 MB/sec because of the complexity of
programming. Because of the internal parallelism of writes inside SSD multiple chips will be active at the same time, so larger-capacity SSDs tend to yield better performance.
Flash Controller
Flash controller is most commonly implemented as a SoC
(system-on-a-chip) design. The controller consists of multiple
hardware-accelerated functional blocks, coupled to one or more embedded
processor cores. Nowadays it is common to see 4/8/16 embedded processor cores inside
flash controller.
Host
Interface
The host interface is the physical interface from the
host system to the SSD. The majority of SSDs leverage existing storage
standards and interfaces, such as SATA (serial ATA) and SAS (serial attached
SCSI). Majority of them use traditional block-access protocols. There is a project to expose the underlying parallelism to host to take benefit of it - https://github.com/OpenChannelSSD/linux. The low-level primitives
and high-speed serializer/deserializer are accelerated in hardware, with
high-level block-access protocols implemented by firmware running on one or more embedded processor.
A newer storage interface for SSD is PCIe (Peripheral Component Interconnect Express) which allows host to send
up to 16 GB/s. These SSDs are called NVMe drives.
Channel
A flash channel is for communication between subset of flash chips on SSD and the controller on SSD. This interface is shared by bunch of NAND chips, with only one chip able to communicate with the controller at any point of time.
FTL Operations
On codecapsule.com website author Emmanuel Goossaert has explained the internal operations of SSD very well in simple language. Please read part 3 to part 5. The link to the content is here - http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/
Writes on Flash
By now you must have realized that the writes on flash are interleaved on chips inside the SSD. So even if you write 64KB data sequentially it will be interleaved inside an SSD on different chips across channels and ways etc. based on the current page allocation strategy employed by the currently running proprietary firmware.
If you want a deeper look inside the page allocation strategies you can refer this paper, it will make you dive through complexities and possibilities of parallelism inside SSD you never would have thought of before. The link to the paper is - https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final55.pdf
Back to Storage Engine
I was always fond of LSM storage engine since I read the BigTable paper since any kind of parallelism slows down throughput of read/write on HDD because of the disk seek is very expensive. The whole purpose of LSM was to mainly avoid the disk seek in my perspective.
The father of LSM Patrick O'Neil original paper on The Log-Structured Merge-Tree mentions the following fact :
Standard
disk-based index structures such as the B-tree will effectively double the I/O cost of the
transaction to maintain an index such as this in real time, increasing the total system cost up to
fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The
Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide
low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an
extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading
the changes from a memory-based component through one or more disk components in an
efficient manner reminiscent of merge sort. During this process all index values are continuously
accessible to retrievals (aside from very short locking periods), either through the
memory component or one of the disk components. The algorithm has greatly reduced disk arm
movements compared to a traditional access methods such as B-trees, and will improve cost performance
in domains where disk arm costs for inserts with traditional access methods
overwhelm storage media costs. The LSM-tree approach also generalizes to operations other
than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency
in some cases, so the LSM-tree is most useful in applications where index inserts are
more common than finds that retrieve the entries. This seems to be a common property for
History tables and log files.
It is quite evident from the above that originally LSM was designed for only hard disk drive while it has benefits on using it on SSD but the fact is it can not utilize the full potential of SSD and its parallellism.
Compared to
HDDs, SSDs provide much higher throughput and lower latency,
especially for random operations. However, SSDs also
have their own limitations: performance of random write
substantially lags behind that of sequential write operations
and read operations, mainly due to its incurred expensive
garbage collections. As mentioned before, since the
LSM-tree-based KV stores can effectively eliminate random
writes, it proves promising to integrate LSM-tree-based KV
stores with NAND flash-based SSDs so that high throughput
can be achieved for both read and write I/O operations.
This seems like an obvious benefit of using LSM in SSD. Better write amplification due to less fragmentation of data on the device. But the way SSD writes to the flash chips, this advantage is completely lost. SSD stripes the write on to multiple flash chips with each stripe write size equal to one page unit. So we employ LSM to avoid random writes while SSD fragments the LSM based Sorted String Table file by writing 8KB page on each flash chip internally.
Although researchers have been aware of the potential
advantage of combining LSM-tree-based KV stores with
SSDs, this topic has not
been well studied.
As we have seen process of data movement in LSM-tree-based KV stores is
originally designed for HDDs rather than SSDs. Since an
HDD has only one disk head, the KV store serially issues
I/O requests to the HDD. We need to increase the concurrency
level in order to take advantage of SSD’s performance
advantage on random read. That is precisely what RocksDB is doing but without knowing how internal parallelism on SSD works to take full advantage of it. Although the modern SSD usually consists of multiple channels,
the SSD controller hardware provides only one block
device interface to the operating system, and the scheduling
and dispatching of I/O requests between internal channels
are hidden from the software level. This makes the LSMtree-based
KV stores unaware of the SSD’s multi-channel
architecture, and the scheduling and dispatching decisions
made by the SSD controller are not optimized according to
the data access patterns from LSM-tree-based KV stores also it causes too much overhead and duplicate mappings inside SSD.
Since LSM writes data contiguously (keys are sorted inside of SST files), one may think that when SST file gets deleted in compaction more data gets invalidated in few SSD blocks so when garbage collection runs it can see the Invalid page count of these few erase block is high and it can target those for garbage collection with less valid data to be copied. One catch here is that data is never written sequentially inside SSD. Even if the SST file data size is exactly equal to 2 erase blocks it would write the SST file data to many erase blocks based on page allocation strategy inside the SSD controller firmware. This problem still could be partially addressed by carefully assigning correct number of erase blocks in a superblock* which is basic unit of garbage collection in advanced controller firmware. Although one advantage of SST file in LSM is that it still avoids random writing to the disk so there is less fragmentation of data. This helps when SST file gets deleted, there wont be one or two key entries of that SST file over 100s of erase blocks. So write amplification get some benefit but not a lot of benefit using LSM.
Erasing memory in flash memory happens at the block level. The entire block of data consisting of usually 128 pages needs to be erased as a single operation (need to supply high current that erases the entire block of data). In general a page is 8 KB (4 KB - 16 KB) in Flash memory nowadays. If page is more granular the latency is better but read / write throughput suffers and more DRAM memory is required for the controller for block-page mapping.
Erasing memory in flash memory happens at the block level. The entire block of data consisting of usually 128 pages needs to be erased as a single operation (need to supply high current that erases the entire block of data). In general a page is 8 KB (4 KB - 16 KB) in Flash memory nowadays. If page is more granular the latency is better but read / write throughput suffers and more DRAM memory is required for the controller for block-page mapping.
Further Reading -
http://surajwaghulde.blogspot.com/2015/06/ssd-controller-firmware-inefficiency.html
http://surajwaghulde.blogspot.com/2015/06/perspective-on-rocksdb-on-ssd.html
References
1. Coding for SSD - http://codecapsule.com/2014/02/12/coding-for-ssds-part-3-pages-blocks-and-the-flash-translation-layer/
2. Log-Structured Merge-Tree - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf
3. Superblock* based FTL - http://dl.acm.org/citation.cfm?id=1176911&dl=ACM&coll=DL&CFID=517102487&CFTOKEN=87363593
3. Superblock* based FTL - http://dl.acm.org/citation.cfm?id=1176911&dl=ACM&coll=DL&CFID=517102487&CFTOKEN=87363593
Nice write up, Suraj! Great summary on why LSM level strategy is not good on SSDs generally; definitely agree with you on this, there is a lot of optimization which an be done. Look forward to seeing more articles from you.
ReplyDelete