SSD and storage engines: SSD controller firmware inefficiency and possible optimizations

1. Over provisioning - To hide the fact of inefficient single controller firmware with usually more than one (4 or 8) embedded processors, SSD vendors tend to have over provisioning to speed up the garbage collection and writing of data. If you see the sizes of the drive of SSD e.g., Intel sells 400 GB drive so they take 512 - 400 = 112, whopping 112 GB for garbage collection.

2. Writing is interleaved over many flash chips to take benefit of parallelism creating fragmentation for what is logically sequential write. SSD will write 1 MB of data to several chips mostly writing 1 page = 8 KB on one chip and next page on the next chip. So if you are using LSM storage engine you are basically not getting much benefit of writing the sorted data sequentially since SSD internally is never going to write sequentially. Technically, in a conventional SSD concurrency is maintained by striping data across the channels so that one request can be served by multiple channels. However, using this approach to achieving high bandwidth is in conflict with the goal of eliminating write amplification

3. SSDs consist of multiple NAND flash chips, and conventional SSDs provide data protection by using RAID 5 parity coding across the flash chips in addition to powerful BCH ECC protection within individual flash chips for fault detection and correction with attendant cost, complexity, and reduced flash space available for user data. However, in our large-scale Internet service infrastructure, data reliability is provided by data replication across multiple racks.

4. Asymmetry of write and erase causes lot of write amplification since no matter how you write data there is going to be write amplification. To cover it you will need quite expensive garbage collection. For garbage collection to work in multi-terabyte SSD, we need hybrid mapping (block-page) which needs garbage collection to work on superblock level (few aligned erase blocks) creating more write amplification. To cover this one you need extra over provisioning so after 112 GB taken in over provisioning to make software perform better software administrator put extra over provisioning taking another 100 GB for it reducing the drive capacity to 300 GB out of 512 GB (about 60%).

5. The layer of I/O stack, such as the block layer, has become a bottleneck for today’s high-performance SSD. There is a duplicate mapping exists inside SSD to map outside exposed logical block address to physical blocks. This mapping takes lot of DRAM space of SSD controller and complicates controller logic increasing cost and energy requirement since DRAM consumes lot of energy.

This extra mapping of LBA logical block address to PBA physical block address is stored inside of SSD DRAM. This mapping is simple array based but the LBAs are interleaved over multiple flash chips with simple mathematical power calculation - e.g., last 2 bits are (4) planes inside a chip then next 4 bits for (16) erase blocks inside a chip, next 4 bits for (16) ways per channel, next 5 bits for (32) channels in SSD etc. and rest of the bits sequentially grows that forms the physical address while write the data on multiple flash chips in parallel. This could completely change your imagination of flash drive if you are unaware of this. This mapping is completely redundant since in NoSQL storage engines we want to store data as key values. Now key needs to be mapped to one or more LBAs or multiple keys may map to one LBA where LBA is sector addressable. A sector is 512 Bytes of storage. Bottom line - we simply do not need this logical block layer.

SSD and storage engines

Friday, June 5, 2015

SSD controller firmware inefficiency and possible optimizations

No comments:

Post a Comment