



#### Flashing in the Memory Hierarchy An Overview on Flash Memory Internals

Jalil Boukhobza, Stéphane Rubini Lab-STICC, Université de Bretagne Occidentale

{boukhobza, rubini}@univ-brest.fr 15/11/2012



2

#### NAND Flash in the hierarchy



{boukhobza, rubini}@univ-brest.fr 15/11/2012

- -

Lab-STICC



## Where is the NAND flash memory ?



#### Number of Units (Millions) by Application



Lab<sub>'</sub>STICC







{boukhobza, rubini}@univ-brest.fr 15/11/2012 3



5

Fla





#### Millions of \$ by Application

Flash Memory Summit 2011 Santa Clara, CA

> {boukhobza, rubini}@univ-brest.fr 15/11/2012



## Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



## Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



# Université de bretagne occidentale

## Flash memory cells

- Invented by F. Masuoka Toshiba 1980
- Introduced by Intel in 1988
- Type of EEPROM (Electrically Erasable & Programmable Read Only Memory)
- Use of Floating gate transistors
- Electrons pushed in the floating gate are trapped



3 operations: program (write), erase, and read

#### Flash memory operations



#### **Erase operation**

- FN (Fowler-Nordheim) tunneling: Apply high voltage to substrate (compared to the operating voltage of the chip usually between 7– 20V)
- → electrons off the floating gate
- Logic « I » in SLC

# Program / write operation

- Apply high voltage to the control gate
- → electrons get trapped into the floating gate
- Logic « 0 »

#### **Read operation**

- Apply reference voltage to the control gate:
  - If floating gate charged: no current flow
  - If not charged; current flow

U

B

0

de bretagne occidentale



## NOR Vs NAND

NOR

U

B

0

de bretagne occidentale

- Byte random access
- Low density
- Higher cost (/bit)
- Fast read
- Slow write
- Slow erase
- Code storage (XIP eXecute In Place)
- NAND
  - Page access
  - High density
  - Slower read
  - Faster write
  - Faster erase (block granularity)
  - Data storage
- Other types: DiNOR, AND, ...



Source EETimes: http://www.eetimes.com/design/memorydesign/4009410/Flash-memory-101-An-Introduction-to-NAND-flash



## NAND flash memory architecture



- ▶ Read/Write → page
- ► Erasures → blocks
- Page: 2-8KB
- Block: 128-1024 KB

 Page
 Cell

 64 Pages per block
 Cell

 Source: http://www.electroig.com/articles/sst/2011/05/solid-state-drives.html

Control gate

Lab<mark>-STICC</mark>

1/0

1/0

1/0

1/0

Memory

Floating gate

{boukhobza, rubini}@univ-brest.fr 15/11/2012

bits per page-

Université de bretagne occidentale

## Different densities: SLC, MLC, TLC

|                          | SLC<br>(Single Level Cell)                                         | MLC<br>(Multi Level Cell)                                           | TLC<br>(Tri Level Cell)                                                           |
|--------------------------|--------------------------------------------------------------------|---------------------------------------------------------------------|-----------------------------------------------------------------------------------|
|                          | 1<br>0<br>V <sub>Total</sub><br>Single Level<br>Cell (SLC)<br>NAND | 11 01 00 10 Multi Level<br>Celi (MLC)<br>NAND<br>V <sub>Total</sub> | Tri Level<br>Cell (TLC)<br>NAND<br>V <sub>Total</sub>                             |
| Storage                  | l bit / cell                                                       | 2 bits / cell                                                       | 3 bits /cell                                                                      |
| Performance              | +++                                                                | ++                                                                  | +                                                                                 |
| Density                  | +                                                                  | ++                                                                  | +++                                                                               |
| Lifetime (P/E<br>cycles) | ~ 100 000                                                          | ~ 10 000                                                            | ~5 000                                                                            |
| ECC complexity           | +                                                                  | ++                                                                  | +++                                                                               |
| Applications             | Embedded and<br>industrial applications<br>(high end SSDs)         | Most consumer<br>applications (e.g.<br>memory cards)                | Low-end consumer<br>applications not<br>needing data updates<br>(e.g. mobile GPS) |





{boukhobza, rubini}@univ-brest.fr 15/11/2012

U

0

université

de bretagne occidentale



## Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



## Flash memory constraints



E E E



## Wear leveling

You already do that with your tyres ...







4 tyre rotation all are same size Tyre rotation when size is different front & rear

Pierelli Courtesy

## Keeping a balanced erasures' distrubution over flash

all are same size

memory blocks.

http://www.presence-pc.com/tests/ssd-flash-disques-22675/5/



{boukhobza, rubini}@univ-brest.fr 15/11/2012



#### Garbage Collection



Moving valid pages from blocks containing invlid data and then erase/recycle the blocks



#### Flash memory structure



{boukhobza, rubini}@univ-brest.fr 15/11/2012

**U U** 



## Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



## Basic mapping schemes a- Page mapping ideal scheme

- Each page mapped independently
- High flexibility
- Ideal performance
- ► High RAM usage → unfeasible
  - 32GB flash memory, 2KB per page and 8 bytes/table entry → 128MB table !!!
- Optimal performance reference



Lab<sub>'</sub>STICC



## Basic mapping schemes b- Block mapping scheme

- Only blocks numbers in the mapping table
- Page offsets remain unchanged
- Small mapping table (memory footprint)
- Very bad performance for write updates
- Same config. with 64 pages/block: 2MB page table



\_ab<mark>-</mark>STICC

(b) Block mapping table

# 

## Basic mapping schemes c- Hybrid mapping scheme

- Use of both block and page mappings
- Example:
  - Block mapping (BM)
  - Some data blocks are page mapped (PM)
- Current designs use whether:
  - One global BM and use log blocks
  - Partition flash memory in one BM region and a PM region
- Performance of PM with a memory footprint approaching BM



(c) Hybrid mapping tables



{boukhobza, rubini}@univ-brest.fr 15/11/2012



- Use of OOB to save the @ of the data page written to the spare area
- [Ban99] use of log blocks: blocks dedicated to absorb data update
  - ANAND : respecting the page offset in log blocks (cons: no more than one same page update before merging data and log blocks)
  - FMAX: Allowing associativity in log blocks (use of OOB area)
    - I to I data/log block correspondance



## Log block based FTL -2-

Merge operations:



Merge data and log block into a free block

- [RNFTL10] Reuse-Aware NAND FTL: only 46% of the data blocks are full before merge operation happen
  - erase only log block and use the rest of data block as log pages for other blocks
- [BAST02] Block Associative Sector Translation FTL
  - FMAX with  $\searrow$  log blocks
  - Log blocks managed by page mapping table
  - One log block for a given data block



## Log block based FTL -3-

- FAST07] Fully Associative Sector Translation FTL
  - Log blocks poorly filled because of associativity
  - Divide log blocks in 2 regions (spatial locality):
    - One sequentially written log block
    - Randomly accessed log blocks → fully associative → page mapped

#### [LAST08] Locality Aware Sector Translation FTL

- FAST: less merges operations but very costly because pages coming from many different blocks
- As for FAST 2 regions
  - Big writes → sequential log blocks
  - Small write  $\rightarrow$  random log blocks
    - In random region: hot and cold region to avoid costly merge operations



# Log block based FTL -4-

- [EAST08] Efficient and Advanced Space management Technique
  - FAST: bad usage of log blocks ... yet not full
  - No sequential and random regions but:
    - In-place data updates in log blocks in first pace
    - Out-of-place data updates if more updates are achieved
  - Fix the number of log blocks that can be dedicated to one data blocks according to flash characteristics

#### [KAST09] K-Associative Sector Translation FTL

- FAST: costly merge operation for the random region
- Limit the associativity of log blocks (K)
- Many sequential log blocks
- Migration between random and sequential region

0

Université de bretagne occidentale

# Hybrid FTLs, region partitioning

- [STAFF07] State Transition Applied Fast FTL
  - Block: (F)ree state, (M)odified in-place, complete in-place (S)tate, modified out-of-place (N), or in (O)bsolete state (no valid data)
  - Page mapping table for blocks in the N state.
  - RAM usage unpredictability because of N
- [HFTL09] Hybrid FTL
  - Use of hot data identifier:
    - Hot data are page mapped
    - Cold data use FAST



Lab-STICC



Université de bretagne occidentale

## Hybrid FTLs, region partitioning -2-

- [WAFTLII] Workload adaptive FTL
  - Page mapped region
    - Random data and partial updates
  - Block mapped region
    - Sequential data and mapping tables Layout and In-memory Data Structure



Source: http://storageconference.org/2011/Presentations/Research/6.Wei.pdf

\_ab<mark>-</mark>STICC



- [DFTL09] Demand based FTL
  - Idea : use page mapping and keep only part of the mapping table in RAM
    - Rest of the mapping table stored in the flash



Source: [DFTL09]

- [SFTLII] Spatial locality FTL
  - Reduces the size of the mapping table by keeping track of sequential accesses and use only one table entry.

U

B

0

de bretagne occident<u>ale</u>



## Tablet Apple IPad



Virtual Flash Layer (VFL): remaps bad blocks and presents an error-free NAND to the FTL layer

FTL  $\rightarrow$  YaFTL (Yet another FTL)



http://esec-lab.sogeti.com/post/Low-level-iOS-forensics



## YaFTL

E E E

0

- Page mapping (implemented as a walk table)
  DFTL principles: a part of the map is stored on the Flash
- Splits the virtual address space into superblocks
  - 3 types of superblocks
    - User data page
    - Index page (page map)
    - Context (index page+ erase count)





{boukhobza, rubini}@univ-brest.fr 15/11/2012



## Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



## Wear leveling

- **<u>Objective</u>**: keep all the flash memory space usable as long as possible
- Based on the number of erasures or writes performed on a block
  - > < mean value : cold block</pre>
  - > mean value: hot block
  - Maintain the gap between hot and cold block as small as possible
  - Swap data from hot blocks to cold blocks (costly)
  - Which blocks are concerned: only free ? All of them ?
- [DualPool95] adds 2 additional block mapping tables
  - Hot and cold table: free block taken from cold blocks
  - Periodically the mean erase number is recalculated and hot and cold tables updated

B

0



- Write count based wear leveler
- [Achiwa99]: erase count maintained in RAM in addition to write count
  - Put the more written data into the less erased blocks
- [Chang07]: like the dual pool but according to the number of writes (pages level)
- [Kwon11] considering groups of blocks reducing the RAM usage of the wear leveler

B

0



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives

### Garbage Collection / cleaning policy

- Process that recycles free space from previously invalidated pages in different blocks.
- Answers the questions:
  - When should it be launched ?
  - Which blocks to choose and how many ? 2.
  - How should valid data be written? 3
  - (where to write the new data ?)  $\rightarrow$  wear leveler 4.



- Free blocks < 3
- 2. Containing the most invalid pages to free 3 blocks (1,3,5)

\_ab-STICC

3. Respecting pages' placement

<sup>{</sup>boukhobza, rubini}@univ-brest.fr 15/11/2012



### Garbage Collection -2-

- Minimizing cleaning cost wile maximizing cleaned space
- Cleaning cost:
  - Number of erase operations
  - Number of valid pages copy
- Main considered metric: ratio of dirty pages in blocks
  - Maintaining counters in RAM or in metadata area
- Separate cold and hot data when performing valid pages copy (Q. 3 In previous slide)
  - Otherwise GC frequently launched (due to hot data updates)
  - [DAC08] Dynamic dAta Clustering: partitioning flash memory into many regions depending on update frequency



### Garbage Collection -3-

- Which blocks to choose (Q. 2)
  - Greedy policy: blocks with the most dirty pages
    - Efficient if flash memory accessed uniformly
- [Kawaguchi95]
  - Space recycled: (I-u), cost of read and write valid data: (2u)
  - Elapsed time since the last modification: age
  - Age \* (I-u)/2u block with the highest score is chosen

#### [CAT99]: Cost AgeTimes

- Hot blocks are given more time to accumulate more invalid data
   → direct erase without GC
- CleaningCost \* (1/age) \* NumberOfCleaning
  - CleaningCost: u/(I-u) u: percentage of valid data in the block
  - NumberOfCleaning: number of generated erases
  - The less the score, the more chances to be cleaned



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics

#### 3. Flash memory support

- I. Mapping schemes
- 2. Wear leveling
- 3. Garbage collection
- 4. Flash specific cache systems
- 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



### Flash specific cache systems

- They are mostly write specific and try to:
  - absorb most page/block write operations at the cache level
  - reveal sequentiality by buffering write operations and reorganizing them

\_ab<mark>-</mark>STICC

- [CFLRU06] Clean First LRU (caches reads & writes)
  - LRU list divided into two regions:
    - A working region: recently accessed pages
    - A clean first region: candidate for eviction
  - It evicts first clean pages that do not generate any write (e.g. P7, P5, P8, then P6.)

    Working region
    Clean-first region



### Flash specific cache systems -2-

- [FAB06] Flash Aware Buffer
  - Flushes the largest groups of pages belonging to the same block (minimize merges)

\_ab<mark>·</mark>STICC

- If the same number of pages: uses LRU
- Fightharpoonup [BPLRU08] Block Padding LRU
  - Block level LRU scheme
  - Page padding: read lacking pages and flush full blocks
  - LRU compensation: sequentially written data are moved to the end of the LRU queue
- LRU like algorithms, page/block granularity, & double objective : caching, reducing erasures



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives

U D D D D U u i v e s c i d e b r e tagne o c c i d e tagne té

### C-lash: a cache for flash [C-lash11]

- Hierarchical cache
  - Cache with no WL or GC
  - 2 regions:
    - Page region (P-space)
    - Block region (B-space)
- Read operations
  - Hit: Read from cache
  - Miss: no copy to the cache
- Write operations
  - Hit: update in the cache
  - Miss: write in the P-spaces



Lab-STICC



### C-lash: a cache for flash -2-

#### 2 eviction policies

- P-space → B-space: largest set of pages from the same block (spatial locality)
- B-space → flash : LRU (temporal locality)
- Early and late cache merge in B-space.
- Switch (p-space/b-space)
- Proved good performance for mostly sequential workload
  - Bad for very random ones



U

B

0

CACH-FTL Cache-Aware Configurablab STICC Hybrid FTL [CACH-FTL13]

- Most flash systems have cache mechanisms on top of FTL
- Most flash specific cache systems flush groups of pages



{boukhobza, rubini}@univ-brest.fr 15/11/2012



## CACH-FTL -2-

- PMR (Page Mapped Region) garbage collection:
  - Launched when the number of free blocks in the PMR goes under a predefined threshold
  - Greedy reclamation algorithm (least number of valid pages)
- BMR (Block Mapped Region) garbage collections
  - PMR-GC cannot find any physical block containing enough invalid pages to recycle a block
  - Greedy reclamation algorithm selecting the largest group of PMR pages belonging to the same data block

|                      | PMR-GC victim block Block n Block n+1 Block n+2···· |          |                 |                   |        |  |  |  |  |  |  |
|----------------------|-----------------------------------------------------|----------|-----------------|-------------------|--------|--|--|--|--|--|--|
|                      | Block n                                             |          |                 |                   |        |  |  |  |  |  |  |
|                      | of PMR                                              |          | $\sim$          |                   |        |  |  |  |  |  |  |
|                      | P2, B4                                              | XXXXX    | P1, B9          | P1, B7            |        |  |  |  |  |  |  |
| (1)                  | P2, B7                                              | P1, B4   | P2, B9          | XXXXX             |        |  |  |  |  |  |  |
| (')                  | P0, B1                                              | P3, B4   | P3, B2          | XXXXX             |        |  |  |  |  |  |  |
|                      | P1, B0                                              | XXXXX    | P0, B2          | XXXXX             |        |  |  |  |  |  |  |
|                      |                                                     |          |                 |                   |        |  |  |  |  |  |  |
|                      | P2, B4                                              | XXXXX    | P1, B9          |                   | P1, B7 |  |  |  |  |  |  |
| (2)                  | P2, B7                                              | P1, B4   | P2, B9          |                   |        |  |  |  |  |  |  |
| (2)                  | P0, B1                                              | P3, B4   | P3, B2          |                   |        |  |  |  |  |  |  |
|                      | P1, B0                                              | XXXXX    | P0, B2          |                   |        |  |  |  |  |  |  |
| BMR-GC victims pages |                                                     |          |                 |                   |        |  |  |  |  |  |  |
| ſ                    | P2, B4                                              | P2, B8   | _P1, B9         | P2, B10           |        |  |  |  |  |  |  |
| (3)                  | P2, B7                                              | P1, B4   | P2, B9          | -P1, B11          |        |  |  |  |  |  |  |
| (3)                  | P0, B1                                              | P3, B4 🚽 | <b>CP3</b> , B2 | P3_B14            |        |  |  |  |  |  |  |
|                      | P1, B0                                              | P3, B3   | P0, B4 -        | <b>\$</b> P1, B10 |        |  |  |  |  |  |  |
| C victim block       |                                                     |          |                 |                   |        |  |  |  |  |  |  |
| (4)                  | XXXXX                                               | P2, B8   | P1, B9          | P2, B10           |        |  |  |  |  |  |  |
|                      | P2, B7                                              | XXXXX    | P1, B9          | P1, B11           |        |  |  |  |  |  |  |
|                      | P0, B1                                              | XXXXX    | P3, B2          | P3, B14           |        |  |  |  |  |  |  |
|                      | P1, B0                                              | P3, B3   | XXXXX           | P1, B10           |        |  |  |  |  |  |  |
| l                    | , 50                                                | 13, 53   | 70000           | , 810             |        |  |  |  |  |  |  |
| ſ                    | XXXXX                                               |          | P1, B9          | P2, B10           | P2, B8 |  |  |  |  |  |  |
| (5)                  | P2, B7                                              |          | P2, B9          | P1, B11           | P3, B3 |  |  |  |  |  |  |
| (3)-                 | P0 B1                                               |          | P3 B2           | P3 B14            |        |  |  |  |  |  |  |

P3, B2 | P3, B14

P1, B10

XXXXX

P1. B0



#### CACH-FTL -3-





- Flashsim+Disksim simulator
- CACH-FTL configuration:
  - threashold =8 pages
  - Over-prov. of 10%
- OLTP real traces + Synth. traces





#### CACH-FTL -4-



Lab-STICC



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



### Performance and energy considerations

 2006 → 2012: nearly exponential growth of published work on flash memory

\_ab<mark>·STICC</mark>

"Tape is dead, disk is tape, <u>flash is disk</u>, RAM locality is King" Jim Gray 2006

#### Flash disks outperform hard disk drives (HDD)

- Sequential reads and writes
- Random reads (no mechanical elements)
- ▶ Random writes → Achilles' heel
  - Depends on flash intricacies
- Flash disks are generally more energy efficient
  - More than 5x less energy in some cases [Park I]



#### Flash disk performance is heterogeneous

- Depend on internal structure and workload
- Performance disparities between SSDs from the same constructor and between different technologies are significant [Park II]



143/93

120/90 JUOUKIIOULA, I UUIIII (WUIIIV-UI COLII IJ/II/ZVIZ

220/200

NA/NA

Rd./Wr. Perf. (MB/s)

60/45

E E

0



### Performance and energy -3-

- A wide performance and energy asymmetry between reads and writes [Park11]
  - The more free space the better the write performance



|                      | OLD          | MLC     | SLC     | TGN       | HDD             |
|----------------------|--------------|---------|---------|-----------|-----------------|
| Model                | FSD32GB25M   | 1C32G   | MSP7000 | MMCRE28G5 | WD1600BEKT      |
| Vendor               | Super talent | OCZ     | MTron   | Samsung   | Western digital |
| Form factor          | 2.5 in.      | 2.5 in. | 2.5 in. | 2.5 in.   | 2.5 in.         |
| Flash type/RPM       | SLC          | MLC     | SLC     | MLC       | 7200            |
| Capacity             | 32 GB        | 32 GB   | 16 GB   | 128 GB    | 160 GB          |
| Rd./Wr. Perf. (MB/s) | 60/45        | 143/93  | 120/90  | 220/200   | NA/NA           |

{boukhobza, rubini}@univ-brest.fr 15/11/2012

ั ย เ

0



# Flash performance needs time to reach steady state.



Figure 1-1 – NAND-based SSS Performance States (RND 4KiB Writes)

{boukhobza, rubini}@univ-brest.fr 15/11/2012

E E E

0



### Performance and energy -5-

#### Flash memory design space is large !

- Different FTLs (mapping,WL, GC)
- Degree of concurrency [Agrawal08]:
  - Parallel requests: parallel requests to each element of the flash array, a queue per element.
  - **Ganging**: using a gang of flash elements in synchrony to optimize multi-page request
  - Interleaving: within a die
  - Background cleaning
- Capacity > 2TB, cost 0.7€/GB





{boukhobza, rubini}@univ-brest.fr-15/11/2012-

U

B

O)



### Performance and energy -6-

- Intra SSD performance:
  - SSD System level: among channels
  - Chip-level: among chips in a channel
  - Dies level: among dies in a chip
  - Among planes in a die



E B

0



### Performance and energy -6-

**Off-the-shelf SSDs** 

15k RPM SAS HDD: ~250-300 IOPS 7.2k RPM SATA HDD: ~80 IOPS

#### Α В С D Е PATA Drive SATA Drive SATA Drive SAS Drive Form Factor PCI-e card Consumer Consumer Consumer Enterprise Enterprise MLC Flash Chips MLC MLC SLC SLC Capacity 32 GB 100 GB 160GB 140GB 450 GB Read 53 MB/s 250 MB/s 220 MB/s 285 MB/s 700 MB/s Bandwidth Write 28 MB/s 250 MB/s 100 MB/s 115 MB/s 500 MB/s Bandwidth ~ 1 order of magnitude Random 4kB 3.5k 30k 45k 140k 35k Read IOPS > 2 orders of magnitude Random 4kB 0.01k 0.6k 16k 10k 70k Write IOPS ~ 15 \$/GB ~ 4 \$/GB ~ 2.5 \$/GB ~18 \$/GB ~ 38 \$/GB Street Price: (2007)(2011)(2010)(2010)(2009)

66

Bonnet, Bouganim, Koltsidas, Viglas, VLDB 2011

{boukhobza, rubini}@univ-brest.fr 15/11/2012

E E E

0

université de bretagne occidentale



### Contributions



Power consumption & performance modeling of embedded systems with an embedded OS (Pierre Olivier PhD)

- Performance and energy consumption at different layers.
- Microbenchmarking different
   FFS / different initial states
- Simple, atomic access. Legacy NAND commands :
  - Read and write (page)
  - Erase (block)



### Linux : MTD Device versus Block Device

| Block device                                                     | MTD device                                                                                    |
|------------------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| Consists of sectors                                              | Consists of eraseblocks                                                                       |
| Sectors are small (512, 1024 bytes)                              | Eraseblocks are larger (typically 128KiB)                                                     |
| Maintains 2 main operations: <b>read</b> sector and write sector | Maintains 3 main operations: read<br>from block, write to eraseblock, and<br>erase eraseblock |
| Bad sectors are re-mapped and hidden by hardware                 | Bad eraseblocks are not hidden and should be dealt with in software                           |
| Sectors are devoid of the wear-out property                      | Eraseblocks wear-out and become bad<br>and unusable                                           |

Source: <a href="http://www.linux-mtd.infradead.org/">http://www.linux-mtd.infradead.org/</a>



### Performance

#### Tests programs:

- Kernel level : modules (MTD) low level calls)
- MTD-userspace : shell scripts
- Ex : writes (MTD kernel level)





### Power consumption



Same test programs, max access on test partition

#### Ex : erases(MTD kernel level)

Kernel erase test module power consumtion



{boukhobza, rubini}@univ-brest.fr 15/11/2012

### Used tool: Flashmon [Flashmon11]

ab-STICC

- Flash access profiler (kernel module) for raw flashbased Linux embedded systems
  - Monitors and log events (read, write, erase)
  - Stores access counter for each block of the monitored flash memory
- Linux programs profiled with Flashmon
- Flash access numbers injected into the models to estimate Flash I/O time and power consumption





### Flashmon -2-

| [pierre@as3: ~] [trunk - Navigateur de root@as3: /home/pierr |
|--------------------------------------------------------------|
|--------------------------------------------------------------|



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



### Flash based subsystems

- Flash based subsystems
  - Flash memory +
  - Controller (FTL, wear
     leveler) + (buffer) +
  - Hardware interface, software API
    - Solid State Drive: ATA, SATA, PCIe
    - USB mass storage

. . .



U

B

0

### Memory Card & Drive

| Card Type       | Interface<br>(command)        |
|-----------------|-------------------------------|
| CompactFlash    | N + PCMCIA,<br>PC_Card (PATA) |
| MMC/SD          | N + SPI (MMC)                 |
| XQD             | N + PClexpress                |
| USB flash drive | USB (SCSI)                    |
| UFS             | UniPro (SCSI)                 |
| CFAST           | N+ SATA (SCSI)                |
| N=Native        |                               |

SSD Drives: SATA, SAS, NVMexpress,



Lab-STICC



{boukhobza, rubini}@univ-brest.fr 15/11/2012

### Universal Flash Storage (UFS)

- JEDEC standard VI.1, 2012
  - Features:
    - Multiple command queues, multiple partitions parallel functions, boot partition
    - Max interface speed: 5.8 Gbps
    - Command set: SCSI+UFS specific
    - Compatible eMMC
- Serial interconnection (unipro), chain topology



Lab<mark>-STICC</mark>

http://www.toshiba-components.com/ufs/index.html





### **Real-Time Constraints**



- Full-duplex host interface
- High Priority Interrupt





### Multiple NAND channels & partitions

- Full utilization of interleaving across NAND channels
- Read-while-write (full-duplex) and dual write across multiple channels



Multiple partitions

Lab-STICC

Technology mix: Part I: SLC Part 2:TLC



### Interface / product



{boukhobza, rubini}@univ-brest.fr 15/11/2012

**B B** 

0

versité



### Presentation outline

- I. Flash memory basics
- 2. Flash memory characteristics
- 3. Flash memory support
  - I. Mapping schemes
  - 2. Wear leveling
  - 3. Garbage collection
  - 4. Flash specific cache systems
  - 5. Some contributions
- 4. Performance & energy considerations
- 5. Flash memory interfacing
- 6. Conclusions & perspectives



### Conclusions & perspectives

| NV Memory Technology Comparison Ed Grochowski |                                                |                  |                  |                  |                  |                   |                  |                 |                   |                       |                      |    |
|-----------------------------------------------|------------------------------------------------|------------------|------------------|------------------|------------------|-------------------|------------------|-----------------|-------------------|-----------------------|----------------------|----|
|                                               | Memory                                         | HDD              | Flash<br>(SLC)   | Flash<br>(MLC)   | FeRAM            | MRAM              | РСМ              | ReRAM           | STT-<br>RAM       | Race-<br>Track        | Mole-<br>cular       |    |
|                                               | Cell<br>Structure                              | None             | 1T/0C            | 1T/0C            | 1T/1R            | 1T/1R             | 1T/1R            | 1T/1R           | 1T/1R             | None                  | 1T/1R                |    |
|                                               | Cell size<br>(F <sup>2</sup> )                 | 0.5              | 3.5              | 3.5              | 15               | 16=30             | 5-7              | 6-10            | 6–20              | 0.5e                  | 10-5                 |    |
|                                               | Read time<br>(ns)                              | 2000             | 50               | 50               | 20–80            | 3–20              | 20–50            | 20–50           | 2–20              | 500                   | 100e                 |    |
|                                               | Write /<br>Erase time<br>(ns)                  | 1000             | 1 ms /<br>0.1 ms | 1 ms /<br>0.1 ms | 50/50            | 3-20              | 100              | 20–50           | 2–20              | 200                   | 300e                 |    |
|                                               | Endurance                                      | 10 <sup>16</sup> | 10 <sup>5</sup>  | 10 <sup>3</sup>  | 10 <sup>12</sup> | >10 <sup>15</sup> | 10 <sup>12</sup> | 10 <sup>8</sup> | >10 <sup>15</sup> | 10 <sup>16</sup><br>e | 10 <sup>8</sup><br>e |    |
|                                               | Write<br>power                                 | Low              | Very<br>high     | Very<br>high     | Low              | High              | High             | Low             | Low               | Low                   | Low                  | Ju |
|                                               | Max. Areal<br>Density<br>Gbits/in <sup>2</sup> | 750-<br>1000     | 150              | 550              | 0.1              | 10                | 200e             | 200e            | 300e              | >1000e                | 400e                 |    |
|                                               | Voltage<br>required                            | 3-5V             | 12 V             | 12 V             | 2–3 V            | 3 V               | 1.5–3 V          | 1.5–3 V         | <1.5 V            | 3-5V                  | 3-5V                 |    |
| Flash Memory Summit 201<br>Santa Clara, CA    |                                                | 2                | Exis             | ting proc        | ducts            |                   | ŀ                | Prototype       |                   |                       | 22                   |    |



Lab-STICC

Just one of a team !

Source. E. Grochowski, "Future Technology Challenges For NAND Flash And HDD Products", FlashSummit 2012

{boukhobza, rubini}@univ-brest.fr 15/11/2012

### Conclusions & perspectives -2-

Integration of the non-volatile memory in operating system stack (not in the storage stack).

Lab-STICC

- Non volatile memories' coexistence
  - Toward the predictibility of access times in flash memory:
    - Real time systems
    - Data base cost models

#### Leveraging data center's energy/performance bottleneck:

- Storage represents 20% to 40% of the total energy consumption [Carter I 0]
- EMC forecasted that the amount of digital information created annually will grow by a factor of 44 from 2009 to 2020 [Farmer10]
- Microsoft, Google, and Yahoo are showing the way ...
- → <u>Energy proportioinality</u>
- Architectural design space exploration tools



#### References



[Shinohara99] Shinohara, T. (1999), Flash Memory Card with Block Memory Address Arrangement, United States Patent, No 5,905,993.

[Ban99] Ban, A. (1999), Flash File System Optimized for Page-mode Flash Technologies, United States Patent, No 5,937,425. [RNFTL10] Wang, Y., Liu, D., Wang, M., Qin, Z., Shao, Z., & Guan, Y. (2010), RNFTL: a Reuse-aware NAND Flash Translation Layer for Flash Memory, *In Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems (LCTES).* 

[BAST02] Kim, J. Kim, J. M., Noh, S. H., Min, S. L., & Cho, Y. (2002), A Space-Efficient Flash Translation Layer for Compact Flash Systems, IEEE Transactions on Consumer Electronics, 48(2), 366-375.

[FAST07] Lee, S., Park, D., Chung, T., Lee, D., Park, S., & Song, H. (2007), A Log Buffer Based Flash Translation Layer Using Fully Associative Sector Translation, ACM Transactions on Embedded Computing Systems, 6(3), 1-27.

[LAST08] Lee, S., Shin, D., Kim, Y., & Kim, J. (2008), LAST: Locality Aware Sector Translation for NAND Flash Memory Based Storage Systems, ACM SIGOPS Operating Systems Review, 42(6), 36-42.

[EAST08] Kwon, S. J., & Chung, T. (2008), An Efficient and Advanced Space-management Technique for Flash Memory Using Reallocation Blocks, *IEEE Transactions on Consumer Electronics*, 54(2), 631-638.

[KAST09] Cho, H., Shin, D., & Eom, Y. I. (2009), KAST: K-associative sector translation for NAND flash memory in real-time systems. *In Proceedings of Design, Automation, and Test in Europe (DATE)*, 507-512.

[STAFF07] Chung, T. S., & Park, H. S. (2007), STAFF: a Flash Driver Algorithm Minimizing Block Erasures, *Journal of Systems* Architectures, 53(12), 889-901.

[HFTL09] Lee, H., Yun, H., & Lee, D. (2009), HFTL: Hybrid Flash Translation Layer Based on Hot Data Identification for Flash Memory, *IEEE Transactions on Consumer Electronics*, 55(4), 2005-2011.

[WAFTL11] Wei, Q., Gong, B., Pathak, S., Veeravalli, B., Zeng, L., & Okada, K. (2011), WAFTL: A Workload Adaptive Flash Translation Layer with Data Partition, *In Proceedings of 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST)*. [DFTL09] Gupta, A., Kim, Y., & Urgaonkar, B. (2009), DFTL: a Flash Translation Layer Employing Demand-based Selective Caching of Page-level Address mappings, *In Proceedings of the 14th international conference on Architectural support for programming languages and operating systems (ASPLOS)*.

[SFTLII] Jiang, S., Zhang, L., Yuan, X., Hu, H., & Chen, Y. (2011), SFTL: An Efficient Address Translation for Flash Memory by Exploiting Spatial Locality, *In Proceedings of 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST)*. [Boukhobza13] Boukhobza, J. (2013), Flashing in the Cloud: Shedding some Light on NAND Flash Memory Storage System, chapter to appear in Data Intensive Storage Services for Cloud Environment, IGI Global.



### References -2-

[DualPool95] Assar, M., Namazie, S., & Estakhri, P. (1995), Flash Memory Mass Storage Architecture Incorporation Wear Leveling Technique, United States Patent, No 5,479,638.

[Achiwa99] Achiwa, K., Yamamoto, A., & Yamagata, O. (1999), Memory Systems Using a Flash Memory and Method for Controlling the Memory System, United States Patterns, No 5,930,193

[Chang07] Chang, L. (2007), On Efficient Wear Leveling for Large-scale Flash-Memory Storage Systems, In Proceedings of the 2007 ACM Symposium on Applied Computing (SAC).

[Kwon11] Kwon S. J., Ranjitkar, A., Ko, Y., & Chung, T. (2011), FTL Algorithms for NAND-type Flash Memories, Design Automation for Embedded Systems, 15(3-4), 191-224.

[DAC08] Chiang, M., & Chang, R. C. (1999), Cleaning Policies in Mobile Computers Using Flash Memories, Journal of Systems and Software, 48(3), 213-231.

[Kawaguchi95] Kawaguchi, A., Nishioka, S., & Motoda, H. (1995), A Flash Memory Based File System, In Proceedings of the USENIX 1995 Annual Technical Conference (ATC).

[CAT99] Chiang, M., & Chang, R. C. (1999), Cleaning Policies in Mobile Computers Using Flash Memories, *Journal of Systems and Software, 48(3), 213-231.* 

[CFLRU06] Park, S., Jung, D., Kang, J., Kim, J., & Lee, J. (2006), CFLRU: a Replacement Algorithm for Flash Memory, In Proceedings of the 2006 International conference on Compilers, architecture and synthesis for embedded systems (CASES).

[FAB06] Jo, H., Kang, J., Park, S., Kim, J., & Lee. J. (2006), FAB: a Flash-aware Buffer Management Policy for Portable Media Players, IEEE Transactions on Consumer Electronics, 52(2), 485-493.

[BPLRU08] Kim, H., & Ahn, S. (2008), BPLRU: A Buffer Management Scheme for Improving Random Writes in Flash Storage, In Proceedings of the 6<sup>th</sup> USENIX Conference on File and Storage Technologies (FAST).

[C-lash11] Boukhobza, J., Olivier, P., & Rubini, S. (2011), A Cache Management Strategy To Replace Wear Leveling Techniques for Embedded Flash Memory, 2011 International Symposium on Performance Evaluation of Computer & Telecommunication Systems (SPECTS). [CACH-FTL13] Boukhobza, J., Olivier, P., & Rubini, S., A Cache-Aware Configurable Hybrid Flash Translation Layer, To appear in the 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[Park11] Park, S., Kim, Y., Urgaonkar, B., Lee, J., & Seo, E. (2011), A Comprehensive Study on Energy Efficiency of Flash Memory Storages, *Journal of Systems Architecture (JSA)*, 57(4), 354-365.

[Agrawal08] Agrawal, N., Prabhakaran, V., Wobber, T., Davis, J. D., Manasse, M., & Panigrahy, R. (2008), Design Tradeoffs for SSD Performance, *In Proceedings of the USENIX Annual Technical Conference (ATC)*.

[Flashmon11] B., Khetib, I., and Olivier, P., (2011). Flashmon : un outil de trace pour les accès à la mémoire flash NAND, in Proceedings of the Embed With linux Workshop, France, 2011,



#### Price of storage



100000 Average Retail Prices of Storage 0 DRAM/Flash 10000 DRAM Desktop Enterprise Flash Mobile 1" 0 Enterprise HDD 1000 1 " HDD 100 Price \$/GB Aobile HDD 10 32 GB Flash 1 Desktop HDD **GB/15K S** Ed Grochowski 50 GB Desktop ve 0.1 2TB Desktop 3 TB Desktop 0.01 oemprc2011B.prz 1990 1995 2000 2005 2010 2015 Year

{boukhobza, rubini}@univ-brest.fr 15/11/2012



### Price of NAND storage







{boukhobza, rubini}@univ-brest.fr 15/11/2012

### SoC TI OMAP4 Mobil Application Platform

Development of planned features for the Smartphones and Mobile Internet Devices



boukhobza@univ-brest.fr 15/11/2012





### Flash in the boot process

- Flash NOR
  - boot code, executable, configuration data
  - XIP : eXecute In Place capability

#### Flash NAND:

- eMMC, USB mass storage
- Copy the boot code into RAM before being executed
- Pre-Flashing: the code provided by a peripheral is automatically stored into a Flash for the next boots.

B

0



### NAND Flash with NOR interface

#### Samsung OneNAND

Simplest interface (NOR NAND), XIP, prefetch



Lab<mark>-STICC</mark>