Transcript
Near-‐Data Computa.on: It’s Not (Just) About Performance Steven Swanson Non-‐Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego
1
Solid State Memories • NAND flash – Ubiquitous, cheap – Sort of slow, idiosyncra0c
• Phase change, Spin torque MRAMs, etc. – On the horizon – DRAM-‐like speed – DRAM or flash-‐like density 2
1000000
Bandwidth Rela.ve to disk
100000
5917x à 2.4x/yr
10000 1000
PCIe-‐Flash (2012) PCIe-‐PCM (2010)
PCIe-‐PCM (2015?)
100 10
DDR Fast NVM (2016?)
Hard Drives (2006) PCIe-‐Flash (2007)
7200x à 2.4x/yr
1 1
10
100 1000 10000 100000 1000000 100000 1/Latency Rela.ve To Disk 3
Programmability in High-Speed Peripherals • As hardware gets more sophis0cated, programmability emerges – GPUs – fixed func0on à Programmable shaders à full-‐blown programmability – NICs – Fixed func0on à MAC offloading à TCP offload
• Storage has been leb behind – It hasn’t goden any faster in 40 years 4
Why Near-data Processing? Well-‐ studied
Worth a Closer Look
Data Dependent Accesses Internal bandwidth ≫ interconnect ✓ bandwidth Moving data takes energy ✓ Trusted Cefficient omputa0on NDP compute can be energy ✓ Storage latencies ≤ Interconnect Latency ✓ ✓ Storage latencies ≈ CPU latency ✓ ✓ ✓ NDP compute Seman0c can be trusted Flexibility Improved Real-‐0me capabili0es
✓ 5
Diverse SSD Semantics • File system offload/OS bypass – [Caulfield, ASPLOS 2012] – [Zhang, FAST 2012]
Lessons Learned
• Caching support
Implementa0on Our Goal: Make costs are high Database transac0ons support Programming an SSD Easy and Flexible. Fit with applica0ons is – [Bhaskaran, INFLOW 2013] – [Saxena, EuroSYS 2012]
•
– [Coburn SOSP 2013] – [Ouyang HPCA 2011] – [Prabhakaran OSDI 2008]
• NoSQL offload
– Samsung’s SmartSSD – [De FCCM 2013]
• Sparse storage address space – FusionIO DFS
uncertain
6
The Sobware Defined SSD
7
The Software-defined SSD SDSSD Host Machine Memory Memory
Memory Memory
Memory Memory
Application Generic SSDAPP interface PCIe Ctrl
PCIe
Bridge
3GB/s
3GB/s
3GB/s
Memory Controller SSD CPU
Memory Controller SSD CPU
Memory Controller SSD CPU
4 GB/s
8
SDSSD Programming Model
RPC
Ring
– Send and receive RPC requests
• SSD-‐CPU – Send and receive RPC requests – Manage access to NV storage banks
PCIe
Host CPU
SDSSD CPU
Data streamer
• The host communicates with SDSSD processors via a general RPC interface. • Host-‐CPU
Memory 9
The SDSSD Prototype
• FPGA-‐based implementa0on • DDR2 DRAM emulates PCM – Configurable memory latency – 48 ns reads, 150 ns writes – 64GB across 8 controllers
• PCIe: 2 GB/s, full duplex
10
SDSSD Case Studies
• • • •
Basic IO Caching Transac0on processing NoSQL Databases
11
Basic IO
12
Normal IO Operations: Base-IO • • • •
Base-‐IO App provides read/write func0ons. Just like a normal SSD. Applica0ons make system calls to the kernel The kernel block driver issues RPCs 1. SysCall
user accesses
kernel module
2. Read/Write RPCs
SDSSD CPU
NVM
13
Faster IO: Direct access IO • OS installs access permissions on behalf of a process. • Applica0ons issue RPCs directly to HW • The App checks permission user accesses
1. userspace channel
kernel module
3. userspace channel direct IO accesses
2. kernel installs perms
SDSSD CPU
NVM
14
Bandwidth (MB/s)
Preliminary Performance Comparison
1800
Read
1800
1600
1600
1400
1400
1200
1200
1000
1000
800
800
Direct-‐IO
600
Direct-‐IO-‐HW
400
Direct-‐IO
600
Direct-‐IO-‐HW
400
Base-‐IO
200
Write
Base-‐IO
200
0
0 0.5
1
2
4
8
16
32
64
128
0.5
1
2
4
8
16
32
64
128
15
Caching
16
Caching • NVMs will be caches before they are storage • Cache hits should be as fast as Direct-‐IO • Caching App File System
Applica.on file file Cache Library
Userspace Kernel
– Tracks dirty data – Tracks usage sta0s0cs
system reads/writes
Cache Hit
Cache Miss
Cache Manager
… Device Driver Disk
SDSSD
Caching
17
4KB Cache Hit Latency
18
Transac0on Processing
19
Atomic Writes • Write-‐ahead logging guarantees atomicity • New interface for atomic opera0ons – LogWrite(TID, file, file_off, log, log_off)
• Transac0on commit occurs at the memory interface à full memory bandwidth Logging Module inside SDSSD Transaction Table
…"
TID 15 PENDING
TID 24 FREE
…"
X
…"
TID 37 COMMITTED
…"
Metadata File Log File
Data File
New D
Old D
New B
Old B
New A
Old A
New C
Old C
20
Atomic Writes • SDSSD atomic-‐writes outperform pure sobware approach by nearly 2.0x • SDSSD atomic-‐writes achieve comparable performance to a pure hardware implementa0on Bandwidth (GB/s)
AtomicWrite SDSSD−Tx−Opt SDSSD−Tx
2.0
Moneta−Tx Soft−atomic
1.5 1.0 0.5 0.0
0.5
2
8
32
Request Size (KB)
128
21
Key Value Store
22
Key-value store • Key-‐value store is a fundamental building block for many enterprise applica0ons • Supports simple opera0ons o Get to retrieve the value corresponding to a key o Put to insert or update a key-‐value pair o Delete to remove a key-‐value pair
• Open-‐chaining hash table implementa0on.
23
Direct-IO based Key-Value Store Get K3
Key K3
Hash(K3)
K1 V1
0 1
K2 V2
K3 V3
K8 V8
K9 V9
K4 V4
M-‐1 Index Structure
Compare Keys
K7 V7
Key-‐value data
I/O Mismatch
Match
Host
SSD 24
SDSSD Key-Value Store App Get K3
0 1 Hash(K3)
Key K3
K1 V1
Hash Idx
K2 V2
K3 V3
K8 V8
K9 V9
K4 V4
M-‐1 Index Structure
K7 V7
Key-‐value data
RPC(Get) Hash Idx Key K3
K3 V3
Compare Keys
DMA
Mismatch Match Host
SDSSD
SPU
25
MemcacheDB performance Throughput (in Million ops/s)
0.6 0.5 0.4
Direct-‐IO
0.3
Key-‐Value-‐HW
0.2
Key-‐Value
0.1 0 Get
Put
Workload-‐A
Get
0.6
0.45 0.4 0.35 0.3 0.25
Direct-‐IO
0.2
Key-‐Value
0.15
Key-‐Value-‐HW
0.1 0.05 0
Throughput (Million ops/s)
Throughput (Million ops/s)
0.5
Workload-‐B
Put
0.5 0.4 Direct-‐IO
0.3
Key-‐Value
0.2
Key-‐Value-‐HW
0.1 0
1
2
4
8
Avg. chain length
16
32
1
2
4
8
Avg. chain length
16
32
26
Ease-of-Use
27
Conclusion • New memories argue for near-‐data processing – But not just for bandwidth! – Trust, low-‐latency, and flexibility are cri0cal too!
• Seman0c flexibility is valuable and viable – 15% reduc0on in performance compared to fixed-‐ func0on implementa0on – 6-‐30x reduc0on in implementa0on 0me – Every applica0on could have a custom SSD 28
Thanks!
29
Ques0ons?
30