Preview only show first 10 pages with watermark. For full document please download

It`s Not (just) About Performance - University Of Utah School Of

Rating
Date

September 2018
Size

4MB
Views

7,536
Categories

Computers & electronics Computer components System components Memory modules

Transcript

Near-‐Data Computa.on: It’s Not (Just) About Performance Steven Swanson Non-‐Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories •  NAND ﬂash –  Ubiquitous, cheap –  Sort of slow, idiosyncra0c •  Phase change, Spin torque MRAMs, etc. –  On the horizon –  DRAM-‐like speed –  DRAM or ﬂash-‐like density 2 1000000 Bandwidth Rela.ve to disk 100000 5917x à 2.4x/yr 10000 1000 PCIe-‐Flash (2012) PCIe-‐PCM (2010) PCIe-‐PCM (2015?) 100 10 DDR Fast NVM (2016?) Hard Drives (2006) PCIe-‐Flash (2007) 7200x à 2.4x/yr 1 1 10 100 1000 10000 100000 1000000 100000 1/Latency Rela.ve To Disk 3 Programmability in High-Speed Peripherals •  As hardware gets more sophis0cated, programmability emerges –  GPUs – ﬁxed func0on à Programmable shaders à full-‐blown programmability –  NICs – Fixed func0on à MAC oﬄoading à TCP oﬄoad •  Storage has been leb behind –  It hasn’t goden any faster in 40 years 4 Why Near-data Processing? Well-‐ studied Worth a Closer Look Data Dependent Accesses Internal bandwidth ≫ interconnect ✓ bandwidth Moving data takes energy ✓ Trusted Ceﬃcient omputa0on NDP compute can be energy ✓ Storage latencies ≤ Interconnect Latency ✓ ✓ Storage latencies ≈ CPU latency ✓ ✓ ✓ NDP compute Seman0c can be trusted Flexibility Improved Real-‐0me capabili0es ✓ 5 Diverse SSD Semantics •  File system oﬄoad/OS bypass –  [Caulﬁeld, ASPLOS 2012] –  [Zhang, FAST 2012] Lessons Learned •  Caching support Implementa0on Our Goal: Make costs are high Database transac0ons support Programming an SSD Easy and Flexible. Fit with applica0ons is –  [Bhaskaran, INFLOW 2013] –  [Saxena, EuroSYS 2012] •  –  [Coburn SOSP 2013] –  [Ouyang HPCA 2011] –  [Prabhakaran OSDI 2008] •  NoSQL oﬄoad –  Samsung’s SmartSSD –  [De FCCM 2013] •  Sparse storage address space –  FusionIO DFS uncertain 6 The Sobware Deﬁned SSD 7 The Software-defined SSD SDSSD Host Machine Memory Memory Memory Memory Memory Memory Application Generic SSDAPP interface PCIe Ctrl PCIe Bridge 3GB/s 3GB/s 3GB/s Memory Controller SSD CPU Memory Controller SSD CPU Memory Controller SSD CPU 4 GB/s 8 SDSSD Programming Model RPC Ring –  Send and receive RPC requests •  SSD-‐CPU –  Send and receive RPC requests –  Manage access to NV storage banks PCIe Host CPU SDSSD CPU Data streamer •  The host communicates with SDSSD processors via a general RPC interface. •  Host-‐CPU Memory 9 The SDSSD Prototype •  FPGA-‐based implementa0on •  DDR2 DRAM emulates PCM –  Conﬁgurable memory latency –  48 ns reads, 150 ns writes –  64GB across 8 controllers •  PCIe: 2 GB/s, full duplex 10 SDSSD Case Studies •  •  •  •  Basic IO Caching Transac0on processing NoSQL Databases 11 Basic IO 12 Normal IO Operations: Base-IO •  •  •  •  Base-‐IO App provides read/write func0ons. Just like a normal SSD. Applica0ons make system calls to the kernel The kernel block driver issues RPCs 1. SysCall user accesses kernel module 2. Read/Write RPCs SDSSD CPU NVM 13 Faster IO: Direct access IO •  OS installs access permissions on behalf of a process. •  Applica0ons issue RPCs directly to HW •  The App checks permission user accesses 1. userspace channel kernel module 3. userspace channel direct IO accesses 2. kernel installs perms SDSSD CPU NVM 14 Bandwidth (MB/s) Preliminary Performance Comparison 1800 Read 1800 1600 1600 1400 1400 1200 1200 1000 1000 800 800 Direct-‐IO 600 Direct-‐IO-‐HW 400 Direct-‐IO 600 Direct-‐IO-‐HW 400 Base-‐IO 200 Write Base-‐IO 200 0 0 0.5 1 2 4 8 16 32 64 128 0.5 1 2 4 8 16 32 64 128 15 Caching 16 Caching •  NVMs will be caches before they are storage •  Cache hits should be as fast as Direct-‐IO •  Caching App File System Applica.on ﬁle ﬁle Cache Library Userspace Kernel –  Tracks dirty data –  Tracks usage sta0s0cs system reads/writes Cache Hit Cache Miss Cache Manager … Device Driver Disk SDSSD Caching 17 4KB Cache Hit Latency 18 Transac0on Processing 19 Atomic Writes •  Write-‐ahead logging guarantees atomicity •  New interface for atomic opera0ons –  LogWrite(TID, ﬁle, ﬁle_oﬀ, log, log_oﬀ) •  Transac0on commit occurs at the memory interface à full memory bandwidth Logging Module inside SDSSD Transaction Table …" TID 15 PENDING TID 24 FREE …" X …" TID 37 COMMITTED …" Metadata File Log File Data File New D Old D New B Old B New A Old A New C Old C 20 Atomic Writes •  SDSSD atomic-‐writes outperform pure sobware approach by nearly 2.0x •  SDSSD atomic-‐writes achieve comparable performance to a pure hardware implementa0on Bandwidth (GB/s) AtomicWrite SDSSD−Tx−Opt SDSSD−Tx 2.0 Moneta−Tx Soft−atomic 1.5 1.0 0.5 0.0 0.5 2 8 32 Request Size (KB) 128 21 Key Value Store 22 Key-value store •  Key-‐value store is a fundamental building block for many enterprise applica0ons •  Supports simple opera0ons o  Get to retrieve the value corresponding to a key o  Put to insert or update a key-‐value pair o  Delete to remove a key-‐value pair •  Open-‐chaining hash table implementa0on. 23 Direct-IO based Key-Value Store Get K3 Key K3 Hash(K3) K1 V1 0 1 K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-‐1 Index Structure Compare Keys K7 V7 Key-‐value data I/O Mismatch Match Host SSD 24 SDSSD Key-Value Store App Get K3 0 1 Hash(K3) Key K3 K1 V1 Hash Idx K2 V2 K3 V3 K8 V8 K9 V9 K4 V4 M-‐1 Index Structure K7 V7 Key-‐value data RPC(Get) Hash Idx Key K3 K3 V3 Compare Keys DMA Mismatch Match Host SDSSD SPU 25 MemcacheDB performance Throughput (in Million ops/s) 0.6 0.5 0.4 Direct-‐IO 0.3 Key-‐Value-‐HW 0.2 Key-‐Value 0.1 0 Get Put Workload-‐A Get 0.6 0.45 0.4 0.35 0.3 0.25 Direct-‐IO 0.2 Key-‐Value 0.15 Key-‐Value-‐HW 0.1 0.05 0 Throughput (Million ops/s) Throughput (Million ops/s) 0.5 Workload-‐B Put 0.5 0.4 Direct-‐IO 0.3 Key-‐Value 0.2 Key-‐Value-‐HW 0.1 0 1 2 4 8 Avg. chain length 16 32 1 2 4 8 Avg. chain length 16 32 26 Ease-of-Use 27 Conclusion •  New memories argue for near-‐data processing –  But not just for bandwidth! –  Trust, low-‐latency, and ﬂexibility are cri0cal too! •  Seman0c ﬂexibility is valuable and viable –  15% reduc0on in performance compared to ﬁxed-‐ func0on implementa0on –  6-‐30x reduc0on in implementa0on 0me –  Every applica0on could have a custom SSD 28 Thanks! 29 Ques0ons? 30

It`s Not (just) About Performance - University Of Utah School Of

Rating

Date

Size

Views

Categories

Share

Transcript

Forgot your password?.