Preview only show first 10 pages with watermark. For full document please download

It`s Not (just) About Performance - University Of Utah School Of

   EMBED


Share

Transcript

Near-­‐Data  Computa.on:     It’s  Not  (Just)  About  Performance   Steven  Swanson   Non-­‐Vola0le  Systems  Laboratory   Computer  Science  and  Engineering   University  of  California,  San  Diego   1   Solid State Memories •  NAND  flash   –  Ubiquitous,  cheap   –  Sort  of  slow,   idiosyncra0c   •  Phase  change,  Spin   torque  MRAMs,  etc.   –  On  the  horizon   –  DRAM-­‐like  speed   –  DRAM  or  flash-­‐like   density   2   1000000   Bandwidth  Rela.ve  to  disk   100000   5917x  à  2.4x/yr   10000   1000   PCIe-­‐Flash  (2012)   PCIe-­‐PCM  (2010)   PCIe-­‐PCM  (2015?)   100   10   DDR  Fast  NVM  (2016?)   Hard  Drives  (2006)  PCIe-­‐Flash  (2007)   7200x  à  2.4x/yr   1   1   10   100   1000   10000   100000   1000000   100000 1/Latency  Rela.ve  To  Disk   3   Programmability in High-Speed Peripherals •  As  hardware  gets  more  sophis0cated,   programmability  emerges   –  GPUs  –  fixed  func0on  à  Programmable  shaders   à  full-­‐blown  programmability   –  NICs  –  Fixed  func0on  à    MAC  offloading  à  TCP   offload   •  Storage  has  been  leb  behind   –  It  hasn’t  goden  any  faster  in  40  years     4   Why Near-data Processing? Well-­‐ studied   Worth  a   Closer   Look   Data  Dependent  Accesses   Internal  bandwidth  ≫  interconnect   ✓   bandwidth     Moving  data  takes  energy   ✓   Trusted   Cefficient   omputa0on   NDP  compute   can  be  energy   ✓   Storage  latencies  ≤  Interconnect  Latency   ✓   ✓     Storage  latencies  ≈  CPU  latency   ✓   ✓   ✓   NDP  compute  Seman0c   can  be  trusted   Flexibility   Improved  Real-­‐0me  capabili0es     ✓   5   Diverse SSD Semantics •  File  system  offload/OS  bypass   –  [Caulfield,  ASPLOS  2012]   –  [Zhang,  FAST  2012]   Lessons  Learned   •  Caching  support   Implementa0on   Our  Goal:   Make   costs   are  high   Database  transac0ons  support   Programming  an  SSD     Easy   and  Flexible.   Fit  with  applica0ons  is   –  [Bhaskaran,  INFLOW  2013]   –  [Saxena,  EuroSYS  2012]     •  –  [Coburn  SOSP  2013]   –  [Ouyang  HPCA  2011]   –  [Prabhakaran  OSDI  2008]   •  NoSQL  offload   –  Samsung’s  SmartSSD   –  [De  FCCM  2013]   •  Sparse  storage  address  space   –  FusionIO  DFS     uncertain   6   The  Sobware  Defined  SSD   7   The Software-defined SSD SDSSD Host Machine Memory Memory Memory Memory Memory Memory Application Generic SSDAPP interface PCIe Ctrl PCIe Bridge 3GB/s 3GB/s 3GB/s Memory Controller SSD CPU Memory Controller SSD CPU Memory Controller SSD CPU 4 GB/s 8   SDSSD Programming Model RPC   Ring   –  Send  and  receive  RPC  requests   •  SSD-­‐CPU   –  Send  and  receive  RPC  requests   –  Manage  access  to  NV  storage   banks   PCIe   Host   CPU   SDSSD   CPU   Data  streamer   •  The  host  communicates  with   SDSSD  processors  via  a   general  RPC  interface.   •  Host-­‐CPU   Memory   9   The SDSSD Prototype •  FPGA-­‐based  implementa0on   •  DDR2  DRAM  emulates  PCM   –  Configurable  memory  latency   –  48  ns  reads,  150  ns  writes   –  64GB  across  8  controllers   •  PCIe:  2  GB/s,  full  duplex   10   SDSSD Case Studies •  •  •  •  Basic  IO   Caching   Transac0on  processing   NoSQL  Databases   11   Basic  IO   12   Normal IO Operations: Base-IO •  •  •  •  Base-­‐IO  App  provides  read/write  func0ons.   Just  like  a  normal  SSD.   Applica0ons  make  system  calls  to  the  kernel   The  kernel  block  driver  issues  RPCs   1.  SysCall   user   accesses   kernel   module   2.  Read/Write   RPCs   SDSSD   CPU   NVM   13   Faster IO: Direct access IO •  OS  installs  access  permissions  on  behalf  of  a   process.   •  Applica0ons  issue  RPCs  directly  to  HW   •  The  App  checks  permission   user   accesses   1.  userspace   channel   kernel   module   3.  userspace  channel   direct  IO  accesses   2.  kernel   installs  perms   SDSSD   CPU   NVM   14   Bandwidth  (MB/s)   Preliminary Performance Comparison 1800   Read   1800   1600   1600   1400   1400   1200   1200   1000   1000   800   800   Direct-­‐IO   600   Direct-­‐IO-­‐HW   400   Direct-­‐IO   600   Direct-­‐IO-­‐HW   400   Base-­‐IO   200   Write   Base-­‐IO   200   0   0   0.5   1   2   4   8   16   32   64   128   0.5   1   2   4   8   16   32   64   128   15   Caching   16   Caching •  NVMs  will  be  caches  before   they  are  storage   •  Cache  hits  should  be  as  fast   as  Direct-­‐IO   •  Caching  App     File  System     Applica.on   file     file   Cache  Library           Userspace   Kernel   –  Tracks  dirty  data   –  Tracks  usage  sta0s0cs     system   reads/writes   Cache  Hit   Cache  Miss   Cache  Manager       …   Device     Driver   Disk   SDSSD     Caching   17   4KB Cache Hit Latency 18   Transac0on  Processing   19   Atomic Writes •  Write-­‐ahead  logging  guarantees  atomicity   •  New  interface  for  atomic  opera0ons   –  LogWrite(TID,  file,  file_off,  log,  log_off)   •  Transac0on  commit  occurs  at  the  memory   interface  à  full  memory  bandwidth   Logging  Module  inside  SDSSD   Transaction Table …" TID 15 PENDING TID 24 FREE …" X …" TID 37 COMMITTED …" Metadata File Log File Data File New D Old D New B Old B New A Old A New C Old C 20   Atomic Writes •  SDSSD  atomic-­‐writes  outperform  pure  sobware   approach  by  nearly  2.0x   •  SDSSD  atomic-­‐writes  achieve  comparable   performance  to  a  pure  hardware  implementa0on   Bandwidth (GB/s) AtomicWrite SDSSD−Tx−Opt SDSSD−Tx 2.0 Moneta−Tx Soft−atomic 1.5 1.0 0.5 0.0 0.5 2 8 32 Request Size (KB) 128 21   Key  Value  Store   22   Key-value store •  Key-­‐value  store  is    a  fundamental    building   block  for  many  enterprise  applica0ons     •  Supports  simple  opera0ons   o  Get  to  retrieve  the  value  corresponding  to  a  key   o  Put  to  insert  or  update  a  key-­‐value  pair   o  Delete  to  remove  a  key-­‐value  pair   •  Open-­‐chaining  hash  table  implementa0on.     23   Direct-IO based Key-Value Store Get  K3   Key  K3   Hash(K3)   K1   V1   0   1   K2   V2   K3   V3   K8   V8   K9   V9   K4   V4   M-­‐1   Index  Structure   Compare        Keys   K7   V7   Key-­‐value  data   I/O   Mismatch   Match   Host   SSD   24   SDSSD Key-Value Store App Get  K3   0   1   Hash(K3)   Key  K3   K1   V1   Hash   Idx   K2   V2   K3   V3   K8   V8   K9   V9   K4   V4   M-­‐1   Index     Structure   K7   V7   Key-­‐value  data   RPC(Get)   Hash  Idx   Key  K3   K3   V3   Compare        Keys   DMA   Mismatch   Match   Host   SDSSD   SPU   25   MemcacheDB performance Throughput  (in  Million  ops/s)   0.6   0.5   0.4   Direct-­‐IO   0.3   Key-­‐Value-­‐HW   0.2   Key-­‐Value   0.1   0   Get   Put   Workload-­‐A   Get   0.6   0.45   0.4   0.35   0.3   0.25   Direct-­‐IO   0.2   Key-­‐Value   0.15   Key-­‐Value-­‐HW   0.1   0.05   0   Throughput  (Million  ops/s)   Throughput  (Million  ops/s)   0.5   Workload-­‐B   Put   0.5   0.4   Direct-­‐IO   0.3   Key-­‐Value   0.2   Key-­‐Value-­‐HW   0.1   0   1   2   4   8   Avg.  chain  length   16   32   1   2   4   8   Avg.  chain  length   16   32   26   Ease-of-Use 27   Conclusion •  New  memories  argue  for  near-­‐data  processing   –  But  not  just  for  bandwidth!   –  Trust,  low-­‐latency,  and  flexibility  are  cri0cal  too!   •  Seman0c  flexibility  is  valuable  and  viable   –  15%  reduc0on  in  performance  compared  to  fixed-­‐ func0on  implementa0on   –  6-­‐30x  reduc0on  in  implementa0on  0me   –  Every  applica0on  could  have  a  custom  SSD     28   Thanks!   29   Ques0ons?   30