Preview only show first 10 pages with watermark. For full document please download

Distributed Parity V Raid6 For Lustre

Rating
Date

October 2018
Size

17.5MB
Views

8,028
Categories

Computers & electronics Computer components System components RAID controllers

Transcript

Providing Australian researchers with world-class computing services Lustre Community Conference 2015 Distributed Parity v RAID6 for Lustre Daniel Rodwell Manager, Data Storage Services August 2015 W nci.org.au @NCInews Agenda •  Overview –  DDP vs RAID6 quick overview –  Why would you use DDP? –  RAID6 Rebuild and ApplicaGons •  DDP Test Pla/orm at NCI –  Gdata3 built as DDP –  TesGng and Results •  RAID 6 –  Gdata3 built as RAID6 –  TesGng and Results •  Observa9ons and Recommenda9ons –  Lustre and DDP 2 Agenda •  Disclaimer –  These are our observaGons to date –  Not authoritaGve, nor a IEEE paper –  Other factors in design decisions not listed here –  Choose your own adventure •  your mileage may vary •  Need to select an architecture that works for your site 3 Redundant Array of InoperaGve Drives? DDP & RAID6 Overview 4 TradiGonal RAID 6 summary –  –  –  –  –  Typically deployed as “8+2” RAID 6 Defacto standard in most deployments 8 “data drives” equivalent of usable data 2 “parity drives” equivalent of P+Q parity data Typically constructed as 8x 128K chunks, for a 1024K (1MiB) Stripe –  Tolerate failure of 2 drives with no loss of user data –  Typically no performance penalty for read, write requires parity calculaGon of P and Q parity blocks. Modern RAID controllers can oﬀer RAID6 performance similar or be_er than RAID10 with like number of drives. (image: Wikipedia) –  Loss of 1 drive = ‘degraded state’, •  medium-‐high risk –  Loss of 2 drives = ‘no redundancy state’, •  very high risk –  Loss of 3 drives = ‘pool oﬄine’, •  typically fatal (image: NCI Whiteboard, unoﬃcial) 5 The problem with RAID today –  As disk sizes increase, RAID rebuild Gmes become problemaGc. 20+ hours for a single disk rebuild in a RAID6 set under normal workload condiGons for a 4TB drive. –  Overall Lustre volume performance is degraded while this occurs. 1 RAID 6 pool = 1 OST. –  1 Slow OST can degrade the lustre volume, amount varies depending on scenario (parGcularly if small number of OSTs, and ﬁle striping across OSTs is in use). –  RAID controller overhead can impact the performance other pools on the same controller which are not in a rebuild state, adding to the degredaGon –  Risk of loss of second disk in RAID 6 pool during rebuild (typically use 8+2 R6) –  If pool enters a no-‐redundancy state (ie loss of 2 drives in RAID6 pool), HPC operaGons are suspended while rebuild occurs due to increased risk. 6 Rebuild impact on applicaGon –  Example applicaGon – IOR (simulates HPC I/O workload) –  Plot of 198 RAID6 OSTs. Poorly performing OSTs idenGﬁed –  Slow drive replaced. RAID6 Rebuild Gme = 17h 14m .…ApplicaGon…. Run duraGon .…..ApplicaGon…… Run duraGon extended 7 DDP –  DDP – Dynamic Disk Pools with highly distributed parity. Many drives involved in rebuild. Sparing capacity built in to pool. –  DDP rebuilds in 1-‐3 hour range (can be faster). –  Instantaneous performance (Stream) is typically rated lower for DDP. Can be higher for Random workloads than RAID5 or RAID6. –  DDP average performance when measured over months or years is higher than RAID 5 or 6. –  RAID rebuild Gme will become more problemaGc with 6 & 8TB drives. Drive performance (130-‐150MB/sec) generally not increasing. –  InvesGgate DDP for Lustre now to assist with future architectures and planning. (image: Netapp) 8 DDP Summary –  DDP typically has more drives in a pool, minimum is DDP-‐11 (11 drives). –  D-‐Piece: conGguous 512MB secGon within disk, constructed from 4096 128K segments –  D-‐stripe: 10x D-‐Pieces. D-‐Pieces are selected from some (not all) drives in the pool using pseudo-‐random selecGon (ensuring pa_ern is not repeated) –  A D-‐Stripe is eﬀecGvely a RAID6 group sinng within a larger pool of drives (8 data and 2 parity chunks, aka Pieces). A D-‐Stripe has a usable capacity of 8x D-‐Pieces (8x 512MB, 4096MB) –  A Volume within a Dynamic Disk Pool is created by construcGng N x D-‐Stripes. (image: Netapp) Note: 10x D-‐pieces represented by a colour, with 12 Drives in Dynamic Disk Pool. For each D-‐stripe, only 10 drives are used, 2 are not used. Sparing (preservaJon) capacity = 2 9 DDP Summary –  The loss of a single drive aﬀects a number of D-‐ stripes that have D-‐pieces on the lost drive. –  DDP has a noGon of sparing capacity. This spare capacity is used during the reconstrucGon to reconstruct and rebalance the D-‐pieces. DDP-‐11 has a sparing capacity of 1. –  EnGre pool parGcipates in the reconstrucGon, with a larger number of drives both reading and wriGng during the reconstruct –  cf. RAID 5 or 6, where only 1 drive is eﬀecGvely wriGng as the new member of the pool. Rebuild performance is limited to the write performance of an individual drive (7.2K NL-‐SAS = ~160MB/sec) (image: Netapp) –  Only the required D-‐pieces are read for the aﬀected D-‐stripes during reconstrucGon, not enGre drives. 10 DDP at scale for Lustre Dynamic Disk Pools 11 Lustre Design •  Preferable OST CharacterisGcs –  OST performance is typically determined at iniGal build by choice of disk array technology (choose carefully if adding incrementally over mulGple years). –  Performance of all OSTs (and OSSes) in the ﬁlesystem should be very similar. –  Mixed OSTs sizes and/or performance will result in hotsponng and inconsistent read/write performance as ﬁles are striped across OSTs or allocated in a round-‐robin / stride. –  Design building block for your workload – controller to disk to IOPS raGos need to be considered. –  Mixed 1MB Stream and Random 4K IO workload. Lustre uses 1MB transfers as default (opGmise RAID conﬁg for 1MB stripe size). OSS HA Pair Object Storage Servers (OSS) Object Storage Targets (OST) –  More small OSTs preferable to few very large OSTs. Loss of a single 230TB OST = a lot of data gone A very large OST (230TB) will take a long Gme to e2fsck. Many smaller OSTs can be e2fsck’d in parallel Each OST mapping on client requires some memory – fewer are be_er •  Smaller OSTs can ﬁll quickly with few large ﬁles if striping not set by user or default •  Balancing act. •  •  •  •  12 Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks 1x Building Block •  2x Fujitsu RX2530-‐M1 •  1x E5660 60 Disk controller shelf •  2x DE6600 60 Disk expansion shelf 13 Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks •  •  OST storage for Gdata3 is built using Netapp E5660, with 8x OSS-‐OST 12Gbit SAS interconnects Array (OST) –  –  –  –  •  180 x 4TB NL-‐SAS, 7.2K Dual 12G SAS Controllers 2x 90 Disk DDPs 8 volume slices per 90 Disk DDP 6Gbit SAS CtrlA x2 CtrlB x2 OSS 1 OSS 2 OSTs OSTs OSTs Hosts (MDS) –  2x Servers as High Availability pair –  1RU Fujitsu RX2530-‐M1’s each with •  2x Intel Xeon E5-‐2640v3 ‘Haswell’ •  8 Core, 16 Hyperthread, 2.6Ghz Base, 3.4Ghz Turbo Boost max •  256GB DDR4 RDIMM •  Single Port FDR connecGon to Fabric •  Quad Port 6G SAS connecGon to E5660 14 Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks FDR IB Fabric OSS A SAS Array-‐Host ConnecGons 30TB Volume Slice = 1x OST High Availability Pair FDR IB Fabric 8x 30TB Volume Slices per 90D DDP 90 Disk DDP – Controller A 90 Disk DDP – Controller B OSS B 8x 30TB Volume Slices 30TB Volume Slice = 1x OST 180x 4TB NL-‐SAS 15 Design – gdata3 Object Storage Gdata 3 DDP conﬁguraGon FDR IB Fabric 8x OSTs to OSS A OSS A 30TB Volume Slice = 1x OST 90 Disk DDP – Controller A High Availability Pair FDR IB Fabric OSS B 8x 30TB Volume Slices per 90D DDP 8x 30TB Volume Slices 8x OSTs to OSS B 30TB Volume Slice = 1x OST 90 Disk DDP – Controller B Building block Capacity = 16x 30TB OSTs = 480TB + 6 DDP spares per 90 disk pool 180x 4TB NL-‐SAS 16 DDP Performance – IniGal TesGng –  DDP can tradiGonally perform slightly lower under peak stream workload condiGons. –  Need to evaluate impact of rebuild Gme versus slightly lower stream performance. –  Interim Benchmark for E5600 controllers looks very promising –  1x Building block = 180 disks, 2x 90D DDP, 8x slice per DDP with fully balanced SAS/MulGpath/Controller conﬁg –  But… is there really a free lunch? 6.26 GB/sec Write test 9.19 GB/sec Read test 8x 1MB block size streams per Pool, driven by OSS HA pair 17 DDP Performance – Extended TesGng –  Lots of tesGng in front of screens like this 18 DDP Performance – Extended TesGng –  And this 19 DDP Performance – Extended TesGng –  And lots of excel worksheets –  A/B type tesGng, comparing performance as load is scaled on various components such as array controllers, disk pool, OSS. –  Something appears to be not quite right with read performance. –  Read performance should be higher than write on E-‐ series plauorm. –  Reads are slower than write, from light to heavy workloads. –  Read performance suﬀering as load increased, but load well within max rated MB/sec of controller 20 DDP Performance – Extended TesGng –  Disk Latency has widely varying numbers, instantaneous read performance varies across run •  Er, not so good there. 4.9 ms current latency 243 ms max latency 21 Scaling Read and Write workload, DDP •  Series of tests, progressively loading up contenGon on a DDP-‐90 pool or an E5600 controller to determine what is going on 22 Scaling Read and Write workload, DDP •  DDP-‐90 Writers and Readers contending on Controller A •  1 Reader = Single reader on enGre array •  2 readers = 1 reader on Pool 21, 1 on Pool 22. Controller ContenGon test Note: At 4 Readers or Writers (all same controller, 2 in each pool), performance falls away rapidly 23 Scaling Read and Write workload, DDP •  DDP-‐90 Writers and Readers contending on same pool •  1 Reader = Single reader on enGre array •  2 readers = 1 reader on Controller A, 1 on Controller B. Pool ContenGon test Note: Read Performance is poor where mulGple readers contend on same 90 disk pool. Is not controller contenGon. 24 Revert to RAID •  A lot of tuning later, no appreciable diﬀerence in performance achieved •  DDP constructed ﬁlesystem can hit peak performance, but cannot sustain consistent, well averaged / Gght performance proﬁle across workload simulaGon •  Things not looking good for DDP-‐90 with volume slices at this point •  Decision made: –  Lustre ﬁlesystem reiniGalised as RAID 6 (8+2), 18x OSTs per array. –  Details on why later… 25 RAID6 at scale for Lustre RAID 6 26 Design – gdata3 Object Storage Gdata 3 RAID 6 conﬁguraGon FDR IB Fabric 9x OSTs to OSS A OSS A Ctrl A 29TB 8+2 RAID 6 pool = 1x OST 90 Disk DDP – Controller A High Availability Pair FDR IB Fabric OSS B 9x 29TB 8+2 RAID 6 pools on preferred ctrlr A 9x OSTs to OSS B Ctrl B 29TB 8+2 RAID 6 pool = 1x OST Building block Capacity = 18x 29TB OSTs = 520TB 9x 29TB 8+2 RAID 6 pools on preferred ctrlr B 180x 4TB NL-‐SAS 27 RAID6 Performance – Extended TesGng –  More tesGng again 28 RAID6 Performance – Extended TesGng –  More excel again 29 Scaling Read and Write workload, RAID 6 •  Another Series of tests, progressively loading up contenGon on a OSS or an E5600 controller to determine what is going on. (No DDP pools on R6) 30 Scaling Read and Write workload, RAID 6 •  RAID 6 Writers and Readers contending on Controller A •  1 Reader = Single reader on enGre array •  2 readers = 1 reader on OSS 21, 1 on OSS 22. Controller ContenGon test Note: Progressive performance decay as controller (A) reaches fully loaded conﬁguraGon 31 Scaling Read and Write workload, RAID 6 •  RAID 6 Writers and Readers contending on same OSS •  1 Reader = Single reader on enGre array •  2 readers = 1 reader on Controller A, 1 on Controller B. OSS ContenGon test Note: Progressive performance decay as OSS21 reaches fully loaded conﬁguraGon (OSS has 1 FDR IB link at 56G, 8x 640MB/sec = 5120MB/sec) 32 Recommended conﬁguraGon for Lustre ObservaGons 33 DDP vs RAID6 on Controller ContenGon test DDP-‐90 8 OSTs Write 438MB/sec each (avg) RAID 6 (8+2) 8 OSTs Write 535 MB/sec each (avg) DDP-‐90 8 OSTs Write 368MB/sec each (avg) RAID 6 (8+2) 8 OSTs Read 710MB/sec each 34 What is going on? •  Reads are most aﬀected by contenGon of Volume slices in Pool –  Why? –  Reads cannot be as easily smoothed out like writes by using cache –  Writes can be held in controller cache unGl a slightly more opGmal Gme to commit is available •  Why does performance fall away so quickly when pool contenGon occurs? –  In RAID 6 conﬁguraGon, 10 disk set is formed into a RAID 6 (8+2) with only one volume (LUN) accessing the 10 drives. –  In DDP-‐90 conﬁguraGon, 90 drives are formed into the one 90 disk pool, with 8 volume slices (LUNs). The access pa_erns to volume slices will randomly overlap, as will access to the underlying 90 disks. –  This is observed through the sca_ered disk latency pa_erns 35 What is going on? •  Is DDP no good ? –  Not at all, just choose your design, layout and use cases carefully –  Excellent for thin provisioned, many enterprise applicaGons workloads –  Where workload is properly random, with few contending stream operaGons •  Lustre workload –  Is not overly random. –  HPC workloads will be trying to use as many OSTs simultaneously as possible if properly opGmised for lustre –  You want to be hinng as many OSTs as hard as possible for maximum performance, with large transfer sizes and streaming loads. 36 RecommendaGon •  Can I use DDP for Lustre? –  Yes (with caveats) •  Either use only 1 volume per DDP pool (equivalent of 235TB OST on DDP-‐90) •  Use smaller DDP pools like DDP-‐11, with only 1 volume per pool •  Where there is Gght alignment between Pools and Volumes (1:1) performance impact is minimised –  Swiss NaGonal Supercoming Centre evaluaGon of DDP-‐11 on E5500 •  h_p://www.cscs.ch/ﬁleadmin/publicaGons/Tech_Reports/evaluaGon-‐NetApp-‐e5560__1_.pdf •  But… –  Carving up a 180 4TB disk E5660 system using DDP-‐11 will result in signiﬁcantly lower overall useable capacity. More individual disks are held in the array as sparing/preservaGon capacity. –  Same as always, and unchanged – you need to choose protecGon vs capacity. –  May not get a choice in this with 8TB+ drives. The rebuild Gmes and risk factors may outweigh the loss of capacity. –  AlternaGves to look at? JBODs + ZFS? (ZFS supported as alternate back-‐end ﬁlesystem in IEEL 2+) 37 QuesGons ? 38 Providing Australian researchers with world-class computing services NCI Contacts General enquiries: +61 2 6125 9800 Media enquiries: +61 2 6125 4389 Help desk: [email protected] Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200 W nci.org.au @NCInews Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Front View – bezel removed •  5x 12 Disk Drawers Front View – Tray1, Drawer 5 open •  12x 4TB NL SAS 40 Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Front of Rack •  3x Building blocks •  42 RU Hosts and storage •  42 RU APC Rack •  1RU in-‐house custom built UTP Patch panel a_ached at RU0 posiGon 41 Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Rear of Rack •  2x Building blocks 42 Scale Out 16x Building blocks 8PB, 144GB/sec+ 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair MDS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair OSS HA Pair 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 9GB/sec+ 0.5PB 43

Distributed Parity V Raid6 For Lustre

Rating

Date

Size

Views

Categories

Share

Transcript

Forgot your password?.