Transcript
Providing Australian researchers with world-class computing services
Lustre Community Conference 2015
Distributed Parity v RAID6 for Lustre Daniel Rodwell Manager, Data Storage Services
August 2015 W
nci.org.au @NCInews
Agenda
• Overview – DDP vs RAID6 quick overview – Why would you use DDP? – RAID6 Rebuild and ApplicaGons • DDP Test Pla/orm at NCI – Gdata3 built as DDP – TesGng and Results • RAID 6 – Gdata3 built as RAID6 – TesGng and Results
• Observa9ons and Recommenda9ons – Lustre and DDP
2
Agenda
• Disclaimer – These are our observaGons to date – Not authoritaGve, nor a IEEE paper – Other factors in design decisions not listed here – Choose your own adventure • your mileage may vary • Need to select an architecture that works for your site
3
Redundant Array of InoperaGve Drives?
DDP & RAID6 Overview 4
TradiGonal RAID 6 summary – – – – –
Typically deployed as “8+2” RAID 6 Defacto standard in most deployments 8 “data drives” equivalent of usable data 2 “parity drives” equivalent of P+Q parity data Typically constructed as 8x 128K chunks, for a 1024K (1MiB) Stripe
– Tolerate failure of 2 drives with no loss of user data – Typically no performance penalty for read, write requires parity calculaGon of P and Q parity blocks. Modern RAID controllers can offer RAID6 performance similar or be_er than RAID10 with like number of drives.
(image: Wikipedia)
– Loss of 1 drive = ‘degraded state’, •
medium-‐high risk
– Loss of 2 drives = ‘no redundancy state’, •
very high risk
– Loss of 3 drives = ‘pool offline’, •
typically fatal (image: NCI Whiteboard, unofficial) 5
The problem with RAID today – As disk sizes increase, RAID rebuild Gmes become problemaGc. 20+ hours for a single disk rebuild in a RAID6 set under normal workload condiGons for a 4TB drive. – Overall Lustre volume performance is degraded while this occurs. 1 RAID 6 pool = 1 OST. – 1 Slow OST can degrade the lustre volume, amount varies depending on scenario (parGcularly if small number of OSTs, and file striping across OSTs is in use). – RAID controller overhead can impact the performance other pools on the same controller which are not in a rebuild state, adding to the degredaGon – Risk of loss of second disk in RAID 6 pool during rebuild (typically use 8+2 R6) – If pool enters a no-‐redundancy state (ie loss of 2 drives in RAID6 pool), HPC operaGons are suspended while rebuild occurs due to increased risk.
6
Rebuild impact on applicaGon – Example applicaGon – IOR (simulates HPC I/O workload) – Plot of 198 RAID6 OSTs. Poorly performing OSTs idenGfied – Slow drive replaced. RAID6 Rebuild Gme = 17h 14m
.…ApplicaGon…. Run duraGon
.…..ApplicaGon…… Run duraGon extended
7
DDP – DDP – Dynamic Disk Pools with highly distributed parity. Many drives involved in rebuild. Sparing capacity built in to pool. – DDP rebuilds in 1-‐3 hour range (can be faster). – Instantaneous performance (Stream) is typically rated lower for DDP. Can be higher for Random workloads than RAID5 or RAID6. – DDP average performance when measured over months or years is higher than RAID 5 or 6. – RAID rebuild Gme will become more problemaGc with 6 & 8TB drives. Drive performance (130-‐150MB/sec) generally not increasing. – InvesGgate DDP for Lustre now to assist with future architectures and planning.
(image: Netapp)
8
DDP Summary – DDP typically has more drives in a pool, minimum is DDP-‐11 (11 drives). – D-‐Piece: conGguous 512MB secGon within disk, constructed from 4096 128K segments – D-‐stripe: 10x D-‐Pieces. D-‐Pieces are selected from some (not all) drives in the pool using pseudo-‐random selecGon (ensuring pa_ern is not repeated) – A D-‐Stripe is effecGvely a RAID6 group sinng within a larger pool of drives (8 data and 2 parity chunks, aka Pieces). A D-‐Stripe has a usable capacity of 8x D-‐Pieces (8x 512MB, 4096MB) – A Volume within a Dynamic Disk Pool is created by construcGng N x D-‐Stripes.
(image: Netapp) Note: 10x D-‐pieces represented by a colour, with 12 Drives in Dynamic Disk Pool. For each D-‐stripe, only 10 drives are used, 2 are not used. Sparing (preservaJon) capacity = 2
9
DDP Summary – The loss of a single drive affects a number of D-‐ stripes that have D-‐pieces on the lost drive. – DDP has a noGon of sparing capacity. This spare capacity is used during the reconstrucGon to reconstruct and rebalance the D-‐pieces. DDP-‐11 has a sparing capacity of 1. – EnGre pool parGcipates in the reconstrucGon, with a larger number of drives both reading and wriGng during the reconstruct – cf. RAID 5 or 6, where only 1 drive is effecGvely wriGng as the new member of the pool. Rebuild performance is limited to the write performance of an individual drive (7.2K NL-‐SAS = ~160MB/sec)
(image: Netapp)
– Only the required D-‐pieces are read for the affected D-‐stripes during reconstrucGon, not enGre drives. 10
DDP at scale for Lustre
Dynamic Disk Pools 11
Lustre Design
• Preferable OST CharacterisGcs
– OST performance is typically determined at iniGal build by choice of disk array technology (choose carefully if adding incrementally over mulGple years). – Performance of all OSTs (and OSSes) in the filesystem should be very similar. – Mixed OSTs sizes and/or performance will result in hotsponng and inconsistent read/write performance as files are striped across OSTs or allocated in a round-‐robin / stride. – Design building block for your workload – controller to disk to IOPS raGos need to be considered. – Mixed 1MB Stream and Random 4K IO workload. Lustre uses 1MB transfers as default (opGmise RAID config for 1MB stripe size).
OSS HA Pair
Object Storage Servers (OSS) Object Storage Targets (OST)
– More small OSTs preferable to few very large OSTs.
Loss of a single 230TB OST = a lot of data gone A very large OST (230TB) will take a long Gme to e2fsck. Many smaller OSTs can be e2fsck’d in parallel Each OST mapping on client requires some memory – fewer are be_er • Smaller OSTs can fill quickly with few large files if striping not set by user or default • Balancing act. • • • •
12
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks
1x Building Block • 2x Fujitsu RX2530-‐M1 • 1x E5660 60 Disk controller shelf • 2x DE6600 60 Disk expansion shelf
13
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks •
•
OST storage for Gdata3 is built using Netapp E5660, with 8x OSS-‐OST 12Gbit SAS interconnects Array (OST) – – – –
•
180 x 4TB NL-‐SAS, 7.2K Dual 12G SAS Controllers 2x 90 Disk DDPs 8 volume slices per 90 Disk DDP
6Gbit SAS CtrlA x2 CtrlB x2
OSS 1 OSS 2 OSTs OSTs OSTs
Hosts (MDS) – 2x Servers as High Availability pair – 1RU Fujitsu RX2530-‐M1’s each with • 2x Intel Xeon E5-‐2640v3 ‘Haswell’ • 8 Core, 16 Hyperthread, 2.6Ghz Base, 3.4Ghz Turbo Boost max • 256GB DDR4 RDIMM • Single Port FDR connecGon to Fabric • Quad Port 6G SAS connecGon to E5660 14
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks
FDR IB Fabric
OSS A
SAS Array-‐Host ConnecGons
30TB Volume Slice = 1x OST High Availability Pair FDR IB Fabric
8x 30TB Volume Slices per 90D DDP
90 Disk DDP – Controller A 90 Disk DDP – Controller B
OSS B
8x 30TB Volume Slices
30TB Volume Slice = 1x OST
180x 4TB NL-‐SAS
15
Design – gdata3 Object Storage Gdata 3 DDP configuraGon
FDR IB Fabric
8x OSTs to OSS A
OSS A 30TB Volume Slice = 1x OST
90 Disk DDP – Controller A
High Availability Pair FDR IB Fabric
OSS B
8x 30TB Volume Slices per 90D DDP
8x 30TB Volume Slices
8x OSTs to OSS B
30TB Volume Slice = 1x OST
90 Disk DDP – Controller B
Building block Capacity = 16x 30TB OSTs = 480TB + 6 DDP spares per 90 disk pool
180x 4TB NL-‐SAS
16
DDP Performance – IniGal TesGng – DDP can tradiGonally perform slightly lower under peak stream workload condiGons. – Need to evaluate impact of rebuild Gme versus slightly lower stream performance. – Interim Benchmark for E5600 controllers looks very promising – 1x Building block = 180 disks, 2x 90D DDP, 8x slice per DDP with fully balanced SAS/MulGpath/Controller config – But… is there really a free lunch? 6.26 GB/sec Write test
9.19 GB/sec Read test
8x 1MB block size streams per Pool, driven by OSS HA pair 17
DDP Performance – Extended TesGng – Lots of tesGng in front of screens like this
18
DDP Performance – Extended TesGng – And this
19
DDP Performance – Extended TesGng – And lots of excel worksheets – A/B type tesGng, comparing performance as load is scaled on various components such as array controllers, disk pool, OSS. – Something appears to be not quite right with read performance. – Read performance should be higher than write on E-‐ series plauorm. – Reads are slower than write, from light to heavy workloads. – Read performance suffering as load increased, but load well within max rated MB/sec of controller 20
DDP Performance – Extended TesGng – Disk Latency has widely varying numbers, instantaneous read performance varies across run •
Er, not so good there.
4.9 ms current latency 243 ms max latency
21
Scaling Read and Write workload, DDP
• Series of tests, progressively loading up contenGon on a DDP-‐90 pool or an E5600 controller to determine what is going on
22
Scaling Read and Write workload, DDP
• DDP-‐90 Writers and Readers contending on Controller A • 1 Reader = Single reader on enGre array • 2 readers = 1 reader on Pool 21, 1 on Pool 22.
Controller ContenGon test
Note: At 4 Readers or Writers (all same controller, 2 in each pool), performance falls away rapidly
23
Scaling Read and Write workload, DDP
• DDP-‐90 Writers and Readers contending on same pool • 1 Reader = Single reader on enGre array • 2 readers = 1 reader on Controller A, 1 on Controller B.
Pool ContenGon test
Note: Read Performance is poor where mulGple readers contend on same 90 disk pool. Is not controller contenGon. 24
Revert to RAID
• A lot of tuning later, no appreciable difference in performance achieved • DDP constructed filesystem can hit peak performance, but cannot sustain consistent, well averaged / Gght performance profile across workload simulaGon • Things not looking good for DDP-‐90 with volume slices at this point
• Decision made: – Lustre filesystem reiniGalised as RAID 6 (8+2), 18x OSTs per array. – Details on why later…
25
RAID6 at scale for Lustre
RAID 6 26
Design – gdata3 Object Storage Gdata 3 RAID 6 configuraGon
FDR IB Fabric
9x OSTs to OSS A
OSS A
Ctrl A 29TB 8+2 RAID 6 pool = 1x OST
90 Disk DDP – Controller A
High Availability Pair FDR IB Fabric
OSS B
9x 29TB 8+2 RAID 6 pools on preferred ctrlr A
9x OSTs to OSS B
Ctrl B 29TB 8+2 RAID 6 pool = 1x OST
Building block Capacity = 18x 29TB OSTs = 520TB
9x 29TB 8+2 RAID 6 pools on preferred ctrlr B
180x 4TB NL-‐SAS
27
RAID6 Performance – Extended TesGng – More tesGng again
28
RAID6 Performance – Extended TesGng – More excel again
29
Scaling Read and Write workload, RAID 6
• Another Series of tests, progressively loading up contenGon on a OSS or an E5600 controller to determine what is going on. (No DDP pools on R6)
30
Scaling Read and Write workload, RAID 6
• RAID 6 Writers and Readers contending on Controller A • 1 Reader = Single reader on enGre array • 2 readers = 1 reader on OSS 21, 1 on OSS 22.
Controller ContenGon test
Note: Progressive performance decay as controller (A) reaches fully loaded configuraGon 31
Scaling Read and Write workload, RAID 6
• RAID 6 Writers and Readers contending on same OSS • 1 Reader = Single reader on enGre array • 2 readers = 1 reader on Controller A, 1 on Controller B.
OSS ContenGon test
Note: Progressive performance decay as OSS21 reaches fully loaded configuraGon (OSS has 1 FDR IB link at 56G, 8x 640MB/sec = 5120MB/sec) 32
Recommended configuraGon for Lustre
ObservaGons 33
DDP vs RAID6 on Controller ContenGon test
DDP-‐90 8 OSTs Write 438MB/sec each (avg)
RAID 6 (8+2) 8 OSTs Write 535 MB/sec each (avg)
DDP-‐90 8 OSTs Write 368MB/sec each (avg)
RAID 6 (8+2) 8 OSTs Read 710MB/sec each
34
What is going on?
• Reads are most affected by contenGon of Volume slices in Pool – Why? – Reads cannot be as easily smoothed out like writes by using cache – Writes can be held in controller cache unGl a slightly more opGmal Gme to commit is available
• Why does performance fall away so quickly when pool contenGon occurs? – In RAID 6 configuraGon, 10 disk set is formed into a RAID 6 (8+2) with only one volume (LUN) accessing the 10 drives. – In DDP-‐90 configuraGon, 90 drives are formed into the one 90 disk pool, with 8 volume slices (LUNs). The access pa_erns to volume slices will randomly overlap, as will access to the underlying 90 disks. – This is observed through the sca_ered disk latency pa_erns
35
What is going on?
• Is DDP no good ? – Not at all, just choose your design, layout and use cases carefully – Excellent for thin provisioned, many enterprise applicaGons workloads – Where workload is properly random, with few contending stream operaGons
• Lustre workload – Is not overly random. – HPC workloads will be trying to use as many OSTs simultaneously as possible if properly opGmised for lustre – You want to be hinng as many OSTs as hard as possible for maximum performance, with large transfer sizes and streaming loads.
36
RecommendaGon •
Can I use DDP for Lustre? – Yes (with caveats) • Either use only 1 volume per DDP pool (equivalent of 235TB OST on DDP-‐90) • Use smaller DDP pools like DDP-‐11, with only 1 volume per pool • Where there is Gght alignment between Pools and Volumes (1:1) performance impact is minimised – Swiss NaGonal Supercoming Centre evaluaGon of DDP-‐11 on E5500 • h_p://www.cscs.ch/fileadmin/publicaGons/Tech_Reports/evaluaGon-‐NetApp-‐e5560__1_.pdf
•
But… – Carving up a 180 4TB disk E5660 system using DDP-‐11 will result in significantly lower overall useable capacity. More individual disks are held in the array as sparing/preservaGon capacity. – Same as always, and unchanged – you need to choose protecGon vs capacity. – May not get a choice in this with 8TB+ drives. The rebuild Gmes and risk factors may outweigh the loss of capacity. – AlternaGves to look at? JBODs + ZFS? (ZFS supported as alternate back-‐end filesystem in IEEL 2+)
37
QuesGons ? 38
Providing Australian researchers with world-class computing services
NCI Contacts General enquiries: +61 2 6125 9800 Media enquiries: +61 2 6125 4389 Help desk:
[email protected] Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200
W
nci.org.au @NCInews
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks
Front View – bezel removed • 5x 12 Disk Drawers
Front View – Tray1, Drawer 5 open • 12x 4TB NL SAS
40
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Front of Rack • 3x Building blocks • 42 RU Hosts and storage • 42 RU APC Rack • 1RU in-‐house custom built UTP Patch panel a_ached at RU0 posiGon
41
Design – gdata3 Object Storage Gdata 3 Object Storage Building Blocks Rear of Rack • 2x Building blocks
42
Scale Out 16x Building blocks 8PB, 144GB/sec+ 9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
MDS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
OSS HA Pair
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB
9GB/sec+ 0.5PB 43