Preview only show first 10 pages with watermark. For full document please download

Seagate Exascale Hpc Storage Miro Lehocky System Engineer

   EMBED


Share

Transcript

Seagate ExaScale HPC storage Miro Lehocky System Engineer Seagate Systems Group, HPC © 2015 Seagate, Inc. All Rights Reserved. 100+ PB Lustre File System 130+ GB/s Lustre File System 140+ GB/s Lustre File System 55 PB Lustre File System 1.6 TB/sec Lustre File System 500+ GB/s Lustre File System 1 TB/sec Lustre File System Market leadership …. Rank Name Computer Site Total Cores Rmax (TFLOPS) Rpeak (TFLOPS) Power (KW) TH-IVB-FEP Cluster, Xeon E5-2692 12C 2.2GHz, TH Express-2, Intel Xeon Phi Cray XK7 , Opteron 6274 16C 2.2GHz, Cray Gemini interconnect, NVIDIA K20x National Super Computer Center in Guangzhou 3120000 33862700 54902400 17808 Lustre/H2FS 12.4 PB ~750 GB/s DOE/SC/Oak Ridge National Laboratory 560640 17590000 27112550 8209 Lustre 10.5 PB 240 GB/s 1572864 17173224 20132659 7890 Lustre 55 PB 850 GB/s RIKEN AICS 705024 10510000 11280384 12659 Lustre 40 PB 965 GB/s DOE/SC/Argonne National Lab. 786432 8586612 10066330 3945 GPFS 28.8 PB 240 GB/s 76 PB 1,600 GB/s File system Size Perf 1 Tianhe-2 2 Titan 3 Sequoia 4 K computer 5 Mira 6 Trinity Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect DOE/NNSA/LANL/SNL 301056 7 Piz Daint Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x Swiss National Supercomputing Centre (CSCS) 115984 6271000 7788853 2325 Lustre 2.5 PB 138 GB/s 8 Shaheen II Cray XC40, Xeon E5-2698v3 16C 2.3GHz, Aries interconnect KAUST, Saudi Arabia 196,608 5,537 7,235 2,834 Lustre 17 PB 500 GB/s Cray XC40, Xeon E5-2680v3 12C 2.5GHz, Aries interconnect HLRS - Stuttgart 185088 5640170 7403520 7 PB ~ 100 GB/s PowerEdge C8220, Xeon E5-2680 8C 2.7GHz, IB FDR, Intel Xeon Phi TACC/ Univ. of Texas 462462 5168110 8520112 14 PB 150 GB/s Hazel Hen 9 10 Stampede BlueGene/Q, Power BQC 16C 1.60 GHz, Custom Interconnect Fujitsu, SPARC64 VIIIfx 2.0GHz, , Tofu interconnect BlueGene/Q, Power BQC 16C 1.60GHz, Custom DOE/NNSA/LLNL n.b. NCSA Bluewaters 24 PB 8100900 1100 GB/s (Lustre 2.1.3) 11078861 Lustre Lustre 4510 Lustre Still The Same Concept: Fully integrated, fully balanced, no bottlenecks … ClusterStor Scalable Storage Unit  Intel Ivy bridge or Haswell CPUs  F/EDR, 100 GbE & 2x40GbE, all SAS infrastructure  SBB v3 Form Factor, PCIe Gen-3  Embedded RAID & Lustre support ClusterStor Manager Lustre File System (2.x) Data Protection Layer (PD-RAID/Grid-RAID) Linux OS Embedded server modules Unified System Management (GEM-USM) © 2015 Seagate, Inc. All Rights Reserved. So what’s new ??? Seagate Next Gen Lustre appliance CS9000 ClusterStor Lustre Software › Lustre v2.5 › Linux v6.5 CS L300 ClusterStor Lustre Software › Lustre v2.5 Phase 1, Lustre 2.7 Phase 2 › Linux v6.5 Phase 1, Linux 7.2 Phase 2 Management Switches › 1Gbe Switches (component communication) Management Switches › 1Gbe Switches (component communication) Top of Rack Switches (ToR) › Infiniband (IB) Mellanox EDR or 100/40GbE Top of Rack Switches (ToR) › Infiniband (IB) Mellanox FDR or 40GbE (High availability connectivity in the Rack) ClusterStor System Management Unit (SMU) › 2U24 w/ Embedded Servers in HA Configuration › File System Management, Boot, Storage ClusterStor Meta Data Management Unit (MMU) › 2U24 w/ Embedded Servers in HA Configuration › Meta Data (user data location map) ClusterStor Management Unit Hardware (CMU) › 4 Servers in a 2U Chassis, FDR / 40GbE › Sever 1 & 2 File System Management, Boot › Server 3 & 4 Meta Data (user data location map) › 2U24 JBOD (HDD storage for Management and Meta Data) ClusterStor Scalable Storage (SSU) › 5U84 Enclosure › Object Storage Servers (OSS) Network I/O › Mellanox EDR 100/40GbE › 10K and 7.2K RPM HDDs › 2U24 Enclosure › Seagate Koho Flash SSDs ClusterStor Scalable Storage (SSU) › 5U84 Enclosure › 6Gbit SAS › Object Storage Servers (OSS) Network I/O › Mellanox FDR/40GbE › 7.2K RPM HDDs © 2015 Seagate, Inc. All Rights Reserved. ClusterStor GRIDRAID Feature Benefit De-clustered RAID 6: Up to 400% faster to repair Rebuild of 6TB drive – MD RAID ~ 33.3 hours, GridRAID ~ 9.5 hours Recover from a disk failure and return to full data protection faster Repeal Amdahl’s Law: speed of a parallel system is gated by the performance of the slowest component Minimizes application impact to widely striped file performance Minimize file system fragmentation Improved allocation and layout maximizes sequential data placement 4 to1 Reduction in OSTs Simplifies scalability challenges ClusterStor Integrated Management CLI and GUI configuration, monitoring and management reduces Opex Traditional RAID Parity Rebuild Disk Pool #1 Parity Rebuild Disk Pool #2 Parity Rebuild Disk Pool #3 GridRAID OSS Server OSS Server Parity Rebuild Disk Pool #1 Parity Rebuild Disk Pool #4 © 2015 Seagate, Inc. All Rights Reserved. ClusterStor L300 HPC Disk Drive HPC Optimized Performance 3.5 HDD High level product description • • – – – • • – • – • ClusterStor L300 HPC 4TB SAS HDD HPC Industry First; Best Mixed Application Workload Value Performance Leader World-beating performance over other 3.5in HDDs: Speeding data ingest, extraction and access Capacity Strong 4TB of storage for big data applications 600 500 CS HPC HDD 400 CS HPC HDD 300 Reliable Workhorse 2M hour MTBF and 750TB/year ratings for reliability under the toughest workloads your users throw at it 200 CS HPC HDD NL 7.2K RPM HDD NL 7.2K RPM HDD NL 7.2K RPM HDD 100 Power Efficient Seagate’s PowerBalance feature provides significant power benefits for minimal performance tradeoffs 0 Random writes (4K IOPS, WCD) Random reads (4KQ16 IOPS) Sequential data rate (MB/s) IBM Spectrum Scale based solutions © 2015 Seagate, Inc. All Rights Reserved. Best in Class Features Designed for HPC, Big Data and Cloud Connectivity › IB – FDR, QDR and 40 GbE (EDR, 100GbE and Omnipath on roadmap) › Exportable via CNFS, CIFS, Object storage, HDFS connectors › Linux and Windows Clients Robust Feature Set › Global Shared Access with Single Namespace across cluster/file systems › Building Block approach with Scalable performance and capacity › Distributed file system data caching and coherency management › RAID 6 (8 +2) with De-clustered RAID › Snapshot and Rollback › Integrated Lifecycle Management › Backup to Tape › Non Disruptive Scaling, Restriping, Rebalancing › Synchronously replicated data and metadata Management and Support › Clusterstor CLI Based Single Point of Management › RAS/Phone home › SNMP integration with Business Operation Systems › Low level Hardware Monitoring & Diagnostics › Embedded monitoring, › Proactive alerts Hardware Platform › Industry’s Fastest Converged Scale-Out Storage Platform › Latest Intel Processors › Embedded High Availability NSD Servers Integrated into Data Storage Backplane › Fastest Available IO – details?? › Extremely Dense Storage Enclosures with 84 drives in 5U ClusterStor Spectrum Scale Performance Density Rack Configuration ETN 1 Key components: › ClusterStor Manager Node (2U enclosure) • 2 HA management servers • 10 drives ETN 1 ETN 2 ETN 2 High Speed Network 1 High Speed Network 1 High Speed Network 2 High Speed Network 2 CSM NSD NSD NSD NSD NSD NSD NSD NSD NSD NSD NSD NSD NSD NSD Base Rack Expansion Rack › 2 Management switches Performance: › Up to 56GB /sec per rack Key components: › 5U84 Enclosure Configured as NSDs + Disk • 2 HA Embedded NSD Servers • 76 to 80 7.2K RPM HDDs • 4 to 8 SSD › 42U reinforced Rack • Custom cable harness • Up to 7 enclosures in each rack (base + expansion) ClusterStor Spectrum Scale Capacity Optimized Rack Configuration ETN 1 Key components: › ClusterStor Manager Node (2U enclosure) • 2 HA management servers • 10 drives › 2 Management switches Performance: › Up to 32GB /sec per rack Key components: › 5U84 Enclosure Configured as NSDs + Disk • 2 HA Embedded NSD Servers • 76 to 80 7.2K RPM HDDs • 4 to 8 SSD › 5U84 Enclosure Configured as JBODs • 84 7.2K RPM HDDs • SAS connected to NSD servers, 1 to 1 ratio › 42U reinforced Rack ETN 1 ETN 2 ETN 2 High Speed Network 1 High Speed Network 1 High Speed Network 2 High Speed Network 2 CSM NSD JBOD JBOD NSD NSD JBOD JBOD NSD NSD JBOD JBOD NSD NSD JBOD Base Rack Expansion Rack ClusterStor Spectrum Scale – Standard Configuration 2U24 Management Server 5U84 Disk Enclosure NSD (MD) Server x 2 servers > 8GB/sec per 5U84 (Clustered) ~ 20K File Creates per Sec ~ 2 Billion Files NSD (MD) Server #1 NSD (MD) Server #2 ClusterStor ToR & Mgt Switch, Rack, Cables, PDU Factory Integration & Test Up to (7) 5U84’s in base rack Metadata SSD Pool ~10K File Creates / sec ~ 1Billion files, 800 GB SSD x 2 User Data Pool ~4GB/sec HDD x qty (40) Metadata SSD Pool ~10K File Creates / sec ~ 1Billion files, 800 GB SSD x 2 User Data Pool ~4GB/sec HDD x qty (40) Object Storage based archiving solutions © 2015 Seagate, Inc. All Rights Reserved. Clusterstor A200 Object Store Features Can achieve 4 “nines” system availability High density storage up to 3.6PB usable per rack “Infinite” numbers of objects (2128) 8+2 network erasure coding for cost-effective protection Rapid drive rebuild (<1hr for 8TB in a large system) Global object namespace ClusterStor A200 Object API & portfolio of network based interfaces Integrated management & consensus based HA Performance scales with capacity(up to 10GB/s per rack) ClusterStor A200: Resiliency Built-In Redundant TOR switches › › › Combines data and management network traffic VLANs used to segregate network traffic 10GbE with 40GbE TOR uplinks Management Unit › › 1x2U24 enclosure 2x Embedded Controllers Storage Units 42U Rack with wiring loom & power cables › › › Dual PDUs 2U spare space reserved for future configuration options Blanking plates as required › › › › › Titan v2 5U84 enclosures – 6x is the minimum config 82 SMR SATA HDD(8TB) Single Embedded Storage Controller Dual 10GbE network connections Resiliency withj 2SSU failures (12SSU’s miimum) Economic Benefits of SMR drives SMR Drives Shingled Technology increases capacity of platter by 30-40% › Write tracks are overlapped by up to 50% of write width › Read head is much smaller & can reliably read narrower tracks SMR Drives are optimal for object stores as most data is static/WORM › Updates require special intelligence and may be expensive in terms of performance › “Wide tracks in each band are often reserved for updates CS A200 manages SMR Drives directly to optimize workflow & caching Read Head Write Head Updates destroy portion of next track Current Clusterstor Lustre solutions line-up CS-1500 CS-9000 Base rack Expansion rack L300 Base rack Expansion rack Current ClusterStor Spectrum Scale/Active Archive line-up G200 Base rack Expansion rack A200 Base rack Expansion rack