Transcript
W H I T E PA P E R
ACHIEVING FIVE-9s AVAILABILITY Quantum QXS Hybrid Storage Commitment to Reliability
CONTENTS Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Designing for High Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Proof of Success With Rigorous Analysis and Field Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 WHITE PAPER | Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability
INTRODUCTION Businesses make great efforts to maximize the availability of their online data, as well as to minimize or even eliminate the loss of any valuable data. RAID (Redundant Array of Independent Disks) hardware configurations supplemented with intelligent software features all help to achieve these goals. These high-reliability goals are more effective and less expensive to maintain when they are built into a reliable Storage Area Network (SAN) or a virtualized storage array foundation. RAID availability is measured by percentage of uptime. When storage area network (SAN) data are unavailable, so are the applications that must access this data. When system outages occur, most online work stops. This is why any downtime of mission-critical applications is extremely expensive. While figures vary by the type and size of the organization, some industries—such as energy and telecommunications—report losses of revenue from $33,000 to $47,000 (USD) for every minute of downtime, according to a study conducted by META Group (subsequently acquired by Gartner, a leading information technology research and advisory company). Measures of RAID Availability: • 99.99% or Four-9s translates to about 53 minutes of downtime annually. • 99.999% or Five-9s of availability translates to just over 5 minutes of downtime annually. • 99.9999% or Six-9s of availability translates to just over 31 seconds of downtime annually. These are statistical averages, but you can see that the additional 48 minutes of average annual uptime/productivity (at e.g. $30K-$47K/min) by moving from Four-9s to Five-9s availability is important to businesses (with projected $1.44 - 2.26M in productivity savings) that rely on a SAN infrastructure for such mission-critical applications. Quantum understands the importance of high availability for its customers. Throughout its history, Quantum has built a reputation for exceptional quality in both hardware and software. Quantum’s flagship product, Xcellis™ (powered by StorNext®), which uses QXS storage, is used in the most demanding environments. This paper’s remaining content is divided into two sections followed by a brief conclusion. The section on Designing for High Availability describes the measures taken during the engineering design phase to ensure high reliability and availability. The measures taken to validate system reliability are then covered in the section on Proof of Success With Rigorous Analysis and Field Data.
Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 3
DESIGNING FOR HIGH AVAILABILITY High availability is achieved through a combination of three design elements: 1. Fault Avoidance – High reliability (measured by the Mean Time Between Failures, or MTBF) of the system and its several subsystems. 2. Fault Tolerance – Redundant subsystems to eliminate as many single points of failure as possible. 3. Serviceability – Rapid diagnosis and repair of any failure (measured by Mean Time to Repair, or MTTR) by using internal diagnostics code and adequately positioned Field Replaceable Units (FRUs) for all critical subsystems. The following equation for availability demonstrates the vital role of serviceability in the system’s design. Maximum availability can be achieved only by minimizing the time it takes to diagnose and perform a repair, which is reduced significantly by using FRUs. MTBF Availability = MTBF + MTTR
To achieve maximum availability, designs must consider fault avoidance, fault tolerance, and serviceability. This also improves their manufacturability. The specific aspects of each of these areas are described in turn here. DESIGN FOR RELIABILITY, AVAILABILITY, & SERVICEABILITY (RAS) Designing hardware for high Reliability, Availability, and Serviceability involves both the system and its several subsystems. To achieve high availability at the system level, QXS hybrid storage integrates reliability into the design process in several ways. The first and most obvious is the use of storage device (disk drive) redundancy within RAID configurations (RAID 1, 3, 5, 6, 10 and 60). Next, we use dual power supplies containing fans for the entire system in the standard 2U12 or 2U24 chassis RAIDs and JBODs. JBOD stands for Just a Bunch Of Disks, or a group of disks not connected through an intelligent controller. In our high-density RAID and JBOD systems (2U48 and 4U56 chassis RAIDs), there are separate dual Fan Control Modules (FCMs) as field replaceable units (FRUs) to prevent overheating (and avoid life acceleration component failures) and also allow rapid field replacements, if ever needed. Even higher availability is achieved by using redundant RAID and JBOD controllers. By eliminating as many single points of failure in these critical subsystems as possible, the system itself continues to operate normally during a failure of any single redundant FRU. While such FRU failures do factor into the subsystem’s MTBF (its FRU rated reliability), it does not diminish the availability of the overall system itself since an outage will only occur when there is an exposed single point of failure. QXS hybrid storage architecture features full redundancy by design for every subsystem requiring a significant number of active components. The mechanical chassis/midplane itself cannot be redundant, of course, since there is a single midplane that performs the simple function of connecting the redundant controllers to the redundant disk drives. The midplane has minimal active components, however, and Quantum selects these components for the highest possible reliability. The result is an extraordinarily high MTBF for the chassis and its midplane and, therefore, virtually no impact is projected on overall system availability.
4 WHITE PAPER | Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability
To enhance system serviceability for the shortest possible Mean Time To Repair (MTTR), QXS was developed with two complementary design techniques. The first is the use of a modular chassis with Field Replaceable Units (FRUs). The ability to swap out a confirmed failed FRU subsystem quickly and easily minimizes the time it takes to repair an installed system and restore it to full operating performance. By utilizing such a modular FRU design, which provides convenient access to all subsystems, QXS hybrid storage can be maintained seamlessly with minimal to no disruption in service during most repairs. Standard RAID chassis FRUs are illustrated in Figure 1 below. Two other views (Figures 2 and 3) of high-density RAID chassis with their drawer assemblies are shown next. Figure 1: Standard 2U12 RAID/Chassis Assembly
Figure 2: High-Density 2U48 RAID Chassis/Drawer Assembly
Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 5
Figure 3: High-Density 4U56 Chassis/Drawer Assembly
QXS hybrid storage standard chassis mechanical design enables the power supply with internal fan, JBOD I/O module and RAID controller, and disk drives all to be serviced quickly as hot-swappable Field Replaceable Units (FRUs). Being able to replace redundant FRUs while the system is fully operational further enhances its availability. Note how the power supply and I/O module are accessible from the rear of the chassis, while the disk drives are accessed from the front. Also note that redundant FRUs are not shown here to identify the individual subsystems more clearly. In addition, the higher part count HD chassis design is still able to maintain high availability by separating the larger-wattage power supply from the fans themselves by having redundant FRUs for both the PSUs and Fan Control Modules, or FCMs.
The second serviceability technique is immediate notification of any failure by messaging to the user. Typically, the longer it takes to detect a failure, the longer it will take to repair it. Time is of the essence for another reason, however: the failure of a redundant subsystem creates, in effect, a temporary single point of failure that increases the risk of a system-level outage. For this reason, the firmware in all QXS hybrid storage arrays is designed to quickly detect, isolate, and confirm any failure, initiate a failover to a redundant subsystem, and provide immediate notification. The actual “messaging” of the notification can also be configured to match operational procedures to ensure that on-duty (or on-call) staff is properly and quickly notified via “phone home” features. DESIGN FOR RELIABILITY (DFR) At the FRU or subsystem level, QXS hybrid storage utilizes five separate designs for reliability (DFR) techniques to increase the reliability of the overall system by addressing each subsystem and FRU, while at the same time also maximizing the inclusion of leading-edge SAN features. The first DFR technique is to use only high-quality parts. Higher-quality parts cost more, of course, but their superior performance and longer service lives normally contribute to a lower total cost of ownership in the long run. Despite the higher per-part cost, minimizing the part count—while concurrently enhancing feature functionality—helps to improve the overall price/ performance of a highly reliable design. For these reasons, QXS hybrid storage utilizes only the highest-quality parts available from reputable suppliers.
6 WHITE PAPER | Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability
The second DFR guideline is reducing the electronic component part count. Because any individual part can fail, the fewer there are, the higher the inherent reliability of the subsystem. The third DFR technique involves the de-rating of selected electronic parts. Operating any part or component at or near its rated capacities inevitably shortens its useful life. For critical parts, QXS hybrid storage uses only those that will be able to operate well below their maximum allowable specifications for voltage, power, and/or current. This can substantially increase their useful life and, therefore, the MTBF of the subsystem. The fourth DFR technique has been touched upon earlier, but deserves more detail. RAID reliability is also enhanced by using redundant FRU design techniques. Redundant FRU failover options are highly valued for mission-critical subsystems such as RAID Controllers, JBOD Input/ Output Modules (IOMs), Fan Control Modules, Power Supply Units (PSUs), and Drives. The fifth DFR technique is unique to QXS hybrid storage: designing for software reliability. In modern designs, software reliability is just as important as hardware reliability, and in some ways even more important. This is because software bugs (including those in firmware) that cause downtime normally take significantly longer to resolve than the more obvious hardware failures. Bugs are often dependent upon system state (the set of circumstances leading up to the failure), making them difficult to reproduce and isolate quickly, and any patch or update must be fully tested before it can be released. Both add considerably to the MTTR for software failures, thereby adversely impacting on system serviceability and availability. To maximize software reliability, QXS hybrid storage is designed using Software Reliability Growth and Maturity Test techniques. Two primary metrics are utilized to assess software readiness. The first is the Mean Time to Discovery (MTTD) of bugs to assess the maturity of all new product RAID systems software and firmware during development. The second is the Reliability Growth Factor (RGF). The RGF tracks the growth of the MTTD metric over time to indicate when adequate growth has been attained as an input to our product release decisions. After extensive testing, all RAID software designs must have a sufficiently high and stable MTTD and RGF before being released. In addition, the new product system design is released to manufacturing and enters the production phase only after passing three comprehensive tests. The Engineering Verification Test (EVT) and the Design Verification Test (DVT) ensure that the system and/or subsystem(s) fully satisfy all design specifications, including those for high reliability of both the hardware and software. These tests also confirm that marginal variations in parts from component suppliers will not compromise system reliability over the product’s useful life of a minimum of ten years of design life. The Reliability Demonstration Test (RDT) is a separate and rigorous evaluation of the final production hardware that verifies its calculated reliability, availability, and serviceability (covered below). Where some vendors use only a few samples in a fairly short demonstration test, QXS hybrid storage’s reliability test uses 24 fully configured HD chassis to 30 fully configured Standard chassis RAID + JBOD systems in an 8-to-12-week extended reliability test performed at our CM’s location with final hardware and feature complete code that must end with a full MTBF demonstration at 80% confidence to pass.
Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 7
DESIGN FOR MANUFACTURABILITY (DFM) To ensure high Reliability, Availability, and Serviceability, the Quality Assurance (QA) Group establishes a comprehensive set of assurance controls in parallel with the engineering team’s design efforts. These controls ensure that the design facilitates manufacturing best practices for maintaining high quality and reliability with high yields and minimal defects that support rapid fault isolation and repair. Quantum’s DFM process also evaluates all components and suppliers for quality assurance using a strict set of qualifications to ensure compliance with minimal failure rates. All hardware is then manufactured under similar controls for the production process itself, which must comply with all Quantum component and assembly specifications. Additional best practices during manufacturing include burn-in of components, subsystems, and fully configured RAID systems to minimize early infant mortality escape failures. Ongoing Reliability Testing (ORT) is performed on regular manufacturing samples. For the ORT process, random samples are regularly selected for a 4-week extended test to ensure that the manufacturing process continues to yield the desired levels of quality and reliability by not compromising the RAID design’s inherent capabilities.
PROOF OF SUCCESS WITH RIGOROUS ANALYSIS AND FIELD DATA The reliability of any system can be assessed using a “bottom-up” analysis of its numerous electronic component parts. While these analyses can be remarkably accurate—especially when conducted in accordance with proven methodologies—it is also prudent to validate these calculated assessments with actual data from production units in use at customer sites. Quantum does both. RELIABILITY, AVAILABILITY & SERVICEABILITY (RAS) ANALYSIS The RAS methodology models the design’s Reliability, Availability, and Serviceability at both the system and subsystem levels, and also provides estimates of RAID data protection levels. Component-level Telcordia MTBF predictions performed at 25°C and 40°C ambient environments are included for all FRUs as a foundation for the analysis, and all analyses are performed in accordance with Telcordia’s SR-NWT-000332 Reliability Prediction Procedure for Electronic Equipment. This Belcore/Telcordia prediction methodology assumes a series model for the hardware MTBF based on any random hardware failure occurrence (including those in redundant subsystems), regardless of whether or not it causes a system-level outage. The predicted FRU level failure rate data is based on a combination of supplier life test data, Quantum field returns data, and Telcordia’s component industry data. Per the RAS analysis methodology, all FRUs are assembled via Reliability Block Diagram (RBD) methods (shown in the figure below), and Monte Carlo simulations are applied to account for random FRU failures, different data protection levels (Vdisks/LUNs for RAID5 N+1 and RAID6 N+2) and different MTTRs; for example: four hours, twelve hours, twenty-six hours for service restore times for any FRU or software failure. Recovery Time Objectives of 26 vs. 27 hours for MTTRs were determined to be the breakpoint for this configuration when comparing RAID5 vs. RAID6. The analysis also assumes an average 24-second failover time to the redundant RAID or JBOD IOM per QXS hybrid storage internal test data under expected IO conditions.
8 WHITE PAPER | Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability
Depicted in this reliability block diagram (RBD) above are the QXS-6 Series hybrid arrays. This includes a 4U56 chassis RAID + JBOD configuration. Here, the RAID chassis contains dual power supplies, dual fan control modules, and dual SAS RAID controllers; while the JBOD contains dual power supplies, dual fan control modules, and dual JBOD IOMs. Each RAID and JBOD chassis contain 56 x 4TB SAS disk drives (RAID5 and RAID6), resulting in two different RAS analysis configurations for comparison. These statistically analyzed availability rates are based on breakpoint/worst-case Recovery Time Objectives (RTO) for RAID servicing times. Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 9
A summary of the RAS analysis results are shown in the following table: RAID5 Dual Controllers
RAID6 Dual Controllers
27-Hour Recovery Time Objective (RTO)
99.9989%
99.999%
26-Hour Recovery Time Objective (RTO)
99.999%
99.9991%
Based on two selected Recovery Time Objectives (RTO) for RAID servicing, this translates into the following average expected minutes of downtime per year, MTBO, and AOER%: RAID5 Dual Controllers
RAID6 Dual Controllers
27 Hours RTO Down Time Per Year (Minutes/Seconds)
5.47
5.15
26 Hours RTO Down Time Per Year (Minutes/Seconds)
5.15
4.44
RAID5 Dual Controllers
RAID6 Dual Controllers
27 Hours RTO Mean Time Between Outage (Hours)
2,422,215
2,609,305
26 Hours RTO Mean Time Between Outage (Hours)
2,457,736
2,916,754
RAID5 Dual Controllers
RAID6 Dual Controllers
27 Hours RTO Annual Outage Event Rate
0.3616%
0.3357%
26 Hours RTO Annual Outage Event Rate
0.3564%
0.3003%
10 WHITE PAPER | Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability
CONCLUSION As demonstrated by the many measures outlined in this white paper, Quantum is committed to delivering the highest possible reliability and availability across our entire product line. Quantum also takes great pride in the fact that this commitment to quality and reliability has enabled us to deliver “carrier-class” five-9s of availability in products with “enterprise-class” price/performance. No other company delivers such high availability at such a competitive price. But don’t just take our word for it. Compare for yourself QXS hybrid storage high availability with any other product in its class. Evaluate the vendor’s commitment to quality and reliability by its design, manufacturing, and testing processes. Look at the bottom line results. Talk to colleagues about their experiences. We are confident that, if you do, you too will come to realize what many others already know: Quantum storage solutions deliver industry-leading high reliability and availability.
Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 11
ABOUT QUANTUM Quantum is a leading expert in scale-out storage, archive and data protection, providing solutions for sharing, preserving and accessing digital assets over the entire data lifecycle. From small businesses to major enterprises, more than 100,000 customers have trusted Quantum to address their most demanding data workflow challenges. With Quantum, customers can Be Certain™ they have the end-to-end storage foundation to maximize the value of their data by making it accessible whenever and wherever needed, retaining it indefinitely and reducing total cost and complexity. See how at www.quantum.com/customerstories.
www.quantum.com • 800-677-6268 ©2016 Quantum Corporation. All rights reserved. Quantum, the Quantum logo, StorNext and Xcellis are either registered trademarks or trademarks of Quantum Corporation and its affiliates in the United States and/or other countries. All other trademarks are the property of their respective owners.
WP00209A-v01 Jan 2016
Achieving Five-9s Availability: Quantum QXS Hybrid Storage Commitment to Reliability | WHITE PAPER 12