Preview only show first 10 pages with watermark. For full document please download

Enterprise Memory

   EMBED


Share

Transcript

Enterprise Memory Reducing DPPM rates by 90% White Paper M-WP001 Corporate Headquarters: 39870 Eureka Dr., Newark, CA 94560, USA • Tel: (510) 623-1231 • Fax: (510) 623-1434 • E-mail: [email protected] Customer Service: Tel: (+1) 978-303-8500 • Email: [email protected] Latin America: Tel: (+55) 11 4417-7200 • Email: [email protected] Europe: Tel: (+44) 7825-084427 • Email: [email protected] Asia/Pacific: Tel: (+65) 6232-2858 • Email: [email protected] M-WP001 – Enterprise Memory Table of Contents Executive Summary......................................................................................................................... 3 Introduction: The Need for High Reliability Memory........................................................................ 4 Memory Failure............................................................................................................................... 4 How Reliable is Memory?................................................................................................................ 4 Latent Defects and the Weak Bit Error............................................................................................. 5 Module Assembly............................................................................................................................ 5 Lack of Solutions from DRAM OEMs................................................................................................ 5 Extended Burn-in Should be Done at the Module Level................................................................... 6 The SMART Solution: Enterprise Memory........................................................................................ 6 The SMART Solution: Enterprise Memory........................................................................................ 6 SMART‘s Range of Enterprise Memory Products.............................................................................. 6 Proven Reliability............................................................................................................................ 6 Enterprise Memory Reduces TCO..................................................................................................... 7 2 | Page December 2013 M-WP001 – Enterprise Memory Executive Summary According to industry research and OEM feedback, DIMM failures in high memory content applications are pervasive. Typical DPPM rates (defective parts per million) can range between 5,000 to 15,000. A study conducted by Google in 2010 indicated that 8% of DIMMs shipped are affected by errors. With some of the largest data centers in the world housing 1 million or more servers this translates into a DRAM component content of between 288 million to 576 million DRAMs. A single DRAM component failure can lead to a DIMM failure thus causing a whole chain reaction of cost and field support issues. The data in many data centers is typically very critical. End users such as financial services companies and scientific research institutions rely on the data integrity of their servers. The cost of maintaining redundant systems and/or replacing failed DIMMs can be quite high. SMART Modular Technologies has recognized and is addressing this issue by providing the highest quality DIMMS available, “Enterprise Memory.” Developed by SMART, Enterprise Memory can help OEMs reduce costs by dramatically lowering the DPPM rate of their DIMMs. SMART has helped OEMs attain an unprecedented reliability level of sub-two hundred DPPM with its DDR3 RDIMM Enterprise Memory. This exceptional DPPM rate is accomplished utilizing a SMART proprietary Ultra-Reliability Test (URT) process. Early life failures and weak modules are eliminated through SMART’s URT lab where every enterprise grade module undergoes a rigorous triple-stress process entailing hours of burn-in at high speed operation, running custom software at 99% DRAM utilization. A significant and increasing module sample size, careful customer data collection, and ongoing reliability analysis has confirmed the low DPPM over a four year period. 3 | Page December 2013 M-WP001 – Enterprise Memory Introduction: The Need for High Reliability Memory When a server, storage or networking appliance fails, a significant expense is incurred. On-site repair requires skilled manpower, time and an inventory of spare parts on hand. Shipping a system to a repair facility also incurs a shipping cost in addition to the trouble shooting and repair expense. Maintaining a supply of spare components is a considerable expense, complicated by the fact that different appliances use different components. After an appliance is repaired it must be thoroughly tested for reliability before it can be put back into service. There is also the cost of down-time. Some services could be disrupted; maybe a replacement appliance is needed until the down unit can be put back online. Consider also the possibility of lost data, lost transactions or lost business. Yet, perhaps, the biggest unrecoverable loss caused by a system failure is the original equipment manufacturer (appliance OEM) takes to his reputation. An appliance failure for any reason reflects negatively on an OEM’s quality standards. Enterprise customers require higher levels of reliability because a single appliance may be supporting thousands of users. Moreover, in virtual environments Costs of Failure a single physical server may be • Repair costs supporting many virtual servers, so a single hardware failure could • Spare parts inventory disrupt numerous functions. One • Shipping costs of the most common reasons for • Loss of data system failure is memory errors. • Down time This paper will discuss enterprise appliance failures caused by • Disruption of services memory errors and how enterprise • Replacement systems memory can dramatically reduce • Blow to reputation such failures. Memory Failure DRAM vendors produce memory devices (chips), and module makers assemble memory devices and other components into a printed circuit board assembly called a memory module that is used in appliances. An appliance may be a server, a storage appliance or a networking appliance that is used by enterprises. Memory modules are typically built with nine, 18, 36 or 72 discrete memory devices. Enterprise appliances typically use eight, 12 or 24 memory modules. An appliance designed for memory intensive applications may use up to 24 memory modules, each with up to 72 devices, or up to 1728 total discrete memory devices in a single appliance. This is important because the probability of memory failure is directly proportionate to the total number of memory devices in the appliance. Larger memory arrays naturally have a higher likelihood of memory failure. 4 | Page ! Memory Devices !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Memory Modules Appliance Figure 1. Appliances use many memory modules, which in turn use many memory devices. Manufacturing defects can occur in device fabrication or in module assembly. How Reliable is Memory? All DRAM vendors perform burn-in testing on every memory device to screen out manufacturing defects. Their burn-in testing is done at elevated temperatures to accelerate failure for any marginal devices, and is generally very effective, however, device burn-in is normally done at very slow clock speeds, and uses read/write test patterns rather than running a real-world application. Suppose a good DRAM device burnin test is able to yield 99.99% good devices, but 0.01% of the “good” devices are test escapes (devices that passed test but were actually only marginally good and would fail in actual usage). If only one device on one memory module fails, then the whole appliance fails. Figure 2 shows how a 99.99% effective device screen can propagate to a four percent appliance failure rate when loaded with a dozen 36-chip modules. 100 99 98 1 DIMM 97 96 8 DIMMS 95 12 DIMMS 94 93 1 device 9-chip 18-chip 36-chip module module module Figure 2. Approximate Reliability Rates for Various Module Types and System Configurations. Even a very effective device burn-in screen by the DRAM vendor can result in an unacceptable failure rate in the appliance due to the high number of DRAM devices in a single appliance. Such failures may not show up immediately and memory errors may not occur until after the system is deployed. December 2013 M-WP001 – Enterprise Memory Latent Defects and the Weak Bit Error There are three types of memory errors: hard errors, soft errors and weak bits. Hard errors are caused by loose connections or other problems in the system, and are therefore relatively easy to identify and reproduce. Soft errors result from fluctuations in environmental variables such as temperature and voltage. Because soft errors are random they are very difficult to identify. Weak bit errors are often the hardest to detect and prevent. Figure 3 shows a summary of these three types of memory errors. Error Type Error Frequency Occurrence Causes Hard Errors 10 FITs reproducible loose connections, aging components, critical timing, resistive or capacitive variations, noise in the system Soft Errors 100 FITs random temperature, voltage, humidity, pressure, vibrations, EMI, ground loops, cosmic rays, and alpha particles Weak Bits 100 FITs reproducible temperature, voltage, humidity, pressure, vibrations, EMI, ground loops, cosmic rays, and alpha particles Figure 3. Three common types of memory errors. Latent defects eventually show up as ‘weak bits’, detected by system Error-Correction-Code (ECC) methods. This third class of failure - the weak bit - behaves like a hard error in the sense that it is repeatable, yet it displays characteristics that are representative of a soft error; failing only under specific conditions (VDD, timing, data pattern, temperature, I/O loading, etc.) while functioning normally under most conditions. This behavior of the weak bit makes it very hard to detect and eliminate from the DRAM devices shipped by DRAM vendors as well as the memory modules shipped by module makers. Module Assembly In addition to possible marginal quality issues in DRAM devices, memory failure can also result from the memory module manufacturing process. There are countless possible quality problems in printed circuit board (PCB) assembly, which can be attenuated through constant and precise focus on details by the module maker. Most assembly problems can be screened out by the module maker’s test process, however, issues like cold solder joints are sometimes hard to detect in a normal module testing. Some marginal quality modules can still slip through test, only to fail later, after an appliance has been deployed. Enterprise appliance OEMs need extremely reliable memory modules that can support heavy workloads over years of use at elevated operating temperatures with virtually zero field failures. A more rigorous burn-in test is needed to screen out devices with latent defects and modules with marginal quality problems. Marginal devices have inherent defects resulting from manufacturing aberrations which cause time and stress dependent failures. Without burn-in these failures would show up as infant mortality or early lifetime failures. Since all DRAM device vendors perform 100% dynamic burn-in, these infant mortality and early lifetime failures should all be successfully screened out. However, analysis of modules that actually failed during usage after being deployed reveals that another defect must be present that escapes detection in standard electrical test and burn-in by DRAM suppliers. These undetected defects are called latent defects. Lack of Solutions from DRAM OEMs Latent defects not caught at burn-in continue to degrade over time due to thermal stresses, such as the solder reflow cycle during SMT assembly, and/or due to load stress during system testing and use. The failure rate attributable to latent defects should be low, but can be similar to hard and soft failure rates. With typical failure rates for hard failures being reported as 10 FITs (Failures-In-Time, with time being 109 device-hours), and soft-error failure rates being at least an order of magnitude higher, or greater than 100 FITs, the failure rate on a module can be significant because the failure rate of the module is proportionate to the number of devices on the module. Of course significantly longer device burn-in cycles are more effective at screening out latent defects. With no limits on the length of burn-in most (but still not all) of the latent defects can be discovered in test so they never get deployed in an appliance. However, a longer burn-in means a slower production flow and higher cost, which DRAM suppliers are not willing to incur. Furthermore, merely extending the device burn-in time does not sufficiently stress the memory devices to expose all latent defects in devices. Some such defects will only appear under a heavy workload while running a real-world application. 5 | Page There is a dearth of high reliability memory options from the major DRAM suppliers because their burn-in and test processes are catered to high volume PC OEMs, who need super competitive costs above all else. Most major-brand memory is tested rigorously enough to ensure relatively low failure rates in desktop and laptop computers. This screens out most manufacturing defects, but does not catch marginal quality problems like latent defects that could only be detected under intense stress or an extended burn-in. December 2013 M-WP001 – Enterprise Memory rigorous in order to accelerate the discovery of latent defects, and the length of the test must be balanced with the need for cost minimization. The objective, then, is to identify all latent defects within a relatively short but stressful module burn-in. Extended Burn-in Should be Done at the Module Level The SMART Solution: Enterprise Memory SMART has established that a longer and more rigorous burn-in process is necessary to screen out marginal defects, and that this extended burn-in should take place after module assembly. However, the test should be not only longer but also more rigorous in order to accelerate the discovery of latent defects, and the length of the test must be balanced with the need for cost minimization. The objective, then, is to identify all latent defects within a relatively short but stressful module burn-in. A standard burn-in test process that effectively screens out all latent defects in the shortest possible time can be developed by varying carefully controlled test conditions to determine the most efficient way to make a marginal DIMM fail. After extensive testing and analysis, followed by more testing and more analysis, SMART engineers developed a premium system level test burn-in process called SMARTscreen™ that is designed to expose any latent defects and prevent weak bit errors in the field. SMART’s precision controlled process triple stresses every module by subjecting it to a marathon dynamic burn-in in an 85°C environmental chamber, while running an intensive real-world application, the High Performance Computing Linpac benchmark (HPL), at the module’s rated operating speed. This combination of temperature, load and speed over time exposes any latent quality issues that otherwise may have escaped. The SMART Solution: Enterprise Memory SMART has established that a longer and more rigorous burn-in process is necessary to screen out marginal defects, and that this extended burn-in should take place after module assembly. However, the test should be not only longer but also more 6 | Page ! ! ! Memory Modules Load ! ! Time Speed Another problem with an extended device burn-in is that it does nothing to identify possible defects resulting from module assembly. Burn-in screening at the module level is a far superior solution versus a more intense screening of individual devices because the module can be tested as a finished product, so quality issues inside memory devices as well as in module assembly (SMT) can be detected. Not only are the memory devices stressed, the PCB, passive components and solder joints are also stressed. Quality issues with solder reflow can only be identified by testing the finished memory module. If a more rigorous burn-in test is to be applied it makes sense to apply it after module assembly so that latent quality problems in devices and module assembly can be detected and screened out in a single test. SMARTscreen Test Temperature Using off-the-shelf modules from a major DRAM supplier in an enterprise class appliance results in an unacceptable field failure rate because too many latent defects escape detection in their standard mass market burn-in process. ! ! Memory Modules ! Latent Defect Figure 4. SMARTscreen combines temperature, load, speed and time to stress test memory and expose latent defects. It is vastly advantageous to test the module running a real application at rated operating speeds rather than a mere simulation at low clock speeds as in a typical device burn-in. Testing the module at full clock speed allows far more test cycles to be completed in a much shorter time compared to typical low-speed device burn-in. Higher speeds also put greater stress on the module by making it work harder. The module is stressed both physically and functionally. Any marginal devices or modules are screened out through the SMARTscreen test. SMART’s triple stress burn-in results in a high reliability DIMM called enterprise memory that is suitable for use in enterprise appliances. SMART uses specialized proprietary test fixtures to facilitate SMARTscreen, and has sufficient test capacity for volume production of enterprise memory modules. SMART‘s Range of Enterprise Memory Products The SMARTscreen process is technology agnostic when it comes to the memory module. Although the process was developed specifically for DDR3 RDIMMs, SMART can apply the same burnin process to any memory module in their lineup including: - DDR, DDR2, DDR3, (and DDR4) - RDIMM, UDIMM, SO-DIMM, Mini-DIMM, XR-DIMM. Some of these products may require new or modified test fixtures, but the SMARTscreen process is the same. Proven Reliability The reliability of SMART’s enterprise memory modules is unrivaled. To compare the failure rates between OEM memory and SMART Enterprise memory, SMART shipped more than a hundred thousand Enterprise Memory modules with a proven failure rate of less than two in ten thousand. In a 4GB DDR3 module analysis, the company has shipped over 31,000 Enterprise Memory modules to an enterprise storage OEM customer which yielded a DPPM rate of only 161. In an 8GB December 2013 M-WP001 – Enterprise Memory DDR3 analysis, SMART shipped over 88,000 Enterprise Memory modules with a DPPM rate of 151. These DPPMs translate into data integrity, enhanced reputation and total cost of ownership savings for OEM customers (Fig.5). Standard Memory DIMMs per Year Failed DIMMS Enterprise Memory 500,000 500,000 2,500 60 $500 $500 $1,250,000 $30,000 Cost per Failure DPPM Enterprise Memory Total Failure Costs Figure 6. Comparison of Costs With and Without Enterprise Memory in a Medium Sized Data Center. 0 1000 2000 3000 4000 5000 6000 Figure 5. Reliability Rates -- SMART Enterprise Memory vs. OEM memory. Enterprise Memory Reduces TCO The greatest advantage of enterprise memory is that it substantially improves reliability of enterprise appliances by screening out marginal memory modules that would become field failures. This results in more robust enterprise appliances wherein memory errors are nearly eliminated from the failure rate calculation. An additional benefit is that enterprise memory can reduce TCO compared to off-the-shelf RDIMMs. It is intuitive that detecting quality problems prior to deployment is cheaper than a field failure after deployment, however there is always a cost tradeoff point where incremental burn-in testing loses its value. The SMARTscreen process was developed with cost in mind, so the maximum benefit is attained at the minimal cost. Figure 6 shows a cost comparison of Enterprise Memory versus off-the-shelf standard memory for a medium sized data center assuming the cost of a single appliance failure is approximately $500.  In this case, Enterprise Memory demonstrates a net savings of $1.22M.  Intangible and unknown costs such as loss of data, loss of business, replacement system costs, and loss to reputation are not factored into this model. An actual cost comparison will vary depending on the type and density of the DIMMs used, number of DIMMs installed per appliance, actual fallout rates, and actual service and repair costs for failed appliances. The actual cost of a failed appliance also has a direct bearing on the decision whether to use enterprise memory. If the cost of a failure is only $100 then the cost of SMARTscreen may not be justified, however, if the total cost of a failed appliance is $1000 then enterprise memory offers a much lower TCO than off-theshelf OEM memory. DIMM Cost Standard Memory The advantage of the enterprise memory increases with the density of the memory module because as explained earlier, the probability of a memory failure scales directly with the number of memory devices. There may be no benefit to using enterprise memory in an appliance that only uses low density DIMMs, however appliances with high memory density may suffer from high failure rates, which could be curtailed by using enterprise memory. Figure 2 above illustrates this point. 240 235 230 225 220 215 210 205 200 195 190 185 Enterprise Memory OEM Memory 100 200 300 400 500 600 700 800 Cost of Appliance Failure Figure 7. Net Cost Comparison -- SMART Enterprise Memory vs. OEM memory with varying cost of appliance failure. ©2014. All rights reserved. The stylized “S” and “SMART” as well as “SMART Modular Technologies” are trademarks of SMART Modular Technologies. All other trademarks and registered trademarks are the property of their respective companies 01.17.14/MWP001/Rev.1 7 | Page December 2013