Preview only show first 10 pages with watermark. For full document please download

Amd領先業界之高效能運算與 低溫控制技術

   EMBED


Share

Transcript

AMD領先業界之高效能運算與 低溫控制技術 Kevin Lai Senior Product Marketing Manager AMD Nov 2011 AMD CUSTOMERS Cloud Rackspace 1&1 Virtualization Valero Energy Univ. of Edinburgh HPC University of Stuttgart Universitaet Stuttgart Gamigo / Aixit TACC Oak Ridge NCHC University of São Paulo Kyoto Univ. Poznan Supercomputing Center Univ. of Illinois (“Blue Waters”) (Institute of Biorganic Chemistry) NERSC (Poland) Ferrari CPTEC 2 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public AMD OPTERON™ PLATFORMS A WIDE RANGE OF PLATFORM CHOICES TO MEET BOTH STANDARDIZED AND CUSTOMIZED ENVIRONMENTS Performance-perwatt and Expandability for 2P/4P Highly Energy Efficient and Cost-optimized for 1P/2P AMD Opteron™ 6000 Series Platform Standard Platforms Traditional Rack/Tower/Blade AMD Opteron™ 4000 Series Platform Custom, purpose-driven Twins/ Container/”Skinless” Scale Out Low cost SMB servers Price-optimized cost-effective infrastructure for 1P servers AMD Opteron™ 3000 Series Platform Custom, purpose-driven low power systems Low cost, dedicated hosting and small business servers 3 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public INTRODUCING THE AMD OPTERON™ 6200 SERIES PROCESSOR New architecture designed to deliver business agility for the cloud era. World’s first truly modular x86 processor core design. Greater Performance Greater Efficiency For up to 71% more throughput* As low as 5.3 W/core**, reduced processor power at idle by up to 46%** • World’s first 16-core x86 processor1 • First processor with up to 1GHz boost over base frequency2 • First processor with multi-threaded floating point unit3 • First processor to support FMA and XOP instructions4 • C6 power state enables ultra low power by gating power to idle cores • First processor with 1.25V ULV-DDR3 Support6 • First processor with TDP Power Capping7 TPC-C and Price/TpmC are trademarks of the Transaction Processing Performance Council. The results stated above reflect results published on http://www.tpc.org as of November 28, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Models 6282 SE and 6176 SE, operating at each processor’s default frequency. For the latest TPC-C results, visit http://www.tpc.org. Performance (tpmc) = 1,207,982, 2 x AMD Opteron™ processors Model 6282 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501. Performance (tpmc) = 705,652, 2 x AMD Opteron™ processors Model 6176 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=110040801. **See processor power savings slide in substantiation section Numbered claims listed on substantiation slide in substantiation section 4 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public DELIVERING BETTER SCALABILITY FOR MULTITHREADING 5 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public DIRECT CONNECT ARCHITECTURE 2.0 (INTRODUCED IN 2010) 4 MEMORY CHANNELS 4 MEMORY CHANNELS Balanced and scalable design to support up to 16 Cores per CPU 12 DIMMs per CPU 4 MEMORY CHANNELS 4 MEMORY CHANNELS 12 DIMMs per CPU 12 DIMMs per CPU • 1-hop between processors 12 DIMMs per CPU • Four memory channels 33% greater memory throughput1 and 71% more processing throughput2 than AMD Opteron™ 6100 Series processors. 1 Based on measurements by AMD labs as of 8/9/11. Comparison is AMD Opteron 6200 Series with DDR3-1600 vs. AMD Opteron 6100 Series with DDR3-1333. See backup slide #39 for config info. and Price/TpmC are trademarks of the Transaction Processing Performance Council. The results stated above reflect results published on http://www.tpc.org as of November 28, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Models 6282 SE and 6176 SE, operating at each processor’s default frequency. For the latest TPC-C results, visit http://www.tpc.org. Performance (tpmc) = 1,207,982, 2 x AMD Opteron™ processors Model 6282 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501. Performance (tpmc) = 705,652, 2 x AMD Opteron™ processors Model 6176 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=110040801. 2 TPC-C 6 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public COMPUTING WITHOUT COMPROMISES Same Features Across Power Bands Consistent Images and Software No artificially limited features Same Die, Chipset and Memory enable: Intel Full memory speed on all models Full I/O speed on all models Same API Same chipset on all platforms Same BIOS Code Same Drivers Easier To Buy Easier To Qualify Easier To Manage No tradeoffs of performance & core functionality Full consistency across the entire processor stack Seamlessly move virtual machines, easily migrate software between systems 7 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public AMD OPTERON PROCESSORS – NO COMPROMISE Only AMD offers consistency across processors – same features, cache, memory, and bus speed. Consistency helps capacity planning, software image deployment, and validation efforts. Intel Max Memory Speed (DDR-1066-1333 ) Bus Speed (4.8-5.86 GT/s) * Colored columns equal the lowest max value among the SKUs in a given power band divided by the highest max value across all three power bands. Transparent columns equal the highest max value among the SKUs in a given power band divided by the highest max value across all three power bands. Specs as of 9/8/11 for the Intel Xeon 5600 Series can be found at http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor5000-sequence/Xeon5000Specifications.html and http://ark.intel.com/products/series/47915 8 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Intel® Turbo Boost Technology Low Power3 L3 cache Size (4-12MB) Standard Power2 L3 cache Size (4-12MB) Intel® Turbo Boost Technology Intel® Turbo Boost Technology AMD Turbo CORE technology L3 cache Size (16MB) Bus Speed (6.4 GT/s) Max Memory Speed (DDR3-1600) AMD Turbo CORE technology L3 cache Size (16MB) Bus Speed (6.4 GT/s) Max Memory Speed (DDR3-1600) AMD Turbo CORE technology 25% L3 cache Size (16MB) 50% Bus Speed (6.4 GT/s) 75% L3 cache Size (12MB) High Performance1 Low Power Bus Speed (5.86-6.4 GT/s) Standard Power Max Memory Speed (DDR-1066-1333 ) High Performance Max Memory Speed (DDR3-1600) Percentage of Capability 100% 4- and 6-core Intel Xeon 5600 Series processors Max Memory Speed (DDR-1066-1333 ) Bus Speed (4.8-5.86 GT/s) 8-, 12-, 16- core AMD Opteron 6200 Series processors AMD TURBO CORE TECHNOLOGY Base frequency with TDP headroom All core boost activated (up to 500MHz) Max turbo activated (up to 1GHz+, half cores) + All Core Boost When there is TDP headroom in a given workload, AMD Turbo CORE technology is automatically activated and can increase clock speeds by 300-500 MHz* across all cores. Max Turbo Boost When a lightly threaded workload sends half the “Bulldozer” modules into C6 sleep state but also requests max performance, AMD Turbo CORE technology can increase clock speeds by up to 1 GHz+* across half the cores. *Based on AMD Opteron 6200 Series processors with up to 500 MHz in P1 frequency increase and 1.3 GHz P0 frequency increase as well as AMD Opteron 4200 Series processors with up to 300 MHz in P1 frequency increase and 1.2 GHz P0 frequency increase. 9 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public HOW DO CLOUD WORKLOADS SCALE?  Cloud workloads are very different than traditional data center loads  Cloud work is “spiky” in nature, you must be able to account for both the peaks (with more cores) and the valleys (with more power efficiency) Heavy Computation Cores Matter Power Efficiency Matters Low computation 10 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public TDP POWER CAP Enables more control for IT • Set the thermal design power (TDP) to meet power and workload demands1 - One watt increments provide granular control over power settings • Utilize for more flexible, denser deployments - Get more servers in a rack How it works Set the maximum processor power ceiling via BIOS2 or APML3. 1TDP setting may impact frequency, depending on workload. platforms where TDP power capping feature is enabled in the system BIOS 3For platforms that have designed in APML platform support 2For 11 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public REDUCING POWER LEAKAGE ENHANCED NEAR ZERO POWER CORE STATE WITH “C6” AMD Opteron 6100 & 4100 Series Processors AMD Opteron 6200 & 4200 Series Processors Single power plane, all cores powered at all times Single power plane, but each module can be turned on and off Idle Idle Idle VDD Core Idle Voltage is reduced but still applied to cores resulting in leakage / static power 12 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public VDD Core Idle Idle Voltage is gated off to virtually remove all core static power / leakage Cache C6 POWER STATE Active Idle C6 AMD Opteron™ 6200 Series Processors AMD Opteron 6100 Series AMD Opteron 6200 Series All cores running workloads; core/module frequency can run independently to save power Active No cores running workloads; core/module frequency reduced to 800MHz to save more power Idle L2 Smart Fetch L2 After a set idle time L2 cache is flushed to L3, allowing cores to ‘sleep’ to save power while maintaining MP coherency C6 (NEW!) Any idle module can independently enter ‘C6’, helping to reduce processor power at idle by up to 46%*; module state is saved to DRAM C1e System state to reduce memory and I/O power (every core must be idle/C6 state). C6 further reduces idle power where there is almost no leakage. 13 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public * See processor power savings slide in substantiation section MANY OPTIONS TO REDUCE POWER WITH AMD-P OPEX Management Features • TDP Power Cap • AMD PowerCap Manager • Advanced Platform Management Link Fail-Safe Operating Mode • AMD CoolSpeed Technology Compute Power Management Low Power Memory Support • Dual Dynamic Power Management • AMD PowerNow!™ Technology with Independent Dynamic Core Technology • AMD CoolCore™ Technology • AMD Smart Fetch Technology • C6 power state • C1E • Support DDR3 ULV 1.25 DIMMs • Support DDR3 LV 1.35 DIMMs 14 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public THE NEW “BULLDOZER” INSTRUCTIONS | A CLOSER LOOK Instructions Applications/Use Cases SSSE3, SSE4.1, SSE4.2 (AMD and Intel) • Video encoding and transcoding • Biometrics algorithms • Text-intensive applications AESNI PCLMULQDQ (AMD and Intel) • • • • • AVX (AMD and Intel) Floating point intensive applications: • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling FMA4 (AMD Unique)* • Vector and matrix multiplications • Polynomial evaluations • Chemistry, physics, quantum mechanics and digital signal processing XOP (AMD Unique)* • Numeric applications • Multimedia applications • Algorithms used for audio/radio Application using AES encryption Secure network transactions Disk encryption (MSFT BitLocker) Database encryption Cloud security * http://blogs.amd.com/developer/2009/05/06/striking‐a‐balance/ 15 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public XOP and FMA4 instruction set extensions are AMD unique 128-bit and 256-bit instructions designed to: • Improve performance by increasing the work per instruction • Reduce the need to copy and move around register operands • Allow for some new cases of automatic vectorization by compilers AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS OS AND HYPERVISOR SUPPORT SUMMARY ASSUMES latest updates/patches are installed* Enabled  Compatible Not Supported  Optimized to support some or all                    Will boot and run but not take  of “Bulldozer’s” new features       advantage of “Bulldozer’s” new features  outside of new instructions  Includes new instruction support: • • • • • • • • • Hyper‐V Nex Gen (in development) Linux kernel 2.6.37 +  Novell SLES 11 SP2 Beta (includes Xen) RHEL 6.2 with KVM (in development) Windows Server 2008 R2 SP1  Windows 8 Server  (in development) Xen 4.1 Ubuntu 11.04 (w/ KVM) VMware vSphere 5.0  Versions in this category also include latest software advances Incudes new instruction support: • • • • Linux kernel 2.6.32 – 2.6.36  Novell SLES 11 SP1  RHEL 6.1  Ubuntu 10.10 Does not support new instructions for  either Bulldozer or Sandy Bridge:  • • • • • • • • • • Hyper‐V R1 Hyper‐V R2, Hyper‐V R2 SP1 Novell SLES 10 SP4 and higher RHEL 5.7  (included KVM)  Solaris 10u9, 11 VMware vSphere 4.1u2 (in development) Windows Server 2003 R2 SP2 Windows Server 2008 R2 Windows Server 2008 SP2  Xen 3.4.2 Will run but not necessarily provide performance uplift * Please note: For proper support of available features/processors, the latest updates/patches always needs to be installed 16 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Will not run  on          “Bulldozer” platforms and/or  will not be supported by OSV • • • • • • • • • • • Linux kernel 2.6.31 or earlier  Novell SLES 10 thru SP3 Novell SLES 11 RHEL 4.x RHEL 5.0 – 5.5 RHEL 5.6 (can run with patches  but is not supported by Red Hat) RHEL 6.0 Solaris 10 – 10u8 VMware ESX 3.5 VMware ESX 4.0 – 4.1u1 Windows Server 2003 versions  prior to R2 SP2  AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS COMPILER SUPPORT Compiler Status SSSE3  SSE4.1‐.2 AVX  FMA4 XOP  Auto Generates  Code Comments GCC 4.6.1 Available    GCC 4.4 is included in RHEL 6.0 distribution and should be updated to GCC 4.6.1 for optimized support Microsoft Visual Studio 2010 SP1 Available   No Supports new instructions but does not auto generate code  http://developer.amd.com/open64 provide incremental performance and functionality improvements  PGI Unified Binary™ technology combines into a single executable or object file code optimized for multiple AMD and Intel processors  –mAVX is designed to run on any x86 processor, however the ICC runtime makes assumptions about cache line sizes and other parameters that causes code to fail on AMD processors Open64 4.5.1 PGI 11.9 Available Available      ICC 12 Available (-mAVX flag) No Compiler Optimization Quick Guide: http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf 17 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public WHY DOES AMD SUPPORT A RANGE OF COMPILERS? No one compiler services all of our target markets Compilers Languages Supported Processors Supported Operation Systems Supported Comments GCC  C,C++, Fortran,  Wide variety  Objective‐C, including: x86, AIX,  SPARC, ARM Java, Ada, Go Wide variety including:  Linux, Windows, Mac OS,  Default compiler for Linux  Android, Solaris Intel C, C++, Fortran Intel x86, Itanium Linux, Windows, Mac OS Performance compiler for Intel  Open64 C, C++, Fortran AMD and Intel x86 Linux Performance compiler for AMD  C, C++, Fortran AMD,  Intel x86,  NVIDIA CUDA Linux, Mac OS, Windows Performance compiler for HPC   AMD and Intel x86 Windows Default compiler for Windows PGI MSFT Visual  C, C++, C#,  Studio Basic • Default compilers are used to compile the kernel, some of the system software,  and libraries for the OS • Customers are often reluctant to change compilers  • Compilers used to generated high performance code are not necessarily the ones  used for mainstream server applications 18 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public OPEN64 COMPILER | A CLOSER LOOK Setting the “–march” (microarchitecture) flag will automatically optimize code for the target processor’s instruction set Open64 Settings Processor Type -march=bdver1 AMD Opteron™ 4200 and 6200 Series -march=barcelona AMD Opteron™ 13xx, 14xx, 23xx, 24xx, 83xx, 84xx, 4100, and 6200 Series -march=any86 Any x86 processor “Bulldozer” compiler optimizations enabled by –march=bdver1* • Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP) • Automatically selects instructions to improve performance (intrinsics and inline) • Automatic calls to libM (math library) functions that use these new instructions • Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes • Adjusted to take advantage of the improved hardware prefetcher • Improvements in code layout and alignment to take advantage of shared compute unit, e.g. “dispatch scheduling” * Additional information: http://developer.amd.com/tools/open64/Documents/open64.html 19 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public GCC COMPILER | A CLOSER LOOK Setting the “–march” (microarchitecture) flag will automatically optimize code for the target processor’s instruction set GCC Settings Processor Type -march=bdver1 AMD Opteron™ 4200 and 6200 Series -march=amdfam10 AMD Opteron™ 13xx, 14xx, 23xx, 24xx, 83xx, 84xx, 4100, and 6200 Series -march=generic Any x86 processor “Bulldozer” compiler optimizations enabled by –march=bdver1 • • • • • • • Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP) Automatically selects instructions to improve performance (intrinsics and inline) Scalar and vector libm calls available with AMD Libm Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes Memset/Memcpy inliner heuristics Defaults to 128-bit vectorization Improvements in code layout and alignment Additional information: http://developer.amd.com/tools/gnu/pages/default.aspx 20 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public “AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS LIBRARY SUPPORT A library is a collection of pre-written code and subroutines Description Comments ACML (AMD Core Math Library) Set of optimized and threaded math routines for HPC, scientific, engineering and related computeintensive applications ACML 4.x is compatible with “Bulldozer” ACML 5.x is optimized for “Bulldozer” AMD LibM C library containing a collection of basic math functions optimized for x86-64 processors AMD LibM 3.0 is optimized for “Bulldozer” IPP (Intel Performance Primitives) Library of multicore-ready, optimized software functions for multimedia, data processing, and communications applications “For AMD 64-bit processors that support SSE3 the "m7" version of the IPP library will be dispatched automatically. Otherwise "mx" library will be used”* For more information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx For more information on AMD LIbM, go to: http://developer.amd.com/libraries/libm/pages/default.asp * http://software.intel.com/en-us/articles/use-ipp-on-amd-processor/ 21 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public ACML SUPPORT | A CLOSER LOOK Linear Algebra Fast Fourier Transforms (FFT) ACML 5.0 (Aug 2011) • SGEMM (single precision) • DGEMM (double precision) • L1 BLAS • Complex-toComplex (C-C) single precision FFTs ACML 5.1 (Dec 2011) • CGEMM (complex single decision) • ZGEMM (complex double precision) • Real-to-complex (R-C) single precision FFTs • Double precision C-C and R-C FFTs Others • Random Number Generators • AVX compiler switch for Fortran For additional information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx 22 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Compiler Support • Absoft • GCC 4.6 • Open64 4.2.5 • PGI 11.8, 11.9 • ICC 12 • Cray to begin deployment of ACML with their compiler with ACML 5.0 All compilers listed for ACML 5.0 will be supported STARTING POINTS FOR APPLICATION OPTIMIZATION Operating System Compiler Library Recommended for SPECCPU, LINPACK, HPC Challenge Novell SLES 11 SP1 or RHEL 6.1 Open64 4.5.1 ACML 5.1 Recommended for application development and benchmarks with gcc Novell SLES 11 SP1 or RHEL 6.1 GCC 4.6 ACML 5.0 and/or libM 3.0 Recommended for HPC application code development Novell SLES 11 SP1 or RHEL 6.1 Open64 4.25 or PGI 11.9 ACML 5.0 Recommend t for integer code development for Windows Windows Server 2008 RS SP1 Microsoft Visual Studio 2010 SP1 AMD libM 3.0 Recommendations are based on AMD evaluations, please evaluate for your specific workload 23 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public REFERENCES  x86 Compiler Quick Reference Guide for “Bulldozer” processors http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf  Using the x86 Open64 Compiler Suite http://developer.amd.com/tools/open64/Documents/open64.html  x86 Open64 4.2.5.2 Release Notes http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt  ACML 5.0 Information http://developer.amd.com/libraries/acml/features/pages/default.aspx  Software Optimization Guide for “Bulldozer” processors http://support.amd.com/us/Processor_TechDocs/47414.pdf  AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions http://support.amd.com/us/Embedded_TechDocs/43479.pdf 24 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public PRICE POINT PERFORMANCE COMPARISONS HIGHEST PERFORMING CPU’S AT EACH PRICE POINT >$1100 CPU price points $900-$1100 Intel Xeon X5690 $1663 Intel Xeon X5675 $1440 Intel Xeon X5660 $1219 No AMD Opteron™ 6200 Series processors above $1100 48% better performance AMD Opteron 6282 SE $1019 Intel Xeon X5650 $996 AMD Opteron 6276 $788 Intel Xeon E5649 $774 Intel Xeon X5647 $774 $700-$900 Intel Xeon E5640 $774 Intel Xeon E5645 $551 $500-$700 Intel Xeon E5630 $551 AMD Opteron 6262 HE $523 AMD Opteron 6220 $523 $300-$500 AMD Opteron 6238 Intel Xeon E5620 0 100 47% better performance 16% better performance $455 74% better performance $387 200 300 400 500 600 SPECint_rate2006 SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. The results stated above reflect results published on http://www.spec.org/cpu2006/results/ or submitted by AMD to SPEC as of 10/26/11. The comparison presented above is based on the best performing two-socket servers using the specified AMD Opteron™ processor Models and Intel Xeon processor Models operating at each processor’s default frequency. For the latest SPECint®_rate2006 results, visit http://www.spec.org/cpu2006/results/. For additional configuration information, see Two Socket Server SPECint®_rate2006 on backup slide 41. Intel pricing is reflective of published pricing on www.intel.com as of 10/26/11. AMD pricing available at http://www.amd.com/us/products/pricing/Pages/serveropteron.aspx. 25 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public SPECint_rate2006, 2P Name Core Count Core Freq. X5690 X5680 X5687 X5677 X5675 X5670 X5672 X5667 X5660 6282 SE X5650 L5640 6276 E5649 X5647 E5640 6274 E5645 E5630 6272 6220 6238 E5620 6234 E5607 6212 E5606 E5603 6 6 4 4 6 6 4 4 6 16 6 6 16 6 4 4 16 6 4 16 8 12 4 12 4 8 4 4 3.46 GHz 3.33 GHz 3.60 GHz 3.46 GHz 3.06 GHz 2.93 GHz 3.20 GHz 3.06 GHz 2.80 GHz 2.6 GHz 2.66 GHz 2.26 GHz 2.3 GHz 2.53 GHz 2.93 GHz 2.66 GHz 2.2 GHz 2.40 GHz 2.53 GHz 2.1 GHz 3.0 GHz 2.6 GHz 2.40 GHz 2.4 GHz 2.26 GHz 2.6 GHz 2.13 GHz 1.60 GHz Power Rating 130 W TDP 130 W TDP 130 W TDP 130 W TDP 95 W TDP 95 W TDP 95 W TDP 95 W TDP 95 W TDP 140 W TDP 95 W TDP 60 W TDP 115 W TDP 80 W TDP 130 W TDP 80 W TDP 115 W TDP 80 W TDP 80 W TDP 115 W TDP 115 W TDP 115 W TDP 80 W TDP 115 W TDP 80 W TDP 115 W TDP 80 W TDP 80 W TDP 整數運算 1kU 價格 效能% 435 426 354 345 407 396 335 324 387 $1,663 $1,663 $1,663 $1,663 $1,440 $1,440 $1,440 $1,440 $1,219 $1,019 $996 $996 $788 $774 $774 $774 $639 $551 $551 $523 $523 $455 $387 $377 $276 $266 $219 $188 159% 155% 129% 126% 149% 145% 122% 118% 141% 198% 139% 114% 178% 122% 107% 100% 172% 121% 97% 166% 116% 151% 86% 142% 67% 102% 62% 49% 543 381 313 488 335 292 274 472 332 265 455 319 414 236 388 183 279 169 134 Data From: www.spec.org 26 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public > $900 59% 56% $700-$900 69% $500-$700 56% $300-$500 35% $300 < SPECint_rate2006, 4P Core Count CoreFreq. 10 10 10 10 8 8 8 2.40 GHz 2.13 GHz 2.26 GHz 2.00 GHz 2.67 GHz 2.13GHz 2.00GHz Opteron 6282 SE Xeon E7-4807 16 6 Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 16 16 16 16 8 12 12 4 8 Model Name Xeon Xeon Xeon Xeon Xeon Xeon Xeon E7-4870 E7-8867L E7-4860 E7-4850 E7-8837 E7-4830 E7-4820 6276 6274 6272 6262 HE 6220 6238 6234 6204 6212 整數運算 1kU 價格 效能% 130W TDP 105W TDP 130W TDP 130W TDP 130W TDP 105W TDP 105W TDP 1130 974 1090 919 795 831 759 $4,394 $4,172 $3,838 $2,837 $2,280 $2,059 $1,446 136% 117% 131% 111% 96% 100% 91% 2.6GHz 1.86GHz 140W TDP 95W TDP 1060 516 $1,019 $890 128% 62% 2.3GHz 2.2GHz 2.1GHz 1.6GHz 3.0GHz 2.6GHz 2.4GHz 3.3GHz 2.6GHz 115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 115W TDP 973 940 909 740 630 827 763 351 554 $788 $639 $523 $523 $523 $455 $377 $377 $266 117% 113% 109% 89% 76% 100% 92% 42% 67% PowerRating Difference: ($2,059-$523)*4=$6,144 Difference: ($3,838-$1,019)*4=$11,276 27 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public ~ NTD 184,320 ~ NTD 338,280 Data From: www.spec.org WHY MEMCACHED APPLICATION USE AMD? 2P SERVER COMPARISON Core Count Opteron 6212 8 Opteron 6234 12 Opteron 6204 4 Opteron 6238 12 Opteron 6262 HE 16 Opteron 6272 16 Opteron 6220 8 Opteron 6274 16 Opteron 6276 16 Opteron 6282 SE 16 Xeon E5645 6 Xeon E5649 6 Xeon X5650 6 Xeon X5660 6 Xeon E5603 4 Xeon E5606 4 Xeon E5620 4 Xeon E5630 4 Model Name Core Freq. Power Rating 2.6GHz 2.4GHz 3.3GHz 2.6GHz 1.6GHz 2.1GHz 3.0GHz 2.2GHz 2.3GHz 2.6GHz 2.40GHz 2.53GHz 2.66GHz 2.80GHz 1.60GHz 2.13GHz 2.40GHz 2.53GHz 115W TDP 115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 140W TDP 80W TDP 80W TDP 95W TDP 95W TDP 80W TDP 80W TDP 80W TDP 80W TDP Memory  1kU 價格 Bandwidth  102GB/s $266 102GB/s $377 102GB/s $377 102GB/s $455 102GB/s $523 102GB/s $523 102GB/s $523 102GB/s $639 102GB/s $788 102GB/s $1,019 64GB/s $551 64GB/s $774 64GB/s $996 64GB/s $1,219 51GB/s $188 51GB/s $219 51GB/s $387 51GB/s $551 p.s. 2P memory capacity is up to 512MB 28 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Data From: www.spec.org 效能% $266x2=$ 532 159% Difference NTD 17,100 $551x2=$ 1,102 100% 79% WHY MEMCACHED APPLICATION USE AMD? 4P SERVER COMPARISON Model   Name Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Xeon Xeon Xeon Xeon Xeon Xeon Xeon Xeon 6212 6234 6204 6238 6262 HE 6272 6220 6274 6276 6282 SE E7‐4830 E7‐8837 E7‐4850 E7‐4860 E7‐8867L E7‐4870 E7‐4820 E7‐4807 Core Count 8 12 4 12 16 16 8 16 16 16 8 8 10 10 10 10 8 6 Core Freq. Power Rating 2.6GHz 2.4GHz 3.3GHz 2.6GHz 1.6GHz 2.1GHz 3.0GHz 2.2GHz 2.3GHz 2.6GHz 2.13GHz 2.67GHz 2.00GHz 2.3GHz 2.13GHz 2.40GHz 2.00GHz 1.9GHz 115W TDP 115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 140W TDP 105W TDP 130W TDP 130W TDP 130W TDP 105W TDP 130W TDP 105W TDP 95W TDP Memory Bandwidth  205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 115GB/s 115GB/s 115GB/s 115GB/s 115GB/s 115GB/s 105GB/s 86GB/s 1kU 價格 $266 $377 $377 $455 $523 $523 $523 $639 $788 $1,019 $2,059 $2,280 $2,837 $3,838 $4,172 $4,394 $1,446 $890 效能% $266x4=$1,064 178% Difference NTD 215,160 $2,059x4=$8,236 100% 91% 75% p.s. 4P memory capacity is up to 1TB 29 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Data From: www.spec.org BETTER PROCESSOR VALUE Comparison Top Bin AMD processor Top Bin Intel processor Processor name AMD Opteron™ 6282 SE Intel Xeon E5-2690 Cores 16 dedicated 8 shared 1kU list price/processor $1,019 $2,057 Price/core $64 $257 Price/performance $3.75 $5.95 Price/GFLOP $6.14 $11.06 Why spend 102% more per processor? Why settle for 50% fewer cores that have shared resources? Save 44% per GFLOP with AMD. Get 37% better price/performance with AMD. • • • • • • Spec and pricing as of 3/8/12 at www.intc.com/pricelist.cfm See substantiation slide #53 for SPECint®_rate2006 scores Price/performance equals the cost of two processors divided by the estimated 2P SPECint®_rate2006 score. Max theoretical GFLOPS equals number of FLOPS per cycle x frequency of processor x number of cores per processor x number of processors per server Number of FLOPS per cycle is 4 for AMD Opteron 6200 Series-based servers and 8 for Intel Xeon E5-2600 Series based servers AMD Opteron 6200 Series can do up to 166 GFLOPS per processor, Intel Xeon E52600 Series can do up to 186 GFLOPS per processor 30 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public THANK YOU! 31 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public BACKUP SLIDE #31  84% higher performance: LINPACK (2P) AMD Opteron processor Model 6276 generates 84% more FLOPS than Intel Xeon processor Model X5670 – 239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0 – Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 -fopenmp – 130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled – Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk  73% more memory bandwidth:  STREAM (2P) AMD Opteron processor Model 6276 has 73% higher memory bandwidth than Intel Xeon processor Model X5670 – 73, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite – 42, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064 32 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public BACKUP SLIDE #32  1/2 the power-per-core* –   As of Nov 1, 2011, AMD Opteron™ processor Models 4200 EE have the lowest known power per core of any x86 server processor, at 35W TDP (35W/8 = 4.375W/core). Intel 's lowest power per core server processor, L5630, is 40W TDP (40W/4 = 10W/core). See http://www.intel.com/Assets/PDF/prodbrief/323501.pdf. Previous record held by AMD Opteron processor Models 4100 EE at 35W TDP / 6 cores = 5.83 W/core. Requires 2/3 less floor space* – VMs/rack (2P and 4P) One rack of AMD Opteron 6200 Series-based servers can support 672 VMs (1 VM per core, 2U servers) – This would take three racks of floor space and 56 2U Intel Xeon 5600 Series-based servers to do the same. – Assumes 1 VM/core, AMD Opteron 6200 Series-based 2P 2U server has up to 32 cores, supports up to 32 VMs/server x 21 servers per rack, which equals 672 VMs per server. Intel Xeon 5600 Series-based 2P 2U server has up to 12 cores, supports up to 12 VMs/server x 21 servers per rack, which equals 252 VMs per rack, Intel specs as of 11/4/11 at www.intc.com/pricelist.cfm. 1/3 to 2/3 lower platform price* – Top bin comparisons. Dell R710 with two top bin Intel Xeon processor Model X5690s is $7,103. Since pricing for a Dell R715 with two top bin AMD Opteron processor Model 6282 SE (1ku $1019) is not yet available, the similarly priced AMD Opteron processor Model 6140 (1ku $989) was used and the server yielded a price of $4564. That is a 36% price savings. Both servers were configured with 32GB RAM, 146GB 10K hdd, and 3yr base warranty and large enterprise pricing is from www.dell.com as of 10/22/11. HP DL 580 with four top bin Intel Xeon processor Model E7-4870 is $29,336 at www.dell.com. Since pricing for an HP DL585 pricing with four top bin AMD Opteron Model 6828 SE (1ku $1019) is not yet available, the similarly priced AMD Opteron processor Model 6140 (1ku $989) was used and server yielded a price of $11,094. That is a 62% price savings. Both servers were configured with 64GB RAM, 72GB 15K hdd, and 3yr base warranty and large enterprise pricing is from www.hp.com as of 10/22/11. VMware vSphere pricing not included, assuming both servers configured with versions 5.0 or 4.1u2, which are the same price for AMD- and Intel-based servers. 33 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public BACKUP SLIDE #35 ¹ 1ku pricing for AMD Opteron processor Model 6276 is $788 and $1,440 for Intel Xeon processor Model X5670 SPECfp®_rate  SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. The results for AMD Opteron™ processor Model 6276 is based upon data submitted to Standard Performance Evaluation Corporation as of November 7, 2011. 1. The other result stated above reflect results published on http://www.spec.org/cpu2006/results as of November 7, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Model 6276 and Intel Xeon processor Model X5670 operating at each processor’s default frequency. For the latest SPECfp_rate2006 results, visit http://www.spec.org/cpu2006/results. SPECfp®_rate score = 360, 2 x AMD Opteron™ processors Model 6276 in Supermicro A+ 1022-URFserver, 64GB (8 x 8GB DDR3-1600) memory, Red Hat Enterprise Linux 6.1 64-bit, x86 Open64 4.2.5.2 Compiler Suite. SPECfp®_rate score = 263, 2 x Intel Xeon processors Model X5670 in Cisco UCS B200 M2 server, 48GB (12 x 4GB DDR3-1333) memory, SUSE Linux® Enterprise Server 11 SP1 64-bit, Intel C++ Compiler XE v12.0.1.116 STREAM  73 GB/s, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite  42 GB/s, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064 LINPACK  239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0  Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 –fopenmp  130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled  Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk 34 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public BACKUP SLIDE #36 LAMMPS (29% better)  471s, Intel Xeon X5670, Trial 24, Nodes 1, CXX Intel 11.1.064, CXXFLAGS -O2 -funroll-loops -fstrict-aliasin, MPI OMPI 1.5.1 + knem 0.9.5, DIMM Speed (MHz) 1333, DIMM Capacity (GB) 4, DIMM Count 6, Total Capacity (GB) 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes Turbo ON, HT OFF, -mca btl_sm_use_knem 1  333s, AMD Opteron 6276, Trial 9, Nodes 1, CXX openCC 4.2.5.2-1, CXXFLAGS -O3 -OPT:Ofast -OPT:rsqrt=2 -march=bdver1 -mavx -mfma4, MPI OpenMPI 1.5.3, DIMM Speed (MHz) 1600, DIMM Capacity (GB) 8, DIMM Count 8, Total Capacity (GB) 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes hpc NAMD (41% better)  .636 day/ns, Intel Xeon X5670, Trial 40, NAMD Version 2.7, Nodes 1, MB memory 274.328, CC Intel 11.1, CCFLAGS -ip -fno-rtti -O3 -xSSE4.2 -no-prec-div, MPI OMPI 1.5.1, DIMM Speed (MHz) 1333, DIMM Capacity (GB) 4, DIMM Count 6, Total Capacity (GB) 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes default  .375 day/ns, AMD Opteron 6276, Trial 77, NAMD Version 2.8, Nodes 1, MB memory 260.277, CC opencc 4.2.5-2.1, CCFLAGS -O3 -m64 -march=bdver1 -mfma4 -mavx CG:compute_to=ON -OPT:Olimit=40000, MPI OpenMPI 1.5.3+knem 0.9.6, DIMM Speed (MHz) 1600, DIMM Capacity (GB) 8, DIMM Count 8, Total Capacity (GB) 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes default WRF (20% better)  224s, Intel Xeon X5670, Trial 124, STEP NA, NODE 1, FC Intel 11.1.064, NETCDF 4.1.1, FCFLAGS -w -O3 -ip -xSSE4.2 -fp-model fast=2 -no-prec-div -no-prec-sqr, NUMA NA, MPI OMPI 1.5.1 + knem 0.9.5, DIMM # 6, DIMM GB 4, DIMM MHz 2 1333, Total Capacity 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes -mca btl_sm_use_knem 1  180s, AMD Opteron 6276, Trial 436, STEP B2g, NODE 1, FC open64 4.2.5-1, NETCDF 4.1.2, FCFLAGS -O3 -HP -march=bdver1 -mavx -mfma4 -DpgiFortran -OPT:unroll_size=256 LNO:blocking=off -LANG:copyinout=o, NUMA APP FILE, MPI OMPI 1.5.3, DIMM # 8, DIMM GB 8, DIMM MHz 2 1600, Total Capacity 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes NA ² Greater FLOPS per sq ft  2P 1U AMD Opteron™ processor Model 6276-based server generates up to 239.1 LINPACK FLOPS. Forty-two 1U servers can fit in a 42U rack, which equals 10,042 FLOPS per rack. 2P 1U Intel Xeon processor Model 5670-based server generates up to 130.1 LINPACK FLOPS. Forty-two 1U servers can fit in a 42U rack, which equals 5,464 FLOPS per rack.  239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0  Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 –fopenmp  130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled  Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk 35 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public BACKUP SLIDE #37    2 3 4 Maximum cores per rack – 2P 1U AMD Opteron™ processor Model 6276-based server has up to 32 cores. Forty-two 1U servers can fit in a 42U rack, which equals 1344 cores per rack. – 2P 1U Intel Xeon processor Model 5670-based server has up to 12 cores. Forty-two 1U servers can fit in a 42U rack, which equals 504 cores per rack. STREAM (2P) AMD Opteron processor Model 6276 has 73% higher memory bandwidth than Intel Xeon processor Model X5670 – 73 GB/s, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite – 42 GB/s, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064 Comparison of 12-core AMD Opteron™ processor Model 6234 expected price of $377 at launch with 4-core Intel Xeon E5603 price of $188 according to www.intel.com as of 11/4/11. 36 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public