Transcript
AMD領先業界之高效能運算與
低溫控制技術 Kevin Lai Senior Product Marketing Manager AMD
Nov 2011
AMD CUSTOMERS Cloud
Rackspace
1&1 Virtualization
Valero Energy Univ. of Edinburgh HPC University of Stuttgart Universitaet Stuttgart Gamigo / Aixit TACC Oak Ridge NCHC University of São Paulo Kyoto Univ. Poznan Supercomputing Center Univ. of Illinois (“Blue Waters”) (Institute of Biorganic Chemistry) NERSC (Poland) Ferrari
CPTEC
2 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
AMD OPTERON™ PLATFORMS A WIDE RANGE OF PLATFORM CHOICES TO MEET BOTH STANDARDIZED AND CUSTOMIZED ENVIRONMENTS
Performance-perwatt and Expandability for 2P/4P
Highly Energy Efficient and Cost-optimized for 1P/2P
AMD Opteron™ 6000 Series Platform Standard Platforms Traditional Rack/Tower/Blade
AMD Opteron™ 4000 Series Platform Custom, purpose-driven Twins/ Container/”Skinless” Scale Out Low cost SMB servers
Price-optimized cost-effective infrastructure for 1P servers
AMD Opteron™ 3000 Series Platform Custom, purpose-driven low power systems Low cost, dedicated hosting and small business servers
3 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
INTRODUCING THE AMD OPTERON™ 6200 SERIES PROCESSOR
New architecture designed to deliver business agility for the cloud era.
World’s first truly modular x86 processor core design.
Greater Performance
Greater Efficiency
For up to 71% more throughput*
As low as 5.3 W/core**, reduced processor power at idle by up to 46%**
• World’s first 16-core x86 processor1 • First processor with up to 1GHz boost over base frequency2 • First processor with multi-threaded floating point unit3 • First processor to support FMA and XOP instructions4
• C6 power state enables ultra low power by gating power to idle cores • First processor with 1.25V ULV-DDR3 Support6 • First processor with TDP Power Capping7
TPC-C and Price/TpmC are trademarks of the Transaction Processing Performance Council. The results stated above reflect results published on http://www.tpc.org as of November 28, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Models 6282 SE and 6176 SE, operating at each processor’s default frequency. For the latest TPC-C results, visit http://www.tpc.org. Performance (tpmc) = 1,207,982, 2 x AMD Opteron™ processors Model 6282 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501. Performance (tpmc) = 705,652, 2 x AMD Opteron™ processors Model 6176 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=110040801. **See processor power savings slide in substantiation section Numbered claims listed on substantiation slide in substantiation section
4 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
DELIVERING BETTER SCALABILITY FOR MULTITHREADING
5 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
DIRECT CONNECT ARCHITECTURE 2.0 (INTRODUCED IN 2010)
4 MEMORY CHANNELS
4 MEMORY CHANNELS
Balanced and scalable design to support up to 16 Cores per CPU
12 DIMMs per CPU
4 MEMORY CHANNELS
4 MEMORY CHANNELS
12 DIMMs per CPU
12 DIMMs per CPU
• 1-hop between processors
12 DIMMs per CPU
• Four memory channels
33% greater memory throughput1 and 71% more processing throughput2 than AMD Opteron™ 6100 Series processors. 1
Based on measurements by AMD labs as of 8/9/11. Comparison is AMD Opteron 6200 Series with DDR3-1600 vs. AMD Opteron 6100 Series with DDR3-1333. See backup slide #39 for config info. and Price/TpmC are trademarks of the Transaction Processing Performance Council. The results stated above reflect results published on http://www.tpc.org as of November 28, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Models 6282 SE and 6176 SE, operating at each processor’s default frequency. For the latest TPC-C results, visit http://www.tpc.org. Performance (tpmc) = 1,207,982, 2 x AMD Opteron™ processors Model 6282 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=111111501. Performance (tpmc) = 705,652, 2 x AMD Opteron™ processors Model 6176 SE: http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=110040801. 2 TPC-C
6 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
COMPUTING WITHOUT COMPROMISES Same Features Across Power Bands
Consistent Images and Software
No artificially limited features
Same Die, Chipset and Memory enable:
Intel Full memory speed on all models
Full I/O speed on all models Same API
Same chipset on all platforms
Same BIOS Code
Same Drivers
Easier To Buy
Easier To Qualify
Easier To Manage
No tradeoffs of performance & core functionality
Full consistency across the entire processor stack
Seamlessly move virtual machines, easily migrate software between systems
7 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
AMD OPTERON PROCESSORS – NO COMPROMISE Only AMD offers consistency across processors – same features, cache, memory, and bus speed. Consistency helps capacity planning, software image deployment, and validation efforts.
Intel
Max Memory Speed (DDR-1066-1333 ) Bus Speed (4.8-5.86 GT/s)
* Colored columns equal the lowest max value among the SKUs in a given power band divided by the highest max value across all three power bands. Transparent columns equal the highest max value among the SKUs in a given power band divided by the highest max value across all three power bands. Specs as of 9/8/11 for the Intel Xeon 5600 Series can be found at http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor5000-sequence/Xeon5000Specifications.html and http://ark.intel.com/products/series/47915 8 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
Intel® Turbo Boost Technology
Low Power3
L3 cache Size (4-12MB)
Standard Power2
L3 cache Size (4-12MB) Intel® Turbo Boost Technology
Intel® Turbo Boost Technology
AMD Turbo CORE technology
L3 cache Size (16MB)
Bus Speed (6.4 GT/s)
Max Memory Speed (DDR3-1600)
AMD Turbo CORE technology
L3 cache Size (16MB)
Bus Speed (6.4 GT/s)
Max Memory Speed (DDR3-1600)
AMD Turbo CORE technology
25%
L3 cache Size (16MB)
50%
Bus Speed (6.4 GT/s)
75%
L3 cache Size (12MB)
High Performance1
Low Power
Bus Speed (5.86-6.4 GT/s)
Standard Power
Max Memory Speed (DDR-1066-1333 )
High Performance
Max Memory Speed (DDR3-1600)
Percentage of Capability
100%
4- and 6-core Intel Xeon 5600 Series processors
Max Memory Speed (DDR-1066-1333 ) Bus Speed (4.8-5.86 GT/s)
8-, 12-, 16- core AMD Opteron 6200 Series processors
AMD TURBO CORE TECHNOLOGY Base frequency with TDP headroom
All core boost activated (up to 500MHz)
Max turbo activated (up to 1GHz+, half cores)
+
All Core Boost When there is TDP headroom in a given workload, AMD Turbo CORE technology is automatically activated and can increase clock speeds by 300-500 MHz* across all cores.
Max Turbo Boost When a lightly threaded workload sends half the “Bulldozer” modules into C6 sleep state but also requests max performance, AMD Turbo CORE technology can increase clock speeds by up to 1 GHz+* across half the cores.
*Based on AMD Opteron 6200 Series processors with up to 500 MHz in P1 frequency increase and 1.3 GHz P0 frequency increase as well as AMD Opteron 4200 Series processors with up to 300 MHz in P1 frequency increase and 1.2 GHz P0 frequency increase.
9 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
HOW DO CLOUD WORKLOADS SCALE? Cloud workloads are very different than traditional data center loads Cloud work is “spiky” in nature, you must be able to account for both the peaks (with more cores) and the valleys (with more power efficiency) Heavy Computation
Cores Matter
Power Efficiency Matters Low computation
10 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
TDP POWER CAP Enables more control for IT • Set the thermal design power (TDP) to meet power and workload demands1 - One watt increments provide granular control over power settings
• Utilize for more flexible, denser deployments - Get more servers in a rack
How it works Set the maximum processor power ceiling via BIOS2 or APML3.
1TDP
setting may impact frequency, depending on workload. platforms where TDP power capping feature is enabled in the system BIOS 3For platforms that have designed in APML platform support 2For
11 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
REDUCING POWER LEAKAGE ENHANCED NEAR ZERO POWER CORE STATE WITH “C6” AMD Opteron 6100 & 4100 Series Processors
AMD Opteron 6200 & 4200 Series Processors
Single power plane, all cores powered at all times
Single power plane, but each module can be turned on and off
Idle
Idle
Idle
VDD Core
Idle
Voltage is reduced but still applied to cores resulting in leakage / static power
12 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
VDD Core
Idle
Idle
Voltage is gated off to virtually remove all core static power / leakage
Cache
C6 POWER STATE
Active
Idle
C6
AMD Opteron™ 6200 Series Processors AMD Opteron 6100 Series
AMD Opteron 6200 Series
All cores running workloads; core/module frequency can run independently to save power
Active
No cores running workloads; core/module frequency reduced to 800MHz to save more power
Idle
L2
Smart Fetch
L2
After a set idle time L2 cache is flushed to L3, allowing cores to ‘sleep’ to save power while maintaining MP coherency
C6 (NEW!)
Any idle module can independently enter ‘C6’, helping to reduce processor power at idle by up to 46%*; module state is saved to DRAM
C1e
System state to reduce memory and I/O power (every core must be idle/C6 state). C6 further reduces idle power where there is almost no leakage.
13 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public * See processor power savings slide in substantiation section
MANY OPTIONS TO REDUCE POWER WITH AMD-P
OPEX Management Features
• TDP Power Cap • AMD PowerCap Manager • Advanced Platform Management Link
Fail-Safe Operating Mode
• AMD CoolSpeed Technology
Compute Power Management
Low Power Memory Support
• Dual Dynamic Power Management • AMD PowerNow!™ Technology with Independent Dynamic Core Technology • AMD CoolCore™ Technology • AMD Smart Fetch Technology • C6 power state • C1E • Support DDR3 ULV 1.25 DIMMs • Support DDR3 LV 1.35 DIMMs
14 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
THE NEW “BULLDOZER” INSTRUCTIONS | A CLOSER LOOK Instructions
Applications/Use Cases
SSSE3, SSE4.1, SSE4.2 (AMD and Intel)
• Video encoding and transcoding • Biometrics algorithms • Text-intensive applications
AESNI PCLMULQDQ (AMD and Intel)
• • • • •
AVX (AMD and Intel)
Floating point intensive applications: • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling
FMA4 (AMD Unique)*
• Vector and matrix multiplications • Polynomial evaluations • Chemistry, physics, quantum mechanics and digital signal processing
XOP (AMD Unique)*
• Numeric applications • Multimedia applications • Algorithms used for audio/radio
Application using AES encryption Secure network transactions Disk encryption (MSFT BitLocker) Database encryption Cloud security
* http://blogs.amd.com/developer/2009/05/06/striking‐a‐balance/ 15 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
XOP and FMA4 instruction set extensions are AMD unique 128-bit and 256-bit instructions designed to: • Improve performance by increasing the work per instruction • Reduce the need to copy and move around register operands • Allow for some new cases of automatic vectorization by compilers
AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS OS AND HYPERVISOR SUPPORT SUMMARY ASSUMES latest updates/patches are installed* Enabled
Compatible
Not Supported
Optimized to support some or all Will boot and run but not take of “Bulldozer’s” new features advantage of “Bulldozer’s” new features outside of new instructions Includes new instruction support: • • • • • • • • •
Hyper‐V Nex Gen (in development) Linux kernel 2.6.37 + Novell SLES 11 SP2 Beta (includes Xen) RHEL 6.2 with KVM (in development) Windows Server 2008 R2 SP1 Windows 8 Server (in development) Xen 4.1 Ubuntu 11.04 (w/ KVM) VMware vSphere 5.0
Versions in this category also include latest software advances
Incudes new instruction support: • • • •
Linux kernel 2.6.32 – 2.6.36 Novell SLES 11 SP1 RHEL 6.1 Ubuntu 10.10
Does not support new instructions for either Bulldozer or Sandy Bridge: • • • • • • • • • •
Hyper‐V R1 Hyper‐V R2, Hyper‐V R2 SP1 Novell SLES 10 SP4 and higher RHEL 5.7 (included KVM) Solaris 10u9, 11 VMware vSphere 4.1u2 (in development) Windows Server 2003 R2 SP2 Windows Server 2008 R2 Windows Server 2008 SP2 Xen 3.4.2
Will run but not necessarily provide performance uplift
* Please note: For proper support of available features/processors, the latest updates/patches always needs to be installed 16 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
Will not run on “Bulldozer” platforms and/or will not be supported by OSV • • • • • • • • • • •
Linux kernel 2.6.31 or earlier Novell SLES 10 thru SP3 Novell SLES 11 RHEL 4.x RHEL 5.0 – 5.5 RHEL 5.6 (can run with patches but is not supported by Red Hat) RHEL 6.0 Solaris 10 – 10u8 VMware ESX 3.5 VMware ESX 4.0 – 4.1u1 Windows Server 2003 versions prior to R2 SP2
AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS COMPILER SUPPORT
Compiler
Status
SSSE3 SSE4.1‐.2 AVX
FMA4 XOP
Auto Generates Code
Comments
GCC 4.6.1
Available
GCC 4.4 is included in RHEL 6.0 distribution and should be updated to GCC 4.6.1 for optimized support
Microsoft Visual Studio 2010 SP1
Available
No
Supports new instructions but does not auto generate code
http://developer.amd.com/open64 provide incremental performance and functionality improvements
PGI Unified Binary™ technology combines into a single executable or object file code optimized for multiple AMD and Intel processors
–mAVX is designed to run on any x86 processor, however the ICC runtime makes assumptions about cache line sizes and other parameters that causes code to fail on AMD processors
Open64 4.5.1
PGI 11.9
Available
Available
ICC 12
Available
(-mAVX flag)
No
Compiler Optimization Quick Guide: http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf 17 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
WHY DOES AMD SUPPORT A RANGE OF COMPILERS? No one compiler services all of our target markets Compilers
Languages Supported
Processors Supported
Operation Systems Supported
Comments
GCC
C,C++, Fortran, Wide variety Objective‐C, including: x86, AIX, SPARC, ARM Java, Ada, Go
Wide variety including: Linux, Windows, Mac OS, Default compiler for Linux Android, Solaris
Intel
C, C++, Fortran
Intel x86, Itanium
Linux, Windows, Mac OS
Performance compiler for Intel
Open64
C, C++, Fortran
AMD and Intel x86
Linux
Performance compiler for AMD
C, C++, Fortran
AMD, Intel x86, NVIDIA CUDA
Linux, Mac OS, Windows
Performance compiler for HPC
AMD and Intel x86
Windows
Default compiler for Windows
PGI
MSFT Visual C, C++, C#, Studio Basic
• Default compilers are used to compile the kernel, some of the system software, and libraries for the OS • Customers are often reluctant to change compilers • Compilers used to generated high performance code are not necessarily the ones used for mainstream server applications
18 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
OPEN64 COMPILER | A CLOSER LOOK Setting the “–march” (microarchitecture) flag will automatically optimize code for the target processor’s instruction set Open64 Settings
Processor Type
-march=bdver1
AMD Opteron™ 4200 and 6200 Series
-march=barcelona
AMD Opteron™ 13xx, 14xx, 23xx, 24xx, 83xx, 84xx, 4100, and 6200 Series
-march=any86
Any x86 processor
“Bulldozer” compiler optimizations enabled by –march=bdver1* • Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP) • Automatically selects instructions to improve performance (intrinsics and inline) • Automatic calls to libM (math library) functions that use these new instructions • Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes • Adjusted to take advantage of the improved hardware prefetcher • Improvements in code layout and alignment to take advantage of shared compute unit, e.g. “dispatch scheduling” * Additional information: http://developer.amd.com/tools/open64/Documents/open64.html
19 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
GCC COMPILER | A CLOSER LOOK Setting the “–march” (microarchitecture) flag will automatically optimize code for the target processor’s instruction set GCC Settings
Processor Type
-march=bdver1
AMD Opteron™ 4200 and 6200 Series
-march=amdfam10
AMD Opteron™ 13xx, 14xx, 23xx, 24xx, 83xx, 84xx, 4100, and 6200 Series
-march=generic
Any x86 processor
“Bulldozer” compiler optimizations enabled by –march=bdver1 • • • • • • •
Support for all new instructions (SSSE3, SSE4.1, SSE4.2, AVX, FMA, and XOP) Automatically selects instructions to improve performance (intrinsics and inline) Scalar and vector libm calls available with AMD Libm Code generation tuned for microarchitecture, e.g. instruction latencies, cache sizes Memset/Memcpy inliner heuristics Defaults to 128-bit vectorization Improvements in code layout and alignment
Additional information: http://developer.amd.com/tools/gnu/pages/default.aspx 20 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
“AMD OPTERONTM 4200 AND 6200 SERIES PROCESSORS LIBRARY SUPPORT A library is a collection of pre-written code and subroutines Description
Comments
ACML (AMD Core Math Library)
Set of optimized and threaded math routines for HPC, scientific, engineering and related computeintensive applications
ACML 4.x is compatible with “Bulldozer” ACML 5.x is optimized for “Bulldozer”
AMD LibM
C library containing a collection of basic math functions optimized for x86-64 processors
AMD LibM 3.0 is optimized for “Bulldozer”
IPP (Intel Performance Primitives)
Library of multicore-ready, optimized software functions for multimedia, data processing, and communications applications
“For AMD 64-bit processors that support SSE3 the "m7" version of the IPP library will be dispatched automatically. Otherwise "mx" library will be used”*
For more information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx For more information on AMD LIbM, go to: http://developer.amd.com/libraries/libm/pages/default.asp * http://software.intel.com/en-us/articles/use-ipp-on-amd-processor/
21 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
ACML SUPPORT | A CLOSER LOOK
Linear Algebra
Fast Fourier Transforms (FFT)
ACML 5.0 (Aug 2011)
• SGEMM (single precision) • DGEMM (double precision) • L1 BLAS
• Complex-toComplex (C-C) single precision FFTs
ACML 5.1 (Dec 2011)
• CGEMM (complex single decision) • ZGEMM (complex double precision)
• Real-to-complex (R-C) single precision FFTs • Double precision C-C and R-C FFTs
Others • Random Number Generators • AVX compiler switch for Fortran
For additional information on ACML, go to: http://developer.amd.com/libraries/acml/pages/default.aspx
22 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
Compiler Support • Absoft • GCC 4.6 • Open64 4.2.5 • PGI 11.8, 11.9 • ICC 12 • Cray to begin deployment of ACML with their compiler with ACML 5.0 All compilers listed for ACML 5.0 will be supported
STARTING POINTS FOR APPLICATION OPTIMIZATION
Operating System
Compiler
Library
Recommended for SPECCPU, LINPACK, HPC Challenge
Novell SLES 11 SP1 or RHEL 6.1
Open64 4.5.1
ACML 5.1
Recommended for application development and benchmarks with gcc
Novell SLES 11 SP1 or RHEL 6.1
GCC 4.6
ACML 5.0 and/or libM 3.0
Recommended for HPC application code development
Novell SLES 11 SP1 or RHEL 6.1
Open64 4.25 or PGI 11.9
ACML 5.0
Recommend t for integer code development for Windows
Windows Server 2008 RS SP1
Microsoft Visual Studio 2010 SP1
AMD libM 3.0
Recommendations are based on AMD evaluations, please evaluate for your specific workload
23 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
REFERENCES x86 Compiler Quick Reference Guide for “Bulldozer” processors http://developer.amd.com/Assets/CompilerOptQuickRef-62004200.pdf
Using the x86 Open64 Compiler Suite http://developer.amd.com/tools/open64/Documents/open64.html
x86 Open64 4.2.5.2 Release Notes http://developer.amd.com/tools/open64/assets/ReleaseNotes.txt
ACML 5.0 Information http://developer.amd.com/libraries/acml/features/pages/default.aspx
Software Optimization Guide for “Bulldozer” processors http://support.amd.com/us/Processor_TechDocs/47414.pdf
AMD64 Architecture Programmer’s Manual Volume 6: 128-Bit and 256-Bit XOP and FMA4 Instructions http://support.amd.com/us/Embedded_TechDocs/43479.pdf
24 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
PRICE POINT PERFORMANCE COMPARISONS HIGHEST PERFORMING CPU’S AT EACH PRICE POINT
>$1100
CPU price points
$900-$1100
Intel Xeon X5690 $1663 Intel Xeon X5675 $1440 Intel Xeon X5660 $1219 No AMD Opteron™ 6200 Series processors above $1100
48% better performance
AMD Opteron 6282 SE $1019 Intel Xeon X5650
$996
AMD Opteron 6276 $788 Intel Xeon E5649 $774 Intel Xeon X5647 $774
$700-$900
Intel Xeon E5640
$774
Intel Xeon E5645
$551
$500-$700
Intel Xeon E5630 $551 AMD Opteron 6262 HE $523 AMD Opteron 6220 $523
$300-$500
AMD Opteron 6238 Intel Xeon E5620
0
100
47% better performance
16% better performance
$455
74% better performance
$387
200
300
400
500
600
SPECint_rate2006 SPEC and SPECint are registered trademarks of the Standard Performance Evaluation Corporation. The results stated above reflect results published on http://www.spec.org/cpu2006/results/ or submitted by AMD to SPEC as of 10/26/11. The comparison presented above is based on the best performing two-socket servers using the specified AMD Opteron™ processor Models and Intel Xeon processor Models operating at each processor’s default frequency. For the latest SPECint®_rate2006 results, visit http://www.spec.org/cpu2006/results/. For additional configuration information, see Two Socket Server SPECint®_rate2006 on backup slide 41. Intel pricing is reflective of published pricing on www.intel.com as of 10/26/11. AMD pricing available at http://www.amd.com/us/products/pricing/Pages/serveropteron.aspx. 25 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
SPECint_rate2006, 2P Name
Core Count
Core Freq.
X5690 X5680 X5687 X5677 X5675 X5670 X5672 X5667 X5660 6282 SE X5650 L5640 6276 E5649 X5647 E5640 6274 E5645 E5630 6272 6220 6238 E5620 6234 E5607 6212 E5606 E5603
6 6 4 4 6 6 4 4 6 16 6 6 16 6 4 4 16 6 4 16 8 12 4 12 4 8 4 4
3.46 GHz 3.33 GHz 3.60 GHz 3.46 GHz 3.06 GHz 2.93 GHz 3.20 GHz 3.06 GHz 2.80 GHz 2.6 GHz 2.66 GHz 2.26 GHz 2.3 GHz 2.53 GHz 2.93 GHz 2.66 GHz 2.2 GHz 2.40 GHz 2.53 GHz 2.1 GHz 3.0 GHz 2.6 GHz 2.40 GHz 2.4 GHz 2.26 GHz 2.6 GHz 2.13 GHz 1.60 GHz
Power Rating 130 W TDP 130 W TDP 130 W TDP 130 W TDP 95 W TDP 95 W TDP 95 W TDP 95 W TDP 95 W TDP 140 W TDP 95 W TDP 60 W TDP 115 W TDP 80 W TDP 130 W TDP 80 W TDP 115 W TDP 80 W TDP 80 W TDP 115 W TDP 115 W TDP 115 W TDP 80 W TDP 115 W TDP 80 W TDP 115 W TDP 80 W TDP 80 W TDP
整數運算
1kU 價格
效能%
435 426 354 345 407 396 335 324 387
$1,663 $1,663 $1,663 $1,663 $1,440 $1,440 $1,440 $1,440 $1,219 $1,019 $996 $996 $788 $774 $774 $774 $639 $551 $551 $523 $523 $455 $387 $377 $276 $266 $219 $188
159% 155% 129% 126% 149% 145% 122% 118% 141% 198% 139% 114% 178% 122% 107% 100% 172% 121% 97% 166% 116% 151% 86% 142% 67% 102% 62% 49%
543
381 313
488 335 292 274 472 332 265 455 319 414 236
388 183 279 169
134
Data From: www.spec.org 26 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
> $900 59% 56%
$700-$900
69%
$500-$700
56%
$300-$500
35% $300 <
SPECint_rate2006, 4P Core Count
CoreFreq.
10 10 10 10 8 8 8
2.40 GHz 2.13 GHz 2.26 GHz 2.00 GHz 2.67 GHz 2.13GHz 2.00GHz
Opteron 6282 SE Xeon E7-4807
16 6
Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron
16 16 16 16 8 12 12 4 8
Model Name Xeon Xeon Xeon Xeon Xeon Xeon Xeon
E7-4870 E7-8867L E7-4860 E7-4850 E7-8837 E7-4830 E7-4820
6276 6274 6272 6262 HE 6220 6238 6234 6204 6212
整數運算
1kU 價格
效能%
130W TDP 105W TDP 130W TDP 130W TDP 130W TDP 105W TDP 105W TDP
1130 974 1090 919 795 831 759
$4,394 $4,172 $3,838 $2,837 $2,280 $2,059 $1,446
136% 117% 131% 111% 96% 100% 91%
2.6GHz 1.86GHz
140W TDP 95W TDP
1060 516
$1,019 $890
128% 62%
2.3GHz 2.2GHz 2.1GHz 1.6GHz 3.0GHz 2.6GHz 2.4GHz 3.3GHz 2.6GHz
115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 115W TDP
973 940 909 740 630 827 763 351 554
$788 $639 $523 $523 $523 $455 $377 $377 $266
117% 113% 109% 89% 76% 100% 92% 42% 67%
PowerRating
Difference: ($2,059-$523)*4=$6,144 Difference: ($3,838-$1,019)*4=$11,276 27 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
~ NTD 184,320 ~ NTD 338,280
Data From: www.spec.org
WHY MEMCACHED APPLICATION USE AMD? 2P SERVER COMPARISON Core Count Opteron 6212 8 Opteron 6234 12 Opteron 6204 4 Opteron 6238 12 Opteron 6262 HE 16 Opteron 6272 16 Opteron 6220 8 Opteron 6274 16 Opteron 6276 16 Opteron 6282 SE 16 Xeon E5645 6 Xeon E5649 6 Xeon X5650 6 Xeon X5660 6 Xeon E5603 4 Xeon E5606 4 Xeon E5620 4 Xeon E5630 4 Model Name
Core Freq. Power Rating 2.6GHz 2.4GHz 3.3GHz 2.6GHz 1.6GHz 2.1GHz 3.0GHz 2.2GHz 2.3GHz 2.6GHz 2.40GHz 2.53GHz 2.66GHz 2.80GHz 1.60GHz 2.13GHz 2.40GHz 2.53GHz
115W TDP 115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 140W TDP 80W TDP 80W TDP 95W TDP 95W TDP 80W TDP 80W TDP 80W TDP 80W TDP
Memory 1kU 價格 Bandwidth 102GB/s $266 102GB/s $377 102GB/s $377 102GB/s $455 102GB/s $523 102GB/s $523 102GB/s $523 102GB/s $639 102GB/s $788 102GB/s $1,019 64GB/s $551 64GB/s $774 64GB/s $996 64GB/s $1,219 51GB/s $188 51GB/s $219 51GB/s $387 51GB/s $551
p.s. 2P memory capacity is up to 512MB 28 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
Data From: www.spec.org
效能%
$266x2=$ 532
159% Difference NTD 17,100
$551x2=$ 1,102 100%
79%
WHY MEMCACHED APPLICATION USE AMD? 4P SERVER COMPARISON Model Name Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron Xeon Xeon Xeon Xeon Xeon Xeon Xeon Xeon
6212 6234 6204 6238 6262 HE 6272 6220 6274 6276 6282 SE E7‐4830 E7‐8837 E7‐4850 E7‐4860 E7‐8867L E7‐4870 E7‐4820 E7‐4807
Core Count 8 12 4 12 16 16 8 16 16 16 8 8 10 10 10 10 8 6
Core Freq. Power Rating 2.6GHz 2.4GHz 3.3GHz 2.6GHz 1.6GHz 2.1GHz 3.0GHz 2.2GHz 2.3GHz 2.6GHz 2.13GHz 2.67GHz 2.00GHz 2.3GHz 2.13GHz 2.40GHz 2.00GHz 1.9GHz
115W TDP 115W TDP 115W TDP 115W TDP 85W TDP 115W TDP 115W TDP 115W TDP 115W TDP 140W TDP 105W TDP 130W TDP 130W TDP 130W TDP 105W TDP 130W TDP 105W TDP 95W TDP
Memory Bandwidth 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 205GB/s 115GB/s 115GB/s 115GB/s 115GB/s 115GB/s 115GB/s 105GB/s 86GB/s
1kU 價格 $266 $377 $377 $455 $523 $523 $523 $639 $788 $1,019 $2,059 $2,280 $2,837 $3,838 $4,172 $4,394 $1,446 $890
效能%
$266x4=$1,064
178%
Difference NTD 215,160
$2,059x4=$8,236 100%
91% 75%
p.s. 4P memory capacity is up to 1TB 29 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public Data From: www.spec.org
BETTER PROCESSOR VALUE Comparison
Top Bin AMD processor
Top Bin Intel processor
Processor name
AMD Opteron™ 6282 SE
Intel Xeon E5-2690
Cores
16 dedicated
8 shared
1kU list price/processor
$1,019
$2,057
Price/core
$64
$257
Price/performance
$3.75
$5.95
Price/GFLOP
$6.14
$11.06
Why spend 102% more per processor? Why settle for 50% fewer cores that have shared resources?
Save 44% per GFLOP with AMD.
Get 37% better price/performance with AMD. • • • • • •
Spec and pricing as of 3/8/12 at www.intc.com/pricelist.cfm See substantiation slide #53 for SPECint®_rate2006 scores Price/performance equals the cost of two processors divided by the estimated 2P SPECint®_rate2006 score. Max theoretical GFLOPS equals number of FLOPS per cycle x frequency of processor x number of cores per processor x number of processors per server Number of FLOPS per cycle is 4 for AMD Opteron 6200 Series-based servers and 8 for Intel Xeon E5-2600 Series based servers AMD Opteron 6200 Series can do up to 166 GFLOPS per processor, Intel Xeon E52600 Series can do up to 186 GFLOPS per processor
30 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
THANK YOU!
31 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
BACKUP SLIDE #31 84% higher performance: LINPACK (2P) AMD Opteron processor Model 6276 generates 84% more FLOPS than Intel Xeon processor Model X5670 –
239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0
–
Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 -fopenmp
–
130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled
–
Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk
73% more memory bandwidth: STREAM (2P) AMD Opteron processor Model 6276 has 73% higher memory bandwidth than Intel Xeon processor Model X5670 –
73, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite
–
42, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064
32 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
BACKUP SLIDE #32
1/2 the power-per-core* –
As of Nov 1, 2011, AMD Opteron™ processor Models 4200 EE have the lowest known power per core of any x86 server processor, at 35W TDP (35W/8 = 4.375W/core). Intel 's lowest power per core server processor, L5630, is 40W TDP (40W/4 = 10W/core). See http://www.intel.com/Assets/PDF/prodbrief/323501.pdf. Previous record held by AMD Opteron processor Models 4100 EE at 35W TDP / 6 cores = 5.83 W/core.
Requires 2/3 less floor space* –
VMs/rack (2P and 4P) One rack of AMD Opteron 6200 Series-based servers can support 672 VMs (1 VM per core, 2U servers)
–
This would take three racks of floor space and 56 2U Intel Xeon 5600 Series-based servers to do the same.
–
Assumes 1 VM/core, AMD Opteron 6200 Series-based 2P 2U server has up to 32 cores, supports up to 32 VMs/server x 21 servers per rack, which equals 672 VMs per server. Intel Xeon 5600 Series-based 2P 2U server has up to 12 cores, supports up to 12 VMs/server x 21 servers per rack, which equals 252 VMs per rack, Intel specs as of 11/4/11 at www.intc.com/pricelist.cfm.
1/3 to 2/3 lower platform price* – Top bin comparisons. Dell R710 with two top bin Intel Xeon processor Model X5690s is $7,103. Since pricing for a Dell R715 with two top bin AMD Opteron processor Model 6282 SE (1ku $1019) is not yet available, the similarly priced AMD Opteron processor Model 6140 (1ku $989) was used and the server yielded a price of $4564. That is a 36% price savings. Both servers were configured with 32GB RAM, 146GB 10K hdd, and 3yr base warranty and large enterprise pricing is from www.dell.com as of 10/22/11. HP DL 580 with four top bin Intel Xeon processor Model E7-4870 is $29,336 at www.dell.com. Since pricing for an HP DL585 pricing with four top bin AMD Opteron Model 6828 SE (1ku $1019) is not yet available, the similarly priced AMD Opteron processor Model 6140 (1ku $989) was used and server yielded a price of $11,094. That is a 62% price savings. Both servers were configured with 64GB RAM, 72GB 15K hdd, and 3yr base warranty and large enterprise pricing is from www.hp.com as of 10/22/11. VMware vSphere pricing not included, assuming both servers configured with versions 5.0 or 4.1u2, which are the same price for AMD- and Intel-based servers.
33 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
BACKUP SLIDE #35 ¹ 1ku pricing for AMD Opteron processor Model 6276 is $788 and $1,440 for Intel Xeon processor Model X5670 SPECfp®_rate
SPEC and SPECfp are registered trademarks of the Standard Performance Evaluation Corporation. The results for AMD Opteron™ processor Model 6276 is based upon data submitted to Standard Performance Evaluation Corporation as of November 7, 2011. 1. The other result stated above reflect results published on http://www.spec.org/cpu2006/results as of November 7, 2011. The comparison presented above is based on the best performing two-socket servers using AMD Opteron™ processor Model 6276 and Intel Xeon processor Model X5670 operating at each processor’s default frequency. For the latest SPECfp_rate2006 results, visit http://www.spec.org/cpu2006/results. SPECfp®_rate score = 360, 2 x AMD Opteron™ processors Model 6276 in Supermicro A+ 1022-URFserver, 64GB (8 x 8GB DDR3-1600) memory, Red Hat Enterprise Linux 6.1 64-bit, x86 Open64 4.2.5.2 Compiler Suite. SPECfp®_rate score = 263, 2 x Intel Xeon processors Model X5670 in Cisco UCS B200 M2 server, 48GB (12 x 4GB DDR3-1333) memory, SUSE Linux® Enterprise Server 11 SP1 64-bit, Intel C++ Compiler XE v12.0.1.116
STREAM
73 GB/s, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite
42 GB/s, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064
LINPACK
239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0
Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 –fopenmp
130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled
Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk
34 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
BACKUP SLIDE #36 LAMMPS (29% better)
471s, Intel Xeon X5670, Trial 24, Nodes 1, CXX Intel 11.1.064, CXXFLAGS -O2 -funroll-loops -fstrict-aliasin, MPI OMPI 1.5.1 + knem 0.9.5, DIMM Speed (MHz) 1333, DIMM Capacity (GB) 4, DIMM Count 6, Total Capacity (GB) 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes Turbo ON, HT OFF, -mca btl_sm_use_knem 1
333s, AMD Opteron 6276, Trial 9, Nodes 1, CXX openCC 4.2.5.2-1, CXXFLAGS -O3 -OPT:Ofast -OPT:rsqrt=2 -march=bdver1 -mavx -mfma4, MPI OpenMPI 1.5.3, DIMM Speed (MHz) 1600, DIMM Capacity (GB) 8, DIMM Count 8, Total Capacity (GB) 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes hpc
NAMD (41% better)
.636 day/ns, Intel Xeon X5670, Trial 40, NAMD Version 2.7, Nodes 1, MB memory 274.328, CC Intel 11.1, CCFLAGS -ip -fno-rtti -O3 -xSSE4.2 -no-prec-div, MPI OMPI 1.5.1, DIMM Speed (MHz) 1333, DIMM Capacity (GB) 4, DIMM Count 6, Total Capacity (GB) 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes default
.375 day/ns, AMD Opteron 6276, Trial 77, NAMD Version 2.8, Nodes 1, MB memory 260.277, CC opencc 4.2.5-2.1, CCFLAGS -O3 -m64 -march=bdver1 -mfma4 -mavx CG:compute_to=ON -OPT:Olimit=40000, MPI OpenMPI 1.5.3+knem 0.9.6, DIMM Speed (MHz) 1600, DIMM Capacity (GB) 8, DIMM Count 8, Total Capacity (GB) 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes default
WRF (20% better)
224s, Intel Xeon X5670, Trial 124, STEP NA, NODE 1, FC Intel 11.1.064, NETCDF 4.1.1, FCFLAGS -w -O3 -ip -xSSE4.2 -fp-model fast=2 -no-prec-div -no-prec-sqr, NUMA NA, MPI OMPI 1.5.1 + knem 0.9.5, DIMM # 6, DIMM GB 4, DIMM MHz 2 1333, Total Capacity 24, System SM X8DTT, OS SLES11 SP1, Kernel 2.6.32.12-0.7-default, Notes -mca btl_sm_use_knem 1
180s, AMD Opteron 6276, Trial 436, STEP B2g, NODE 1, FC open64 4.2.5-1, NETCDF 4.1.2, FCFLAGS -O3 -HP -march=bdver1 -mavx -mfma4 -DpgiFortran -OPT:unroll_size=256 LNO:blocking=off -LANG:copyinout=o, NUMA APP FILE, MPI OMPI 1.5.3, DIMM # 8, DIMM GB 8, DIMM MHz 2 1600, Total Capacity 64, System SM H8DGT, OS SLES11 SP1, Kernel 2.6.32.28-fam15h-default, Notes NA
² Greater FLOPS per sq ft
2P 1U AMD Opteron™ processor Model 6276-based server generates up to 239.1 LINPACK FLOPS. Forty-two 1U servers can fit in a 42U rack, which equals 10,042 FLOPS per rack. 2P 1U Intel Xeon processor Model 5670-based server generates up to 130.1 LINPACK FLOPS. Forty-two 1U servers can fit in a 42U rack, which equals 5,464 FLOPS per rack.
239.1 FLOPS, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT server, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, gfortran compiler v4.6, OMPI 1.5.3, AMD Core Math Library 5.0.0.0
Compiler Flags: -fomit-frame-pointer -O3 -funroll-loops -W -Wall -mavx -mfma4 –fopenmp
130.1 FLOPS, 2 x Intel Xeon processors Model X5670 in Supermicro 6026TT-BIBQF server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Professional Compiler v11.1, OMPI 1.5.1, Intel Math Kernel Library 10.3, Hyper-Threading disabled, Turbo Boost Technology enabled
Compiler Flags: -O3 -w -ansi-alias -i-static -openmp -nocompchk
35 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public
BACKUP SLIDE #37
2
3
4
Maximum cores per rack –
2P 1U AMD Opteron™ processor Model 6276-based server has up to 32 cores. Forty-two 1U servers can fit in a 42U rack, which equals 1344 cores per rack.
–
2P 1U Intel Xeon processor Model 5670-based server has up to 12 cores. Forty-two 1U servers can fit in a 42U rack, which equals 504 cores per rack.
STREAM (2P) AMD Opteron processor Model 6276 has 73% higher memory bandwidth than Intel Xeon processor Model X5670 –
73 GB/s, 2 x AMD Opteron™ processors Model 6276 in Supermicro H8DGT, 64GB (8 x 8GB DDR3-1600) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, x86 Open64 4.2.5-1 Compiler Suite
–
42 GB/s, 2 x Intel Xeon processors Model X5670 in Supermicro X8DTT server, 24GB (6 x 4GB DDR3-1333) memory, SuSE Linux® Enterprise Server 11 SP1 64-bit, Intel Compiler v11.1.064
Comparison of 12-core AMD Opteron™ processor Model 6234 expected price of $377 at launch with 4-core Intel Xeon E5603 price of $188 according to www.intel.com as of 11/4/11.
36 | AMD領先業界之高效能運算與低溫控制技術 | November 2011 | Public