Transcript
HIGH-PERFORMANCE POWEREFFICIENT X86-64 SERVER AND DESKTOP PROCESSORS Using the core codenamed “Bulldozer” Sean White 19 August 2011
THE DIE Overview, Floorplan, and Technology
2 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE DIE | Photograph
3 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE DIE | What’s on it? Eight “Bulldozer” cores – High-performance, power-efficient AMD64 cores First generation of a new execution core family from AMD (Family 15h) Two cores in each Bulldozer module
– 128 KB of Level1 Data Cache, 16 KB/core, 64-byte cacheline, 4-way associative, write-through – 256 KB of Level1 Instruction Cache, 64 KB/Bulldozer module, 64-byte cacheline, 2-way associative – 8 MB of Level2 Cache, 2 MB/Bulldozer module, 64-byte cacheline, 16way associative Integrated Northbridge which controls: – 8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI – Two 72-bit wide DDR3 memory channels – Four 16-bit receive/16-bit transmit HyperTransport™ links 4 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE DIE | Floorplan (315 mm2) HyperTransport™ Phy 2MB L2 Cache
2MB L3 Cache
2MB L2 Cache
Bulldozer Module
2MB L3 Cache Northbridge
HyperTransport™ Phy
2MB L3 Cache
Bulldozer Module
2MB L2 Cache
HyperTransport™ Phy
2MB L3 Cache
2MB L2 Cache
Bulldozer Module
MiscIO
5 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
DDR3 Phy
HyperTransport™ Phy
Bulldozer Module
MiscIO
THE DIE | Process Technology 32-nm Silicon-On-Insulator (SOI) Hi-K Metal Gate (HKMG) process from GlobalFoundries
11-metal-layer-stack
Low-k dielectric
Dual strain liners and eSiGe to improve performance.
Multiple VT (HVT, RVT, LVT) and long-channel transistors.
Layer
Type
Pitch
CPP1
-
130 nm
M01
1x
104 nm
M02
1x
104 nm
M03
1x
104 nm
M04
1.3x
130 nm
M05
1.3x
130 nm
M06
2x
208 nm
M07
2x
208 nm
M08
2x
208 nm
M09
4x
416 nm
M10
16x
1.6 µm
M11
16x
1.6 µm
1
Contacted poly pitch
6 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE A few highlights
7 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE | What’s different about this core? A Bulldozer module contains 2 cores “Bulldoze r ”
Functions with high utilization that can’t be shared without significant compromises exist for each core
Fetch Decode
Ex: Integer pipelines, Level1 data caches
This allows two cores to each use a larger, higher-performance function (ex: floating point unit) as they need it for less total die area than having separate, smaller functions for each core
L1 DCache
L1 DCache
Shared L2 Cache
8 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
Pipeline
Pipeline
Pipeline
Pipeline
128-bit FMAC
128-bit FMAC
Pipeline
Pipeline
Pipeline
Ex: Floating point pipelines, Level2 cache
Integer Scheduler FP Scheduler
Pipeline
Other functions are shared between the cores
Integer Scheduler
THE BULLDOZER CORE | Microarch: Shared Frontend Decoupled predict and fetch pipelines Prediction-directed instruction prefetch
Prediction Queue
ICache
L1 BTB Fetch Queue L2 BTB Ucode ROM
4 x86 Decoders
Instruction cache: 64K Byte, 2-way
– Level2: 512 entries, 4-way, 4K pages
Ld/ST Unit L1 DTLB L1 DCache
Data Prefetcher
AGen
AGen
EX, DIV
128-bit FMAC
128-bit FMAC
MMX
FP Ld Buffer
Instr Retire
EX, MUL
Int Scheduler
FP Scheduler
MMX
AGen
– Level1: 72 entries, mixed page sizes
Instr Retire
AGen
Instruction TLBs:
EX, DIV
32-Byte fetch
EX, MUL
Int Scheduler
Ld/ST Unit L1 DTLB L1 DCache
Shared L2 Cache
Branch fusion 9 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE | Microarch: Separate Integer Units Thread retire logic Physical Register File (PRF)-based register renaming
Prediction Queue
ICache
L1 BTB Fetch Queue L2 BTB Ucode ROM
4 x86 Decoders
Unified scheduler per core
AGen
AGen
EX, DIV
128-bit FMAC
128-bit FMAC
Instr Retire
EX, MUL
Int Scheduler
FP Scheduler
MMX
AGen
AGen
EX, DIV
Fully out-of-order load/store
Instr Retire
EX, MUL
Data TLB: 32-entry fully associative
Int Scheduler
MMX
Way-predicted 16K Byte Level1 Data cache
– 2 128-bit loads/cycle – 1 128-bit store/cycle – 40-entry Load queue 24-entry Store queue
Ld/ST Unit L1 DTLB L1 DCache
Data Prefetcher
FP Ld Buffer
Ld/ST Unit L1 DTLB L1 DCache
Shared L2 Cache
10 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE | Microarch: Shared FPU
PRF-based register renaming A single floating point scheduler for both cores
L2 BTB Ucode ROM
Ld/ST Unit L1 DTLB L1 DCache
Data Prefetcher
AGen
AGen
128-bit FMAC
128-bit FMAC
MMX
FP Ld Buffer
Instr Retire
EX, DIV
Int Scheduler
FP Scheduler
MMX
Instr Retire
4 x86 Decoders
EX, MUL
Int Scheduler
AGen
Dual 128-bit Packed Integer pipelines
Fetch Queue
AGen
Dual 128-bit Floating Point Multiply/Accumulate (FMAC) pipelines
ICache
L1 BTB
EX, DIV
Reports completion back to parent core
Prediction Queue
EX, MUL
Co-processor organization
Ld/ST Unit L1 DTLB L1 DCache
Shared L2 Cache
11 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE | New Floating Point Instructions New Instructions
Applications/Use Cases
SSSE3, SSE4.1, SSE4.2 (AMD and Intel)
• Video encoding and transcoding • Biometrics algorithms • Text-intensive applications
AESNI PCLMULQDQ (AMD and Intel)
• Application using AES encryption • Secure network transactions • Disk encryption (MSFT BitLocker) • Database encryption (Oracle) • Cloud security
AVX (AMD and Intel)
Floating point intensive applications: • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling
FMA4 (AMD Unique)
HPC applications
XOP (AMD Unique)
• Numeric applications • Multimedia applications • Algorithms used for audio/radio
12 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE BULLDOZER CORE | Microarch: Shared Level 2 Cache 16-way unified L2 cache
Prediction Queue
ICache
L1 BTB
L2 TLB and page walker
Fetch Queue L2 BTB Ucode ROM
4 x86 Decoders
– 1024-entry, 8-way
23 outstanding L2 cache misses for memory system concurrency
Ld/ST Unit L1 DTLB L1 DCache
Data Prefetcher
AGen
AGen
EX, DIV
128-bit FMAC
128-bit FMAC
MMX
FP Ld Buffer
Instr Retire
EX, MUL
Int Scheduler
FP Scheduler
MMX
AGen
AGen
Instr Retire
EX, DIV
Multiple data prefetchers
Int Scheduler
EX, MUL
– Services both Instruction and Data requests
Ld/ST Unit L1 DTLB L1 DCache
Shared L2 Cache
13 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE NORTHBRIDGE A few of its highlights
14 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE NORTHBRIDGE | Simplified Block Diagram
8 MB L3 Cache
CPU CPU CPU C0 CPUC1 L2 L2 L2 L2 L2 Cache
4 Bulldozer modules
Synchronization
L3 Arrays
L3 CTL
System Request Queue (SRQ)
Crossbar
APIC APIC APIC (interrupts) (interrupts) (interrupts) APIC(*2)
Memory CTL
2 DRAM channels
DRAM CTL1 DRAM CTL0
DDR PHY DDR PHY
HT
HT HT CTL HTCTL CTL CTL
4 HyperTransport™ links HT/PCI-E HT PHY HT/PCI-E HT/PCI-E PHY PHY PHY
15 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE NORTHBRIDGE | SRQ, Crossbar, Memory Controller System Request Queue and Crossbar The northbridge SRQ and crossbar form the traffic hub of the design, routing from any requestor to the appropriate destination. Examples: A core request to the L3 cache; a request from another die or chip from a HyperTransport™ link to the memory controller, etc.
Memory Controller Each memory controller enforces cache coherency and proper order of operations for the memory space it “owns”, distributing this responsibility across multi-die or multi-chip systems. This allows multi-die or multi-chip systems to scale rather than bottleneck at a single coherency/ordering choke point in the system.
Obviously, the memory controller also sends the read/write requests to the DRAM controllers. 16 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE NORTHBRIDGE | DRAM Controllers, L3 Cache, Probe Filter DDR3 DRAM Controllers Two per die Supports Unbuffered (UDIMM), Registered (RDIMM) and LoadReduced (LRDIMM) memory, 1.50V, 1.35V and 1.25V depending on the processor package, with frequencies up to DDR3-1866 New power-saving features: Low-voltage DDR3, fast self-refresh entry with internal clocks gated, aggressive precharge power down, tristate addr/cmd/bank w/chip-select deassertion, throttle activity for thermal or power reduction, put RX circuits in standby when no reads active, etc. L3 Cache and Probe Filter Up to 8 MB of Level3 cache, normally shared, can be partitioned L3 data ECC protected (single-bit correct, double-bit detect) Probe Filter: the northbridge can filter coherency traffic to improve system bandwidth Space in the L3 array is used to hold Probe Filter data when it’s enabled 17 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE NORTHBRIDGE | Four 16-bit HyperTransport™ Links High-speed unidirectional signaling, 16-bits receive/16-bits transmit A 16-bit link can be unganged into two 8-bit links
Up to 6.4 Gigatransfers/sec/bit (16-bit link = 12.8 GByte/s RX, 12.8 TX) Maximum link frequency depends on package, chipset capability and printed circuit board signal integrity
Number of links vary with package AM3+ has a single link to the chipset C32 has three links for chipset and/or dual-processor coherent interconnect G34 has four external links for chipset and/or multi-processor coherent interconnect (plus internal links between the dice in the multi-chip module)
In retry mode, bit errors on the link are detected on CRC error and packets are reissued for highly reliable operation. Power consumption is minimized based on configuration (ex: link width/frequency) and activity. 18 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
POWER MANAGEMENT A few key capabilities
19 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
POWER MANAGEMENT | Key Features and Functions The Bulldozer core is architected to be power-efficient Minimize silicon area by sharing functionality between two cores
All blocks and circuits have been designed to minimize power (not just in the Bulldozer core) Extensive flip-flop clock-gating throughout design Circuits power-gated dynamically
Numerous power-saving features under firmware/software control Core C6 State (CC6) Core P-states/AMD Turbo CORE Application Power Management (APM) DRAM power management Message Triggered C1E 20 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
POWER MANAGEMENT | Core C6 State (CC6) Core C6: if a core isn’t active, remove power
PG I/O
Implemented in this physical design by a power gating ring that isolates the Core VSS for each Bulldozer module from the “Real” VSS CC6 entry: when both Bulldozer cores in the module are idle, flush caches and dump register state to CC6 save space, then gate Core VSS CC6 exit: ungate Core VSS, reload CC6 saved state, resume execution (ex: service interrupts, etc.)
“Real” VSS
Core I/O
Core VSS
Bulldozer Module RUN
Power Gating Ring Control
WAKE
Power Gating FETs
Small FET
Core VSS WAKE
RUN
“Real”VSS
21 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
Large FET
POWER MANAGEMENT | P-states, AMD Turbo CORE Core P-states specify multiple frequency and voltage points of operation – Higher frequency P-states deliver greater performance but require higher voltage and thus more power – The hardware and operating system vary which P-state a core is in to deliver performance as needed, but use lower frequency P-states to save power whenever possible
AMD Turbo CORE: when the processor is below its power/thermal limits the frequency and voltage can be boosted above the normal maximum and stay there until it gets back to the power/thermal limits 22 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
TURBO CORE | Base, All Core Boost and Max Boost Base frequency with TDP headroom
All core boost activated
Max turbo activated
All Core Boost
When there is TDP headroom in a given workload, AMD Turbo CORE technology is automatically activated and can increase clock speeds across all cores.
Max Turbo Boost
When a lightly threaded workload sends half the Bulldozer modules into C6 sleep state but also requests max performance, AMD Turbo CORE technology can increase clock speeds for the active Bulldozer modules.
23 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE 1-SOCKET DESKTOP AM3+ PROCESSOR CODENAMED “ZAMBEZI”
940-pin lidded micro PGA package 1.27mm pin pitch 31 row x 31 column pin array C4 die attach
24 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
ZAMBEZI | Performance Desktop Processor “Zambezi”, for the AM3+ Platform AM3+ socket infrastructure adds: Support for low-voltage DRAM Increased ILDT current for higher frequency HyperTransport™ link (2.0A maximum per Gen3 link) Increase in IDDR current (4.0A maximum)
Older AM3 processors plug-in compatible with AM3+ motherboards
HT Link
HT PHY
NC
HT PHY
2
2
2
2
Cores
Cores
Cores
Cores
L2
L2
L2
L2 HT PHY
NC
HT PHY
NC
Northbridge
DRAM CTL’s L3 Cache PHY
2 memory channels, unbuffered DIMMs, up to DDR3-1866 1 HyperTransportTM link, up to 5.2 GT/s For use with AMD 9-Series chipsets
2 Memory Channels
AMD 990FX AMD 990X AMD 970 AMD SB950
25 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE 1 OR 2-SOCKET SERVER C32 PROCESSOR CODENAMED “VALENCIA”
1207-land lidded micro LGA package 1.10mm land pitch 35 row x 35 column land array C4 die attach
26 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
VALENCIA | 1-2 Socket Server Processor “Valencia”, for the C32 Platform Valencia is compatible with existing C32 motherboards (AMD Opteron™ 4000 series processor-based platform) with appropriate BIOS update 2 memory channels, UDIMM, RDIMM or LRDIMM, up to DDR3-1600 3
HyperTransportTM
links, up to 6.4 GT/s
Most designs use only 2 links to achieve lower Thermal Design Power (TDP)
2
2
2
2
Cores
Cores
Cores
Cores
L2
L2
L2
L2
NC
APML HT Link
HT
HT PHY
Northbridge HT PHY
NC
HT PHY DRAM CTL’s L3 Cache PHY
Advanced Platform Management Link (APML) 1 or 2 socket systems For use with AMD server chipsets
HT Link
2 Memory Channels
AMD SR5690 AMD SR5670 AMD SR5650 AMD SP5100 27 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
NC
VALENCIA | 2-socket system example System design optimized to provide maximum performance for minimum cost and power for 1-2 socket servers
Valencia
Valencia
Up to 16 cores in a two socket system Two DDR3 Memory channels per socket PCI-Express
Northbridge expansion I/O
SR56x0 PCI-Express
AMD SR5690: 42 PCI Express® lanes AMD SR5670: 30 PCI Express® lanes AMD SR5650: 22 PCI Express® lanes SP5100 Southbridge: SATA, PCI, USB
Green = Coherent HyperTransport™ links SP5100 Southbridge
Red = Non-coherent HyperTransport™ links Blue = PCI Express® expansion I/O
28 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
THE 1 TO 4-SOCKET SERVER G34 PROCESSOR CODENAMED “INTERLAGOS”
1944-land lidded LGA package 1.00mm land pitch 57 row x 40 column land array C4 die attach
29 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
INTERLAGOS | 1-4 Socket Server Processor
2
2
2
L2
L2
2
2
Cores Cores Cores Cores L2
Master Die
2
2
2
Cores Cores Cores Cores L2
L2
L2
L2
Slave Die
L2
APML 2 and ½ HT Links
HT PHYs
HT PHYs
Northbridge
HT PHYs
PHY
2 Memory Channels
“Interlagos”, for the G34 Platform Two die in a multichip module for 1 to 4 socket systems Compatible with existing G34 motherboards (AMD Opteron™ 6000 series processor-based platform) with appropriate BIOS update
1 and ½ HT Links
DRAM CTL’s
DRAM CTL’s L3 Cache
HT PHYs
Northbridge
L3 Cache
PHY
2 Memory Channels
• Up to 16 MB of combined Level3 cache • 4 memory channels, UDIMM, RDIMM or LRDIMM, up to DDR3-1600 • 4 external HyperTransportTM links, up to 6.4 GT/s (plus internal die-to-die links) • Advanced Platform Management Link (APML) • For use with AMD server chipsets •AMD SR56x0, AMD SP5100
OS views it as one multi-core processor with up to 16 cores 30 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
INTERLAGOS | 4-socket system example System design optimized to provide maximum performance for 1-4 socket servers Up to 64 cores in a four socket system to support highly multi-threaded workloads
Interlagos
Interlagos
Interlagos
Interlagos
Database, Web, Virtualization, Cloud Computing, HighPerformance Computing, etc. 16 DDR3 channels w/4 sockets PCI-Express
PCI-Express SR56x0
SR56x0
PCI-Express
PCI-Express
To support demanding memorybandwidth-intensive workloads
High-performance PCI Express® links via SR56x0 chipset to support demanding I/O-intensive workloads (high-speed network, storage, etc.)
Green = Coherent HyperTransport™ links SP5100 Southbridge
Red = Non-coherent HyperTransport™ links Blue = PCI Express® expansion I/O
31 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |
Trademark Attribution AMD, the AMD Arrow logo, AMD Opteron and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. PCI Express is a registered trademark of PCI-SIG. ©2011 Advanced Micro Devices, Inc. All rights reserved.
32 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 |