Transcript
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept.
ENEE 759H, Spring 2005 Memory Systems: Architecture and Performance Analysis System Controller
SLIDE 1
Credit where credit is due: Slides contain original artwork (© Jacob, Wang 2005)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
System Controller
Spring 2005
CPU
ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 2
OS/drivers/etc.
AGP System Controller 3D gfx (North Bridge) processor graphics memory Z- buffer Texture
Harddisc
I/O Controller (Southbridge)
Game AI Collision detection / geometry info multi megabyte texture ethernet packet ethernet packet RAM
Keyboard mouse
Ethernet card
Heavy demand placed on memory system Heavier still in SMP/SMT/CMP system System Controller == System traffic cop UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Memory Request Overview
Spring 2005 ENEE 759H Lecture4.fm
Processor Core
processor Part A: Searching on-chip for data
SLIDE 3
[A3]
[A2] DTLB L1 cache
Bruce Jacob David Wang University of Maryland ECE Dept.
L2 cache
[A1]
BIU (Bus Interface Unit)
[B2] system controller Part B: Going off-chip for data I/O to memory traffic
Fetch
Decode
virtual to physical [A2] L1 D-Cache address translation access. If miss (DTLB access) [A1] then proceed to
physical to memory addr mapping
[A4] [B1] [B8]
[B3]
read data buffer
memory request [B4] scheduling
Exec
Mem
[A3] L2 Cache access. If miss then send to BIU
RAM
[B7]
[B5]
WB
[B6] DRAM core
Stages of instruction execution
Bus Interface Unit (BIU) obtains data from main memory [A4 + B]
Proceeding through the memory hierarchy in a modern processor
[B1] BIU arbitrates [B2] request [B3]physical addr. [B4] memory [B5] memory [B6, B7] DRAM dev. [B8] system controller returns for ownership of sent to system to memory addr. request addr. Setup obtains data and address bus ** controller scheduling** (RAS/CAS) returns to controller data to CPU translation. ** Steps not required for some processor/system controllers. protocol dependant.
Progression of a Memory Read Transaction Request Through Memory System
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
“Memory Latency”
Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept.
F DRAM
CPU
Mem
E1
Controller
A B C
D
E2/E3
SLIDE 4
A: Transaction request may be delayed in Queue B: Transaction request sent to Memory Controller C: Transaction converted to Command Sequences (may be queued) D: Command/s Sent to DRAM E1: Requires only a CAS or E2: Requires RAS + CAS or E3: Requires PRE + RAS + CAS F: Transaction sent back to CPU “DRAM Latency” = A + B + C + D + E + F UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Small System Topologies
Spring 2005
University of Maryland ECE Dept. SLIDE 5
CPU
System Controller memory controller
I/O
CPU
CPU I/O
Point-to-point processor-controller system topology (AMD Athlon/Alpha EV6/PPC 970)
CPU
CPU System Controller DRAM
DRAM
System Controller memory controller
System Controller memory controller
Classic small system topology (Lots of systems)
I/O
CPU
DRAM
Bruce Jacob David Wang
CPU
DRAM
ENEE 759H Lecture4.fm
CPU
I/O
memory controller
Integrated system controller system topology (AMD Opteron/Alpha EV7 etc.) represents point of synchronization*. (for local access)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
System Controller: Athlon
Spring 2005 ENEE 759H Lecture4.fm
CPU
CPU
BIU1
BIU0
Bruce Jacob David Wang University of Maryland ECE Dept.
AGP
SLIDE 6
MRO
MCT
DRAM
APC
PCI MRO:Memory Request Organizer APC:AGP PCI Controller block MCT:Memory Controller (SDRAM/DDR/DRDRAM) UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
MRO: Memory Request Organizer
Spring 2005 ENEE 759H Lecture4.fm
•
Request crossbar responsible for scheduling memory read and write requests from BIU, PCI, AGP
•
Serves as the coherence point
•
Requests are reordered to minimize page conflict and maximize page hits
•
Anti-starvation mechanism by aging of entries
•
Arbitration bypassed during idle conditions to improve latency
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 7
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
AMD Athlon Controller:
Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept.
Chip Version
Tech & Voltage
Max Core Die Size No. of Speed (pad limited) pins
SDRAM 1P, 2xAGP
0.35um, 100 MHz 3.3V
107 mm2
492
SDRAM, 2P, 0.35um, 100 MHz 2xAGP 3.3V
130 mm2
656
DDR, 1P, 4xAGP
0.25um, 133 MHz 2.5V
133 mm2
553
DDR, 2P, 4xAGP
0.25um, 133 MHz 2.5V 107 mm2
492
SLIDE 8
RDRAM, 1P, 0.25um, 133 MHz 4xAGP 2.5V
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Cache Coherency I
Spring 2005 ENEE 759H Lecture4.fm
CPU
CPU Read Request
Bruce Jacob David Wang University of Maryland ECE Dept.
BIU1
BIU0
SLIDE 9
MRO
MCT
DRAM
Request: I would like data for cacheline 0x001CA980
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Cache Coherency II
Spring 2005 ENEE 759H Lecture4.fm
CPU
CPU
BIU1
BIU0
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 10
Snoop Request MRO
MCT
DRAM
Snoop Request: Do you have cachline 0x001CA980 ? Memory Fetch: Give me data for 0x001CA980.
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Cache Coherency IIIa
Spring 2005 ENEE 759H Lecture4.fm
CPU
CPU
BIU1
BIU0
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 11
Data Snoop Response MRO
MCT
DRAM
Snoop Response: No SDRAM MCT: RAS to rank 2, bank 0, row 0x00842 SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3 SDRAM MCT: Here’s the data.
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Cache Coherency IIIb
Spring 2005 ENEE 759H Lecture4.fm
CPU Data
Bruce Jacob David Wang University of Maryland ECE Dept.
CPU
BIU1
BIU0
Stale Data trash
SLIDE 12
Snoop Response MRO
MCT
DRAM
Snoop Response: Yes, I have this cache line SDRAM MCT: RAS to rank 2, bank 0, row 0x00842 SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3 MRO: Here’s the data.
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Why worry about CC? Part 1
Spring 2005 ENEE 759H Lecture4.fm
CPU
CPU
Bruce Jacob David Wang University of Maryland ECE Dept.
Distance to data in cache
SLIDE 13
“Point of Synchronization”
DRAM
Distance to data in DRAM
What if distance to DRAM is shorter than distance to cache (in another CPU)? UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm
Why worry about CC? Part 2 1
2
3
4
5
6
7
arbitration command
8
9
error (checking)
10
11
12 13
snoop
14
15
response
16
17 18
data
Bruce Jacob David Wang Data returned by system to CPU is assumed to be the “most current copy”. Read transaction latency is bound by max(snoop_latency, dram_latency)
University of Maryland ECE Dept. SLIDE 14
Intel P6 system bus read transaction latency breakdown cpu 0
cpu 1
cpu 2
cpu 3
cpu 1
cpu 2
system controller (northbridge)
system controller (northbridge)
RAM
RAM
Processors can grab request address off of shared bus in shared multi-drop topology UNIVERSITY OF MARYLAND
cpu 0
cpu 3
System controller rebroadcast request address to aid in snoop for point-to-point topology
Memory Systems Architecture and Performance Analysis
Multiple Clock Domains I
Spring 2005
CPU
ENEE 759H Lecture4.fm
command:133 MHz address:133*2 MHz data:133*4 MHz
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 15
High Bandwidth I/O
System Controller
RAM
AGP:66*4 MHz Low Bandwidth I/O PCI: 33 MHz
DDR:133*2 MHz DDR:166*2 MHz
Most clock domains are integer multiples of each other UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Multiple Clock Domains II
Spring 2005
CPU
ENEE 759H Lecture4.fm
Processor bus 66/100/133 MHz
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 16
High Bandwidth I/O AGP: 66*2 MHz (32 bit)
System Controller
RAM SDRAM: 100/133 MHz
Low Bandwidth I/O PCI: 33 MHz (32 bit)
What if clock domains are not integer multiples of each other? UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang
Multiple Clock Domains III FastClk (100MHz) SlowClk (66MHz) Data
University of Maryland ECE Dept. SLIDE 17
AMD Athlon Chipset Gearbox Logic D0
D1
Latched Data D0
D1
D2
LatchEn D2
CtlMask
D
Dff
Q Cmb Logic
SlowClk Slow clock domain UNIVERSITY OF MARYLAND
D LatchEn FastClk
Cmb Logic
Q
Dff
Gear Box Module
CtlMask
D
Q
Dff
FastClk Fast clock domain
Memory Systems Architecture and Performance Analysis
Multiple Clock Domains IV
Spring 2005 ENEE 759H Lecture4.fm
d0
Bruce Jacob David Wang University of Maryland ECE Dept.
d1
d2
d0
d3 d1
d2
d3
Data transfer from 100 MHz clock domain to 133 MHz clock domain (Latency Optimal)
SLIDE 18
d0
d1
d2 d0
d3 d1
d2
d3
Data transfer from 100 MHz clock domain to 133 MHz clock domain (Bandwidth Optimal)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005
Multiple Clock Domains V d0
ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept.
d1 d0
d2
d3
d1
d2
d3
Data transfer from 400 MHz clock domain to 800 MHz clock domain (Latency Optimal)
SLIDE 19
d0
d1
d2
d3 d0
d1
d2
d3
Data transfer from 400 MHz clock domain to 800 MHz clock domain (Bandwidth Optimal)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005
Multiple Clock Domains VI 0 1 2 3 4 5 6 7 8 9
CPU Clock Domain(10:2)
500 MHz (harmonic)
ENEE 759H Lecture4.fm Bruce Jacob David Wang
Bus Clock - 100 MHz 0 1 2 3 4 5 6 7 8 9 10
CPU Clock Domain(11:2) University of Maryland ECE Dept. SLIDE 20
550 MHz (not-harm)
Bus Clock - 100 MHz 0 1 2 3 4 5 6 7 8 9 10 11
CPU Clock Domain(12:2)
600 MHz (harmonic)
Bus Clock - MHz
Processor to Processor Bus Interface
Fractional multipliers could impact performance, but we may not have a choice UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Multiple Clock Domains VII
Spring 2005
AMD Athlon SPEC CPU FP Completion Time
ENEE 759H Lecture4.fm
5300
1466 MHz
University of Maryland ECE Dept. SLIDE 21
Total Task Time (seconds)
Bruce Jacob David Wang
5200
1533 MHz
5100
5000
1666 MHz
1600 MHz
4900
Non-Harmonic Node Harmonic Node
1733 MHz 4800 0.58
UNIVERSITY OF MARYLAND
0.6
0.62 0.64 Cycle Time (ns)
0.66
0.68
Memory Systems Architecture and Performance Analysis
AMD Opteron
Spring 2005
L2 Miss Latency 17 Opteron 1
Cache
Processor 16 Core 4
clock domain crossing 5 15
Bruce Jacob David Wang
6 System 7
9
8
14
crossbar hypertransport
10
memory 13 controller
hypertransport
12
11
Request Queue
hypertransport
DRAM DRAM
SLIDE 22
3 L2
“memory latency”
ENEE 759H Lecture4.fm
University of Maryland ECE Dept.
2
L1 Access Miss 1 2 L2 Request L2 Tag 3 Addr to NB L2 Data 4 5 Clock Boundary Route/Mux/ SRQ 6 ECC 7 GART/Addr Decode W L1 D$ 8 Crossbar & FWD 9 Coh./Order Check
10 Memory Controller 11 Req to DRAM Pins 12 ... DRAM Access Data to MCT 13 NB Route 14 15 Clock Boundary 16 CPU route/mux/ECC 17 Write L1 D$ & FWD
NB = northbridge
CPU or I/O
UNIVERSITY OF MARYLAND
L2 Access Latency
Memory Systems Architecture and Performance Analysis
Summary
Spring 2005 ENEE 759H Lecture4.fm
•
System Controller is a “traffic cop”
•
Traffic cop may have to deal with clock domain synchronization issue
•
Handles Cache Coherency for small scale SMP configuration
•
“Memory Latency” depends on lots of little things, not just speed of DRAM.
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 23
UNIVERSITY OF MARYLAND