Transcript
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
ENEE 759H, Spring 2005 Memory Systems: Architecture and Performance Analysis Fully Buffered DIMM Memory System
SLIDE 1
Credit where credit is due: Slides contain original artwork (© Jacob, Wang 2005)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
DRAM Datarate Trends
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 2
200 400 600 800 1000 1200 1400 1600
ENEE 759H Lecture15.fm
DRAM Device Datarate (Mb/s)
Spring 2005
XDR (min burst:16) 3.2 Gb/s (differential pair)
Commodity DRAM Devices Datarate: ~Doubling Every 3 Years (log scale)
DDR2 SDRAM (min burst:4) DDR SDRAM (min burst:2) SDRAM (min burst:1) 1998
2000
2002
2004
New Generations of DRAM Devices (time)
UNIVERSITY OF MARYLAND
2006
Memory Systems Architecture and Performance Analysis Spring 2005
Recall - Multidrop Bus input
output 0
output 0
input
output 1
output 1
input
output 2
output 2
ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 3
more ringing
larger delay lower risetime
Higher datarate, but
SDRAM - Max 8/6 ranks (loads); (133 MHz) DDR SDRAM - Max 6 ranks; (400 Mbps) DDR2 - Max 4 ranks; (667/800 Mbps)
loss of capacity UNIVERSITY OF MARYLAND
DDR3 - Max 2 ranks; (800+ Mbps)
Memory Systems Architecture and Performance Analysis
DDR II Memory System Rank1
Spring 2005
Rank2
ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
Ctl
SLIDE 4
How to keep commodity DRAM devices, keep high datarate, but maintain or increase capacity? (large servers) UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
FB-DIMM Solution
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
Point to point chaining of sub-memory systems
Advanced Memory Buffer (AMB)
14 10
SLIDE 5 Ctl
Ctl
Ctl
up to 8 FB DIMMS Use commodity DRAM devices UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
AMB Block Diagram:
Spring 2005 ENEE 759H Lecture15.fm
From Controller
Bruce Jacob David Wang
Point to point interface Pass-through Logic Command & Addr
University of Maryland ECE Dept.
PLL
SLIDE 6
SMBus
LAI Controller SMBus Controller Thermal Sensor
To Controller UNIVERSITY OF MARYLAND
De-serializer & Decode Data Bus Interface Serializer
Pass-through and Merging Logic
Data Bus
Memory Systems Architecture and Performance Analysis
Salient Points
Spring 2005 ENEE 759H Lecture15.fm
•
ASIC-to-ASIC signalling. High datarate. Differential pair signalling, 6X data rate of DRAM devices. (i.e. DRAM: 800 Mbps, FBD link datarate: 4.8 Gbps)
•
Asymmetric configuration. Higher inbound bandwidth (14 inbound pin-pair, 10 outbound pin-pair)
•
Deskewing in ASIC, not on board
•
Longer idle system latency, but much higher pin-bandwidth
•
Keep increasing datarate and solves capacity problem
•
Uses commodity DRAM devices.
Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 7
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Technology Roadmap (ITRS)
Spring 2005 ENEE 759H Lecture15.fm
2004
2007
2010
2013
2016
90
65
45
32
22
CPU MHz
3990
6740
12000
19000
29000
MLogicTransistors/ cm^2
77.2
154.3
309
617
1235
High Perf chip pin count
2263
3012
4009
5335
7100
High Performance chip cost (cents/pin)
1.88
1.61
1.68
1.44
1.22
Memory pin cost (cents/pin)
0.34 1.39
0.27 0.84
0.22 0.34
0.19 0.39
0.19 0.33
Memory pin count
48-160
48-160
62-208
81-270
105-351
Semi Generation (nm) Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 8
Trend: C
UNIVERSITY OF MARYLAND
Free Transistors & Costly Interconnects
Memory Systems Architecture and Performance Analysis
Choices for Future
Spring 2005 Direct Connect Commodity DRAM Low Bandwidth + Low Latency
DRAM DRAM
ENEE 759H Lecture15.fm
CPU DRAM DRAM
Bruce Jacob David Wang University of Maryland ECE Dept.
DRAM DRAM DRAM DRAM Memory Controller
CPU
DRAM DRAM
SLIDE 9
DRAM DRAM
Indirect Connection Highest Bandwidth
DRAM DRAM
DRAM DRAM DRAM DRAM
DRAM DRAM Inexpensive DRAM
Highest Latency
CPU
UNIVERSITY OF MARYLAND
AMB
AMB
AMB
AMB
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
DRAM DRAM
Memory Systems Architecture and Performance Analysis
Pin Count Comparison
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 10
DDR2
FBD
667
4000
108
-
141
59
~200
~70
Peak bandwidth (GB/s)
5.3
8.0
Theoretical Efficiency (bandwidth/pin) (Mbps)
213
914
Datarate (Mbps)
72 data pins + 18 Pin Count (Data Bus) diff pairs Channel pin count of DQS (without pwr and gnd) strobes Channel pin count (with pwr and gnd)
~ 4.5x increase in pin-bandwidth (counting power and ground)
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Routing Comparison:
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 11
1 Channel of registered DDR2 SDRAM. 2 routing layer + power plane
2 Channels of FB-DIMM 2 routing layer including power delivery
no need to match path path length matched traces lengths (deskewing on ASIC) ~200 traces per channel ~70 traces per channel
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
- Pseudo network-like packet protocol - Fixed packet (frame) size - Everything is 12 “beats” (or 2 DRAM beats = 1 clk) - 12 x 14 = 168 bits/frame upstream (to controller) - 144 bit data payload per frame upstream (read data) 14 10 Ctl
Ctl
Ctl
SLIDE 12
FB-DIMM Protocol:
- 12 x 10 = 120 bits/frame downstream (from controller) - 72 bit payload per frame downstream (write data) - Write bandwidth = 0.5x of read bandwidth - Total bandwidth = 1.5x of single module bandwidth
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 13
UNIVERSITY OF MARYLAND
Downstream (southbound) I: - 10 bit lanes x 12 bits = 120 bits per frame - Command + write data - 72 bits of data per frame - 24 bits of command (max of 3 commands per frame) - 2 bits command type - 22 bits of CRC, strong protection in normal mode - 10 bits of CRC in failover mode (reduced protection) 0 1 2 3 4 5 6 7 8 9
Transparent failover: lose a bitlane, keep on going.
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 14
Downstream (southbound) II: - DRAM Commands Activate (row-bank) Write (column) Read (column) Precharge (all banks) Precharge (bank) Auto Refresh Enter Self Refresh Enter Power Down Exit Self Refresh and Exit Power Down - Channel Commands Channel NOP ensure transmission density Sync Soft Channel Reset transient bit failure recovery Write Config Register Read Config Register Allows tuning of refresh DRAM CKE per DIMM mechanism. Per DIMM DRAM CKE per Rank or per rank Debug
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Write FIFO
Spring 2005 ENEE 759H Lecture15.fm
Southbound Frames (12 FBD beats per frame) w0
w1
r0
w2
w3
w4
r1
Bruce Jacob David Wang
w5
w6
w7
D0
D1
D2
D3
D0
D1
D2
D3
Northbound Frames University of Maryland ECE Dept.
Burst-of-eight Burst-of-eight (12 FBD beats per frame) (2 DRAM beats per frame)
SLIDE 15
35 entries deep
Write data (1 beat per frame) buffered/queued need not be contiguous frames. May be separated by arbitrary number of intervening frames.
Write FIFO UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
Upstream: - 14 bit lanes - 12 x 14 = 168 bits/frame upstream (to controller) - 2 (72 or 64 bit) data payload per frame upstream - Last AMB on channel initiates northbound frames. - Idle frames contain permuting data pattern (LFSR) 14 10 Ctl
Ctl
Ctl
SLIDE 16
- 14 bit channel with 12 bits CRC per payload. Failover to 13 bit channel with 6 bits of CRC per payload. - 13 bit channel with 6 bits CRC per payload. Failover to 12 bit channel with ECC coverage only. - 12 bit channel with 6 bits CRC per payload No failover. UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm
Command Scheduling I: - Short channel: variable read latency capability
Oops
r1
r0
Bruce Jacob David Wang
D0
D1
University of Maryland ECE Dept. SLIDE 17
D0
D1 D3
D2
D3
Delay read request until ctl knows that empty Northbound frame exists
r1
r0
D0 D2
D1
D2
D3
D0
D1
D2
D3
FBD-merging, no rank-to-rank turnarounds 14 10 Ctl
Ctl
Ctl
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang
Command Scheduling II: - Long channel: fixed latency longer channel = longer read latency for all requests r1
r0
D0
University of Maryland ECE Dept. SLIDE 18
D1
D2
D3
D0
D1
D2
buffer/delay 14 10 Ctl
UNIVERSITY OF MARYLAND
Ctl
Ctl
“token passing”
D3
Memory Systems Architecture and Performance Analysis
Latency:
Spring 2005 ENEE 759H Lecture15.fm
From Controller
Bruce Jacob David Wang University of Maryland ECE Dept.
~2 ns Pass-through Logic
~8.1 ns @ 667 Mbps PLL
SLIDE 19
SMBus
LAI Controller SMBus Controller Thermal Sensor
To Controller UNIVERSITY OF MARYLAND
Point to point interface Command & Addr
De-serializer & Decode Data Bus Interface Serializer
Pass-through and Merging Logic
~5.0 ns @ 667 Mbps
Data Bus
Memory Systems Architecture and Performance Analysis
Hypothetical Latency Distrib.
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 20
Number of Accesses at Given Latency Value
1e+08
Direct Connect FB-DIMM
1e+06
10000
100
1 0
200
400 600 Memory Access Latency (ns)
Longer Idle system latency, but higher pin-bandwidth. Lower avg latency? UNIVERSITY OF MARYLAND
800
Memory Systems Architecture and Performance Analysis
Power Impact
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 21
9 x8
18 x8 or 18 x4
36 x4
2.7W
5.4W
10.8W
4W
4W
4W
Total FBD Power
~6.7W
~9.4W
~14.8W
Power overhead
~148%
~74%
~37%
DRAM power (assume ~0.3W per device) AMB Power
UNIVERSITY OF MARYLAND
Memory Systems Architecture and Performance Analysis
Summary
Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept.
•
Can merge read data frames and achieve 100% read bandwidth utilization
•
FBD Channel relies on deep channel to obtain “full bandwidth efficiency”
•
Longer latency, particularly with long channel configurations.
•
Asymmetric configuration. Peak write bandwidth = 0.5x peak read bandwidth. peak read + write bandwidth = 1.5x of single channel DIMM bandwidth
SLIDE 22
UNIVERSITY OF MARYLAND