Transcript
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 1
A Holistic Approach to DRAM (and Systems) Prof. Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park
OUTLINE • • • • UNIVERSITY OF MARYLAND
Anecdotes, Vision Our Past & Present Work Anecdotes Revisited Conclusions
A HOLISTIC APPROACH Bruce Jacob
Anecdote I: System Issues
University of Maryland 3
Cycles per Instruction (CPI)
SLIDE 2
32-Byte Burst 64-Byte Burst 128-Byte Burst
2
1
0 an
0.8 x1
e yt
b
h
1c
1
1.6 s te
e yt
by 1 b x2 nx an ha ch 2 c
3.2
6.4
12.8
25.6
es es es es es es es yte es yt byt yt byt byt yt yt byt b b b b b 1 x x4 x4 x8 x8 x8 x2 x2 x4 an han han an han han an an han h h h h 4c 4c 1c 1c 4c 2c 4c 2c 2c
System Bandwidth (GB/s = Channels * Width * 800MHz)
Benchmark = GCC (SPEC 2000), 2 banks/channel
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 3
Anecdote II: DDR’s DLL Delay DCLK of clock from clock input pad to output drivers
CK Bufs
CKEXT
DRAM Arrays
CKINT
DCLK
DQEXT
Data
CKEXT CKINT CMD
READ
DQEXT DDQ
DCLK + DDQ
Ideally aligned
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Anecdote II: DDR’s DLL Delay DCLK of clock
CK Bufs
DLL
CKEXT
SLIDE 4
Point of DLL: to align DQ output with system clock (minimize internal skew by eliminating DCLK)
DRAM Arrays
DCLK + DDLL DDLL DCLK
CKINT Data
DQEXT
CKEXT
Delay introduced by DLL
CKINT CMD
READ
Delayed
DQEXT DDQ
Aligned
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Anecdote II: DDR’s DLL A handful of alternatives:
SLIDE 5
MC
DATA strobe
D
Unassisted
MC
DATA strobe
MC
MC
DATA strobe
DLL on MC
V
DLL
D
DIMM
DLL on module DLL
D
DLL
MC
DLL on DRAM DLL
DATA strobe
RCLK DATA
V
D
Read clock V
D
MC
DATA strobe
V
D
Static delay w/ recalibration
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Anecdote III: Circuit v System tRRD & tFAW limitations: tRRD
SLIDE 6
Cmd
RC
Internal cmd Data
tFAW RC
RC
RC
C
C
C
data
data
R
Row Activation Command
C
Column Read Command
RC
C
RC
RC
C
C
C
data
data
data
RC
data
data
tDQS limitations: Clock Cmd
RC
RC
Internal Cmd
Data
data
RC
C
C
C
data
data
data
tDQS
RC
RC
C
data
RC
RC
C
C
C
data
data
data
A HOLISTIC APPROACH Bruce Jacob
Vision
University of Maryland SLIDE 7
Must make circuit-level decisions considering system-level ramifications Must make system-level decisions considering circuit-level ramifications
(holistic approach)
FPM, EDO, SDRAM, ESDRAM, DDR:
x16 DRAM x16 DRAM
SLIDE 8
x16 DRAM CPU and caches
128-bit 100MHz bus
x16 DRAM
Memory Controller
x16 DRAM x16 DRAM x16 DRAM x16 DRAM
DRAM
DRAM
Memory Controller
DRAM
128-bit 100MHz bus
DRAM
CPU and caches
DRAM
DIMM
Rambus, Direct Rambus, SLDRAM:
DRAM
University of Maryland
Past Work: Device-Level
DRAM
Bruce Jacob
DRAM
A HOLISTIC APPROACH
Fast, Narrow Channel
[Cuppu et al. ISCA 1999]
A HOLISTIC APPROACH Bruce Jacob
Past Work: Device-Level
University of Maryland
Average Latencies
SLIDE 9
PERL 500
Avg Time per Access (ns)
Newer DRAMs 400
Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time
300
Critical word arrival times
200
100
0
FPM
EDO
SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR
DRAM Architecture [Cuppu et al. ISCA 1999]
A HOLISTIC APPROACH Bruce Jacob
Past Work: Device-Level
University of Maryland
Bandwidth-Enhancing Techniques I:
SLIDE 10
Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution & Memory Processor Execution
PERL 5
Cycles Per Instruction (CPI)
Ye
PU C U ’s CP PU w s ro y’ s C or da y’ m o da T er st
4
To
Newer DRAMs
3
2
1
0
FPM
EDO
SLDRAM RDRAM SDRAM DRDRAM ESDRAM
DDR
DRAM Architecture [Cuppu et al. ISCA 1999]
A HOLISTIC APPROACH Bruce Jacob
Past Work: Device-Level
University of Maryland
Bandwidth-Enhancing Techniques II:
SLIDE 11
PERL
Cycles Per Instruction (CPI)
5
Stalls due to Memory Bandwidth Stalls due to Memory Latency Execution Time in CPI — PERL Overlap between Execution & Memory Processor Execution
4
3
2
1
0
FPM/interleaved
EDO/interleaved
SDRAM & DDR
SLDRAM x1/x2
RDRAM x1/x2
DRAM Architecture (10GHz CPU) [Cuppu et al. ISCA 1999]
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Past Work: System-Level Even when we restrict our focus … D D
SLIDE 12
D
D D
D D
D
D
D D
D D
D D
D D
D D
D D
... C
C
...
C
C
D
D
D
C
Two independent channels Banking degrees of 1, 2, 4, ...
One independent channel Banking degrees of 1, 2, 4, ...
D
C
D
D
D
D
D D
D D
D D
D D
D
D
D
D
D D
D D
D D
D D
... C
C
C
Four independent channels Banking degrees of 1, 2, 4, ...
1, 2, 4 8, 16, 32, 64 1, 2, 4, 8 32, 64, 128
800 MHz Channels Data Bits per Channel Banks per Channel (Indep.) Bytes per Burst [Cuppu & Jacob ISCA 2001]
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Past Work: System-Level ... the design space is FAR from regular …
SLIDE 13
Cycles per Instruction (CPI)
3
GCC
2
1
0 an
1
32-Byte Burst 64-Byte Burst 128-Byte Burst
ch
0.8 x1
b
yte
1.6 es
yte
t by 1 b 2 x x n an cha h c 2 1
3.2 es
es
6.4 yte
es
es
12.8 es
es
es
25.6 s
t te t t t t t by by by by by 1 b by by 2 8 4 4 2 8 8 4 x x x x x x x x x n n n n n n n an han ha ha ha cha ha cha cha c c c c c ch 4 4 4 4 2 2 2 1 1 t
by
System Bandwidth (GB/s = Channels * Width * 800MHz) [Cuppu & Jacob ISCA 2001]
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Past Work: System-Level ... and the cost of poor judgment is high.
SLIDE 14
Cycles per Instruction (CPI)
10
Worst Organization Average Organization Best Organization
8
6
4
2
0
bzip
gcc
mcf
parser
perl
vpr
average
SPEC 2000 Benchmarks
[Cuppu & Jacob ISCA 2001]
A HOLISTIC APPROACH Bruce Jacob
An Aside
University of Maryland SLIDE 15
Past work used first-order models.
Present work uses models accurate to second & third order effects …
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 16
[ Definition: Zero’th Order ] ... if ( INSTR.is_loadstore ) { if (L1_cache_miss( INSTR.daddr )) { if (L2_cache_miss( INSTR.daddr )) { cycles += DRAM_LATENCY; OR INSTR.ready = now() + DRAM_LATENCY; } } } ...
A HOLISTIC APPROACH Bruce Jacob
An Aside
University of Maryland SLIDE 17
Past work used first-order models.
Present work uses models accurate to second & third order effects …
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 18
Past & Present Work SimBed C-language embedded-system model
CMP$im Pin tool to model CMP cache (and DRAM) systems
CPU
$
Executes apps to 1 trillion references without sampling
EmPower SystemC model of SoC, incorporating heat and power
DRAMsim C-language DRAM model DRAM, BIU, MC GEMS, M5, etc.
SYSim C-language system model Models both performance & energy
Recent development work
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 19
Past & Present Work SimBed CMP$im Pin tool to model CMP cache (and DRAM) systems
CPU
(HPCA 2006)
$
C-language embedded-system model (CASES 2001, EmPower IEEETC 2003) SystemC model of SoC, heat & power (SPIE 2005)
ISCA 2001 System-level study
ISCA 1999, IEEETC 2001 Device-level study
DRAMsim C-language DRAM model (SIGARCH 2005)
SYSim IEEETC 1996
C-language system model (MKP book)
Analytical model
IEEETC 1996: ISCA 1999, IEEETC 2001: CASES 2001, IEEETC 2003: ISCA 2001, IEEE Micro 2003: SPIE 2005: SIGARCH 2005: ISPASS 2005, HPCA 2006:
System-level analytical tool for cost/performance DRAM device-level characterization Performance & energy modeling of RTOS, CPU, memory DRAM system-level characterization SystemC modeling of energy in systems-on-chip DRAMsim released to community Characterization of bioinformatics workloads
A HOLISTIC APPROACH Bruce Jacob University of Maryland
DRAMsim Execution of a Load Instruction
SLIDE 20
Processor Core L2 cache [A1] [A3] [A2] DTLB [A4] L1 cache BIU (Bus Interface Unit) [B1]
Part A: Searching on-chip for data (CPU clocking domain)
[B2] System Controller
[B8]
physical to memory addr mapping
read data buffer
Part B: Going off-chip for data (DRAM clocking domain) I/O-to-memory traffic
[B3]
[B7]
memory request scheduling [B ]
[B5]
4
DRAM System [B6] DRAM core
Stages of instruction execution:
IF virtual to physical address translation (DTLB access) [A1]
ID
[A2] L1 D-Cache access. If miss then proceed to
EX
MEM
[A3] L2 Cache access. If miss then send to BIU
WB
Bus Interface Unit (BIU) obtains data from main memory [A4 + B]
[B1] BIU arbitrates [B2] request [B3] phys. addr. [B4] mem. [B5] mem. for ownership of sent to system to memory addr. request addr. Setup address bus ** controller scheduling** (RAS/CAS) translation.
[B6, B7] DRAM dev. [B8] system controller returns obtains data and returns to controller data to CPU
** Steps not required for some processor/system controllers. protocol dependant.
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 21
DRAMsim tRC
READ: tRAS cmd
1
2
4
data
tCAC tRCD
1 2 3 4
tRP 1
3
tCAS
tDATA
Active: Open Row, tRCD time later, a CAS command may be issued to the DRAM chip CAS: Column Read command, tCAS time later, data begins to be placed onto the Data bus. We use tCAC to factor out command transmission time. Data: The number of cycles that the data transmits over the Data bus Precharge: Close the Row, this command may be issued tRAS time after the Active command. After tRP time, another active command may be issued.
WRITE:
tRTR tRAS
cmd
1
tRP
2
5
tRCD
5
1
3
data
2
4
tCWD
tDATA
CWD: Column Write Delay, the number of cycles that the controllers must wait before placing the data onto the data bus. RTR: Retirement delay, this is for systems with write delay buffers.(RDRAM)
A HOLISTIC APPROACH Bruce Jacob University of Maryland
DRAMsim Memory Access Latency Distribution Number of Accesses at each Latency Value
SLIDE 22
2 GHz CPU, 200 MBPS 8 Byte wide DDR SDRAM
MCF
GCC
2000
8000
1500
6000
Minimum Latency ~ 180 CPU cycles
1000
Most accesses satisfied immediately. Latency distribution favors low latency values Dominant Latency “modes” evident:
4000
CAS Hit Latency 500
2000
0 0
500
1000
1500
0
0
Bank Conflict Latency
500
1000
Memory Access Latency: In CPU cycles
Most accesses must be pipelined. Long queueing delay, Large range of latency values
1500
A HOLISTIC APPROACH Bruce Jacob
DRAMsim
University of Maryland SLIDE 23
http://www.ece.umd.edu/dramsim
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Accuracy: Why? Benefit: Insights (Anecdote II, revisited)
SLIDE 24
MC
DATA strobe
D
Unassisted
MC
DATA strobe
MC
MC
DATA strobe
DLL on MC
V
DLL
D
DIMM
DLL on module DLL
D
DLL
MC
DLL on DRAM DLL
DATA strobe
RCLK DATA
V
D
Read clock V
D
MC
DATA strobe
V
D
Static delay w/ recalibration
A HOLISTIC APPROACH Bruce Jacob University of Maryland
Accuracy: Why? Benefit: Insights (Anecdote II, revisited)
SLIDE 25
SCHEME COST
EFFECTIVENESS (Uncertainty in read)
No DLL
DCLK + Xmit + wire + Recv + Clk skew
0
on DRAM 16xDLL on MC
2xDLL 16xVern
Xmit + wire + Recv + Clk skew wire + Recv
on DIMM 2xDLL 16xVern
wire + Recv + Clk skew
Read CLK
2xDLL 16xVern
wire + Recv
Static
16xVern
Xmit + wire + Recv
• •
Cost = for 2-DIMM system, 8 DRAM parts per DIMM note: “cost” applies to both die area and power Uncertainty = very rough, intuitive idea
A HOLISTIC APPROACH Bruce Jacob
Anecdote III, revisited
University of Maryland
command bus
RC
internal cmd
current draw in abstract units
SLIDE 26
row access
R
P C
precharge
data bus
data burst
time R Row Activation Command
C Column Read Command
Quiescent current draw of active device
P Precharge Command
Current draw profile due to device activity
Power consumption in DRAM devices: • •
Row activation, data read-out, bank precharge: all are relatively expensive operations Current draw of operation additive to quiescent value
… So what’s the big deal?
Bruce Jacob University of Maryland SLIDE 27
Anecdote III, revisited tRRD & tFAW protocol-level limitations placed upon device to limit maximum current draw tRRD tFAW cmd
R C
internal cmd data
current draw in abstract units
A HOLISTIC APPROACH
• •
data
R C
RC
R C
C
C
C
data
RC C
data
RC C
data
data
overlapping current profiles Severely limits bus efficiency from single rank Problem worsens in future: parameters defined in nanoseconds, not cycles
time
A HOLISTIC APPROACH Bruce Jacob
Anecdote III, revisited
University of Maryland
•
tRRD & tFAW — Problem worsens in future: parameters defined in nanoseconds, not cycles
SLIDE 28
tRRD
cmd
R C
internal cmd data
tFAW R C
RC
R C
C
C
C
data
R C
internal cmd data
R C
internal cmd data
d
data
data
data
RC
R C
C
C
C
dat
RC C
dat
RC C
dat
dat
tFAW
d
2.4Gbps
R C
RC
R C
C
C
C
d
d
1.2Gbps
R C
tRRD
cmd
C
tFAW
dat
RC
RC C
data
tRRD
cmd
800Mbps
d
RC C
d
RC C
d
d
A HOLISTIC APPROACH Bruce Jacob
Max. Sustainable Bandwidth 9
University of Maryland
th
164.gzip
an
Maximum Sustainable Bandwidth: GB/s
8 SLIDE 29
id dw
B ak
Pe 7 6 5 4 3 2 1 0
Queue Depth 4 2 0 (FIFO) 533.33
666.66
4.3
5.3
Bank Count Filled - 16 Banks Outline - 8 Banks 800
Simulated tFAW values: tFAW = tRC / 2 tFAW = tRC
933.33 1066.7 Datarate (Mbps) 8.5 7.5 6.4 Peak BW (GBps)
1200
1333.3
9.6
10.7
tRC = 60ns, burst of eight, 8B wide channel
A HOLISTIC APPROACH Bruce Jacob
Max. Sustainable Bandwidth
University of Maryland
tFAW Impact for Two Different Simulated Values: (tFAW = tRC/2) 164.gzip (tFAW = tRC) 9 8
Maximum Sustainable Bandwidth: GB/s
SLIDE 30
Inflection Point For tFAW = tRC / 2 th
wid
HI
nd
a kB
a
Pe
7 6
MED HI MED
5 4 3 2 1 0
Queue Depth 4 2 0 (FIFO) 533.33 4.3
666.66 5.3
Bank Count Simulated tFAW values: tFAW = tRC / 2 Filled - 16 Banks tFAW = tRC Outline - 8 Banks 800
933.33 1066.7 Datarate (Mbps) 8.5 7.5 6.4 Peak BW (GBps)
1200
1333.3
9.6
10.7
High and Moderate Levels of Reordering (e.g. system sophistication) No Reordering
tRC = 60ns, burst of eight, 8B wide channel
16 BANKS improves bandwidth over 8 BANKS by ~10% (how does this compare with incremental cost?)
A HOLISTIC APPROACH Bruce Jacob
But Wait, There’s More …
University of Maryland
tDQS protocol-level limitation placed upon ranks to prevent data-bus collisions on rank hand-off
SLIDE 31
Clock Cmd
Internal Cmd
Data
RC
RC
data
RC
RC
RC
C
C
C
data
data
data
C
data
C
C
C
data
data
data
tDQS • •
RC
RC
RC
Severely limits bus efficiency from multiple ranks Luckily, it is defined in cycles and not nanoseconds
A HOLISTIC APPROACH Bruce Jacob
Solution I: Scheduling
University of Maryland
Problems created by tFAW + tRRD + tDQS
SLIDE 32
• •
tFAW + tRRD tDQS
Must spread out ACT commands Must switch ranks infrequently
Salient point: tFAW does not place limit on total number of open banks Problem can be solved with scheduling: row-column command decoupling (RCCD) • •
Schedule ACT commands far before their corresponding READ commands Schedule large number of bank-reads before switching ranks
[patent pending]
A HOLISTIC APPROACH Bruce Jacob
Solution I: Scheduling
University of Maryland
IPC Speedup Relative to FCFS Scheduling
SLIDE 33
BRR RCCD 1.6
Increasing Bandwidth Utilization
1.4
negligible speedup
1.2
t ar
p m am
pl u ap
gr id m
p2 bz i
ke ua
eq
rt ex gc c vo
a m es
el lg ga
ip gz
ol f
1
tw
IPC Speedup Relative to FCFS Scheduling
1.8
University of Maryland SLIDE 34
Solution I: Scheduling Number of Accesses at Given Latency Value
Bruce Jacob
1e+08
179.art FCFS
179.art RCCD
1e+06
10000
100
1 0
Number of Accesses at Given Latency Value
A HOLISTIC APPROACH
200
400 600 Memory Access Latency (ns)
800 0
200
400 600 Memory Access Latency (ns)
800
1e+08
188.ammp FCFS
188.ammp RCCD
1e+06
10000
100
1 0
200
400 600 Memory Access Latency (ns)
800 0
200
400 600 Memory Access Latency (ns)
800
Bruce Jacob
Solution II: Topology, etc.
University of Maryland
Problems solved by tFAW + tRRD + tDQS • •
tFAW + tRRD tDQS
Instantaneous current draw in device Bus collisions on rank handoffs
Any alternative solution will do …
DIMM
MC
DIMM
CMD & write data
DIMM
SLIDE 35
DIMM
A HOLISTIC APPROACH
read data • •
Topology eliminates collisions (can account for static DIMM-DIMM skew with Vernier-type solution) Note: solution requires source-synchronous clocking
A HOLISTIC APPROACH Bruce Jacob
Interesting Side Note
University of Maryland
Fully Buffered DIMM
SLIDE 36
A HOLISTIC APPROACH Bruce Jacob
Nth Order Effects: Heat, EMI
University of Maryland
EmPower: First Target Application
SLIDE 37
A HOLISTIC APPROACH Bruce Jacob
Nth Order Effects: Heat, EMI
University of Maryland
EmPower: Initial Results
SLIDE 38
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 39
Summary No longer appropriate to optimize subsystems in isolation: local optima do not yield globally optimal system Systemic behaviors: unanticipated interactions yielding inefficiencies Specific instances: • •
tFAW + tRRD + tDQS severely limits BW Choice of DLL on DDR SDRAMs to de-skew parts
Many problems can be addressed by system-level solutions; can be better than circuit-level solutions UNIVERSITY OF MARYLAND
A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 40
Et Cetera (CURRENT) MEMSYS GRAD STUDENTS: • • • • • • • • •
Dave Wang: DRAMsim, tFAW + tRRD + tDQS studies, etc. Aamer Jaleel: CMP$im, bioinformatics, etc. Brinda Ganesh: DRAMsim, FB-DIMM power mgmt Samuel Rodriguez: SRAM circuit-level details Ankush Varma: SystemC system-on-chip energy model Sadagopan Srinivasan: SoC memory system issues Nuengwong Tuaycharoen: SYSim development Hongxia Wang: SRAM circuit integrity issues Joe Gross: DRAMsim II development
CONTACT INFO: • • UNIVERSITY OF MARYLAND
Prof. Bruce Jacob ECE Dept., University of Maryland, College Park, MD www.ece.umd.edu/~blj/
[email protected]
A HOLISTIC APPROACH Bruce Jacob
DRAM: Brief Primer
University of Maryland SLIDE 41
Word Line Bit Line
Switching element (transistor) Storage element (capacitor)
Dual In-line Memory Module (DIMM) (printed circuit board w/ DRAM chips on it)
A HOLISTIC APPROACH Bruce Jacob
DRAM: Brief Primer
University of Maryland
Graphics Co-Processor Frontside bus
SCSI bus
Memory Controller North Bridge Chipset
DRAM DRAM DRAM Array Array Array
Memory modules PCI bus
Hard Drive/s
SCSI Controller
Other Low-BW I/O Devices
DRAM
Primary Cache
DRAM bus
DRAM
Secondary Secondary Cache Cache
AGP bus
CPU CPU
DRAM
Backside bus
DRAM
SLIDE 42
Network Interface
I/O Controller
Keyboard
South Bridge Chipset
Mouse
The memory system (in blue) … and DRAM’s typical place within it. (typical PC-style desktop system)
A HOLISTIC APPROACH Bruce Jacob
DRAM: Brief Primer
University of Maryland
DRAM
SLIDE 43
Word Line
Storage element (capacitor)
Column Decoder Sense Amps
Data In/Out Buffers
... Bit Lines...
The DRAM device
. .. Word Lines ...
Switching element
Row Decoder
Bit Line
Memory Array
A HOLISTIC APPROACH Bruce Jacob
DRAM: Brief Primer
University of Maryland
Access Protocol
SLIDE 44
tRC
READ: tRAS cmd
1
tRP
2
4
data
tCAC tRCD
1
3
tCAS
tDATA
1 2
Active: Open Row, tRCD time later, a CAS command may be issued to the DRAM chip CAS: Column Read command, tCAS time later, data begins to be placed onto the Data bus. We use tCAC to factor out command transmission time.
3 4
Data: The number of cycles that the data transmits over the Data bus Precharge: Close the Row, this command may be issued tRAS time after the Active command. After tRP time, another active command may be issued.
tRTR
WRITE: tRAS cmd
1
tRP
2
5
tRCD
5
1
3
data
2
4
tCWD
tDATA
CWD: Column Write Delay, the number of cycles that the controllers must wait before placing the data onto the data bus. RTR: Retirement delay, this is for systems with write delay buffers.(RDRAM)