Transcript
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology With adaptations and additions by S. Yalamanchili for ECE 4100/6100 – Spring 2009
Reading • Section 5.3 • Suggested Readings
2
Main Memory Storage Technologies • DRAM: “Dynamic” Random Access Memory – Highest densities – Optimized for cost/bit Æ main memory
• SRAM: “Static” Random Access Memory – – – –
Densities ¼ to 1/8 of DRAM Speeds 8-16x faster than DRAM Cost 8-16x more per bit Optimized for speed Æ caches
3
The DRAM Cell Word Line (Control)
Storage Capacitor Bit Line (Information)
1T1C DRAM cell •
Why DRAMs
Stack capacitor (vs. Trench capacitor) Source: Memory Arch Course, Insa. Toulouse
– Higher density than SRAMs
•
Disadvantages – Longer access times – Leaky, needs to be refreshed – Cannot be easily integrated with CMOS 4
SRAM Cell Wordline
Bit line
Bit line
• Bit is stored in a latch using 6 transistors • To read: – set bitlines to 2.5v – drive wordline, bitlines settle to 0v / 5v
• To write: – set bitlines to 0v / 5v – drive wordline, bitlines “overpower” latch transistors 5
One DRAM Bank bitlines
Row decoder
Address
Column decoder
wordline Sense amps I/O gating
Data out
6
Example: 512Mb 4-bank DRAM (x4) 2k BA[1:0]
A[10:0]
Columndecoder decoder Column Column decoder Column decoder
Address Multiplexing
Bank0 16384 x 2048 x 4
Sense amps I/O gating
16K
Address
Row decoder Row decoder Row Rowdecoder decoder
A[13:0]
A DRAM page = 2kx4 = 1KB
Data out D[3:0] A x4 DRAM chip 7
DRAM Cell Array Wordline0
Wordline1 Wordline2
Wordline3
Wordline1023
bitline0
bitline1
bitline2
bitline15
8
DRAM Basics • Address multiplexing – Send row address when RAS asserted – Send column address when CAS asserted
• DRAM reads are self-destructive – Rewrite after a read
• Memory array – All bits within an array work in unison
• Memory bank – Different banks can operate independently
• DRAM rank – Chips inside the same rank are accessed simultaneously
9
Examples of DRAM DIMM Standards
D63
x8 D55 D56
x8 D47 D48
x8 D39 D40
x8 D31 D32
x8 D23 D24
x8 D15 D16
x8
D7 D8
D0
x8
x64 (No ECC)
D63
x8 D55 D56
x8 D47 D48
x8 D39 D40
x8 D31 D32
x8 D23 D24
x8 CB7 D16
x8 D15 CB0
x8
D7 D8
D0
x8
X72 (ECC) 10
Rank0 Rank1
x8
x8
D55 D56
x8
x8
D47 D48
x8
x8
D39 D40
x8
x8
D31 D32
x8
x8
D23 D24
x8
x8
D15 D16
x8
x8
D7 D8
CS1
x8
x8
D63
CS0
D0
Memory Controller
DRAM Ranks
11
DRAM Ranks 64b 8b
8b
8b
8b
8b
8b
8b
8b
Single Rank 64b
4b
4b
4b
4b
4b
4b
4b
4b
Single Rank 4b
4b
4b
4b
4b
4b
4b
4b
64b 8b
8b
8b
8b
8b
8b
8b
8b
DualRank 8b
8b
8b
8b
8b
8b
8b
8b 64b 12
DRAM Organization
Source: Memory Systems Architecture Course, B. Jacobs, Maryland
13
Organization of DRAM Modules Addr and Cmd Bus
Memory Controller
Data Bus Channel
Multi-Banked DRAM Chip Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland
14
DRAM Configuration Example Source: MICRON DDR3 DRAM
15
Memory Read Timing: Conventional
Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland
16
Memory Read Timing: Fast Page Mode
Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland
17
Memory Read Timing: Burst
Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland
18
Memory Controller Convert to DRAM commands
Core Transaction request sent to MC
Commands sent to DRAM
Memory Controller
• Consider all of steps a LD instruction must go through! – Virtual Æ physical Ærank/bank
• Scheduling policies are increasingly important – Give preference to references in the same page?
19
Integrated Memory Controllers
*From http://chip-architect.com/news/Shanghai_Nehalem.jpg 20
DRAM Refresh • Leaky storage • Periodic Refresh across DRAM rows • Un-accessible when refreshing • Read, and write the same data back • Example: – 4k rows in a DRAM – 100ns read cycle – Decay in 64ms – 4096*100ns = 410µs to refresh once – 410µs / 64ms = 0.64% unavailability 21
DRAM Refresh Styles • Bursty 410µs
410µs =(100ns*4096)
64ms
• Distributed
64ms
15.6µs
100ns
64ms
64ms
22
DRAM Refresh Policies • RAS-Only Refresh DRAM Module Assert RAS
Memory Controller
RAS CAS WE
Row Address
Addr Bus Refresh Row
• CAS-Before-RAS (CBR) Refresh
DRAM Module
Assert RAS
WE# Addr Bus
Assert CAS WE High
Addr counter
Memory Controller
RAS CAS
No address involved
Refresh Row Increment counter 23
Types of DRAM •
Asynchronous DRAM – Normal: Responds to RAS and CAS signals (no clock) – Fast Page Mode (FPM): Row remains open after RAS for multiple CAS commands – Extended Data Out (EDO): Change output drivers to latches. Data can be held on bus for longer time – Burst Extended Data Out: Internal counter drives address latch. Able to provide data in burst mode.
•
Synchronous DRAM – SDRAM: All of the above with clock. Adds predictability to DRAM operation – DDR, DDR2, DDR3: Transfer data on both edges of the clock – FB-DIMM: DIMMs connected using point to point connection instead of bus. Allows more DIMMs to be incorporated in server based systems
•
RDRAM – Low pin count
24
Main Memory Organizations registers
registers
ALU
registers
ALU
cache
ALU
cache
cache wide bus
bus Memory
Mem
Mem
Mem
bus
Mem
Mem
Mem
Mem
Mem
• The processor-memory bus may have width of one or more memory words • Multiple memory banks can operate in parallel
– Transfer from memory to the cache is subject to the width of the processor-memory bus
• Wide memory comes with constraints on expansion
– Use of error correcting codes require the complete “width” to be read to recompute the codes on writes – Minimum expansion unit size is increased 25
Word Level Interleaved Memory Memory Module
Read the output of a memory access memory access 1
memory access 2
word interleaving 1 5
0 4
τ
3 7
τ Time
• • • • •
2 6
Time to read the output of memory
Bank 0
Bank 1
Bank 2
Bank 3
Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns Assuming a word-wide bus, cache miss penalty is Taddress + Tmem_access + #words * Ttransfer cycles
Note the effect of a split transaction vs. locked bus
26
Sequential Bank Operation bank
0 bank
1 n-m higher order bits
bank
m lower order bits m-1
access 1 module 0 module 1 word 1
• Implement using DRAMs with page mode access 27
Concurrent Bank Operation DATA
1
0
m-1
n-m ADDR
m module 0 module 1 module 2 word 1
• Supports arbitrary accesses • Needs sources of multiple, independent accesses – Lock-up free caches, data speculation, write buffers, pre-fetching 28
Concurrent Bank Operation Memory Module
memory bank access
τ Time
• Each bank can be addressed independently – Sequence of addresses
• Difference with interleaved memory – Flexibility in addressing – Requires greater address bandwidth – Separate controllers and memory buses
• Support for non-blocking caches with multiple outstanding misses
29
Data Skewing for Concurrent Access
a2 a6
a1 a5
a0 a4
a3 a7
A 3-ordered 8 vector with C = 2.
• How can we guarantee that data can be accessed in parallel? – Avoid bank conflicts
• Storage Scheme: – A set of rules that determine for each array element, the address of the module and the location within a module – Design a storage scheme to ensure concurrent access – d-ordered n vector: the ith element is in module (d.i + C) mod M.
30
Conflict-Free Access • Conflict free access to elements of the vector if Æ – M >= N – M >= N. gcd(M,d)
• Multi-dimensional arrays treated as arrays of 1-d vectors • Conflict free access for various patterns in a matrix requires – – – –
M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals
31
Conflict-Free Access • Implications for M = N = even number? • For non-power-of-two values of M, indexing and address computation must be efficient • Vectors that are accessed are scrambled – Unscrambling of vectors is a non-trivial performance issue
• Data dependencies can still reduce bandwidth far below O(M)
32
Avoiding Bank Conflicts: Compiler Techniques • Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • Solutions: – – – –
Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing
33
Study Guide: Glossary • • • • • • • • • • • • • • • • • •
Asynchronous DRAM Bank and rank Bit line Burst mode access Conflict free access Data skewing DRAM High-order and low-order interleaving Leaky transistors Memory controller Page mode access RAS and CAS Refresh RDRAM SRAM Synchronous DRAM Word interleaving Word line
34
Study Guide • Differences between SRAM/DRAM in operation and performance • Given a memory organization determine the miss penalty in cycles • Cache basics – Mappings from main memory to locations in the cache hierarchy – Computation of CPI impact of miss penalties, miss rate, and hit times – Computation CPI impact of update strategies
• Find a skewing scheme for concurrent accesses to a given data structure – For example, diagonals of a matrix – Sub-blocks of a matrix
• Evaluate the CPI impact of various optimizations • Relate mapping of data structures to main memory (such as matrices) to cache behavior and the behavior of optimizations 35