Preview only show first 10 pages with watermark. For full document please download

Reading

   EMBED


Share

Transcript

ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology With adaptations and additions by S. Yalamanchili for ECE 4100/6100 – Spring 2009 Reading • Section 5.3 • Suggested Readings 2 Main Memory Storage Technologies • DRAM: “Dynamic” Random Access Memory – Highest densities – Optimized for cost/bit Æ main memory • SRAM: “Static” Random Access Memory – – – – Densities ¼ to 1/8 of DRAM Speeds 8-16x faster than DRAM Cost 8-16x more per bit Optimized for speed Æ caches 3 The DRAM Cell Word Line (Control) Storage Capacitor Bit Line (Information) 1T1C DRAM cell • Why DRAMs Stack capacitor (vs. Trench capacitor) Source: Memory Arch Course, Insa. Toulouse – Higher density than SRAMs • Disadvantages – Longer access times – Leaky, needs to be refreshed – Cannot be easily integrated with CMOS 4 SRAM Cell Wordline Bit line Bit line • Bit is stored in a latch using 6 transistors • To read: – set bitlines to 2.5v – drive wordline, bitlines settle to 0v / 5v • To write: – set bitlines to 0v / 5v – drive wordline, bitlines “overpower” latch transistors 5 One DRAM Bank bitlines Row decoder Address Column decoder wordline Sense amps I/O gating Data out 6 Example: 512Mb 4-bank DRAM (x4) 2k BA[1:0] A[10:0] Columndecoder decoder Column Column decoder Column decoder Address Multiplexing Bank0 16384 x 2048 x 4 Sense amps I/O gating 16K Address Row decoder Row decoder Row Rowdecoder decoder A[13:0] A DRAM page = 2kx4 = 1KB Data out D[3:0] A x4 DRAM chip 7 DRAM Cell Array Wordline0 Wordline1 Wordline2 Wordline3 Wordline1023 bitline0 bitline1 bitline2 bitline15 8 DRAM Basics • Address multiplexing – Send row address when RAS asserted – Send column address when CAS asserted • DRAM reads are self-destructive – Rewrite after a read • Memory array – All bits within an array work in unison • Memory bank – Different banks can operate independently • DRAM rank – Chips inside the same rank are accessed simultaneously 9 Examples of DRAM DIMM Standards D63 x8 D55 D56 x8 D47 D48 x8 D39 D40 x8 D31 D32 x8 D23 D24 x8 D15 D16 x8 D7 D8 D0 x8 x64 (No ECC) D63 x8 D55 D56 x8 D47 D48 x8 D39 D40 x8 D31 D32 x8 D23 D24 x8 CB7 D16 x8 D15 CB0 x8 D7 D8 D0 x8 X72 (ECC) 10 Rank0 Rank1 x8 x8 D55 D56 x8 x8 D47 D48 x8 x8 D39 D40 x8 x8 D31 D32 x8 x8 D23 D24 x8 x8 D15 D16 x8 x8 D7 D8 CS1 x8 x8 D63 CS0 D0 Memory Controller DRAM Ranks 11 DRAM Ranks 64b 8b 8b 8b 8b 8b 8b 8b 8b Single Rank 64b 4b 4b 4b 4b 4b 4b 4b 4b Single Rank 4b 4b 4b 4b 4b 4b 4b 4b 64b 8b 8b 8b 8b 8b 8b 8b 8b DualRank 8b 8b 8b 8b 8b 8b 8b 8b 64b 12 DRAM Organization Source: Memory Systems Architecture Course, B. Jacobs, Maryland 13 Organization of DRAM Modules Addr and Cmd Bus Memory Controller Data Bus Channel Multi-Banked DRAM Chip Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland 14 DRAM Configuration Example Source: MICRON DDR3 DRAM 15 Memory Read Timing: Conventional Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland 16 Memory Read Timing: Fast Page Mode Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland 17 Memory Read Timing: Burst Source: Memory Systems Architecture Course Bruce Jacobs, University of Maryland 18 Memory Controller Convert to DRAM commands Core Transaction request sent to MC Commands sent to DRAM Memory Controller • Consider all of steps a LD instruction must go through! – Virtual Æ physical Ærank/bank • Scheduling policies are increasingly important – Give preference to references in the same page? 19 Integrated Memory Controllers *From http://chip-architect.com/news/Shanghai_Nehalem.jpg 20 DRAM Refresh • Leaky storage • Periodic Refresh across DRAM rows • Un-accessible when refreshing • Read, and write the same data back • Example: – 4k rows in a DRAM – 100ns read cycle – Decay in 64ms – 4096*100ns = 410µs to refresh once – 410µs / 64ms = 0.64% unavailability 21 DRAM Refresh Styles • Bursty 410µs 410µs =(100ns*4096) 64ms • Distributed 64ms 15.6µs 100ns 64ms 64ms 22 DRAM Refresh Policies • RAS-Only Refresh DRAM Module Assert RAS Memory Controller RAS CAS WE Row Address Addr Bus Refresh Row • CAS-Before-RAS (CBR) Refresh DRAM Module Assert RAS WE# Addr Bus Assert CAS WE High Addr counter Memory Controller RAS CAS No address involved Refresh Row Increment counter 23 Types of DRAM • Asynchronous DRAM – Normal: Responds to RAS and CAS signals (no clock) – Fast Page Mode (FPM): Row remains open after RAS for multiple CAS commands – Extended Data Out (EDO): Change output drivers to latches. Data can be held on bus for longer time – Burst Extended Data Out: Internal counter drives address latch. Able to provide data in burst mode. • Synchronous DRAM – SDRAM: All of the above with clock. Adds predictability to DRAM operation – DDR, DDR2, DDR3: Transfer data on both edges of the clock – FB-DIMM: DIMMs connected using point to point connection instead of bus. Allows more DIMMs to be incorporated in server based systems • RDRAM – Low pin count 24 Main Memory Organizations registers registers ALU registers ALU cache ALU cache cache wide bus bus Memory Mem Mem Mem bus Mem Mem Mem Mem Mem • The processor-memory bus may have width of one or more memory words • Multiple memory banks can operate in parallel – Transfer from memory to the cache is subject to the width of the processor-memory bus • Wide memory comes with constraints on expansion – Use of error correcting codes require the complete “width” to be read to recompute the codes on writes – Minimum expansion unit size is increased 25 Word Level Interleaved Memory Memory Module Read the output of a memory access memory access 1 memory access 2 word interleaving 1 5 0 4 τ 3 7 τ Time • • • • • 2 6 Time to read the output of memory Bank 0 Bank 1 Bank 2 Bank 3 Memory is organized into multiple, concurrent, banks World level interleaving across banks Single address generates multiple, concurrent accesses Well matched to cache line access patterns Assuming a word-wide bus, cache miss penalty is Taddress + Tmem_access + #words * Ttransfer cycles Note the effect of a split transaction vs. locked bus 26 Sequential Bank Operation bank 0 bank 1 n-m higher order bits bank m lower order bits m-1 access 1 module 0 module 1 word 1 • Implement using DRAMs with page mode access 27 Concurrent Bank Operation DATA 1 0 m-1 n-m ADDR m module 0 module 1 module 2 word 1 • Supports arbitrary accesses • Needs sources of multiple, independent accesses – Lock-up free caches, data speculation, write buffers, pre-fetching 28 Concurrent Bank Operation Memory Module memory bank access τ Time • Each bank can be addressed independently – Sequence of addresses • Difference with interleaved memory – Flexibility in addressing – Requires greater address bandwidth – Separate controllers and memory buses • Support for non-blocking caches with multiple outstanding misses 29 Data Skewing for Concurrent Access a2 a6 a1 a5 a0 a4 a3 a7 A 3-ordered 8 vector with C = 2. • How can we guarantee that data can be accessed in parallel? – Avoid bank conflicts • Storage Scheme: – A set of rules that determine for each array element, the address of the module and the location within a module – Design a storage scheme to ensure concurrent access – d-ordered n vector: the ith element is in module (d.i + C) mod M. 30 Conflict-Free Access • Conflict free access to elements of the vector if Æ – M >= N – M >= N. gcd(M,d) • Multi-dimensional arrays treated as arrays of 1-d vectors • Conflict free access for various patterns in a matrix requires – – – – M >= N. gcd(M,δ1) for columns M >= N. gcd(M, δ2) for rows M >= N. gcd(M, δ1+ δ2 ) for forward diagonals M >= N. gcd(M, δ1- δ2) for backward diagonals 31 Conflict-Free Access • Implications for M = N = even number? • For non-power-of-two values of M, indexing and address computation must be efficient • Vectors that are accessed are scrambled – Unscrambling of vectors is a non-trivial performance issue • Data dependencies can still reduce bandwidth far below O(M) 32 Avoiding Bank Conflicts: Compiler Techniques • Many banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • Solutions: – – – – Software: loop interchange Software: adjust array size to a prime # (“array padding”) Hardware: prime number of banks (e.g. 17) Data skewing 33 Study Guide: Glossary • • • • • • • • • • • • • • • • • • Asynchronous DRAM Bank and rank Bit line Burst mode access Conflict free access Data skewing DRAM High-order and low-order interleaving Leaky transistors Memory controller Page mode access RAS and CAS Refresh RDRAM SRAM Synchronous DRAM Word interleaving Word line 34 Study Guide • Differences between SRAM/DRAM in operation and performance • Given a memory organization determine the miss penalty in cycles • Cache basics – Mappings from main memory to locations in the cache hierarchy – Computation of CPI impact of miss penalties, miss rate, and hit times – Computation CPI impact of update strategies • Find a skewing scheme for concurrent accesses to a given data structure – For example, diagonals of a matrix – Sub-blocks of a matrix • Evaluate the CPI impact of various optimizations • Relate mapping of data structures to main memory (such as matrices) to cache behavior and the behavior of optimizations 35