Preview only show first 10 pages with watermark. For full document please download

A Holistic Approach To Dram (and Systems) Prof. Bruce Jacob

   EMBED


Share

Transcript

A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 1 A Holistic Approach to DRAM (and Systems) Prof. Bruce Jacob Electrical & Computer Engineering University of Maryland, College Park OUTLINE • • • • UNIVERSITY OF MARYLAND Anecdotes, Vision Our Past & Present Work Anecdotes Revisited Conclusions A HOLISTIC APPROACH Bruce Jacob Anecdote I: System Issues University of Maryland 3 Cycles per Instruction (CPI) SLIDE 2 32-Byte Burst 64-Byte Burst 128-Byte Burst 2 1 0 an 0.8 x1 e yt b h 1c 1 1.6 s te e yt by 1 b x2 nx an ha ch 2 c 3.2 6.4 12.8 25.6 es es es es es es es yte es yt byt yt byt byt yt yt byt b b b b b 1 x x4 x4 x8 x8 x8 x2 x2 x4 an han han an han han an an han h h h h 4c 4c 1c 1c 4c 2c 4c 2c 2c System Bandwidth (GB/s = Channels * Width * 800MHz) Benchmark = GCC (SPEC 2000), 2 banks/channel A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 3 Anecdote II: DDR’s DLL Delay DCLK of clock from clock input pad to output drivers CK Bufs CKEXT DRAM Arrays CKINT DCLK DQEXT Data CKEXT CKINT CMD READ DQEXT DDQ DCLK + DDQ Ideally aligned A HOLISTIC APPROACH Bruce Jacob University of Maryland Anecdote II: DDR’s DLL Delay DCLK of clock CK Bufs DLL CKEXT SLIDE 4 Point of DLL: to align DQ output with system clock (minimize internal skew by eliminating DCLK) DRAM Arrays DCLK + DDLL DDLL DCLK CKINT Data DQEXT CKEXT Delay introduced by DLL CKINT CMD READ Delayed DQEXT DDQ Aligned A HOLISTIC APPROACH Bruce Jacob University of Maryland Anecdote II: DDR’s DLL A handful of alternatives: SLIDE 5 MC DATA strobe D Unassisted MC DATA strobe MC MC DATA strobe DLL on MC V DLL D DIMM DLL on module DLL D DLL MC DLL on DRAM DLL DATA strobe RCLK DATA V D Read clock V D MC DATA strobe V D Static delay w/ recalibration A HOLISTIC APPROACH Bruce Jacob University of Maryland Anecdote III: Circuit v System tRRD & tFAW limitations: tRRD SLIDE 6 Cmd RC Internal cmd Data tFAW RC RC RC C C C data data R Row Activation Command C Column Read Command RC C RC RC C C C data data data RC data data tDQS limitations: Clock Cmd RC RC Internal Cmd Data data RC C C C data data data tDQS RC RC C data RC RC C C C data data data A HOLISTIC APPROACH Bruce Jacob Vision University of Maryland SLIDE 7 Must make circuit-level decisions considering system-level ramifications Must make system-level decisions considering circuit-level ramifications (holistic approach) FPM, EDO, SDRAM, ESDRAM, DDR: x16 DRAM x16 DRAM SLIDE 8 x16 DRAM CPU and caches 128-bit 100MHz bus x16 DRAM Memory Controller x16 DRAM x16 DRAM x16 DRAM x16 DRAM DRAM DRAM Memory Controller DRAM 128-bit 100MHz bus DRAM CPU and caches DRAM DIMM Rambus, Direct Rambus, SLDRAM: DRAM University of Maryland Past Work: Device-Level DRAM Bruce Jacob DRAM A HOLISTIC APPROACH Fast, Narrow Channel [Cuppu et al. ISCA 1999] A HOLISTIC APPROACH Bruce Jacob Past Work: Device-Level University of Maryland Average Latencies SLIDE 9 PERL 500 Avg Time per Access (ns) Newer DRAMs 400 Bus Wait Time Refresh Time Data Transfer Time Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time 300 Critical word arrival times 200 100 0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR DRAM Architecture [Cuppu et al. ISCA 1999] A HOLISTIC APPROACH Bruce Jacob Past Work: Device-Level University of Maryland Bandwidth-Enhancing Techniques I: SLIDE 10 Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution & Memory Processor Execution PERL 5 Cycles Per Instruction (CPI) Ye PU C U ’s CP PU w s ro y’ s C or da y’ m o da T er st 4 To Newer DRAMs 3 2 1 0 FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR DRAM Architecture [Cuppu et al. ISCA 1999] A HOLISTIC APPROACH Bruce Jacob Past Work: Device-Level University of Maryland Bandwidth-Enhancing Techniques II: SLIDE 11 PERL Cycles Per Instruction (CPI) 5 Stalls due to Memory Bandwidth Stalls due to Memory Latency Execution Time in CPI — PERL Overlap between Execution & Memory Processor Execution 4 3 2 1 0 FPM/interleaved EDO/interleaved SDRAM & DDR SLDRAM x1/x2 RDRAM x1/x2 DRAM Architecture (10GHz CPU) [Cuppu et al. ISCA 1999] A HOLISTIC APPROACH Bruce Jacob University of Maryland Past Work: System-Level Even when we restrict our focus … D D SLIDE 12 D D D D D D D D D D D D D D D D D D D ... C C ... C C D D D C Two independent channels Banking degrees of 1, 2, 4, ... One independent channel Banking degrees of 1, 2, 4, ... D C D D D D D D D D D D D D D D D D D D D D D D D D ... C C C Four independent channels Banking degrees of 1, 2, 4, ... 1, 2, 4 8, 16, 32, 64 1, 2, 4, 8 32, 64, 128 800 MHz Channels Data Bits per Channel Banks per Channel (Indep.) Bytes per Burst [Cuppu & Jacob ISCA 2001] A HOLISTIC APPROACH Bruce Jacob University of Maryland Past Work: System-Level ... the design space is FAR from regular … SLIDE 13 Cycles per Instruction (CPI) 3 GCC 2 1 0 an 1 32-Byte Burst 64-Byte Burst 128-Byte Burst ch 0.8 x1 b yte 1.6 es yte t by 1 b 2 x x n an cha h c 2 1 3.2 es es 6.4 yte es es 12.8 es es es 25.6 s t te t t t t t by by by by by 1 b by by 2 8 4 4 2 8 8 4 x x x x x x x x x n n n n n n n an han ha ha ha cha ha cha cha c c c c c ch 4 4 4 4 2 2 2 1 1 t by System Bandwidth (GB/s = Channels * Width * 800MHz) [Cuppu & Jacob ISCA 2001] A HOLISTIC APPROACH Bruce Jacob University of Maryland Past Work: System-Level ... and the cost of poor judgment is high. SLIDE 14 Cycles per Instruction (CPI) 10 Worst Organization Average Organization Best Organization 8 6 4 2 0 bzip gcc mcf parser perl vpr average SPEC 2000 Benchmarks [Cuppu & Jacob ISCA 2001] A HOLISTIC APPROACH Bruce Jacob An Aside University of Maryland SLIDE 15 Past work used first-order models. Present work uses models accurate to second & third order effects … A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 16 [ Definition: Zero’th Order ] ... if ( INSTR.is_loadstore ) { if (L1_cache_miss( INSTR.daddr )) { if (L2_cache_miss( INSTR.daddr )) { cycles += DRAM_LATENCY; OR INSTR.ready = now() + DRAM_LATENCY; } } } ... A HOLISTIC APPROACH Bruce Jacob An Aside University of Maryland SLIDE 17 Past work used first-order models. Present work uses models accurate to second & third order effects … A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 18 Past & Present Work SimBed C-language embedded-system model CMP$im Pin tool to model CMP cache (and DRAM) systems CPU $ Executes apps to 1 trillion references without sampling EmPower SystemC model of SoC, incorporating heat and power DRAMsim C-language DRAM model DRAM, BIU, MC GEMS, M5, etc. SYSim C-language system model Models both performance & energy Recent development work A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 19 Past & Present Work SimBed CMP$im Pin tool to model CMP cache (and DRAM) systems CPU (HPCA 2006) $ C-language embedded-system model (CASES 2001, EmPower IEEETC 2003) SystemC model of SoC, heat & power (SPIE 2005) ISCA 2001 System-level study ISCA 1999, IEEETC 2001 Device-level study DRAMsim C-language DRAM model (SIGARCH 2005) SYSim IEEETC 1996 C-language system model (MKP book) Analytical model IEEETC 1996: ISCA 1999, IEEETC 2001: CASES 2001, IEEETC 2003: ISCA 2001, IEEE Micro 2003: SPIE 2005: SIGARCH 2005: ISPASS 2005, HPCA 2006: System-level analytical tool for cost/performance DRAM device-level characterization Performance & energy modeling of RTOS, CPU, memory DRAM system-level characterization SystemC modeling of energy in systems-on-chip DRAMsim released to community Characterization of bioinformatics workloads A HOLISTIC APPROACH Bruce Jacob University of Maryland DRAMsim Execution of a Load Instruction SLIDE 20 Processor Core L2 cache [A1] [A3] [A2] DTLB [A4] L1 cache BIU (Bus Interface Unit) [B1] Part A: Searching on-chip for data (CPU clocking domain) [B2] System Controller [B8] physical to memory addr mapping read data buffer Part B: Going off-chip for data (DRAM clocking domain) I/O-to-memory traffic [B3] [B7] memory request scheduling [B ] [B5] 4 DRAM System [B6] DRAM core Stages of instruction execution: IF virtual to physical address translation (DTLB access) [A1] ID [A2] L1 D-Cache access. If miss then proceed to EX MEM [A3] L2 Cache access. If miss then send to BIU WB Bus Interface Unit (BIU) obtains data from main memory [A4 + B] [B1] BIU arbitrates [B2] request [B3] phys. addr. [B4] mem. [B5] mem. for ownership of sent to system to memory addr. request addr. Setup address bus ** controller scheduling** (RAS/CAS) translation. [B6, B7] DRAM dev. [B8] system controller returns obtains data and returns to controller data to CPU ** Steps not required for some processor/system controllers. protocol dependant. A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 21 DRAMsim tRC READ: tRAS cmd 1 2 4 data tCAC tRCD 1 2 3 4 tRP 1 3 tCAS tDATA Active: Open Row, tRCD time later, a CAS command may be issued to the DRAM chip CAS: Column Read command, tCAS time later, data begins to be placed onto the Data bus. We use tCAC to factor out command transmission time. Data: The number of cycles that the data transmits over the Data bus Precharge: Close the Row, this command may be issued tRAS time after the Active command. After tRP time, another active command may be issued. WRITE: tRTR tRAS cmd 1 tRP 2 5 tRCD 5 1 3 data 2 4 tCWD tDATA CWD: Column Write Delay, the number of cycles that the controllers must wait before placing the data onto the data bus. RTR: Retirement delay, this is for systems with write delay buffers.(RDRAM) A HOLISTIC APPROACH Bruce Jacob University of Maryland DRAMsim Memory Access Latency Distribution Number of Accesses at each Latency Value SLIDE 22 2 GHz CPU, 200 MBPS 8 Byte wide DDR SDRAM MCF GCC 2000 8000 1500 6000 Minimum Latency ~ 180 CPU cycles 1000 Most accesses satisfied immediately. Latency distribution favors low latency values Dominant Latency “modes” evident: 4000 CAS Hit Latency 500 2000 0 0 500 1000 1500 0 0 Bank Conflict Latency 500 1000 Memory Access Latency: In CPU cycles Most accesses must be pipelined. Long queueing delay, Large range of latency values 1500 A HOLISTIC APPROACH Bruce Jacob DRAMsim University of Maryland SLIDE 23 http://www.ece.umd.edu/dramsim A HOLISTIC APPROACH Bruce Jacob University of Maryland Accuracy: Why? Benefit: Insights (Anecdote II, revisited) SLIDE 24 MC DATA strobe D Unassisted MC DATA strobe MC MC DATA strobe DLL on MC V DLL D DIMM DLL on module DLL D DLL MC DLL on DRAM DLL DATA strobe RCLK DATA V D Read clock V D MC DATA strobe V D Static delay w/ recalibration A HOLISTIC APPROACH Bruce Jacob University of Maryland Accuracy: Why? Benefit: Insights (Anecdote II, revisited) SLIDE 25 SCHEME COST EFFECTIVENESS (Uncertainty in read) No DLL DCLK + Xmit + wire + Recv + Clk skew 0 on DRAM 16xDLL on MC 2xDLL 16xVern Xmit + wire + Recv + Clk skew wire + Recv on DIMM 2xDLL 16xVern wire + Recv + Clk skew Read CLK 2xDLL 16xVern wire + Recv Static 16xVern Xmit + wire + Recv • • Cost = for 2-DIMM system, 8 DRAM parts per DIMM note: “cost” applies to both die area and power Uncertainty = very rough, intuitive idea A HOLISTIC APPROACH Bruce Jacob Anecdote III, revisited University of Maryland command bus RC internal cmd current draw in abstract units SLIDE 26 row access R P C precharge data bus data burst time R Row Activation Command C Column Read Command Quiescent current draw of active device P Precharge Command Current draw profile due to device activity Power consumption in DRAM devices: • • Row activation, data read-out, bank precharge: all are relatively expensive operations Current draw of operation additive to quiescent value … So what’s the big deal? Bruce Jacob University of Maryland SLIDE 27 Anecdote III, revisited tRRD & tFAW protocol-level limitations placed upon device to limit maximum current draw tRRD tFAW cmd R C internal cmd data current draw in abstract units A HOLISTIC APPROACH • • data R C RC R C C C C data RC C data RC C data data overlapping current profiles Severely limits bus efficiency from single rank Problem worsens in future: parameters defined in nanoseconds, not cycles time A HOLISTIC APPROACH Bruce Jacob Anecdote III, revisited University of Maryland • tRRD & tFAW — Problem worsens in future: parameters defined in nanoseconds, not cycles SLIDE 28 tRRD cmd R C internal cmd data tFAW R C RC R C C C C data R C internal cmd data R C internal cmd data d data data data RC R C C C C dat RC C dat RC C dat dat tFAW d 2.4Gbps R C RC R C C C C d d 1.2Gbps R C tRRD cmd C tFAW dat RC RC C data tRRD cmd 800Mbps d RC C d RC C d d A HOLISTIC APPROACH Bruce Jacob Max. Sustainable Bandwidth 9 University of Maryland th 164.gzip an Maximum Sustainable Bandwidth: GB/s 8 SLIDE 29 id dw B ak Pe 7 6 5 4 3 2 1 0 Queue Depth 4 2 0 (FIFO) 533.33 666.66 4.3 5.3 Bank Count Filled - 16 Banks Outline - 8 Banks 800 Simulated tFAW values: tFAW = tRC / 2 tFAW = tRC 933.33 1066.7 Datarate (Mbps) 8.5 7.5 6.4 Peak BW (GBps) 1200 1333.3 9.6 10.7 tRC = 60ns, burst of eight, 8B wide channel A HOLISTIC APPROACH Bruce Jacob Max. Sustainable Bandwidth University of Maryland tFAW Impact for Two Different Simulated Values: (tFAW = tRC/2) 164.gzip (tFAW = tRC) 9 8 Maximum Sustainable Bandwidth: GB/s SLIDE 30 Inflection Point For tFAW = tRC / 2 th wid HI nd a kB a Pe 7 6 MED HI MED 5 4 3 2 1 0 Queue Depth 4 2 0 (FIFO) 533.33 4.3 666.66 5.3 Bank Count Simulated tFAW values: tFAW = tRC / 2 Filled - 16 Banks tFAW = tRC Outline - 8 Banks 800 933.33 1066.7 Datarate (Mbps) 8.5 7.5 6.4 Peak BW (GBps) 1200 1333.3 9.6 10.7 High and Moderate Levels of Reordering (e.g. system sophistication) No Reordering tRC = 60ns, burst of eight, 8B wide channel 16 BANKS improves bandwidth over 8 BANKS by ~10% (how does this compare with incremental cost?) A HOLISTIC APPROACH Bruce Jacob But Wait, There’s More … University of Maryland tDQS protocol-level limitation placed upon ranks to prevent data-bus collisions on rank hand-off SLIDE 31 Clock Cmd Internal Cmd Data RC RC data RC RC RC C C C data data data C data C C C data data data tDQS • • RC RC RC Severely limits bus efficiency from multiple ranks Luckily, it is defined in cycles and not nanoseconds A HOLISTIC APPROACH Bruce Jacob Solution I: Scheduling University of Maryland Problems created by tFAW + tRRD + tDQS SLIDE 32 • • tFAW + tRRD tDQS Must spread out ACT commands Must switch ranks infrequently Salient point: tFAW does not place limit on total number of open banks Problem can be solved with scheduling: row-column command decoupling (RCCD) • • Schedule ACT commands far before their corresponding READ commands Schedule large number of bank-reads before switching ranks [patent pending] A HOLISTIC APPROACH Bruce Jacob Solution I: Scheduling University of Maryland IPC Speedup Relative to FCFS Scheduling SLIDE 33 BRR RCCD 1.6 Increasing Bandwidth Utilization 1.4 negligible speedup 1.2 t ar p m am pl u ap gr id m p2 bz i ke ua eq rt ex gc c vo a m es el lg ga ip gz ol f 1 tw IPC Speedup Relative to FCFS Scheduling 1.8 University of Maryland SLIDE 34 Solution I: Scheduling Number of Accesses at Given Latency Value Bruce Jacob 1e+08 179.art FCFS 179.art RCCD 1e+06 10000 100 1 0 Number of Accesses at Given Latency Value A HOLISTIC APPROACH 200 400 600 Memory Access Latency (ns) 800 0 200 400 600 Memory Access Latency (ns) 800 1e+08 188.ammp FCFS 188.ammp RCCD 1e+06 10000 100 1 0 200 400 600 Memory Access Latency (ns) 800 0 200 400 600 Memory Access Latency (ns) 800 Bruce Jacob Solution II: Topology, etc. University of Maryland Problems solved by tFAW + tRRD + tDQS • • tFAW + tRRD tDQS Instantaneous current draw in device Bus collisions on rank handoffs Any alternative solution will do … DIMM MC DIMM CMD & write data DIMM SLIDE 35 DIMM A HOLISTIC APPROACH read data • • Topology eliminates collisions (can account for static DIMM-DIMM skew with Vernier-type solution) Note: solution requires source-synchronous clocking A HOLISTIC APPROACH Bruce Jacob Interesting Side Note University of Maryland Fully Buffered DIMM SLIDE 36 A HOLISTIC APPROACH Bruce Jacob Nth Order Effects: Heat, EMI University of Maryland EmPower: First Target Application SLIDE 37 A HOLISTIC APPROACH Bruce Jacob Nth Order Effects: Heat, EMI University of Maryland EmPower: Initial Results SLIDE 38 A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 39 Summary No longer appropriate to optimize subsystems in isolation: local optima do not yield globally optimal system Systemic behaviors: unanticipated interactions yielding inefficiencies Specific instances: • • tFAW + tRRD + tDQS severely limits BW Choice of DLL on DDR SDRAMs to de-skew parts Many problems can be addressed by system-level solutions; can be better than circuit-level solutions UNIVERSITY OF MARYLAND A HOLISTIC APPROACH Bruce Jacob University of Maryland SLIDE 40 Et Cetera (CURRENT) MEMSYS GRAD STUDENTS: • • • • • • • • • Dave Wang: DRAMsim, tFAW + tRRD + tDQS studies, etc. Aamer Jaleel: CMP$im, bioinformatics, etc. Brinda Ganesh: DRAMsim, FB-DIMM power mgmt Samuel Rodriguez: SRAM circuit-level details Ankush Varma: SystemC system-on-chip energy model Sadagopan Srinivasan: SoC memory system issues Nuengwong Tuaycharoen: SYSim development Hongxia Wang: SRAM circuit integrity issues Joe Gross: DRAMsim II development CONTACT INFO: • • UNIVERSITY OF MARYLAND Prof. Bruce Jacob ECE Dept., University of Maryland, College Park, MD www.ece.umd.edu/~blj/ [email protected] A HOLISTIC APPROACH Bruce Jacob DRAM: Brief Primer University of Maryland SLIDE 41 Word Line Bit Line Switching element (transistor) Storage element (capacitor) Dual In-line Memory Module (DIMM) (printed circuit board w/ DRAM chips on it) A HOLISTIC APPROACH Bruce Jacob DRAM: Brief Primer University of Maryland Graphics Co-Processor Frontside bus SCSI bus Memory Controller North Bridge Chipset DRAM DRAM DRAM Array Array Array Memory modules PCI bus Hard Drive/s SCSI Controller Other Low-BW I/O Devices DRAM Primary Cache DRAM bus DRAM Secondary Secondary Cache Cache AGP bus CPU CPU DRAM Backside bus DRAM SLIDE 42 Network Interface I/O Controller Keyboard South Bridge Chipset Mouse The memory system (in blue) … and DRAM’s typical place within it. (typical PC-style desktop system) A HOLISTIC APPROACH Bruce Jacob DRAM: Brief Primer University of Maryland DRAM SLIDE 43 Word Line Storage element (capacitor) Column Decoder Sense Amps Data In/Out Buffers ... Bit Lines... The DRAM device . .. Word Lines ... Switching element Row Decoder Bit Line Memory Array A HOLISTIC APPROACH Bruce Jacob DRAM: Brief Primer University of Maryland Access Protocol SLIDE 44 tRC READ: tRAS cmd 1 tRP 2 4 data tCAC tRCD 1 3 tCAS tDATA 1 2 Active: Open Row, tRCD time later, a CAS command may be issued to the DRAM chip CAS: Column Read command, tCAS time later, data begins to be placed onto the Data bus. We use tCAC to factor out command transmission time. 3 4 Data: The number of cycles that the data transmits over the Data bus Precharge: Close the Row, this command may be issued tRAS time after the Active command. After tRP time, another active command may be issued. tRTR WRITE: tRAS cmd 1 tRP 2 5 tRCD 5 1 3 data 2 4 tCWD tDATA CWD: Column Write Delay, the number of cycles that the controllers must wait before placing the data onto the data bus. RTR: Retirement delay, this is for systems with write delay buffers.(RDRAM)