Preview only show first 10 pages with watermark. For full document please download

Enee 759h, Spring 2005 Memory Systems: Architecture And Performance Analysis

   EMBED


Share

Transcript

Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. ENEE 759H, Spring 2005 Memory Systems: Architecture and Performance Analysis Fully Buffered DIMM Memory System SLIDE 1 Credit where credit is due: Slides contain original artwork (© Jacob, Wang 2005) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis DRAM Datarate Trends Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 2 200 400 600 800 1000 1200 1400 1600 ENEE 759H Lecture15.fm DRAM Device Datarate (Mb/s) Spring 2005 XDR (min burst:16) 3.2 Gb/s (differential pair) Commodity DRAM Devices Datarate: ~Doubling Every 3 Years (log scale) DDR2 SDRAM (min burst:4) DDR SDRAM (min burst:2) SDRAM (min burst:1) 1998 2000 2002 2004 New Generations of DRAM Devices (time) UNIVERSITY OF MARYLAND 2006 Memory Systems Architecture and Performance Analysis Spring 2005 Recall - Multidrop Bus input output 0 output 0 input output 1 output 1 input output 2 output 2 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 3 more ringing larger delay lower risetime Higher datarate, but SDRAM - Max 8/6 ranks (loads); (133 MHz) DDR SDRAM - Max 6 ranks; (400 Mbps) DDR2 - Max 4 ranks; (667/800 Mbps) loss of capacity UNIVERSITY OF MARYLAND DDR3 - Max 2 ranks; (800+ Mbps) Memory Systems Architecture and Performance Analysis DDR II Memory System Rank1 Spring 2005 Rank2 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. Ctl SLIDE 4 How to keep commodity DRAM devices, keep high datarate, but maintain or increase capacity? (large servers) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis FB-DIMM Solution Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. Point to point chaining of sub-memory systems Advanced Memory Buffer (AMB) 14 10 SLIDE 5 Ctl Ctl Ctl up to 8 FB DIMMS Use commodity DRAM devices UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis AMB Block Diagram: Spring 2005 ENEE 759H Lecture15.fm From Controller Bruce Jacob David Wang Point to point interface Pass-through Logic Command & Addr University of Maryland ECE Dept. PLL SLIDE 6 SMBus LAI Controller SMBus Controller Thermal Sensor To Controller UNIVERSITY OF MARYLAND De-serializer & Decode Data Bus Interface Serializer Pass-through and Merging Logic Data Bus Memory Systems Architecture and Performance Analysis Salient Points Spring 2005 ENEE 759H Lecture15.fm • ASIC-to-ASIC signalling. High datarate. Differential pair signalling, 6X data rate of DRAM devices. (i.e. DRAM: 800 Mbps, FBD link datarate: 4.8 Gbps) • Asymmetric configuration. Higher inbound bandwidth (14 inbound pin-pair, 10 outbound pin-pair) • Deskewing in ASIC, not on board • Longer idle system latency, but much higher pin-bandwidth • Keep increasing datarate and solves capacity problem • Uses commodity DRAM devices. Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 7 UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Technology Roadmap (ITRS) Spring 2005 ENEE 759H Lecture15.fm 2004 2007 2010 2013 2016 90 65 45 32 22 CPU MHz 3990 6740 12000 19000 29000 MLogicTransistors/ cm^2 77.2 154.3 309 617 1235 High Perf chip pin count 2263 3012 4009 5335 7100 High Performance chip cost (cents/pin) 1.88 1.61 1.68 1.44 1.22 Memory pin cost (cents/pin) 0.34 1.39 0.27 0.84 0.22 0.34 0.19 0.39 0.19 0.33 Memory pin count 48-160 48-160 62-208 81-270 105-351 Semi Generation (nm) Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 8 Trend: C UNIVERSITY OF MARYLAND Free Transistors & Costly Interconnects Memory Systems Architecture and Performance Analysis Choices for Future Spring 2005 Direct Connect Commodity DRAM Low Bandwidth + Low Latency DRAM DRAM ENEE 759H Lecture15.fm CPU DRAM DRAM Bruce Jacob David Wang University of Maryland ECE Dept. DRAM DRAM DRAM DRAM Memory Controller CPU DRAM DRAM SLIDE 9 DRAM DRAM Indirect Connection Highest Bandwidth DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Inexpensive DRAM Highest Latency CPU UNIVERSITY OF MARYLAND AMB AMB AMB AMB DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Memory Systems Architecture and Performance Analysis Pin Count Comparison Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 10 DDR2 FBD 667 4000 108 - 141 59 ~200 ~70 Peak bandwidth (GB/s) 5.3 8.0 Theoretical Efficiency (bandwidth/pin) (Mbps) 213 914 Datarate (Mbps) 72 data pins + 18 Pin Count (Data Bus) diff pairs Channel pin count of DQS (without pwr and gnd) strobes Channel pin count (with pwr and gnd) ~ 4.5x increase in pin-bandwidth (counting power and ground) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Routing Comparison: Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 11 1 Channel of registered DDR2 SDRAM. 2 routing layer + power plane 2 Channels of FB-DIMM 2 routing layer including power delivery no need to match path path length matched traces lengths (deskewing on ASIC) ~200 traces per channel ~70 traces per channel UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. - Pseudo network-like packet protocol - Fixed packet (frame) size - Everything is 12 “beats” (or 2 DRAM beats = 1 clk) - 12 x 14 = 168 bits/frame upstream (to controller) - 144 bit data payload per frame upstream (read data) 14 10 Ctl Ctl Ctl SLIDE 12 FB-DIMM Protocol: - 12 x 10 = 120 bits/frame downstream (from controller) - 72 bit payload per frame downstream (write data) - Write bandwidth = 0.5x of read bandwidth - Total bandwidth = 1.5x of single module bandwidth UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 13 UNIVERSITY OF MARYLAND Downstream (southbound) I: - 10 bit lanes x 12 bits = 120 bits per frame - Command + write data - 72 bits of data per frame - 24 bits of command (max of 3 commands per frame) - 2 bits command type - 22 bits of CRC, strong protection in normal mode - 10 bits of CRC in failover mode (reduced protection) 0 1 2 3 4 5 6 7 8 9 Transparent failover: lose a bitlane, keep on going. Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 14 Downstream (southbound) II: - DRAM Commands Activate (row-bank) Write (column) Read (column) Precharge (all banks) Precharge (bank) Auto Refresh Enter Self Refresh Enter Power Down Exit Self Refresh and Exit Power Down - Channel Commands Channel NOP ensure transmission density Sync Soft Channel Reset transient bit failure recovery Write Config Register Read Config Register Allows tuning of refresh DRAM CKE per DIMM mechanism. Per DIMM DRAM CKE per Rank or per rank Debug UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Write FIFO Spring 2005 ENEE 759H Lecture15.fm Southbound Frames (12 FBD beats per frame) w0 w1 r0 w2 w3 w4 r1 Bruce Jacob David Wang w5 w6 w7 D0 D1 D2 D3 D0 D1 D2 D3 Northbound Frames University of Maryland ECE Dept. Burst-of-eight Burst-of-eight (12 FBD beats per frame) (2 DRAM beats per frame) SLIDE 15 35 entries deep Write data (1 beat per frame) buffered/queued need not be contiguous frames. May be separated by arbitrary number of intervening frames. Write FIFO UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. Upstream: - 14 bit lanes - 12 x 14 = 168 bits/frame upstream (to controller) - 2 (72 or 64 bit) data payload per frame upstream - Last AMB on channel initiates northbound frames. - Idle frames contain permuting data pattern (LFSR) 14 10 Ctl Ctl Ctl SLIDE 16 - 14 bit channel with 12 bits CRC per payload. Failover to 13 bit channel with 6 bits of CRC per payload. - 13 bit channel with 6 bits CRC per payload. Failover to 12 bit channel with ECC coverage only. - 12 bit channel with 6 bits CRC per payload No failover. UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Command Scheduling I: - Short channel: variable read latency capability Oops r1 r0 Bruce Jacob David Wang D0 D1 University of Maryland ECE Dept. SLIDE 17 D0 D1 D3 D2 D3 Delay read request until ctl knows that empty Northbound frame exists r1 r0 D0 D2 D1 D2 D3 D0 D1 D2 D3 FBD-merging, no rank-to-rank turnarounds 14 10 Ctl Ctl Ctl UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang Command Scheduling II: - Long channel: fixed latency longer channel = longer read latency for all requests r1 r0 D0 University of Maryland ECE Dept. SLIDE 18 D1 D2 D3 D0 D1 D2 buffer/delay 14 10 Ctl UNIVERSITY OF MARYLAND Ctl Ctl “token passing” D3 Memory Systems Architecture and Performance Analysis Latency: Spring 2005 ENEE 759H Lecture15.fm From Controller Bruce Jacob David Wang University of Maryland ECE Dept. ~2 ns Pass-through Logic ~8.1 ns @ 667 Mbps PLL SLIDE 19 SMBus LAI Controller SMBus Controller Thermal Sensor To Controller UNIVERSITY OF MARYLAND Point to point interface Command & Addr De-serializer & Decode Data Bus Interface Serializer Pass-through and Merging Logic ~5.0 ns @ 667 Mbps Data Bus Memory Systems Architecture and Performance Analysis Hypothetical Latency Distrib. Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 20 Number of Accesses at Given Latency Value 1e+08 Direct Connect FB-DIMM 1e+06 10000 100 1 0 200 400 600 Memory Access Latency (ns) Longer Idle system latency, but higher pin-bandwidth. Lower avg latency? UNIVERSITY OF MARYLAND 800 Memory Systems Architecture and Performance Analysis Power Impact Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 21 9 x8 18 x8 or 18 x4 36 x4 2.7W 5.4W 10.8W 4W 4W 4W Total FBD Power ~6.7W ~9.4W ~14.8W Power overhead ~148% ~74% ~37% DRAM power (assume ~0.3W per device) AMB Power UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Summary Spring 2005 ENEE 759H Lecture15.fm Bruce Jacob David Wang University of Maryland ECE Dept. • Can merge read data frames and achieve 100% read bandwidth utilization • FBD Channel relies on deep channel to obtain “full bandwidth efficiency” • Longer latency, particularly with long channel configurations. • Asymmetric configuration. Peak write bandwidth = 0.5x peak read bandwidth. peak read + write bandwidth = 1.5x of single channel DIMM bandwidth SLIDE 22 UNIVERSITY OF MARYLAND