Preview only show first 10 pages with watermark. For full document please download

Introduction To Hpc Architecture - Memory Lecture 5

   EMBED


Share

Transcript

COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Introduction to HPC Architecture - Memory Lecture 5 Lennart Johnsson Dept of Computer Science COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 CPU Transistor Counts 1971 – 2008 and Moore’s Law AMD RV870 GPU 2.15B AMD Cayman 2.64B nVidia GF100 GPU 3.0 B Sparc T3, 16-core, 1.0 B Intel Westmere, 6-core, 1.17B IBM Power7 1.2 B Intel Nehalem-EX, 8-core 2.3 B Intel Westmere-EX, 10-core 2.6 B Intel Itanium 8-core, 3.1 B 2009 2010 2010 2010 2010 2010 2010 2011 2012 Date of introduction http://en.wikipedia.org/wiki/Transistor_count 1 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Processor Die Area and Transistor Count Processor Dual-Core Itanium 2 Transistors Year Manufact. Process Area 1,700,000,000 2006 Intel 90 nm 596 mm² POWER6 789,000,000 2007 IBM 65 nm 341 mm² Six-Core Opteron 2400 904,000,000 2009 AMD 45 nm 346 mm² 2,154,000,000 2009 AMD 40 nm 334 mm RV870 2 16-Core SPARC T3 1,000,000,000 2010 Sun/Oracle 40 nm 377 mm² Six-Core Core i7 1,170,000,000 2010 Intel 32 nm 240 mm² 8-Core POWER7 1,200,000,000 2010 IBM 45 nm 567 mm² 4-Core Itanium Tukwila 2,000,000,000 2010 Intel 65 nm 699 mm² 8-Core Xeon Nehalem-EX 2,300,000,000 2010 Intel 45 nm 684 mm² Cayman (GPU) 2,640,000,000 2010 AMD 40 nm 389 mm 2 GF100 (GPU) 3,000,000,000 2010 nVidia 40 nm 529 mm 2 Tahiti (GPU) 4,310,000,000 2011 AMD 28 nm 365 mm 2 10-Core Xeon Westmere-EX 2,600,000,000 2011 Intel 32 nm 512 mm 2 8-Core Itanium Poulson 3,100,000,000 2012 Intel 32 nm 544 mm 2 Sandy Bridge, 8C 2,270,000,000 2012 Intel 32 nm 435 mm 2 7,100,000,000 2013 nVidia 28 nm ? GK110 (GPU) http://en.wikipedia.org/wiki/Transistor_count, http://en.wikipedia.org/wiki/Sandy_Bridge COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Where have all the transistors gone? 2 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Recall: Energy Consumption “We are on the Wrong side of a Square Law” Fred Pollack 1999 New goal for CPU design: “Double Valued Performance every 18 months, at the same power level”, Fred Pollack Pollack, F (1999). New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Paper presented at the Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, Haifa, Israel. Normalized Normalized EPI on 65 nm at Product Performance Power 1.33 volts (nJ) i486 1.0 1.0 10 Pentium 2.0 2.7 14 Pentium Pro 3.6 9 24 Ed Grochowski, Murali Annavaram Energy per Instruction Trends in Intel® Microprocessors. http://support.intel.co.jp/pressroom/kits/ core2duo/pdf/epi-trends-final2.pdf Pentium 4 (Willamette) Pentium 4 (Cedarmill) Pentium M (Dothan) Core Duo (Yonah) COSC6365 6.0 23 38 7.9 38 48 5.4 7 15 7.7 8 11 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Recall: Solution to heat dissipation and double valued performance 3 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Typical High Level Architecture “CPU” = Processing die core Processing Logic Processing Logic 20 – 100 GB/s CPU Processing Logic 1 – 5 cycles Level 1 Cache Level 1 Cache Level 2 Cache Level 2 Cache Level 1 Cache Processing logic clock rates from ~100+ MHz to 3+ GHz Execution widths from one operation to eight 8 – 15 cycles Level 2 Cache 25 - 50 cycles Level 3 Cache 20 – 50 GB/s DRAM COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Processor examples 4 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Shared L3 Cache Core Core Core Shared L3 Cache Core Core Core Memory Controller Misc I/O and QPI Misc I/O and QPI Intel Westmere/Gulftown Intel Westmere, 2010 1.17 billion transistors 248 mm2 in 32 nm Hyperthreading (2 threads/core) 2.26 – 3.33 GHz 10.0 – 13.3 GF/core 6 cores L1 cache: 32kB+32kB/core (8-way, cache line 64B,64 sets) L2 cache: 256 kB/core (8-way, cache line 64B, 512 sets) L3 cache: 12 MB shared (16-way, cache line 64B, 12,288 sets, 36 cycle latency) 60 - 130 W TDP 0.9 – 0.6 GF/W TDP = Thermal Design Power http://www.pcper.com/comments.php?nid=8348 http://www.intel.com/Assets/PDF/prodbrief/323501.pdf http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem_Westmere http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Intel Sandy Bridge (2012) • Two threads per core, “Hyperthreading” • L1 cache: 32 kB + 32 kB each core, latency 3 cycles, 8-way • L2 cache: 256 kB each core, latency 8 cycles, 8-way • 64-byte cache line size • Two load/store operations/CPU cycle/memory channel • 256-bit/cycle ring bus between cores, graphics, cache and System Agent Domain • Advanced Vector Extensions 256-bit instruction set • DDR3 up to 1600 MHz • 40 desktop versions, 1.6 GHz – 3.6 GHz, • 1 - 6 cores • L3 cache: 1 – 2.5 MB/core, latency 35 – 45 cycles, 12-way • 45 server versions, 1.8 GHz – 3.6 GHz • 2 – 8 cores • L3 cache: 1.5 - 2.5MB/core, shared latency 35 – 45 cycles, 12-way Sandy Bridge-E hits the market with more cores,.more threads, • 57 mobile versions, 1.1 GHz – 2.8 GHz http://arstechnica.com/business/news/2011/11/ • 1 – 4 cores sandy-bridge-e-hits-the-market-with-more-cores-more-threads.ars • L3 cache: up to 20 MB Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge Intel Sandy Bridge, http://www.7-cpu.com/cpu/SandyBridge.html latency 35 – 45 cycles, 12-way Sandy Bridge for servers, http://realworldtech.com/page.cfm?ArticleID=RWT072811020122&p=1 5 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 AMD Istanbul 6-core Magny-Cours AMD Magny Cour, 2010 2 billion transistors 2 x 346 mm2 in 45 nm 1.7 – 2.3 GHz 6.8 – 9.2 GF/core 12 cores L1 cache: 64kB+64kB/core (2-way, cache line 64B, 512 sets, 3 cycles) L2 cache: 512 kB/core (16-way, cache line 64B, 512 sets, 12 cycles) L3 cache: 2 x 6 MB shared (48-way,cache line 64B, 2048 sets) 90 - 140 W TDP 0.9 – 0.8 GF/W Conway et al."Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor,“ IEEE Micro, pp. 16-29, March/April, 2010 http://www.3dprofessor.org/Reviews%20Folder%20Pages/Istanbul/SMISP1.htm http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 AMD Bulldozer - Interlagos Bulldozer (2 cores) Orichi (8 cores) The Bulldozer core is architected to be power-efficient Minimize silicon area by sharing functionality between two cores All blocks and circuits have been designed to minimize power (not just in the Bulldozer core) Extensive flip-flop clock-gating throughout design Circuits power-gated dynamically Numerous power-saving features under firmware/software control Core C6 State (CC6) Core P-states/AMD Turbo CORE Application Power Management (APM) DRAM power management Message Triggered C1E THE DIE 315 mm2 Eight “Bulldozer” cores High-performance, power-efficient AMD64 cores Two cores in each Bulldozer module L1D cache: 128 KB, 16 KB/core, 64-byte cache line, 4-way associative, write-through L1I cache: 256 KB, 64 KB/Bulldozer module, 64-byte cache line, 2-way associative L2 cache: 8MB, 2 MB/Bulldozer module, 64-byte cache line, 16-way associative Integrated Northbridge which controls: L3 cache: 8 MB, 64-byte cache line, 16-way associative, MOESI Two 72-bit wide DDR3 memory channels Four 16-bit receive/16-bit transmit HyperTransport™ links http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.9-Desktop-CPUs/HC23.19.940-Bulldozer-White-AMD.pdf 6 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 IBM PowerPC PowerPC 450, 208M transistors, 90 nm,16W 4 flops/cycle, 850 MHz L1 cache: 32kB+32kB (64-way, cache line 32B, 16 sets, round-robin repl., write-through, 4 cycle latency, 8B/cycle)) L2 cache: prefetch buffer 2kB (16 128B lines, fully associative, 12 cycle latency, 8B/cycle (avg. 4.6B/cycle (128B/(12+128/8)))) L3 cache: shared, 4 banks of 2 MB each (each bank has a L3 directory and a 15 entry 128B combining buffer, 35 cycle latency) Memory: 2GB DDR2 @ 400 MHz, (4 banks of 512MB) 86 cycle latency https://computing.llnl.gov/tutorials/bgp/ http://workshops.alcf.anl.gov/gs10/files/2010/01/Morozov-BlueGeneP-Architecture.pdf COSC 6365 Lecture 6, 2011-02-02 COSC6365 System-on-a-Chip design : integrates processors, memory and networking logic into a single chip Lennart Johnsson 2013-01-29 IBM Blue Gene Q (2012) • 360 mm² Cu-45 technology (SOI) • 16 user + 1 service processors – – – – – – – – – – • plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS@55W Central shared L2 cache: 32 MB – – – • ~ 1.47 B transistoRS eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller – – – 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) • Chip-to-chip networking • External IO – – Router logic integrated into BQC chip. PCIe Gen2 interface Source: Ruud haring, http://www.hotchips/org/hc23 7 COSC6365 IBM Power7 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 IBM Power7, 2010 1.2 billion transistors 567 mm2 in 45 nm 4-way simultaneous multi-threading (SMT) 3.0 – 4.25 GHz 24 – 34 GF/core 4, 6 or 8 cores up to 4 chips/MCM 12 execution units/core L1 instr cache: 32kB/core (4-way, 128B cache line, 64 sets, 2-3 cycle latency) L1 data cache: 32kB/core (8-way, 128B cache line, 64 sets, 2-3 cycle latency, write-through) L2 cache: 256 kB/core (8-way, 128B cache line, 256 sets, 8 cycle latency, write-back) L3 cache: 4 MB/core (8-way, 128 B cache line, 4,096 sets, 25 cycle latency) 200W TDP < 1.36 GF/W (peak) http://www-03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf http://www-05.ibm.com/fr/events/hpc_summit_2010/P7_HPC_Summit_060410.pdf http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Itanium 8-core (2012) Core Devices Area Threads Instruction Q size Max instructionissue/cycle Pipeline stages Pipeline hazardresolution 89 million 20 mm2 2+2 96x2 12 4+7 Replay Chip Process Devices Area Power (max TDP) Itanium® Cores Last Level Cache Size Intel® Quick Path Interconnect Links Intel® QPI Link Speed Intel® Scalable Memory Interface Links Intel® SMI Link Speed 32nm 3.1Billion 544 mm2 170W 8 32MB 4 full+2 half 6.4GT/s 4 6.4GT/s Poulson: An 8 Core 32 nm Next Generation Intel®Itanium®Processor, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.7-Server/HC23.19.721-Poulson-Chin-Intel-Revised%202.pdf 8 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Oracle SPARC T3/Niagara 3 Oracle SPARC T3/Niagara 3, 2010 1 billion transistors 371 mm2 in 40 nm 8-way simultaneous multi-threading (SMT) 1.65 GHz 13 GF/core 16 cores 2 SIMD units/core, 4 ops/SIMD unit L1 cache: 8kB (data)+16kB (inst) per core L2 cache: 384 kB/core (6 MB total) L3 cache: No 75 – 140W TDP < 1.51 GF/W (peak) http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars http://www.theregister.co.uk/2010/09/20/oracle_sparc_t3_chip_servers/print.html http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t3-chip-ds-173097.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Intel ATOM Processor (Mobile) Intel ATOM S1260, Q4 2012 32 nm 2 GHz Hyperthreading (2/core) 4 GF/core 2 cores L1 cache: 24kB, 6-way (data)+ 32kB, 4-way (inst) per core L2 cache: 512 kB/core 8-way 8.5 W TDP 0.94 GF/W http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-processor-s1200-datasheet-vol-1.pdf http://www.anandtech.com/show/6509/intel-launches-centerton-atom-s1200-family-first-atom-for-servers 9 COSC6365 ARM Cortex A15 • – – Instruction cache, 32kB, 2-way set associative, 64B cache line, LRU replacement policy, parity for 16-bits Data cache 32 kB, 2-way set associative, 64B cache line, LRU replacement policy, 32-bit ECC Write-Back and Write-Through 1 – 2 cycle latency L2 Cache - shared – – – – – – – • up to 4 cores Supporting six independent power domains Optional SIMD/NEON™ unit L1 Cache/core: – • Lennart Johnsson 2013-01-29 ARM Cortex-A15 MPCore supports – – – • COSC 6365 Lecture 6, 2011-02-02 512 kB – 4 MB 16-way associativity Cache coherence Random replacement policy L2 inclusive of L1 caches 3 – 8 cycle latency Option to include parity or ECC Power domains Advanced bus interface supporting up to 32GB/s – Exposing Accelerator Coherence Port (ACP) for enhanced peripheral and accelerator SoC integration http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/ DDI0438G_cortex_a15_r3p2_trm.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 GPUs – AMD 5870 (2010) • • • • • • • • • • • 1600 PEs 20 SIMD Engines (SE) 2.72 TF SP, 0.544 TF DP Memory BW 147GB/s 8kB L1 and 32kB data share for each SE 64kB Global data share Four 128 kB L2 caches Up to 272 billion 32-bit fetches/second Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L2 225W 10 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Texas Instruments TMS320C6678 • Multi-Core KeyStone SoC • Fixed/Floating CorePac 8 core 1.25 GHz CorePac 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W • Navigator • Queue Manager, Packet DMA • Multicore Shared Memory Controller C66x DSP C66x DSP C66x DSP C66x DSP L1 L1 L1 L1 L2 C66x DSP C66x DSP L1 L1 L1 L1 L2 • 50G Baud Expansion Port • Transparent to Software L2 L2 Crypto L2 C66x DSP L2 Memory Subsystem • Network Coprocessor • 3-port GigE Switch (Layer 2) • HyperLink L2 C66x DSP • Low latency, high bandwidth memory access • IPv4/IPv6 Network interface solution • 1.5M pps (1Gb Ethernet wire-rate) • IPSec, SRTP, Encryption fully offloaded L2 Network CoProcessors Packet Accelerator TeraNet • • • • Multicore Navigator 8 x CorePac IP Interfaces GbE Switch SGMII SGMII Multicore Shared Memory Controller (MSMC) DDR364b Peripherals & IO Shared Memory 4MB System Elements Power Management SysMon Debug EDMA Hyper Link 50 SRIO x4 PCIe x2 EMIF 16 TSIP 2x I2 C SPI UART Source: Eric Stotzer, TI COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Movidius Myriad 65nm media processor 180 MHz L1 cache: 1 kB L2 cache: 128 kB/core http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.8-Video/HC23.19.811-1TOPS-Media-Moloney-Movidius.pdf 11 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Cache Summary • Server processors today typically have three levels of on-chip cache • Typical cache characteristics – Level 1: • separate data and instruction caches, • 16 – 64kB, 2-way – 8-way (PowerPC exception 64-way) (Movidius 1 kB) • 1 – 5 cycles latency – Level 2: • shared cache for data and instructions (Itanium exception), • 256kB – 512kB, 8-way – 16-way (PowerPC exception, 2kB, fully associative) (Movidius 128 kB) • ~ 8 -15 cycles latency – Level 3: • shared for data and instructions, • 1 – 6 MB/core, typically shared among all cores, 8MB – 32 MB/chip, 8-way – 48-way • ~ 25 - 50 cycles latency • Mobile CPUs typically do NOT have on-chip level 3 cache COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Threads/socket • • • • • • • Intel Sandy Bridge: 16 AMD Interlagos: 32 IBM Power 7: 32 IBM BG/Q: 64 Oracle T3: 128 AMD/ATI: 320 64-bit nVidia Fermi: 256 64-bit 12 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Energy Observations http://www.lbl.gov/cs/html/Manycore_Workshop09/GPU%20Multicore%20SLAC%202009/dallyppt.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Mobile CPUs die size • ARM Cortex A9 in 40 nm technology: – 4.6 mm2 power optimized (0.5 W) – 6.7 mm2 performance optimized (1.9 W) • Atom in 45 nm technology: – 26 mm2 (2.5 W) Note: 1 – 10% of a typical server CPU area, 0.2 – 5% of a typical server CPU power http://www.cs.virginia.edu/~skadron/cs8535_s11/ARM_Cortex.pdf 13 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory systems COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory terminology (logical) • • • • • Registers Cache Main memory Disk Archive 14 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory hierarchy (logical) • Registers (CMOS) (optimized for latency and bandwidth) – Low latency (1 cycle), high bandwidth at least three operands per core cycle. Example: 3 64-bit operands @ 3GHz = 72 GB/sec • Cache(s) (CMOS) (optimized for latency) – Today typically 3 levels, with all three on the same die as the cores. Level 1 and Level 2 typically private to a core, while Level 3 often shared. – Latencies (typical): Level 1, 1 – 5 core cycles Level 2, 5 – 15 core cycles Level 3, 15 – 40 core cycles – Bandwidths (typical): 8B – 32B wide data paths operating at a clock rate up to that of the cores; ~ 20GB/s – 100 GB/sec • Main memory (CMOS) (optimized for cost, leakage, pin-out) – Latencies (typical): 200 - core cycles – Bandwidth (typical): ~ 10 GB/sec COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory terminology • • • • SRAM DRAM SDRAM RDRAM Static Random Access Memory Dynamic Random Access Memory Synchronous DRAM Rambus DRAM 15 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 IC Complexity Source: Gordon E Moore, Intel ISSCC February 10, 2003 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 First 1,024 DRAM Chip – 1970 (Intel) • 1970’s processes usually had only nMOS transistors – Inexpensive, but consume power while idle • 1980s-present: CMOS processes for low idle power http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt 16 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 First 256-Bit Static RAM -- 1970 The Fairchild 4100 http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 First 65,536 Bit Dynamic Memory Chip -- 1977 IBM Corporation http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt 17 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 First 294,912 Bit Dynamic RAM Memory -- 1981 IBM Corporation http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Intel 22nm SRAM http://download.intel.com/pressroom/kits/events/idffall_2009/pdfs/IDF_MBohr_Briefing.pdf 18 COSC6365 SRAM COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Mark Bohr, IDF, Moscow, April 2006 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM Samsung 1-Gbit, 80-nm DDR3 The Samsung sample has a die size of 127 mm2, and the DRAM cell size is 0.053 µm2. Samsung's DRAM cell is an 8F2 cell with a metal-insulator-metal (MIM) capacitor. The DDR3 device uses a spherical recess-access transistor with a gate length of 47 nm. The wordline width and pitch are 47 µm and 165 nm, respectively. The bitline width and pitch are 40 nm and 160 nm. The Micron 1-Gbit DDR3 sample's die size is 102 mm2, and it has 38 percent cell efficiency. Micron's 6F2 cell, with an area of 0.0365 µm2, is the smallest DRAM cell that Semiconductor Insights has analyzed. The cell efficiency for DDR3 designs ranges from 33 to 45 percent, whereas the cell efficiency for DDR2 designs of the same technology is much higher, measuring between 41 to 54 percent. The wide internal data bus and the related circuitry, including data read/write amplifier and multiplexing circuits to support 8-bit prefetch architecture, consume precious silicon area. Extra pipeline stages to support DDR3's high-speed I/O, improved on-die termination circuitry and other features contribute to the die size overhead. Given DDR3's relatively low cell efficiency, higher-density DDR3 designs (1 Gbit and higher) make the most sense from a cost perspective. http://www.eetimes.com/design/other/4004747/Under-the-Hood-1-Gbit-DDR3-SDRAMs-square-off 19 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DDR4 http://www.xbitlabs.com/news/memory/display/20120912171441_ Samsung_Demonstrates_DDR4_Memory_Modules_DDR4_ Roadmap.html http://semiaccurate.com/2012/02/28/ddr4-shows-up-in-the-wild/samsubg-ddr4-die COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html 20 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html 21 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 http://www.newelectronics.co.uk/electronics-technology/dram-refresher-problems-the-technology-is-set-to-encounter/34922 2011-06-29 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 January 30, 2009 Samsung announced this week that it has developed the world’s highest density DRAM chip. Using its 50 nm technology, Samsung has made the world’s first 4 Gb DDR3 DRAM chip. The South Korean electronics company said that its low-power 4Gb DDR3 is of the ‘green’ variety, which it is pitching as a selling point to data center managers because it “will not only provide a reduction in electricity bills, but also a cutback in installment fees, maintenance fees and repair fees involving power suppliers and heat-emitting equipment.” Samsung’s new 4 Gb DDR3 DRAM chips operate at 1.35V, and the company even does the handy math for us by saying it’s a 20 percent improvement over a 1.5V DDR3. Also, its maximum speed is 1.6 gigabits per second (Gbps). The company goes on to explain that 4 Gb DDR3 can consume 40 percent less power than 2 Gb DDR3 (in the case of 16 GB module configurations) because of its higher density and because it uses only half the DRAM (32 vs. 64 chips). Since 2010 produced in 40 nm technology 22 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory Die size • Samsung, 4 Gbit DDR3 – 104.7 mm2 in 46 nm technology – 69.30 mm2 in 35 nm technology – Power reduction 30% • Hynix 4 Gbit DDR3 – 30.9 mm2 in 23 nm technology • Hynix 2 Gbit DDR4 – 43.15 mm2 in 38 nm technology http://www.datastorage.ch/web4archiv/objects/objekte/marketingcenter/1/pcm1102220014gbddr3sdramb-die78fbga1.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory Density has improved nicely, quadrupling about every three years until recently when density increase for DRAM has slowed to about half the historic rate, i.e. doubling about every three years. What about memory speed? What about energy consumption? 23 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Processor-Memory Performance Gap 10000 CPU Performance “Moore’s Law” 1000 100 10 DRAM 19 80 19 83 19 86 19 89 19 92 19 95 19 98 20 01 20 04 1 Year http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 The “Memory Wall” • Logic vs DRAM speed gap continues to grow 100 10 Core Memory 1 0.1 Clocks per DRAM access Clocks per instruction 1000 0.01 VAX/1980 PPro/1996 2010+ http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view 24 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Memory terminology • • • • DDR DDR2 DDR3 DDR4 • Memory rank • Memory bank COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 DDR SDRAM • DDR – Data rate is 2x the clock • One data phase in each half of the memory bus clock Clock# Clock Data T0 T1 D0 D1 T2 D2 D3 T3 D4 D5 T4 D6 D7 • Synchronous Dynamic Random Access Memory http://www.sbsmn.org/Memory%20training%20Aug%202009.ppt 25 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 DRAM Evolution Standard name Memory clock Cycle time I/O Bus clock Data transfers per second Module name Peak Transfer rate DDR-200 100 MHz 10 ns 100 MHz 200 Million PC-1600 DDR-266 133 MHz 7.5 ns 133 MHz 266 Million PC-2100 2100 MB/s DDR-333 166 MHz 6 ns 166 MHz 333 Million PC-2700 2667 MB/s DDR-400 200 MHz 5 ns 200 MHz 400 Million PC-3200 3200 MB/s DDR2-400 100 MHz 10 ns 200 MHz 400 Million PC2-3200 3200 MB/s DDR2-533 133 MHz 7.5 ns 266 MHz 533 Million PC2-4200 4266 MB/s DDR2-667 166 MHz 6 ns 333 MHz 667 Million PC2-5300 5333 MB/s DDR2-800 200 MHz 5 ns 400 MHz 800 Million PC2-6400 6400 MB/s DDR2-1066 266 MHz 3.75 ns 533 MHz 1066 Million PC2-8500 8533 MB/s DDR3-800 100 MHz 10 ns 400 MHz 800 Million PC3-6400 6400 MB/s DDR3-1066 133 MHz 7.5 ns 533 MHz 1066 Million PC3-8500 8533 MB/s DDR3-1333 166 MHz 6 ns 667 MHz 1333 Million PC3-10600 10667 MB/s DDR3-1600 200 MHz 5 ns 800 MHz 1600 Million PC3-12800 12800 MB/s COSC6365 COSC 6365 Lecture 6, 2011-02-02 1600 MB/s Lennart Johnsson 2013-01-29 CMOS Technology Basics • RAM is made out of silicon CMOS technologies, which effectively act as charge transfer technologies Resistor R V Capacitor C V(1-e-t/RC) • R is proportional to length/(wire cross section), l/(w*h) • C is proportional to area/thickness • If all dimensions but length scales by a factor s, then RC increases in proportion to s! • Halving the feature sizes but keeping the chip size fixed would with this simple scaling make memory slower!! 26 COSC6365 Memory DRAM Dynamic Random Access Memory COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 SRAM Static Random Access Memory bit bit_b word Designed for Speed Intel: ~0.1 µm2 in 22 nm technology http://download.intel.com/pressroom/ kits/events/idffall_2009/pdfs/ IDF_MBohr_Briefing.pdf Designed for Density (and low cost) Samsung: 0.0092 µm2 in 48 nm technology http://www.electroiq.com/blogs/chipworks_real_chips _blog/2011/01/samsung-s-3x-ddr3-sdram-4f2-or-6f2you-be-the-judge.html http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec18.ppt COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM • • • • A read destroys the value stored Every read needs to be a read/write operation DRAM is volatile. Must be refreshed frequently, typically at least every 64 msec. Refresh is typically made one row at a time. The refresh impacts performance and energy consumption negatively. http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf 27 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Impact of Refresh on Performance Source: Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Refresh penalty as a function of density and temperature DDR3 DRAM Capacity 512 Mb tRFC Bandwidth Latency Bandwidth Latency overhead overhead overhead overhead (85oC) (95oC) (95oC) (85oC) 90 ns 1.3% 0.7 ns 2.7% 1.4 ns 1Gb 110 ns 1.6% 1.0 ns 3.3% 2.1 ns 2Gb 160 ns 2.5% 2.4 ns 5.0% 4.9 ns 4Gb 300 ns 3.8% 5.8 ns 7.7% 11.5 ns 8Gb 350 ns 4.5% 7.9 ns 9.0% 15.7 ns Source: Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf 28 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Classical RAM Organization (~Square) bit (data) lines R o w D e c o d e r row address Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell RAM Cell Array word (row) line Column Selector & I/O Circuits data bit or word column address One memory row holds a block of data, so the column address selects the requested bit or word from that block http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory Organization Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf 29 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory Organization • DRAM designed for density and high yield – Large number of bits per die, 1 – 4 Gbits today – Economics require high-yield which means relatively small dies. That in turn implies a very limited number of pads: traditionally 1-bit per chip. – Minimum memory becomes large: 64-bit words means 64 x 4 Gbits (32GB) with 4Gbit chips minimum memory – Remedy: more than 1 bit per chip – “by M” chips for M bits at a time COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Classical DRAM Organization (~Square Planes) bit (data) lines Each intersection represents a 1-T DRAM cell R o w D e c o d e r row address RAM Cell Array word (row) line column address Column Selector & I/O Circuits data bit data bit data bit The column address selects the requested bit from the row in each plane http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view 30 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 Classical DRAM Operation Column Address • DRAM Organization: – N rows x N column x M-bit – Read or Write M-bit at a time – Each M-bit access requires a RAS / CAS cycle N rows DRAM Row Address M bit planes M-bit Output 2nd M-bit Access Cycle Time 1st M-bit N cols Access RAS CAS Row Address Col Address Row Address Col Address http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 31 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM Logic Diagram Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM Read Timing Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf 32 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM Write Timing Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 33 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 34 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 DRAM Latency RAM latency are given by four numbers, in the format "tCAS-tRCD-tRP-tRAS". So, for example, latency values given as 2.5-3-3-8 would indicate tCAS=2.5, tRCD=3, tRP=3, tRAS=8. (Note that 0.5 values of latency (such as 2.5) are only possible in double data rate (DDR) RAM, where two parts of each clock cycle are used) tCAS: The number of clock cycles needed to access a certain column of data in SDRAM. CAS latency, is known as Column Address Strobe time, sometimes referred to as tCL. tRCD (RAS to CAS Delay): The number of clock cycles needed between a row address strobe (RAS) and a CAS. It is the time required between the computer defining the row and column of the given memory block and the actual read or write to that location. tRCD stands for Row address to Column address Delay time. tRP (RAS Precharge): The number of clock cycles needed to terminate access to an open row of memory, and open access to the next row. It stands for Row Precharge time. tRAS (Row Active Time): The minimum number of clock cycles needed to access a certain row of data in RAM between the data request and the precharge command. It's known as active to precharge delay. It stands for Row Address Strobe time. http://en.wikipedia.org/wiki/SDRAM_latency COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 SDRAM Access Timing Row access Column accesses Precharge tCAS (R) tWR (Write) tRCD tCAS (R) tRP tCAS (R) tRAS Initially, the row address is sent to the DRAM. After tRCD, the row is open and may be accessed. For SDRAM multiple column access can be in progress at once. Each read takes time tCAS. When done accessing the column, a precharge returns the SDRAM to the starting state after time tRP. Two other time limits that must also be maintained are tRAS, the time for the refresh of the row to complete before it may be closed again, and tWR, the time that must elapse after the last write before the row may be closed. http://en.wikipedia.org/wiki/SDRAM_latency 35 COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 DRAM evolution PC-3200 (DDR-400) Memory 200 MHz I/O bus 200 MHz Typical PC2-6400 (DDR2-800) Memory 200 MHz I/O bus 400 MHz Fast Typical PC3-12800 (DDR3-1600) Memory 200 MHz I/O bus 800 MHz Fast Typical Fast Description cycles time cycles time cycles time cycles time cycles time cycles time tCL 3 15 ns 2 10 ns 5 12.5 ns 4 10 ns 9 11.25 ns 8 10 ns /CAS low to valid data out (equivalent to tCAC) tRCD 4 20 ns 2 10 ns 5 12.5 ns 4 10 ns 9 11.25 ns 8 10 ns /RAS low to /CAS low time tRP 4 20 ns 2 10 ns 5 12.5 ns 4 10 ns 9 11.25 ns 8 10 ns /RAS precharge time (minimum precharge to active time) tRAS 8 40 ns 5 25 ns 16 40 ns 12 30 ns 27 33.75 ns 24 30 ns Row active time (minimum active to precharge time) Introduced year 2000 Introduced year 2003 Introduced year 2007 http://en.wikipedia.org/wiki/Dynamic_random_access_memory COSC 6365 Lecture 6, 2011-02-02 COSC6365 Lennart Johnsson 2013-01-29 JEDEC equiv. name Memory clock Cycle tim e I/O Bus clock DDR3-800 100.0 MHz 10.000 ns 400 MHz 800 MT/s PC3-6400 6400 MByte/s 5-5-5 6-6-6 DDR3-1066 133.3 MHz 7.500 ns 533 MHz 1066 MT/s PC3-8500 8533 MByte/s 5-5-5 6-6-6 7-7-7 8-8-8 DDR3-1333 166.7 MHz 6.000 ns 667 MHz 1333 MT/s PC3-10600 10667 MByte/s 7-7-7 8-8-8 9-9-9 10-10-10 DDR3-1600 200.0 MHz 5.000 ns 800 MHz 1600 MT/s PC3-12800 12800 MByte/s 7-7-7 8-8-8 9-9-9 10-10-10 11-11-11 DDR3-1800 225.0 MHz 4.444 ns 900 MHz 1800 MT/s PC3-14400 14400 MByte/s 7-7-7 8-8-8 DDR3-1866 233.3 MHz 4.286 ns 933 MHz 1866 MT/s PC3-14900 14900 MByte/s 7-7-7 8-8-8 9-9-9 DDR3-2000 250.0 MHz 4.000 ns 1000 MHz 2000 MT/s PC3-16000 16000 MByte/s 7-8-7 8-8-8 9-9-9 10-10-10 DDR3-2133 266.7 MHz 3.750 ns 1067 MHz 2133 MT/s PC3-17000 17067 MByte/s 9-9-9 10-10-10 DDR3-2200 275.0 MHz 3.636 ns 1100 MHz 2200 MT/s PC3-17600 17600 MByte/s 9-9-9 10-10-10 DDR3-2500 312.5 MHz 3.200 ns 1250 MHz 2500 MT/s PC3-20000 20000 MByte/s 11-11-11 Data rate Module name Burst transfer rate Timings (CL-nRCD-nRP) JEDEC = Joint Electron Device Engineering Council 36 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM • Banks are organized into sub-banks made up of mats that are made up of sub-arrays. http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf 37 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM • DDR3 standard specifies 8 banks/chip • DRAM chips are narrow with a data path width of up to16 bits, typically. The width is reflected in the naming of the DRAM and appears as x4, x8, x16 and x32. • The JEDEC standard specifies a burst length of 8 for DDR3. • For a wider memory interface multiple DRAM chips are assembled together. Thus, e.g., a 64-bit wide interface requires 16 x4 chips, 8 x8 chips, 4 x16 chips, or 2 x32 chips. COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 1Gb 256Mb x4 DDR3 SDRAM DDR3: 8 banks Burstmode of 8 http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx 38 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 1Gb 128Mb x8 DDR3 SDRAM DDR3: 8 banks Burstmode of 8 http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 1Gb 64Mb x16 DDR3 DRAM DDR3: 8 banks Burstmode of 8 http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx 39 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 40 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DDR3 vs DDR2 http://www.xbitlabs.com/articles/memory/display/ddr3_2.html • 2x bandwidth of DDR2 – Data rate/pin: 1600 MT/s vs 800 MT/s – Bus bandwith: 12,800 MT/s vs 6,400 MT/s • 8 banks vs 4 banks – More open banks for back-to-back access – Hide turn-around time – Hide row precharge http://www.micron.com/~/media/Documents/Products/Presentation/ddr3_advantages1.ashx COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DDR2 vs. DDR3 - market http://www.micron.com/~/media/Documents/Products/Presentation/ddr3_advantages1.ashx 41 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DDR2 vs DDR3 http://www.xbitlabs.com/articles/memory/display/ddr3_2.html COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DIMMs – Dual In-line memory Modules 8-byte wide memory modules (DIMMs) SDRAM 168-pin DDR 184-pin DRAM device DDR2 240- pin DIMM DDR3 240-pin Keying (notch) http://nik.uni-obuda.hu/sima/letoltes/magyar/SZA2011_osz/esti/Platforms_2011_12_01_E.ppt 42 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DIMMs and Ranks http://www.kingston.com/ukroot/serverzone/pdf_files/mem_ranks_eng.pdf 43 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Nu Number of Ranks on a Channel limited due to signal integrity and driver power Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 44 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Memory ranks - capacity Processor chipsets may limit the number of memory ranks per memory channel Example: Max 8 ranks (some Intel chipsets) Two pairs of 2GB dual-ranked memory modules: Total ranks; 2 x 2 x 2 = 8 Total memory: 2 x 2 x 2GB = 8GB Four pairs of 2GB single ranked memory modules: Total ranks: 4 x 2 x 1 = 8 Total memory: 4 x 2 x 2 GB = 16 GB 45 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf 46 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.06.opt.pdf COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 DRAM control is tricky • CPU prioritizes memory accesses – transaction requests send to memory controller • Memory controller – translates transaction into the appropriately timed command sequence • transactions are different – – – – open bank then it’s just a CAS no open bank then Activate, PRE, RAS, CAS wrong open bank then write-back and then ACT, PRE, RAS, CAS lots of timing issues • result: latency varies – often the command sequence can be stalled or even restarted – refresh controller always wins http://www.eng.utah.edu/~cs7810/pres/dram-cs7810-oview-devices-x2.pdf 47 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 References • • Transistor Count, http://en.wikipedia.org/wiki/Transistor_count New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Pollack, F., 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999, Haifa, Israel. http://dl.acm.org/citation.cfm?id=320080.320082, http://hpc.ac.upc.edu/Talks/dir07/T000065/slides.pdf Energy per Instruction Trends in Intel® Microprocessors, Ed Grochowski, Murali Annavaram, http://support.intel.co.jp/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf Intel Gulftown die shot, specs revealed, Ryan Shrout, February 3, 2010, PC Perspective, http://www.pcper.com/comments.php?nid=8348 Intel Xeon Processors 5600 Series, http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-5600-brief.pdf AMD’s 12-core Magny-Cours Opteron 6174 vs Intel’s 6-core Xeon, Johan De Gelas, March 29, 2010, http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon Intel Nehalem Westmere, http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem_Westmere Sandy Bridge-E hits the market with more cores,.more threads, http://arstechnica.com/business/news/ 2011/11/sandy-bridge-e-hits-the-market-with-more-cores-more-threads.ars Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge Intel Sandy Bridge, http://www.7-cpu.com/cpu/SandyBridge.html Sandy Bridge for servers, http://realworldtech.com/page.cfm?ArticleID=RWT072811020122&p=1 Venom GPU System with AMD's Opteron™ (aka Istanbul) http://www.3dprofessor.org/Reviews%20Folder%20Pages/Istanbul/SMISP1.htm Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor, Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, Bill Hughes, IEEE Micro, pp. 16-29, March/April, 2010, http://portal.nersc.gov/project/training/files/XE6-feb-2011/Architecture/Opteron-Memory-Cache.pdf • • • • • • • • • • • COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 References (cont’d) • • • • • • • • • • • High-Performance Power-Efficient x86-64 Server and Desktop Processors, “Using the core codenamed “Bulldozer”, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.9-DesktopCPUs/HC23.19.940-Bulldozer-White-AMD.pdf Blue Gene/P Architecture, Vitali Morozov, http://workshops.alcf.anl.gov/gs10/files/2010/01/MorozovBlueGeneP-Architecture.pdf Using the Dawn BG system, Blaise Barny, https://computing.llnl.gov/tutorials/bgp IBM Blue Gene/Q Compute Chip, Ruud A. Haring, http://www.hotchips/org/hc23 Of GigaHertz and CPW, Version 2, Mark Funk, Robert Gagliardi, Allan Johnson, Rick Peterson, January 30, 2010, http://www03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf Mainline Functional Verification of IBM’s Power7 Processor Core, John Ludden, http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf Two billion-transistor beats: Power7 and Niagara 3, John Stokes, 2011, http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars Power7 Processors: The Beat Goes on, Joel M Tendler, http://www.ibm.com/developerworks/wikis/ download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf Poulson: An 8 Core 32 nm Next Generation Intel®Itanium®Processor, Steve Undy, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.7-Server/HC23.19.721-Poulson-ChinIntel-Revised%202.pdf Larry Ellison’s First Sparc chip and server, Timothy Pricket Morgan, September 20, 2010, http://www.theregister.co.uk/2010/09/20/oracle_sparc_t3_chip_servers/print.html Sparc T3 Processor, http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/tseries/sparc-t3-chip-ds-173097.pdf 48 COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 References (cont’d) • A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-κ Metal Gate CMOS, Gianfranco Gerosa, Steve Curtis, Mike D’Addeo, Bo Jiang, Belliappa Kuttanna, Feroze Merchant, Binta Patel, Mohammed Taufique, Haytham Samarchi, ISSCC 2008, Session 13.1, Mobile Processing, http://download.intel.com/pressroom/kits/isscc/ISSC_Intel_Paper_Silverthorne.pdf Intel Atom D510 – Processor Information and Comparisons, http://www.diffen.com/difference/Special:Information/Intel_Atom_D510 2 GHz Capable Cortex-A9 Dual Core Processor Implementation, September 16, 2009, http://arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf ARM Low Power Leadership, CMP Conference, Eric Lalardie, January 28th 2010, http://cmp.imag.fr/aboutus/slides/Slides2010/14_ARM_lalardie_2009.pdf Apple’s A4 dissected, discussed, …. and tantalizing, Paul Boldt, Don Scansen, Tim Whibley June 17, 2010, http://www.eetimes.com/electronics-news/4200451/Apple-s-A4-dissected-discussed--and-tantalizing 1TOPS/W Software Programmable Media Processor, David Moloney, http://www.hotchips.org/archives/ hc23/HC23-papers/HC23.19.8-Video/HC23.19.811-1TOPS-Media-Moloney-Movidius.pdf No exponential is “forever”: but “forever” can be delayed!, Gordon E Moore, ISSCC February 10, 2003 http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fstamp%2Fstamp.j sp%3Ftp%3D%26arnumber%3D1234194&authDecision=-203 Concepts in VLSI Design, Computer-Aided Digital VLSI Design, Michael Bushnell, Rutgers, http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt 22nm SRAM Announcement, Mark Bohr, Intel, September 2009, http://download.intel.com/pressroom/kits/events/idffall_2009/pdfs/IDF_MBohr_Briefing.pdf • • • • • • • • COSC6365 COSC 6365 Lecture 6, 2011-02-02 Lennart Johnsson 2013-01-29 References (cont’d) • • • • • • • • • • Under the Hood: 1-Gbit DDR3 SDRAMs square off, Young Choi, September 3, 2007, http://www.eetimes.com/design/other/4004747/Under-the-Hood-1-Gbit-DDR3-SDRAMs-square-off Mark Bohr, IDF, Moscow, April 2006 DRAM refresher: Problems the technology is set to encounter, Chris Edwards, June 29, 2011 http://www.newelectronics.co.uk/electronics-technology/dram-refresher-problems-the-technologyis-set-to-encounter/34922 CSE 431, Computer Architecture, Fall 2005, Mary Jane Irwin, Lecture 18, Memory Hierarchy Review, http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view Samsung’s 3x DDR3 SDRAM – 4F2 or 6F2? You Be the Judge.. Dick James, January 31, 2011, http://www.electroiq.com/blogs/chipworks_real_chips_blog/2011/01/samsung-s-3x-ddr3-sdram4f2-or-6f2-you-be-the-judge.html Concepts in VLSI Design, Lecture 18, SRAM, David Harris, Mike Bushnell, http://eceweb1.rutgers.edu/~bushnell/vlsidesign/digvlsideslec18.ppt Computer Architecture and Organization, COS471a, COS471b/ELE 375, Lecture 20: Memory Technology, David August, Fall 2004, http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf EE282, Lecture 5: Memory, Jacob Leverich, Spring 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf Dynamic random-access memory, http://en.wikipedia.org/wiki/Dynamic_random_access_memory 49