Transcript
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Introduction to HPC Architecture - Memory Lecture 5 Lennart Johnsson Dept of Computer Science
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
CPU Transistor Counts 1971 – 2008 and Moore’s Law
AMD RV870 GPU 2.15B AMD Cayman 2.64B nVidia GF100 GPU 3.0 B Sparc T3, 16-core, 1.0 B Intel Westmere, 6-core, 1.17B IBM Power7 1.2 B Intel Nehalem-EX, 8-core 2.3 B Intel Westmere-EX, 10-core 2.6 B Intel Itanium 8-core, 3.1 B
2009 2010 2010 2010 2010 2010 2010 2011 2012
Date of introduction http://en.wikipedia.org/wiki/Transistor_count
1
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Processor Die Area and Transistor Count Processor Dual-Core Itanium 2
Transistors
Year
Manufact.
Process
Area
1,700,000,000
2006
Intel
90 nm
596 mm²
POWER6
789,000,000
2007
IBM
65 nm
341 mm²
Six-Core Opteron 2400
904,000,000
2009
AMD
45 nm
346 mm²
2,154,000,000
2009
AMD
40 nm
334 mm
RV870
2
16-Core SPARC T3
1,000,000,000
2010
Sun/Oracle
40 nm
377 mm²
Six-Core Core i7
1,170,000,000
2010
Intel
32 nm
240 mm²
8-Core POWER7
1,200,000,000
2010
IBM
45 nm
567 mm²
4-Core Itanium Tukwila
2,000,000,000
2010
Intel
65 nm
699 mm²
8-Core Xeon Nehalem-EX
2,300,000,000
2010
Intel
45 nm
684 mm²
Cayman (GPU)
2,640,000,000
2010
AMD
40 nm
389 mm
2
GF100 (GPU)
3,000,000,000
2010
nVidia
40 nm
529 mm
2
Tahiti (GPU)
4,310,000,000
2011
AMD
28 nm
365 mm
2
10-Core Xeon Westmere-EX
2,600,000,000
2011
Intel
32 nm
512 mm
2
8-Core Itanium Poulson
3,100,000,000
2012
Intel
32 nm
544 mm
2
Sandy Bridge, 8C
2,270,000,000
2012
Intel
32 nm
435 mm
2
7,100,000,000
2013
nVidia
28 nm
?
GK110 (GPU)
http://en.wikipedia.org/wiki/Transistor_count, http://en.wikipedia.org/wiki/Sandy_Bridge
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Where have all the transistors gone?
2
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Recall: Energy Consumption “We are on the Wrong side of a Square Law” Fred Pollack 1999 New goal for CPU design: “Double Valued Performance every 18 months, at the same power level”, Fred Pollack Pollack, F (1999). New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Paper presented at the Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture, Haifa, Israel. Normalized Normalized EPI on 65 nm at Product Performance Power 1.33 volts (nJ) i486 1.0 1.0 10 Pentium 2.0 2.7 14 Pentium Pro 3.6 9 24 Ed Grochowski, Murali Annavaram Energy per Instruction Trends in Intel® Microprocessors. http://support.intel.co.jp/pressroom/kits/ core2duo/pdf/epi-trends-final2.pdf
Pentium 4 (Willamette) Pentium 4 (Cedarmill) Pentium M (Dothan) Core Duo (Yonah)
COSC6365
6.0
23
38
7.9
38
48
5.4
7
15
7.7
8
11
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Recall: Solution to heat dissipation and double valued performance
3
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Typical High Level Architecture “CPU” = Processing die
core
Processing Logic
Processing Logic
20 – 100 GB/s
CPU Processing Logic
1 – 5 cycles
Level 1 Cache
Level 1 Cache
Level 2 Cache
Level 2 Cache
Level 1 Cache
Processing logic clock rates from ~100+ MHz to 3+ GHz Execution widths from one operation to eight
8 – 15 cycles
Level 2 Cache
25 - 50 cycles
Level 3 Cache 20 – 50 GB/s
DRAM
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Processor examples
4
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Shared L3 Cache
Core
Core
Core
Shared L3 Cache
Core
Core
Core
Memory Controller
Misc I/O and QPI
Misc I/O and QPI
Intel Westmere/Gulftown Intel Westmere, 2010 1.17 billion transistors 248 mm2 in 32 nm Hyperthreading (2 threads/core) 2.26 – 3.33 GHz 10.0 – 13.3 GF/core 6 cores L1 cache: 32kB+32kB/core (8-way, cache line 64B,64 sets) L2 cache: 256 kB/core (8-way, cache line 64B, 512 sets) L3 cache: 12 MB shared (16-way, cache line 64B, 12,288 sets, 36 cycle latency) 60 - 130 W TDP 0.9 – 0.6 GF/W TDP = Thermal Design Power
http://www.pcper.com/comments.php?nid=8348 http://www.intel.com/Assets/PDF/prodbrief/323501.pdf http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem_Westmere http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Intel Sandy Bridge (2012) • Two threads per core, “Hyperthreading” • L1 cache: 32 kB + 32 kB each core, latency 3 cycles, 8-way • L2 cache: 256 kB each core, latency 8 cycles, 8-way • 64-byte cache line size • Two load/store operations/CPU cycle/memory channel • 256-bit/cycle ring bus between cores, graphics, cache and System Agent Domain • Advanced Vector Extensions 256-bit instruction set • DDR3 up to 1600 MHz • 40 desktop versions, 1.6 GHz – 3.6 GHz, • 1 - 6 cores • L3 cache: 1 – 2.5 MB/core, latency 35 – 45 cycles, 12-way • 45 server versions, 1.8 GHz – 3.6 GHz • 2 – 8 cores • L3 cache: 1.5 - 2.5MB/core, shared latency 35 – 45 cycles, 12-way Sandy Bridge-E hits the market with more cores,.more threads, • 57 mobile versions, 1.1 GHz – 2.8 GHz http://arstechnica.com/business/news/2011/11/ • 1 – 4 cores sandy-bridge-e-hits-the-market-with-more-cores-more-threads.ars • L3 cache: up to 20 MB Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge Intel Sandy Bridge, http://www.7-cpu.com/cpu/SandyBridge.html latency 35 – 45 cycles, 12-way Sandy Bridge for servers, http://realworldtech.com/page.cfm?ArticleID=RWT072811020122&p=1
5
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
AMD Istanbul 6-core Magny-Cours AMD Magny Cour, 2010 2 billion transistors 2 x 346 mm2 in 45 nm 1.7 – 2.3 GHz 6.8 – 9.2 GF/core 12 cores L1 cache: 64kB+64kB/core (2-way, cache line 64B, 512 sets, 3 cycles) L2 cache: 512 kB/core (16-way, cache line 64B, 512 sets, 12 cycles) L3 cache: 2 x 6 MB shared (48-way,cache line 64B, 2048 sets) 90 - 140 W TDP 0.9 – 0.8 GF/W Conway et al."Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor,“ IEEE Micro, pp. 16-29, March/April, 2010 http://www.3dprofessor.org/Reviews%20Folder%20Pages/Istanbul/SMISP1.htm http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
AMD Bulldozer - Interlagos
Bulldozer (2 cores)
Orichi (8 cores)
The Bulldozer core is architected to be power-efficient Minimize silicon area by sharing functionality between two cores All blocks and circuits have been designed to minimize power (not just in the Bulldozer core) Extensive flip-flop clock-gating throughout design Circuits power-gated dynamically Numerous power-saving features under firmware/software control Core C6 State (CC6) Core P-states/AMD Turbo CORE Application Power Management (APM) DRAM power management Message Triggered C1E THE DIE 315 mm2 Eight “Bulldozer” cores High-performance, power-efficient AMD64 cores Two cores in each Bulldozer module L1D cache: 128 KB, 16 KB/core, 64-byte cache line, 4-way associative, write-through L1I cache: 256 KB, 64 KB/Bulldozer module, 64-byte cache line, 2-way associative L2 cache: 8MB, 2 MB/Bulldozer module, 64-byte cache line, 16-way associative Integrated Northbridge which controls: L3 cache: 8 MB, 64-byte cache line, 16-way associative, MOESI Two 72-bit wide DDR3 memory channels Four 16-bit receive/16-bit transmit HyperTransport™ links
http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.9-Desktop-CPUs/HC23.19.940-Bulldozer-White-AMD.pdf
6
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
IBM PowerPC
PowerPC 450, 208M transistors, 90 nm,16W 4 flops/cycle, 850 MHz L1 cache: 32kB+32kB (64-way, cache line 32B, 16 sets, round-robin repl., write-through, 4 cycle latency, 8B/cycle)) L2 cache: prefetch buffer 2kB (16 128B lines, fully associative, 12 cycle latency, 8B/cycle (avg. 4.6B/cycle (128B/(12+128/8)))) L3 cache: shared, 4 banks of 2 MB each (each bank has a L3 directory and a 15 entry 128B combining buffer, 35 cycle latency) Memory: 2GB DDR2 @ 400 MHz, (4 banks of 512MB) 86 cycle latency
https://computing.llnl.gov/tutorials/bgp/ http://workshops.alcf.anl.gov/gs10/files/2010/01/Morozov-BlueGeneP-Architecture.pdf
COSC 6365 Lecture 6, 2011-02-02
COSC6365 System-on-a-Chip design : integrates processors, memory and networking logic into a single chip
Lennart Johnsson 2013-01-29
IBM Blue Gene Q (2012) •
360 mm² Cu-45 technology (SOI)
•
16 user + 1 service processors
– – – – – – – – – –
•
plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS@55W
Central shared L2 cache: 32 MB – – –
•
~ 1.47 B transistoRS
eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops
Dual memory controller – – –
16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC)
•
Chip-to-chip networking
•
External IO
– –
Router logic integrated into BQC chip. PCIe Gen2 interface
Source: Ruud haring, http://www.hotchips/org/hc23
7
COSC6365
IBM Power7
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
IBM Power7, 2010 1.2 billion transistors 567 mm2 in 45 nm 4-way simultaneous multi-threading (SMT) 3.0 – 4.25 GHz 24 – 34 GF/core 4, 6 or 8 cores up to 4 chips/MCM 12 execution units/core L1 instr cache: 32kB/core (4-way, 128B cache line, 64 sets, 2-3 cycle latency) L1 data cache: 32kB/core (8-way, 128B cache line, 64 sets, 2-3 cycle latency, write-through) L2 cache: 256 kB/core (8-way, 128B cache line, 256 sets, 8 cycle latency, write-back) L3 cache: 4 MB/core (8-way, 128 B cache line, 4,096 sets, 25 cycle latency) 200W TDP < 1.36 GF/W (peak)
http://www-03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf http://www-05.ibm.com/fr/events/hpc_summit_2010/P7_HPC_Summit_060410.pdf http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars http://www.ibm.com/developerworks/wikis/download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Itanium 8-core (2012) Core Devices Area Threads Instruction Q size Max instructionissue/cycle Pipeline stages Pipeline hazardresolution
89 million 20 mm2 2+2 96x2 12 4+7 Replay
Chip Process Devices Area Power (max TDP) Itanium® Cores Last Level Cache Size Intel® Quick Path Interconnect Links Intel® QPI Link Speed Intel® Scalable Memory Interface Links Intel® SMI Link Speed
32nm 3.1Billion 544 mm2 170W 8 32MB 4 full+2 half 6.4GT/s 4 6.4GT/s
Poulson: An 8 Core 32 nm Next Generation Intel®Itanium®Processor, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.7-Server/HC23.19.721-Poulson-Chin-Intel-Revised%202.pdf
8
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Oracle SPARC T3/Niagara 3 Oracle SPARC T3/Niagara 3, 2010 1 billion transistors 371 mm2 in 40 nm 8-way simultaneous multi-threading (SMT) 1.65 GHz 13 GF/core 16 cores 2 SIMD units/core, 4 ops/SIMD unit L1 cache: 8kB (data)+16kB (inst) per core L2 cache: 384 kB/core (6 MB total) L3 cache: No 75 – 140W TDP < 1.51 GF/W (peak) http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars http://www.theregister.co.uk/2010/09/20/oracle_sparc_t3_chip_servers/print.html http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t3-chip-ds-173097.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Intel ATOM Processor (Mobile) Intel ATOM S1260, Q4 2012 32 nm 2 GHz Hyperthreading (2/core) 4 GF/core 2 cores L1 cache: 24kB, 6-way (data)+ 32kB, 4-way (inst) per core L2 cache: 512 kB/core 8-way 8.5 W TDP 0.94 GF/W
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/atom-processor-s1200-datasheet-vol-1.pdf http://www.anandtech.com/show/6509/intel-launches-centerton-atom-s1200-family-first-atom-for-servers
9
COSC6365
ARM Cortex A15 •
– –
Instruction cache, 32kB, 2-way set associative, 64B cache line, LRU replacement policy, parity for 16-bits Data cache 32 kB, 2-way set associative, 64B cache line, LRU replacement policy, 32-bit ECC Write-Back and Write-Through 1 – 2 cycle latency
L2 Cache - shared – – – – – – –
•
up to 4 cores Supporting six independent power domains Optional SIMD/NEON™ unit
L1 Cache/core: –
•
Lennart Johnsson 2013-01-29
ARM Cortex-A15 MPCore supports – – –
•
COSC 6365 Lecture 6, 2011-02-02
512 kB – 4 MB 16-way associativity Cache coherence Random replacement policy L2 inclusive of L1 caches 3 – 8 cycle latency Option to include parity or ECC
Power domains
Advanced bus interface supporting up to 32GB/s –
Exposing Accelerator Coherence Port (ACP) for enhanced peripheral and accelerator SoC integration
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/ DDI0438G_cortex_a15_r3p2_trm.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
GPUs – AMD 5870 (2010) • • • • • • • • • • •
1600 PEs 20 SIMD Engines (SE) 2.72 TF SP, 0.544 TF DP Memory BW 147GB/s 8kB L1 and 32kB data share for each SE 64kB Global data share Four 128 kB L2 caches Up to 272 billion 32-bit fetches/second Up to 1 TB/sec L1 texture fetch bandwidth Up to 435 GB/sec between L1 & L2 225W
10
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Texas Instruments TMS320C6678 • Multi-Core KeyStone SoC • Fixed/Floating CorePac 8 core 1.25 GHz CorePac 4.0 MB Shared L2 320G MAC, 160G FLOP, 60G DFLOPS 10W
• Navigator • Queue Manager, Packet DMA
• Multicore Shared Memory Controller
C66x DSP
C66x DSP
C66x DSP
C66x DSP
L1
L1
L1
L1
L2
C66x DSP
C66x DSP
L1
L1
L1
L1
L2
• 50G Baud Expansion Port • Transparent to Software
L2
L2
Crypto
L2
C66x DSP
L2
Memory Subsystem
• Network Coprocessor
• 3-port GigE Switch (Layer 2) • HyperLink
L2
C66x DSP
• Low latency, high bandwidth memory access • IPv4/IPv6 Network interface solution • 1.5M pps (1Gb Ethernet wire-rate) • IPSec, SRTP, Encryption fully offloaded
L2
Network CoProcessors Packet Accelerator
TeraNet
• • • •
Multicore Navigator 8 x CorePac
IP Interfaces GbE Switch SGMII
SGMII
Multicore Shared Memory Controller (MSMC)
DDR364b
Peripherals & IO
Shared Memory 4MB
System Elements Power Management
SysMon
Debug
EDMA
Hyper Link 50
SRIO x4
PCIe x2
EMIF 16
TSIP 2x
I2 C SPI
UART
Source: Eric Stotzer, TI
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Movidius Myriad 65nm media processor
180 MHz
L1 cache: 1 kB L2 cache: 128 kB/core
http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.8-Video/HC23.19.811-1TOPS-Media-Moloney-Movidius.pdf
11
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Cache Summary • Server processors today typically have three levels of on-chip cache • Typical cache characteristics – Level 1: • separate data and instruction caches, • 16 – 64kB, 2-way – 8-way (PowerPC exception 64-way) (Movidius 1 kB) • 1 – 5 cycles latency
– Level 2: • shared cache for data and instructions (Itanium exception), • 256kB – 512kB, 8-way – 16-way (PowerPC exception, 2kB, fully associative) (Movidius 128 kB) • ~ 8 -15 cycles latency
– Level 3: • shared for data and instructions, • 1 – 6 MB/core, typically shared among all cores, 8MB – 32 MB/chip, 8-way – 48-way • ~ 25 - 50 cycles latency
• Mobile CPUs typically do NOT have on-chip level 3 cache
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Threads/socket • • • • • • •
Intel Sandy Bridge: 16 AMD Interlagos: 32 IBM Power 7: 32 IBM BG/Q: 64 Oracle T3: 128 AMD/ATI: 320 64-bit nVidia Fermi: 256 64-bit
12
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Energy Observations
http://www.lbl.gov/cs/html/Manycore_Workshop09/GPU%20Multicore%20SLAC%202009/dallyppt.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Mobile CPUs die size • ARM Cortex A9 in 40 nm technology: – 4.6 mm2 power optimized (0.5 W) – 6.7 mm2 performance optimized (1.9 W)
• Atom in 45 nm technology: – 26 mm2 (2.5 W)
Note: 1 – 10% of a typical server CPU area, 0.2 – 5% of a typical server CPU power http://www.cs.virginia.edu/~skadron/cs8535_s11/ARM_Cortex.pdf
13
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory systems
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory terminology (logical) • • • • •
Registers Cache Main memory Disk Archive
14
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory hierarchy (logical) • Registers (CMOS) (optimized for latency and bandwidth) – Low latency (1 cycle), high bandwidth at least three operands per core cycle. Example: 3 64-bit operands @ 3GHz = 72 GB/sec
• Cache(s) (CMOS) (optimized for latency) – Today typically 3 levels, with all three on the same die as the cores. Level 1 and Level 2 typically private to a core, while Level 3 often shared. – Latencies (typical): Level 1, 1 – 5 core cycles Level 2, 5 – 15 core cycles Level 3, 15 – 40 core cycles – Bandwidths (typical): 8B – 32B wide data paths operating at a clock rate up to that of the cores; ~ 20GB/s – 100 GB/sec
• Main memory (CMOS) (optimized for cost, leakage, pin-out) – Latencies (typical): 200 - core cycles – Bandwidth (typical): ~ 10 GB/sec
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory terminology • • • •
SRAM DRAM SDRAM RDRAM
Static Random Access Memory Dynamic Random Access Memory Synchronous DRAM Rambus DRAM
15
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
IC Complexity
Source: Gordon E Moore, Intel ISSCC February 10, 2003
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
First 1,024 DRAM Chip – 1970 (Intel)
• 1970’s processes
usually had only nMOS transistors – Inexpensive, but consume power while idle • 1980s-present: CMOS processes for low idle power
http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt
16
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
First 256-Bit Static RAM -- 1970 The Fairchild 4100
http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
First 65,536 Bit Dynamic Memory Chip -- 1977 IBM Corporation
http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt
17
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
First 294,912 Bit Dynamic RAM Memory -- 1981 IBM Corporation
http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Intel 22nm SRAM
http://download.intel.com/pressroom/kits/events/idffall_2009/pdfs/IDF_MBohr_Briefing.pdf
18
COSC6365
SRAM
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Mark Bohr, IDF, Moscow, April 2006
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM Samsung 1-Gbit, 80-nm DDR3 The Samsung sample has a die size of 127 mm2, and the DRAM cell size is 0.053 µm2. Samsung's DRAM cell is an 8F2 cell with a metal-insulator-metal (MIM) capacitor. The DDR3 device uses a spherical recess-access transistor with a gate length of 47 nm. The wordline width and pitch are 47 µm and 165 nm, respectively. The bitline width and pitch are 40 nm and 160 nm.
The Micron 1-Gbit DDR3 sample's die size is 102 mm2, and it has 38 percent cell efficiency. Micron's 6F2 cell, with an area of 0.0365 µm2, is the smallest DRAM cell that Semiconductor Insights has analyzed. The cell efficiency for DDR3 designs ranges from 33 to 45 percent, whereas the cell efficiency for DDR2 designs of the same technology is much higher, measuring between 41 to 54 percent. The wide internal data bus and the related circuitry, including data read/write amplifier and multiplexing circuits to support 8-bit prefetch architecture, consume precious silicon area. Extra pipeline stages to support DDR3's high-speed I/O, improved on-die termination circuitry and other features contribute to the die size overhead. Given DDR3's relatively low cell efficiency, higher-density DDR3 designs (1 Gbit and higher) make the most sense from a cost perspective. http://www.eetimes.com/design/other/4004747/Under-the-Hood-1-Gbit-DDR3-SDRAMs-square-off
19
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DDR4
http://www.xbitlabs.com/news/memory/display/20120912171441_ Samsung_Demonstrates_DDR4_Memory_Modules_DDR4_ Roadmap.html
http://semiaccurate.com/2012/02/28/ddr4-shows-up-in-the-wild/samsubg-ddr4-die
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html
20
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
http://isscc.org/media/2012/plenary/Eli_Harari/SilverlightLoader.html
21
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
http://www.newelectronics.co.uk/electronics-technology/dram-refresher-problems-the-technology-is-set-to-encounter/34922 2011-06-29
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
January 30, 2009
Samsung announced this week that it has developed the world’s highest density DRAM chip. Using its 50 nm technology, Samsung has made the world’s first 4 Gb DDR3 DRAM chip. The South Korean electronics company said that its low-power 4Gb DDR3 is of the ‘green’ variety, which it is pitching as a selling point to data center managers because it “will not only provide a reduction in electricity bills, but also a cutback in installment fees, maintenance fees and repair fees involving power suppliers and heat-emitting equipment.” Samsung’s new 4 Gb DDR3 DRAM chips operate at 1.35V, and the company even does the handy math for us by saying it’s a 20 percent improvement over a 1.5V DDR3. Also, its maximum speed is 1.6 gigabits per second (Gbps). The company goes on to explain that 4 Gb DDR3 can consume 40 percent less power than 2 Gb DDR3 (in the case of 16 GB module configurations) because of its higher density and because it uses only half the DRAM (32 vs. 64 chips). Since 2010 produced in 40 nm technology
22
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory Die size • Samsung, 4 Gbit DDR3 – 104.7 mm2 in 46 nm technology – 69.30 mm2 in 35 nm technology – Power reduction 30%
• Hynix 4 Gbit DDR3 – 30.9 mm2 in 23 nm technology
• Hynix 2 Gbit DDR4 – 43.15 mm2 in 38 nm technology
http://www.datastorage.ch/web4archiv/objects/objekte/marketingcenter/1/pcm1102220014gbddr3sdramb-die78fbga1.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory Density has improved nicely, quadrupling about every three years until recently when density increase for DRAM has slowed to about half the historic rate, i.e. doubling about every three years. What about memory speed? What about energy consumption?
23
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Processor-Memory Performance Gap 10000
CPU
Performance
“Moore’s Law”
1000 100 10
DRAM
19 80 19 83 19 86 19 89 19 92 19 95 19 98 20 01 20 04
1
Year http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
The “Memory Wall” • Logic vs DRAM speed gap continues to grow
100 10
Core Memory
1 0.1
Clocks per DRAM access
Clocks per instruction
1000
0.01 VAX/1980
PPro/1996
2010+
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view
24
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Memory terminology • • • •
DDR DDR2 DDR3 DDR4
• Memory rank • Memory bank
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
DDR SDRAM • DDR – Data rate is 2x the clock • One data phase in each half of the memory bus clock Clock# Clock Data
T0
T1
D0
D1
T2
D2
D3
T3
D4
D5
T4
D6
D7
• Synchronous Dynamic Random Access Memory
http://www.sbsmn.org/Memory%20training%20Aug%202009.ppt
25
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
DRAM Evolution Standard name
Memory clock
Cycle time
I/O Bus clock
Data transfers per second
Module name
Peak Transfer rate
DDR-200
100 MHz
10 ns
100 MHz
200 Million
PC-1600
DDR-266
133 MHz
7.5 ns
133 MHz
266 Million
PC-2100
2100 MB/s
DDR-333
166 MHz
6 ns
166 MHz
333 Million
PC-2700
2667 MB/s
DDR-400
200 MHz
5 ns
200 MHz
400 Million
PC-3200
3200 MB/s
DDR2-400
100 MHz
10 ns
200 MHz
400 Million
PC2-3200
3200 MB/s
DDR2-533
133 MHz
7.5 ns
266 MHz
533 Million
PC2-4200
4266 MB/s
DDR2-667
166 MHz
6 ns
333 MHz
667 Million
PC2-5300
5333 MB/s
DDR2-800
200 MHz
5 ns
400 MHz
800 Million
PC2-6400
6400 MB/s
DDR2-1066
266 MHz
3.75 ns
533 MHz
1066 Million
PC2-8500
8533 MB/s
DDR3-800
100 MHz
10 ns
400 MHz
800 Million
PC3-6400
6400 MB/s
DDR3-1066
133 MHz
7.5 ns
533 MHz
1066 Million
PC3-8500
8533 MB/s
DDR3-1333
166 MHz
6 ns
667 MHz
1333 Million
PC3-10600
10667 MB/s
DDR3-1600
200 MHz
5 ns
800 MHz
1600 Million
PC3-12800
12800 MB/s
COSC6365
COSC 6365 Lecture 6, 2011-02-02
1600 MB/s
Lennart Johnsson 2013-01-29
CMOS Technology Basics • RAM is made out of silicon CMOS technologies, which effectively act as charge transfer technologies Resistor R V
Capacitor C
V(1-e-t/RC)
• R is proportional to length/(wire cross section), l/(w*h) • C is proportional to area/thickness • If all dimensions but length scales by a factor s, then RC increases in proportion to s! • Halving the feature sizes but keeping the chip size fixed would with this simple scaling make memory slower!!
26
COSC6365
Memory DRAM Dynamic Random Access Memory
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
SRAM Static Random Access Memory bit
bit_b
word
Designed for Speed Intel: ~0.1 µm2 in 22 nm technology http://download.intel.com/pressroom/ kits/events/idffall_2009/pdfs/ IDF_MBohr_Briefing.pdf
Designed for Density (and low cost) Samsung: 0.0092 µm2 in 48 nm technology http://www.electroiq.com/blogs/chipworks_real_chips _blog/2011/01/samsung-s-3x-ddr3-sdram-4f2-or-6f2you-be-the-judge.html
http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec18.ppt
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM • • •
•
A read destroys the value stored Every read needs to be a read/write operation DRAM is volatile. Must be refreshed frequently, typically at least every 64 msec. Refresh is typically made one row at a time. The refresh impacts performance and energy consumption negatively.
http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf
27
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Impact of Refresh on Performance
Source: Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Refresh penalty as a function of density and temperature DDR3 DRAM Capacity 512 Mb
tRFC
Bandwidth Latency Bandwidth Latency overhead overhead overhead overhead (85oC) (95oC) (95oC) (85oC)
90 ns
1.3%
0.7 ns
2.7%
1.4 ns
1Gb
110 ns
1.6%
1.0 ns
3.3%
2.1 ns
2Gb
160 ns
2.5%
2.4 ns
5.0%
4.9 ns
4Gb
300 ns
3.8%
5.8 ns
7.7%
11.5 ns
8Gb
350 ns
4.5%
7.9 ns
9.0%
15.7 ns
Source: Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf
28
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Classical RAM Organization (~Square) bit (data) lines
R o w D e c o d e r
row address
Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell
RAM Cell Array
word (row) line
Column Selector & I/O Circuits
data bit or word
column address One memory row holds a block of data, so the column address selects the requested bit or word from that block
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory Organization
Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf
29
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory Organization • DRAM designed for density and high yield – Large number of bits per die, 1 – 4 Gbits today – Economics require high-yield which means relatively small dies. That in turn implies a very limited number of pads: traditionally 1-bit per chip. – Minimum memory becomes large: 64-bit words means 64 x 4 Gbits (32GB) with 4Gbit chips minimum memory – Remedy: more than 1 bit per chip – “by M” chips for M bits at a time
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Classical DRAM Organization (~Square Planes) bit (data) lines Each intersection represents a 1-T DRAM cell
R o w D e c o d e r
row address
RAM Cell Array
word (row) line
column address Column Selector & I/O Circuits data bit data bit
data bit
The column address selects the requested bit from the row in each plane
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view
30
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
Classical DRAM Operation Column Address
• DRAM Organization: – N rows x N column x M-bit – Read or Write M-bit at a time – Each M-bit access requires a RAS / CAS cycle
N rows
DRAM Row Address
M bit planes M-bit Output 2nd M-bit Access
Cycle Time 1st M-bit
N cols
Access
RAS CAS Row Address
Col Address
Row Address
Col Address
http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
31
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM Logic Diagram
Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM Read Timing
Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf
32
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM Write Timing
Source: David August, Princeton, Fall 2004 http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
33
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
34
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
DRAM Latency RAM latency are given by four numbers, in the format "tCAS-tRCD-tRP-tRAS". So, for example, latency values given as 2.5-3-3-8 would indicate tCAS=2.5, tRCD=3, tRP=3, tRAS=8. (Note that 0.5 values of latency (such as 2.5) are only possible in double data rate (DDR) RAM, where two parts of each clock cycle are used) tCAS: The number of clock cycles needed to access a certain column of data in SDRAM. CAS latency, is known as Column Address Strobe time, sometimes referred to as tCL. tRCD (RAS to CAS Delay): The number of clock cycles needed between a row address strobe (RAS) and a CAS. It is the time required between the computer defining the row and column of the given memory block and the actual read or write to that location. tRCD stands for Row address to Column address Delay time. tRP (RAS Precharge): The number of clock cycles needed to terminate access to an open row of memory, and open access to the next row. It stands for Row Precharge time. tRAS (Row Active Time): The minimum number of clock cycles needed to access a certain row of data in RAM between the data request and the precharge command. It's known as active to precharge delay. It stands for Row Address Strobe time. http://en.wikipedia.org/wiki/SDRAM_latency
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
SDRAM Access Timing Row access
Column accesses
Precharge
tCAS (R) tWR (Write)
tRCD
tCAS (R)
tRP tCAS (R)
tRAS
Initially, the row address is sent to the DRAM. After tRCD, the row is open and may be accessed. For SDRAM multiple column access can be in progress at once. Each read takes time tCAS. When done accessing the column, a precharge returns the SDRAM to the starting state after time tRP. Two other time limits that must also be maintained are tRAS, the time for the refresh of the row to complete before it may be closed again, and tWR, the time that must elapse after the last write before the row may be closed. http://en.wikipedia.org/wiki/SDRAM_latency
35
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
DRAM evolution PC-3200 (DDR-400) Memory 200 MHz I/O bus 200 MHz Typical
PC2-6400 (DDR2-800) Memory 200 MHz I/O bus 400 MHz
Fast
Typical
PC3-12800 (DDR3-1600) Memory 200 MHz I/O bus 800 MHz
Fast
Typical
Fast
Description
cycles
time
cycles
time
cycles
time
cycles
time
cycles
time
cycles
time
tCL
3
15 ns
2
10 ns
5
12.5 ns
4
10 ns
9
11.25 ns
8
10 ns
/CAS low to valid data out (equivalent to tCAC)
tRCD
4
20 ns
2
10 ns
5
12.5 ns
4
10 ns
9
11.25 ns
8
10 ns
/RAS low to /CAS low time
tRP
4
20 ns
2
10 ns
5
12.5 ns
4
10 ns
9
11.25 ns
8
10 ns
/RAS precharge time (minimum precharge to active time)
tRAS
8
40 ns
5
25 ns
16
40 ns
12
30 ns
27
33.75 ns
24
30 ns
Row active time (minimum active to precharge time)
Introduced year 2000
Introduced year 2003
Introduced year 2007
http://en.wikipedia.org/wiki/Dynamic_random_access_memory
COSC 6365 Lecture 6, 2011-02-02
COSC6365
Lennart Johnsson 2013-01-29
JEDEC equiv. name
Memory clock
Cycle tim e
I/O Bus clock
DDR3-800
100.0 MHz
10.000 ns
400 MHz
800 MT/s
PC3-6400
6400 MByte/s
5-5-5 6-6-6
DDR3-1066
133.3 MHz
7.500 ns
533 MHz
1066 MT/s
PC3-8500
8533 MByte/s
5-5-5 6-6-6 7-7-7 8-8-8
DDR3-1333
166.7 MHz
6.000 ns
667 MHz
1333 MT/s
PC3-10600
10667 MByte/s
7-7-7 8-8-8 9-9-9 10-10-10
DDR3-1600
200.0 MHz
5.000 ns
800 MHz
1600 MT/s
PC3-12800
12800 MByte/s
7-7-7 8-8-8 9-9-9 10-10-10 11-11-11
DDR3-1800
225.0 MHz
4.444 ns
900 MHz
1800 MT/s
PC3-14400
14400 MByte/s
7-7-7 8-8-8
DDR3-1866
233.3 MHz
4.286 ns
933 MHz
1866 MT/s
PC3-14900
14900 MByte/s
7-7-7 8-8-8 9-9-9
DDR3-2000
250.0 MHz
4.000 ns
1000 MHz
2000 MT/s
PC3-16000
16000 MByte/s
7-8-7 8-8-8 9-9-9 10-10-10
DDR3-2133
266.7 MHz
3.750 ns
1067 MHz
2133 MT/s
PC3-17000
17067 MByte/s
9-9-9 10-10-10
DDR3-2200
275.0 MHz
3.636 ns
1100 MHz
2200 MT/s
PC3-17600
17600 MByte/s
9-9-9 10-10-10
DDR3-2500
312.5 MHz
3.200 ns
1250 MHz
2500 MT/s
PC3-20000
20000 MByte/s
11-11-11
Data rate
Module name
Burst transfer rate
Timings (CL-nRCD-nRP)
JEDEC = Joint Electron Device Engineering Council
36
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM • Banks are organized into sub-banks made up of mats that are made up of sub-arrays.
http://www.hpl.hp.com/techreports/2008/HPL-2008-20.pdf
37
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM • DDR3 standard specifies 8 banks/chip • DRAM chips are narrow with a data path width of up to16 bits, typically. The width is reflected in the naming of the DRAM and appears as x4, x8, x16 and x32. • The JEDEC standard specifies a burst length of 8 for DDR3. • For a wider memory interface multiple DRAM chips are assembled together. Thus, e.g., a 64-bit wide interface requires 16 x4 chips, 8 x8 chips, 4 x16 chips, or 2 x32 chips.
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
1Gb 256Mb x4 DDR3 SDRAM DDR3: 8 banks
Burstmode of 8
http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx
38
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
1Gb 128Mb x8 DDR3 SDRAM DDR3: 8 banks
Burstmode of 8
http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
1Gb 64Mb x16 DDR3 DRAM DDR3: 8 banks
Burstmode of 8
http://www.micron.com/parts/dram/ddr3-sdram/~/media/Documents/Products/Data%20Sheet/DRAM/4251Gb_DDR3_SDRAM.ashx
39
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
40
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DDR3 vs DDR2
http://www.xbitlabs.com/articles/memory/display/ddr3_2.html
•
2x bandwidth of DDR2 – Data rate/pin: 1600 MT/s vs 800 MT/s – Bus bandwith: 12,800 MT/s vs 6,400 MT/s
•
8 banks vs 4 banks – More open banks for back-to-back access – Hide turn-around time – Hide row precharge http://www.micron.com/~/media/Documents/Products/Presentation/ddr3_advantages1.ashx
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DDR2 vs. DDR3 - market
http://www.micron.com/~/media/Documents/Products/Presentation/ddr3_advantages1.ashx
41
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DDR2 vs DDR3
http://www.xbitlabs.com/articles/memory/display/ddr3_2.html
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DIMMs – Dual In-line memory Modules 8-byte wide memory modules (DIMMs) SDRAM
168-pin
DDR
184-pin
DRAM device DDR2
240- pin DIMM
DDR3
240-pin
Keying (notch) http://nik.uni-obuda.hu/sima/letoltes/magyar/SZA2011_osz/esti/Platforms_2011_12_01_E.ppt
42
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DIMMs and Ranks
http://www.kingston.com/ukroot/serverzone/pdf_files/mem_ranks_eng.pdf
43
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Nu Number of Ranks on a Channel limited due to signal integrity and driver power
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
44
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Memory ranks - capacity Processor chipsets may limit the number of memory ranks per memory channel Example: Max 8 ranks (some Intel chipsets) Two pairs of 2GB dual-ranked memory modules: Total ranks; 2 x 2 x 2 = 8 Total memory: 2 x 2 x 2GB = 8GB Four pairs of 2GB single ranked memory modules: Total ranks: 4 x 2 x 1 = 8 Total memory: 4 x 2 x 2 GB = 16 GB
45
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf
46
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
Source: Jacob Leverich, 2011, http://www.stanford.edu/class/ee282/handouts/lect.06.opt.pdf
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
DRAM control is tricky • CPU prioritizes memory accesses – transaction requests send to memory controller
• Memory controller – translates transaction into the appropriately timed command sequence • transactions are different – – – –
open bank then it’s just a CAS no open bank then Activate, PRE, RAS, CAS wrong open bank then write-back and then ACT, PRE, RAS, CAS lots of timing issues
• result: latency varies – often the command sequence can be stalled or even restarted – refresh controller always wins http://www.eng.utah.edu/~cs7810/pres/dram-cs7810-oview-devices-x2.pdf
47
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
References • •
Transistor Count, http://en.wikipedia.org/wiki/Transistor_count New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies. Pollack, F., 32nd Annual IEEE/ACM International Symposium on Microarchitecture, 1999, Haifa, Israel. http://dl.acm.org/citation.cfm?id=320080.320082, http://hpc.ac.upc.edu/Talks/dir07/T000065/slides.pdf Energy per Instruction Trends in Intel® Microprocessors, Ed Grochowski, Murali Annavaram, http://support.intel.co.jp/pressroom/kits/core2duo/pdf/epi-trends-final2.pdf Intel Gulftown die shot, specs revealed, Ryan Shrout, February 3, 2010, PC Perspective, http://www.pcper.com/comments.php?nid=8348 Intel Xeon Processors 5600 Series, http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/xeon-5600-brief.pdf AMD’s 12-core Magny-Cours Opteron 6174 vs Intel’s 6-core Xeon, Johan De Gelas, March 29, 2010, http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon Intel Nehalem Westmere, http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem_Westmere Sandy Bridge-E hits the market with more cores,.more threads, http://arstechnica.com/business/news/ 2011/11/sandy-bridge-e-hits-the-market-with-more-cores-more-threads.ars Sandy Bridge, http://en.wikipedia.org/wiki/Sandy_Bridge Intel Sandy Bridge, http://www.7-cpu.com/cpu/SandyBridge.html Sandy Bridge for servers, http://realworldtech.com/page.cfm?ArticleID=RWT072811020122&p=1 Venom GPU System with AMD's Opteron™ (aka Istanbul) http://www.3dprofessor.org/Reviews%20Folder%20Pages/Istanbul/SMISP1.htm Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor, Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, Bill Hughes, IEEE Micro, pp. 16-29, March/April, 2010, http://portal.nersc.gov/project/training/files/XE6-feb-2011/Architecture/Opteron-Memory-Cache.pdf
• • • • • • • • • • •
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
References (cont’d) • • • • • • • • • • •
High-Performance Power-Efficient x86-64 Server and Desktop Processors, “Using the core codenamed “Bulldozer”, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.9-DesktopCPUs/HC23.19.940-Bulldozer-White-AMD.pdf Blue Gene/P Architecture, Vitali Morozov, http://workshops.alcf.anl.gov/gs10/files/2010/01/MorozovBlueGeneP-Architecture.pdf Using the Dawn BG system, Blaise Barny, https://computing.llnl.gov/tutorials/bgp IBM Blue Gene/Q Compute Chip, Ruud A. Haring, http://www.hotchips/org/hc23 Of GigaHertz and CPW, Version 2, Mark Funk, Robert Gagliardi, Allan Johnson, Rick Peterson, January 30, 2010, http://www03.ibm.com/systems/resources/pwrsysperf_OfGigaHertzandCPWs.pdf Mainline Functional Verification of IBM’s Power7 Processor Core, John Ludden, http://www.tandvsolns.co.uk/DVClub/1_Nov_2010/Ludden_Power7_Verification.pdf Two billion-transistor beats: Power7 and Niagara 3, John Stokes, 2011, http://arstechnica.com/business/news/2010/02/two-billion-transistor-beasts-power7-and-niagara-3.ars Power7 Processors: The Beat Goes on, Joel M Tendler, http://www.ibm.com/developerworks/wikis/ download/attachments/104533501/POWER7+-+The+Beat+Goes+On.pdf Poulson: An 8 Core 32 nm Next Generation Intel®Itanium®Processor, Steve Undy, http://www.hotchips.org/archives/hc23/HC23-papers/HC23.19.7-Server/HC23.19.721-Poulson-ChinIntel-Revised%202.pdf Larry Ellison’s First Sparc chip and server, Timothy Pricket Morgan, September 20, 2010, http://www.theregister.co.uk/2010/09/20/oracle_sparc_t3_chip_servers/print.html Sparc T3 Processor, http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/tseries/sparc-t3-chip-ds-173097.pdf
48
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
References (cont’d) •
A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-κ Metal Gate CMOS, Gianfranco Gerosa, Steve Curtis, Mike D’Addeo, Bo Jiang, Belliappa Kuttanna, Feroze Merchant, Binta Patel, Mohammed Taufique, Haytham Samarchi, ISSCC 2008, Session 13.1, Mobile Processing, http://download.intel.com/pressroom/kits/isscc/ISSC_Intel_Paper_Silverthorne.pdf Intel Atom D510 – Processor Information and Comparisons, http://www.diffen.com/difference/Special:Information/Intel_Atom_D510 2 GHz Capable Cortex-A9 Dual Core Processor Implementation, September 16, 2009, http://arm.com/files/downloads/Osprey_Analyst_Presentation_v2a.pdf ARM Low Power Leadership, CMP Conference, Eric Lalardie, January 28th 2010, http://cmp.imag.fr/aboutus/slides/Slides2010/14_ARM_lalardie_2009.pdf Apple’s A4 dissected, discussed, …. and tantalizing, Paul Boldt, Don Scansen, Tim Whibley June 17, 2010, http://www.eetimes.com/electronics-news/4200451/Apple-s-A4-dissected-discussed--and-tantalizing 1TOPS/W Software Programmable Media Processor, David Moloney, http://www.hotchips.org/archives/ hc23/HC23-papers/HC23.19.8-Video/HC23.19.811-1TOPS-Media-Moloney-Movidius.pdf No exponential is “forever”: but “forever” can be delayed!, Gordon E Moore, ISSCC February 10, 2003 http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fstamp%2Fstamp.j sp%3Ftp%3D%26arnumber%3D1234194&authDecision=-203 Concepts in VLSI Design, Computer-Aided Digital VLSI Design, Michael Bushnell, Rutgers, http://www.ece.rutgers.edu/~bushnell/vlsidesign/digvlsideslec1.ppt 22nm SRAM Announcement, Mark Bohr, Intel, September 2009, http://download.intel.com/pressroom/kits/events/idffall_2009/pdfs/IDF_MBohr_Briefing.pdf
• • • • • • • •
COSC6365
COSC 6365 Lecture 6, 2011-02-02
Lennart Johnsson 2013-01-29
References (cont’d) • • • • • • • • • •
Under the Hood: 1-Gbit DDR3 SDRAMs square off, Young Choi, September 3, 2007, http://www.eetimes.com/design/other/4004747/Under-the-Hood-1-Gbit-DDR3-SDRAMs-square-off Mark Bohr, IDF, Moscow, April 2006 DRAM refresher: Problems the technology is set to encounter, Chris Edwards, June 29, 2011 http://www.newelectronics.co.uk/electronics-technology/dram-refresher-problems-the-technologyis-set-to-encounter/34922 CSE 431, Computer Architecture, Fall 2005, Mary Jane Irwin, Lecture 18, Memory Hierarchy Review, http://www.cse.psu.edu/research/mdl/mji/mjicourses/431/cse431-18memoryintro.ppt/view Samsung’s 3x DDR3 SDRAM – 4F2 or 6F2? You Be the Judge.. Dick James, January 31, 2011, http://www.electroiq.com/blogs/chipworks_real_chips_blog/2011/01/samsung-s-3x-ddr3-sdram4f2-or-6f2-you-be-the-judge.html Concepts in VLSI Design, Lecture 18, SRAM, David Harris, Mike Bushnell, http://eceweb1.rutgers.edu/~bushnell/vlsidesign/digvlsideslec18.ppt Computer Architecture and Organization, COS471a, COS471b/ELE 375, Lecture 20: Memory Technology, David August, Fall 2004, http://www.cs.princeton.edu/courses/archive/fall04/cos471/lectures/20-Memory.pdf Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory, Jeffrey Stuecheli, Dimitris Kaseridis, Hillery C. Hunter and Lizy K. John, http://lca.ece.utexas.edu/pubs/jeff_micro10.pdf EE282, Lecture 5: Memory, Jacob Leverich, Spring 2011, http://www.stanford.edu/class/ee282/handouts/lect.05.dram.pdf Dynamic random-access memory, http://en.wikipedia.org/wiki/Dynamic_random_access_memory
49