Preview only show first 10 pages with watermark. For full document please download

Enee 759h, Spring 2005 Memory Systems: Architecture And Performance Analysis

   EMBED


Share

Transcript

Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. ENEE 759H, Spring 2005 Memory Systems: Architecture and Performance Analysis System Controller SLIDE 1 Credit where credit is due: Slides contain original artwork (© Jacob, Wang 2005) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis System Controller Spring 2005 CPU ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 2 OS/drivers/etc. AGP System Controller 3D gfx (North Bridge) processor graphics memory Z- buffer Texture Harddisc I/O Controller (Southbridge) Game AI Collision detection / geometry info multi megabyte texture ethernet packet ethernet packet RAM Keyboard mouse Ethernet card Heavy demand placed on memory system Heavier still in SMP/SMT/CMP system System Controller == System traffic cop UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Memory Request Overview Spring 2005 ENEE 759H Lecture4.fm Processor Core processor Part A: Searching on-chip for data SLIDE 3 [A3] [A2] DTLB L1 cache Bruce Jacob David Wang University of Maryland ECE Dept. L2 cache [A1] BIU (Bus Interface Unit) [B2] system controller Part B: Going off-chip for data I/O to memory traffic Fetch Decode virtual to physical [A2] L1 D-Cache address translation access. If miss (DTLB access) [A1] then proceed to physical to memory addr mapping [A4] [B1] [B8] [B3] read data buffer memory request [B4] scheduling Exec Mem [A3] L2 Cache access. If miss then send to BIU RAM [B7] [B5] WB [B6] DRAM core Stages of instruction execution Bus Interface Unit (BIU) obtains data from main memory [A4 + B] Proceeding through the memory hierarchy in a modern processor [B1] BIU arbitrates [B2] request [B3]physical addr. [B4] memory [B5] memory [B6, B7] DRAM dev. [B8] system controller returns for ownership of sent to system to memory addr. request addr. Setup obtains data and address bus ** controller scheduling** (RAS/CAS) returns to controller data to CPU translation. ** Steps not required for some processor/system controllers. protocol dependant. Progression of a Memory Read Transaction Request Through Memory System UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis “Memory Latency” Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. F DRAM CPU Mem E1 Controller A B C D E2/E3 SLIDE 4 A: Transaction request may be delayed in Queue B: Transaction request sent to Memory Controller C: Transaction converted to Command Sequences (may be queued) D: Command/s Sent to DRAM E1: Requires only a CAS or E2: Requires RAS + CAS or E3: Requires PRE + RAS + CAS F: Transaction sent back to CPU “DRAM Latency” = A + B + C + D + E + F UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Small System Topologies Spring 2005 University of Maryland ECE Dept. SLIDE 5 CPU System Controller memory controller I/O CPU CPU I/O Point-to-point processor-controller system topology (AMD Athlon/Alpha EV6/PPC 970) CPU CPU System Controller DRAM DRAM System Controller memory controller System Controller memory controller Classic small system topology (Lots of systems) I/O CPU DRAM Bruce Jacob David Wang CPU DRAM ENEE 759H Lecture4.fm CPU I/O memory controller Integrated system controller system topology (AMD Opteron/Alpha EV7 etc.) represents point of synchronization*. (for local access) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis System Controller: Athlon Spring 2005 ENEE 759H Lecture4.fm CPU CPU BIU1 BIU0 Bruce Jacob David Wang University of Maryland ECE Dept. AGP SLIDE 6 MRO MCT DRAM APC PCI MRO:Memory Request Organizer APC:AGP PCI Controller block MCT:Memory Controller (SDRAM/DDR/DRDRAM) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis MRO: Memory Request Organizer Spring 2005 ENEE 759H Lecture4.fm • Request crossbar responsible for scheduling memory read and write requests from BIU, PCI, AGP • Serves as the coherence point • Requests are reordered to minimize page conflict and maximize page hits • Anti-starvation mechanism by aging of entries • Arbitration bypassed during idle conditions to improve latency Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 7 UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis AMD Athlon Controller: Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. Chip Version Tech & Voltage Max Core Die Size No. of Speed (pad limited) pins SDRAM 1P, 2xAGP 0.35um, 100 MHz 3.3V 107 mm2 492 SDRAM, 2P, 0.35um, 100 MHz 2xAGP 3.3V 130 mm2 656 DDR, 1P, 4xAGP 0.25um, 133 MHz 2.5V 133 mm2 553 DDR, 2P, 4xAGP 0.25um, 133 MHz 2.5V 107 mm2 492 SLIDE 8 RDRAM, 1P, 0.25um, 133 MHz 4xAGP 2.5V UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Cache Coherency I Spring 2005 ENEE 759H Lecture4.fm CPU CPU Read Request Bruce Jacob David Wang University of Maryland ECE Dept. BIU1 BIU0 SLIDE 9 MRO MCT DRAM Request: I would like data for cacheline 0x001CA980 UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Cache Coherency II Spring 2005 ENEE 759H Lecture4.fm CPU CPU BIU1 BIU0 Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 10 Snoop Request MRO MCT DRAM Snoop Request: Do you have cachline 0x001CA980 ? Memory Fetch: Give me data for 0x001CA980. UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Cache Coherency IIIa Spring 2005 ENEE 759H Lecture4.fm CPU CPU BIU1 BIU0 Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 11 Data Snoop Response MRO MCT DRAM Snoop Response: No SDRAM MCT: RAS to rank 2, bank 0, row 0x00842 SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3 SDRAM MCT: Here’s the data. UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Cache Coherency IIIb Spring 2005 ENEE 759H Lecture4.fm CPU Data Bruce Jacob David Wang University of Maryland ECE Dept. CPU BIU1 BIU0 Stale Data trash SLIDE 12 Snoop Response MRO MCT DRAM Snoop Response: Yes, I have this cache line SDRAM MCT: RAS to rank 2, bank 0, row 0x00842 SDRAM MCT: CAS to rank 2, bank 0, col 0x0C3 MRO: Here’s the data. UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Why worry about CC? Part 1 Spring 2005 ENEE 759H Lecture4.fm CPU CPU Bruce Jacob David Wang University of Maryland ECE Dept. Distance to data in cache SLIDE 13 “Point of Synchronization” DRAM Distance to data in DRAM What if distance to DRAM is shorter than distance to cache (in another CPU)? UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm Why worry about CC? Part 2 1 2 3 4 5 6 7 arbitration command 8 9 error (checking) 10 11 12 13 snoop 14 15 response 16 17 18 data Bruce Jacob David Wang Data returned by system to CPU is assumed to be the “most current copy”. Read transaction latency is bound by max(snoop_latency, dram_latency) University of Maryland ECE Dept. SLIDE 14 Intel P6 system bus read transaction latency breakdown cpu 0 cpu 1 cpu 2 cpu 3 cpu 1 cpu 2 system controller (northbridge) system controller (northbridge) RAM RAM Processors can grab request address off of shared bus in shared multi-drop topology UNIVERSITY OF MARYLAND cpu 0 cpu 3 System controller rebroadcast request address to aid in snoop for point-to-point topology Memory Systems Architecture and Performance Analysis Multiple Clock Domains I Spring 2005 CPU ENEE 759H Lecture4.fm command:133 MHz address:133*2 MHz data:133*4 MHz Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 15 High Bandwidth I/O System Controller RAM AGP:66*4 MHz Low Bandwidth I/O PCI: 33 MHz DDR:133*2 MHz DDR:166*2 MHz Most clock domains are integer multiples of each other UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Multiple Clock Domains II Spring 2005 CPU ENEE 759H Lecture4.fm Processor bus 66/100/133 MHz Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 16 High Bandwidth I/O AGP: 66*2 MHz (32 bit) System Controller RAM SDRAM: 100/133 MHz Low Bandwidth I/O PCI: 33 MHz (32 bit) What if clock domains are not integer multiples of each other? UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 ENEE 759H Lecture4.fm Bruce Jacob David Wang Multiple Clock Domains III FastClk (100MHz) SlowClk (66MHz) Data University of Maryland ECE Dept. SLIDE 17 AMD Athlon Chipset Gearbox Logic D0 D1 Latched Data D0 D1 D2 LatchEn D2 CtlMask D Dff Q Cmb Logic SlowClk Slow clock domain UNIVERSITY OF MARYLAND D LatchEn FastClk Cmb Logic Q Dff Gear Box Module CtlMask D Q Dff FastClk Fast clock domain Memory Systems Architecture and Performance Analysis Multiple Clock Domains IV Spring 2005 ENEE 759H Lecture4.fm d0 Bruce Jacob David Wang University of Maryland ECE Dept. d1 d2 d0 d3 d1 d2 d3 Data transfer from 100 MHz clock domain to 133 MHz clock domain (Latency Optimal) SLIDE 18 d0 d1 d2 d0 d3 d1 d2 d3 Data transfer from 100 MHz clock domain to 133 MHz clock domain (Bandwidth Optimal) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 Multiple Clock Domains V d0 ENEE 759H Lecture4.fm Bruce Jacob David Wang University of Maryland ECE Dept. d1 d0 d2 d3 d1 d2 d3 Data transfer from 400 MHz clock domain to 800 MHz clock domain (Latency Optimal) SLIDE 19 d0 d1 d2 d3 d0 d1 d2 d3 Data transfer from 400 MHz clock domain to 800 MHz clock domain (Bandwidth Optimal) UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Spring 2005 Multiple Clock Domains VI 0 1 2 3 4 5 6 7 8 9 CPU Clock Domain(10:2) 500 MHz (harmonic) ENEE 759H Lecture4.fm Bruce Jacob David Wang Bus Clock - 100 MHz 0 1 2 3 4 5 6 7 8 9 10 CPU Clock Domain(11:2) University of Maryland ECE Dept. SLIDE 20 550 MHz (not-harm) Bus Clock - 100 MHz 0 1 2 3 4 5 6 7 8 9 10 11 CPU Clock Domain(12:2) 600 MHz (harmonic) Bus Clock - MHz Processor to Processor Bus Interface Fractional multipliers could impact performance, but we may not have a choice UNIVERSITY OF MARYLAND Memory Systems Architecture and Performance Analysis Multiple Clock Domains VII Spring 2005 AMD Athlon SPEC CPU FP Completion Time ENEE 759H Lecture4.fm 5300 1466 MHz University of Maryland ECE Dept. SLIDE 21 Total Task Time (seconds) Bruce Jacob David Wang 5200 1533 MHz 5100 5000 1666 MHz 1600 MHz 4900 Non-Harmonic Node Harmonic Node 1733 MHz 4800 0.58 UNIVERSITY OF MARYLAND 0.6 0.62 0.64 Cycle Time (ns) 0.66 0.68 Memory Systems Architecture and Performance Analysis AMD Opteron Spring 2005 L2 Miss Latency 17 Opteron 1 Cache Processor 16 Core 4 clock domain crossing 5 15 Bruce Jacob David Wang 6 System 7 9 8 14 crossbar hypertransport 10 memory 13 controller hypertransport 12 11 Request Queue hypertransport DRAM DRAM SLIDE 22 3 L2 “memory latency” ENEE 759H Lecture4.fm University of Maryland ECE Dept. 2 L1 Access Miss 1 2 L2 Request L2 Tag 3 Addr to NB L2 Data 4 5 Clock Boundary Route/Mux/ SRQ 6 ECC 7 GART/Addr Decode W L1 D$ 8 Crossbar & FWD 9 Coh./Order Check 10 Memory Controller 11 Req to DRAM Pins 12 ... DRAM Access Data to MCT 13 NB Route 14 15 Clock Boundary 16 CPU route/mux/ECC 17 Write L1 D$ & FWD NB = northbridge CPU or I/O UNIVERSITY OF MARYLAND L2 Access Latency Memory Systems Architecture and Performance Analysis Summary Spring 2005 ENEE 759H Lecture4.fm • System Controller is a “traffic cop” • Traffic cop may have to deal with clock domain synchronization issue • Handles Cache Coherency for small scale SMP configuration • “Memory Latency” depends on lots of little things, not just speed of DRAM. Bruce Jacob David Wang University of Maryland ECE Dept. SLIDE 23 UNIVERSITY OF MARYLAND