Preview only show first 10 pages with watermark. For full document please download

A 80/20mhz 160mw Multimedia Processor Integrated With

   EMBED


Share

Transcript

A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM, MPEG-4 Accelerator and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park and Hoi-Jun Yoo Dept. of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Korea Outline • Introduction • System Architecture Overview • Low Power Block Design • • • • 32Bit RISC MPEG-4 Accelerator 3D Rendering Engine Embedded DRAM Frame Buffer • Features of Test Chip • Conclusions Requirements for Future Mobile Information Terminals • Multimedia Signal Processing – mp3, 2D Image Processing, etc. – 3D Graphics • Low Power Features – Battery-driven Products • Low Cost Solutions – Major Factor for Consumer Electronics Target Specifications System Power ! < 200mW LCD Display 3D Image Rendering ! > 2 Mpolygons /sec ! 256 x 256 Resolution with 24b True Color ! 16b Z-Buffering ! Alpha Blending ! Double Buffering Low Power Multimedia Processor MPEG-4 Video Decoding ! Simple Profile ! QCIF(176 x 144) ! 15 frames /sec Others ! mp3, etc. The Proposed Solution H/W & S/W Mixed Solution Optimized at Architecture / Circuit Level High Perform. CPU ! ! Large Area High Power Consumption Optimized Perform. CPU ! ! ! ! Dedicated H/W + Embedded DRAM Distribution of Computational Load Small Area Programmability Circuit Level Low Power Techniques Architecture Overview 80MHz MAC Clk Gen. DLL 80MHz 20MHz 128b Frame Buffer 20MHz 3D Rendering Engine Frame Buffer + Z-Buffer Color Out (24b) SAM Ext. I/O 32b B.W. B.W. Equalizer Equalizer Dual-Port 512b Dual-Port SRAM SRAM (2KB) (2KB) MC Accelerator On-Chip Wide Bus between Logic / eDRAM YCrCb to RGB SAM ARM9 ARM9 Data Buffering Slow / Wide Data Transaction (512b @ 20MHz) 2048b Fast / Narrow Data Transaction (32b @ 80MHz) Color Out (24b) Multimedia Enhancement in RISC F 5-Stage Pipeline Execution Units D F REG File u EX D F S MEM EX D ALU WB MEM EX WB MEM WB 4:2 4:2 Add Add 4:2 4:2 Add Add 4:2 Add 4:2 Add MUL 4:2 4:2 Add Add 4:2 Add Tree Structure with 4:2 Adders ! 1-Cycle 32b x 32b Multiplication ! 2-Cycle 32b x 32b Multiplication and Accumulation ! 23% Cycle Reduction Compared with Conventional ARM Architecture Bandwidth Equalizer From RISC BL BL Single Ended for Tight Bit Pitch CELL BL2 BL2 32b @ 80MHz BL BL WBE SEBLSA Flow Cont. 32 DP-SRAM (2KB) 512b WBE STR BLSA SAE Act as A Row Cache CS To Dedicated H/W 512 (32)b @ 20MHz DB DDO DB WBE : Wide Bus Enable STR : Cache Store DDO : Direct Data Out Motion Compensation(MC) Accelerator FB Cont Frame Buffer #0 (512b x 128row x 9bank) FB Cont Frame Buffer #1 (512b x 128row x 9bank) 512b #7 Parallel Operation @ 20MHz Adaptive Fetch Control #7 #6 FB Buffer #6 Half-Pel ALU #0 MUXing Logic Pixel ALU #0 Data Alignment MUXing Logic 20MHz Pixel Buffer 128b (16 Pixels) Frame Buffer for MCA Bank #0 9-Bank with 128b I/O #1 #2 #8 128b I/O Partial Activation Control S/A DB S/A x32 Partial I/O Control /GWL S/A DB S/A x32 GWL SWDL S/A DB S/A x32 128 PA Cont SWL Driver RX SWL Driver SWL Driver SWL Driver Sub-wordline with Partial Activation, Partial I/O Scheme GWL Driver 128b S/A DB S/A x32 Spatial Locality 70~90% are Confined in 8x8 Boundary MVy 16 Blocks to be Reconstructed 8 Needed 4 -16 -8 -4 4 8 16 -4 -8 -16 Distribution of Motion Vectors for Class A/B (MVx,MVy) MVx MB Addr = N Large Spatial Locality Previously Used Commonly Used Re-usable Block MB Addr = N+1 Newly Needed Distributed Nine-Tiled Block Mapping : Low Power Technique (1) A Block in A Row 0 1 2 3 4 5 6 7 8 Row Conflicts Bank #0 1-Bank Bank #1 Bank #8 Frame Image Increasing Re-usability BK#8 BK#0 9-Banks (1- Macro) ! Minimizing Cell Core Activation in DRAM Partial Activation Scheme : Low Power Technique (2) #0 #1 #2 #3 0 1 2 3 Screen SAM Transfer GWL drv Screen 3 0 1 2 3 0 1 2 3 0 1 2 3 Bank #2 Bank #3 Bank #0 Bank #1 0 1 2 3 0 1 2 3 Bank #2 DNTBM + Partial ACT GWL drv 2 GWL drv 1 GWL drv #3 0 Bank #1 GWL drv #2 GWL drv #1 Normal Operation GWL drv Necessary Data #0 0 1 2 3 GWL drv Bank #0 Unnecessary Data 0 1 2 3 0 1 2 3 Partial Activation Bank #3 Up to 31% Power Reduction Compared with 1-Bank Structure Adaptive Fetch Control Scheme : Low Power Technique (3) 1 3 2 + 3 + 4 Block-by-Block Reconstruction 4 Garbage Data FB Buffer Muxing Logic Valid Data Adaptive Fetch Control 2 = 1 + PE #0 PE #1 PE #2 PE #3 PE #4 PE #5 PE #6 PE #7 No Switching in Datapath 3D Rendering Engine Bandwidth Equalizer Old Pixel (RGBZ) Polygon Buffer Right Left 20MHz RGB X Y Z 1-Edge Processor RGB X Y Z R/G/B Unit Z-Unit Shading Shading Shading Shading Depth Comparison 1280b Frame-Buffer Interface Blending Blending Blending Blending 8-Pixel Processors Update PP0 PP1 PP2 PP7 New Pixel (RGBZ) Calculating 8 Pixels/Cycle Parallel Datapath for RGB and Z ViSTA Architecture • Virtually Spanning 2D Array(ViSTA) Architecture Control ! 8 EP's ! 64 PP's Parallel EP M M PP PP PP PP M M M PP PP M M M PP PP PP PP M M M PP PP M M M PP PP PP PP M M M PP PP M M M PP PP PP PP M M M PP PP M EP EP EP EP EP EP EP EP Previous Work (ISSCC2000 TP14.7) 1/8 Scaling ! 1 EP ! 8 PP's EP PP Virtually Spans 2D Array PP PP 8-Stage Pipelined EP Interface M M M This Work (ViSTA) Dynamic Bus Reconfiguration Frame Buffer for 3DRE From Pixel Processors 1280b x 20MHz = 3.2GB/sec Concurrent Data Transfer 640b 640b 768b FBI 256b Write Read 256b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM 384b Depth Buffer 24b SAM 384b 384b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM Frame Buffer #0 384b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM Frame Buffer #1 Interchangeable Double Color-Buffers True Color SCLK Single Bitline Writing Scheme : Low Power Technique (4) BL (Real) /BL (Ref.) No Transitions in /BL WL Vcc Vcc/2 BIS_0 8Kb Cell Array with 2K Column 30.02 mW 19.3 mW Periphery & Control Data Sensing BIS_1 30.47 mW Vcc 15.0 mW GND Single Bitline Writing 20% Power Reduction in Data Sensing System Power Consumption Power (mW) By Embedding DRAM 1000 400~700 175 150 This Work Data I/O 125 eDRAM Macro 100 160mW eDRAM Macro 75 50 25 Logic Logic Logic Conventional Design - I (Ext. FB) Conventional Design - II (Embedded FB) Proposed System Die Photograph MC Frame Buffer #1 Internal DRAM YCrCb to RGB SAM 3DRE Z-Buffer MC Frame Buffer #2 SAM 3DRE Frame Buffer MCA Bandwidth Equalizer DLL 32bit RISC 3D Rendering Engine ! 0.18um EML Technology with 3-poly, 6-metal ! 240pin QFP ! 84mm2 (14 x 7 Including I/O Cells) Chip Features (Physical) ! 80MHz ! 1.5V ! 12mW ! 1.7mm2 ! 80/20MHz ! 1.5V ! 4mW ! 1.6mm2 ! 20MHz ! 1.5V ! 4.6mW ! 2.3mm2 MC Accelerator ! 20MHz ! 2.5V ! 11.7mW ! 5.25mm2 Frame Buffer < 40mW ARM9 B.W. Equalizer 3D Rendering Engine ! I/O Cells : 3.3V ! 20MHz ! 1.5V ! 36mW ! 5mm2 Frame Buffer ! 80MHz ! 2.5V ! 84mW ! 16.4mm2 < 140mW Conclusions • Low Power Multimedia Processor for Mobile Applications – Optimized H/W & S/W Mixed System – Multimedia Signal Processing • Not only 2D Image, But also 3D Graphics – Low Power Techniques • Distributed Nine-Tiled Block Mapping • Partial Activation, Partial I/O scheme • Adaptive Fetch Control Scheme • Single Bitline Writing Scheme • 160mW, 84mm2