Transcript
A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM, MPEG-4 Accelerator and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park and Hoi-Jun Yoo Dept. of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Korea
Outline • Introduction • System Architecture Overview • Low Power Block Design • • • •
32Bit RISC MPEG-4 Accelerator 3D Rendering Engine Embedded DRAM Frame Buffer
• Features of Test Chip • Conclusions
Requirements for Future Mobile Information Terminals • Multimedia Signal Processing – mp3, 2D Image Processing, etc. – 3D Graphics
• Low Power Features – Battery-driven Products
• Low Cost Solutions – Major Factor for Consumer Electronics
Target Specifications System Power ! < 200mW
LCD Display
3D Image Rendering ! > 2 Mpolygons /sec ! 256 x 256 Resolution with 24b True Color ! 16b Z-Buffering ! Alpha Blending ! Double Buffering
Low Power Multimedia Processor
MPEG-4 Video Decoding ! Simple Profile ! QCIF(176 x 144) ! 15 frames /sec
Others ! mp3, etc.
The Proposed Solution H/W & S/W Mixed Solution Optimized at Architecture / Circuit Level
High Perform. CPU
! !
Large Area High Power Consumption
Optimized Perform. CPU
! ! ! !
Dedicated H/W
+
Embedded DRAM
Distribution of Computational Load Small Area Programmability Circuit Level Low Power Techniques
Architecture Overview
80MHz
MAC
Clk Gen. DLL
80MHz 20MHz
128b
Frame Buffer
20MHz 3D Rendering Engine
Frame Buffer + Z-Buffer
Color Out (24b)
SAM
Ext. I/O
32b
B.W. B.W. Equalizer Equalizer Dual-Port 512b Dual-Port SRAM SRAM (2KB) (2KB)
MC Accelerator
On-Chip Wide Bus between Logic / eDRAM YCrCb to RGB SAM
ARM9 ARM9
Data Buffering
Slow / Wide Data Transaction (512b @ 20MHz)
2048b
Fast / Narrow Data Transaction (32b @ 80MHz)
Color Out (24b)
Multimedia Enhancement in RISC F
5-Stage Pipeline
Execution Units
D F
REG File
u
EX D F
S
MEM EX D
ALU
WB MEM EX
WB MEM
WB
4:2 4:2 Add Add 4:2 4:2 Add Add
4:2 Add 4:2 Add
MUL 4:2 4:2 Add Add
4:2 Add
Tree Structure with 4:2 Adders ! 1-Cycle 32b x 32b Multiplication ! 2-Cycle 32b x 32b Multiplication and Accumulation ! 23% Cycle Reduction Compared with Conventional ARM Architecture
Bandwidth Equalizer From RISC
BL
BL
Single Ended for Tight Bit Pitch
CELL
BL2
BL2
32b @ 80MHz BL
BL
WBE SEBLSA
Flow Cont.
32
DP-SRAM (2KB) 512b
WBE
STR BLSA SAE
Act as A Row Cache
CS
To Dedicated H/W 512 (32)b @ 20MHz
DB
DDO DB
WBE : Wide Bus Enable STR : Cache Store DDO : Direct Data Out
Motion Compensation(MC) Accelerator FB Cont
Frame Buffer #0 (512b x 128row x 9bank)
FB Cont
Frame Buffer #1 (512b x 128row x 9bank)
512b
#7
Parallel Operation @ 20MHz
Adaptive Fetch Control
#7
#6
FB Buffer
#6
Half-Pel ALU #0
MUXing Logic
Pixel ALU #0
Data Alignment
MUXing Logic
20MHz
Pixel Buffer
128b (16 Pixels)
Frame Buffer for MCA Bank #0
9-Bank with 128b I/O
#1
#2
#8
128b I/O
Partial Activation Control
S/A DB S/A x32 Partial I/O Control
/GWL
S/A DB S/A x32
GWL
SWDL
S/A DB S/A x32
128
PA Cont
SWL Driver
RX
SWL Driver
SWL Driver
SWL Driver
Sub-wordline with Partial Activation, Partial I/O Scheme
GWL Driver
128b
S/A DB S/A x32
Spatial Locality 70~90% are Confined in 8x8 Boundary
MVy 16
Blocks to be Reconstructed
8
Needed
4 -16
-8 -4
4
8
16
-4 -8
-16
Distribution of Motion Vectors for Class A/B (MVx,MVy)
MVx
MB Addr = N
Large Spatial Locality
Previously Used Commonly Used
Re-usable Block
MB Addr = N+1
Newly Needed
Distributed Nine-Tiled Block Mapping : Low Power Technique (1) A Block in A Row
0 1 2 3 4 5 6 7 8
Row Conflicts
Bank #0
1-Bank
Bank #1 Bank #8
Frame Image
Increasing Re-usability
BK#8 BK#0
9-Banks (1- Macro)
! Minimizing Cell Core Activation in DRAM
Partial Activation Scheme : Low Power Technique (2)
#0
#1
#2
#3
0 1 2 3
Screen
SAM Transfer
GWL drv
Screen
3
0
1
2
3
0
1
2
3
0
1
2
3
Bank #2
Bank #3
Bank #0
Bank #1
0
1
2
3
0
1
2
3
Bank #2
DNTBM + Partial ACT
GWL drv
2
GWL drv
1
GWL drv
#3
0
Bank #1
GWL drv
#2
GWL drv
#1
Normal Operation
GWL drv
Necessary Data
#0
0 1 2 3
GWL drv
Bank #0
Unnecessary Data
0
1
2
3
0
1
2
3
Partial Activation
Bank #3
Up to 31% Power Reduction Compared with 1-Bank Structure
Adaptive Fetch Control Scheme : Low Power Technique (3) 1
3
2
+
3
+
4
Block-by-Block Reconstruction
4
Garbage Data
FB Buffer
Muxing Logic
Valid Data
Adaptive Fetch Control
2
=
1
+
PE #0 PE #1 PE #2 PE #3 PE #4 PE #5 PE #6 PE #7
No Switching in Datapath
3D Rendering Engine Bandwidth Equalizer Old Pixel (RGBZ) Polygon Buffer Right
Left
20MHz RGB X Y Z
1-Edge Processor
RGB X Y Z
R/G/B Unit Z-Unit Shading Shading Shading Shading
Depth Comparison
1280b
Frame-Buffer Interface
Blending Blending Blending Blending
8-Pixel Processors Update PP0 PP1 PP2
PP7
New Pixel (RGBZ)
Calculating 8 Pixels/Cycle
Parallel Datapath for RGB and Z
ViSTA Architecture • Virtually Spanning 2D Array(ViSTA) Architecture Control
! 8 EP's ! 64 PP's
Parallel EP
M M
PP PP
PP PP
M M
M
PP
PP
M
M M
PP PP
PP PP
M M
M
PP
PP
M
M M
PP PP
PP PP
M M
M
PP
PP
M
M M
PP PP
PP PP
M M
M
PP
PP
M
EP
EP
EP
EP
EP
EP
EP
EP
Previous Work (ISSCC2000 TP14.7)
1/8 Scaling
! 1 EP ! 8 PP's
EP
PP
Virtually Spans 2D Array
PP PP 8-Stage Pipelined EP
Interface
M
M
M
This Work (ViSTA)
Dynamic Bus Reconfiguration
Frame Buffer for 3DRE From Pixel Processors
1280b x 20MHz = 3.2GB/sec
Concurrent Data Transfer
640b
640b
768b
FBI 256b Write
Read
256b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM
384b
Depth Buffer
24b SAM
384b 384b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM
Frame Buffer #0
384b 512kb DRAM 512kb DRAM 512kb DRAM 512kb DRAM
Frame Buffer #1
Interchangeable Double Color-Buffers
True Color
SCLK
Single Bitline Writing Scheme : Low Power Technique (4) BL (Real)
/BL (Ref.)
No Transitions in /BL
WL
Vcc
Vcc/2
BIS_0
8Kb Cell Array with 2K Column 30.02 mW
19.3 mW
Periphery & Control
Data Sensing
BIS_1
30.47 mW Vcc
15.0 mW
GND
Single Bitline Writing
20% Power Reduction in Data Sensing
System Power Consumption Power (mW)
By Embedding DRAM
1000 400~700
175 150
This Work Data I/O
125
eDRAM Macro
100
160mW eDRAM Macro
75 50 25
Logic
Logic
Logic
Conventional Design - I (Ext. FB)
Conventional Design - II (Embedded FB)
Proposed System
Die Photograph MC Frame Buffer #1
Internal DRAM YCrCb to RGB SAM
3DRE Z-Buffer
MC Frame Buffer #2 SAM
3DRE Frame Buffer
MCA Bandwidth Equalizer DLL
32bit RISC
3D Rendering Engine
! 0.18um EML Technology with 3-poly, 6-metal ! 240pin QFP ! 84mm2 (14 x 7 Including I/O Cells)
Chip Features (Physical) ! 80MHz ! 1.5V ! 12mW ! 1.7mm2
! 80/20MHz ! 1.5V ! 4mW ! 1.6mm2
! 20MHz ! 1.5V ! 4.6mW ! 2.3mm2 MC Accelerator
! 20MHz ! 2.5V ! 11.7mW ! 5.25mm2 Frame Buffer < 40mW
ARM9
B.W. Equalizer 3D Rendering Engine
! I/O Cells : 3.3V
! 20MHz ! 1.5V ! 36mW ! 5mm2
Frame Buffer ! 80MHz ! 2.5V ! 84mW ! 16.4mm2
< 140mW
Conclusions • Low Power Multimedia Processor for Mobile Applications – Optimized H/W & S/W Mixed System – Multimedia Signal Processing • Not only 2D Image, But also 3D Graphics – Low Power Techniques • Distributed Nine-Tiled Block Mapping • Partial Activation, Partial I/O scheme • Adaptive Fetch Control Scheme • Single Bitline Writing Scheme
• 160mW, 84mm2