Transcript
05 - Microarchitecture, RF and ALU Andreas Ehliar
September 15, 2015
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Microarchitecture Design
I
Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules.
I
Step 2: Collect all microoperations allocated in a module and specify hardware multiplexing for RTL coding of the module
I
Step 3: Fine-tune intermodule specifications of the ASIP architecture and finalize the top-level connections and pipeline.
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Hardware Multiplexing
I
Reusing one hardware module for several different operations
I
Example: Signed and unsigned 16-bit multiplication
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Hardware Multiplexing Possible functions 1. A + C 2. A + D 3. B + C 4. B + D 5. A * C 6. A * D 7. B * C 8. B * D 9. SAT(A + C) 10.SAT(A + D) 11.SAT(B + C) 12.SAT(B + D) 13.SAT(A * C) 14.SAT(A * D) 15.SAT(B * C) 16.SAT(B * D)
Pre processing
A
Control[1]
0
B 1
MA
MB
CA
DB
0
1
opa
opb
Kernel processing MUL
ADD
Post processing-1
Control[2]
0
1 MP1
result1 Saturation
Post processing-2 Control[3]
0
1 MP2
[Liu2008] Andreas Ehliar
05 - Microarchitecture, RF and ALU
Control[0]
Hardware multiplexing
I
Hardware multiplexing can be implemented either by SW or by configuring the HW
I
A processor is basically a very neat design pattern for multiplexing different HW units
I
Perhaps the most important skill of a good VLSI designer
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Post-operation-I
Pre-operation-II
Post-operation-II
... ... Pre-operation-X Pre-operation-Y
... ... Post-operation-X Post-operation-Y
[Liu2008]
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Results and verifications
Pre-operation-I
Kernel operations
Collected
micro operations
Typical design pattern for datapath modules
Discussion break:
I
What is most area expensive of these units? I I I I I
17 x 17 bit multiplier 32-bit Adder/subtracter 32-bit 16 to 1 mux 32-bit Adder 8 KiB memory
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Area properties (a.k.a. what to optimize)
I
Relative areas of a few different components I I I I I
I
32-bit Adder: 0.2 to 1 area units 32-bit Adder/subtracter: 0.3 to 2 area units 32-bit 16 to 1 mux: 0.5 – 0.6 area units 17 x 17 bit multiplier: 1.3-3.7 area units 8 KiB memory (32 bit wide): 33 area units
Exam tips: You are typically supposed to minimize the area of the units you design. That is, don’t use more multipliers than necessary, avoid extra adders, don’t worry about small 2-to-1 multiplexers.(And don’t add extra SRAM memories if you can avoid it...)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Performance properties
I
Relative maximum frequencies I I I I I
32-bit adder: 0.1 to 1 32-bit adder/subtracter: 0.1 to 0.9 32-bit 16 to 1 mux: 0.31 to 0.9 17 x 17 bit multiplier: 0.11 – 0.44 8 KiB memory (32 bit wide): 0.53
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Optimizing memory size is often the most important task
I
MP3 decoder example
I
All memories in the chip are 3 time the size of the DSP core itself
I
(I/O pads are also larger than the DSP core itself)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Microarchitecture design of an instruction I
Required microoperations for a typical convolution instruction: I
I
conv ACRx,DM0[ARy%++),DM1(ARz++)
Required microoperations: I I I I I I I I
Instruction decoding Perform addressing calculation Read memories Perform signed multiplication Add guard bits to the result of the multiplication Accumulate the result Set flags For a combined repeat/conv instruction: I I I I
PC <= PC while in the loop PC <= PC + 1 as the last step in the loop No saturation/rounding during the iteration Saturate/round after final loop iteration
Andreas Ehliar
05 - Microarchitecture, RF and ALU
The register file (RF)
I
The RF gets data from data memories by running load instructions while preparing for an execution of a subroutine.
I
While running a subroutine, the register file is used as computing buffers.
I
After running the subroutine, results in the RF will be stored into data memories by running store instructions.
Andreas Ehliar
05 - Microarchitecture, RF and ALU
General register file Results and flags
MEM ctrl Operation ctrl
Target addresses, configuration vectors from register file
[Liu2008] I
Connected to almost all parts of the core Andreas Ehliar
05 - Microarchitecture, RF and ALU
Results
Exec unit ALU/MAC
DM
AGU
Operand & result control
Instruction decoder
Configuration and status
Program address
PC FSM
PM
Instruction
RF
Program flow control
Register file schematic
Read circuit
from register file from memory 1 from memory 2 from ALU from ports ...... from MAC
Store circuit
Write circuit register 1
OPA register 2
ctrl_o_a register 3
ctrl_o_b
............ ctrl_reg_in register n
OPB
[Liu2008] Andreas Ehliar
05 - Microarchitecture, RF and ALU
Register file speed
I
Almost (but not quite) the same speed as a very fast 32-bit adder (in this particular technology)
I
Also note that it is possible to use special register file memories (but at an increased verification cost)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Read before write or write before read
I
I
A processor architect has to decide how the register file should work when reading and writing the same register Read before write I
I
The old value is read
Write before read I
The new value is read (more costly in terms of the timing budget)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Physical design: fan-out problem From 32 registers in a register file
Fan-out of the control signal For the first stage: 16*16*2 =512 Fan-out of the control signal For the second stage: 16*8*2 =256 Fan-out of the control signal For the third stage: 16*4*2 =128 Fan-out of the control signal For the fourth stage: 16*2*2 =64 Fan-out of the control signal For the fivth stage: 16*1*2 =32
Selected operand
[Liu2008] Andreas Ehliar
05 - Microarchitecture, RF and ALU
Register File in Verilog
reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries always @(posedge clk) begin if(we) rf[waddr] <= wdata; end always @* begin op_a = rf[opaddr_a]; op_b = rf[opaddr_b]; end
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Special Purpose Registers
I
Sometimes we need special purpose registers (SPR or SR) I I I I I I
I
BOT/TOP for modulo addressing AR for address register SP I/O Core configuration registers etc
Should these be included in the general purpose register file?
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Special Purpose Registers as normal registers
I
Convenient for the programmer. Special purpose registers can be accessed like any normal register. I
I
I
Example: add bot0,1 ; Move ringbuffer bottom one word Example 2 (from ARM): pop pc
Drawbacks: I I
Wastes entries in the general purpose register file Harder to use specialized register file memories
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Special purpose registers needs special instructions I I
Special instructions required to access SR:s Example: I I I I I I
I
Advantage: I
I
I
move r0,bot0 ; Move ringbuffer bottom one word (nop) ; May need nop(s) here add r0,1 (nop) ; May need nop(s) here bot0,r0 (Move is encoded as move from from/to special purpose register here) Easier to meet timing as special purpose registers can easier be located anywhere in the core Can scale easily to hundreds of special purpose registers if required. (Common on large and complex processors such as ARM/x86)
Drawback: I
Inconvenient for special registers you need to access all the time Andreas Ehliar
05 - Microarchitecture, RF and ALU
Conclusions: SPRs
I
Only place SPRs as a normal register if you believe it will be read/written via normal instructions very often
Andreas Ehliar
05 - Microarchitecture, RF and ALU
ALU in general
I
ALU: Arithmetic and Logic Unit I I I I I
Arithmetic, Logic, Shift/rotate, others No guard bits for iterative computing One guard bit for single step computing Get operands from and send result to RF Handles single precision computing
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Separate ALU or ALU in MAC Register file
Register file
multiplier
ALU
Accumulator
DTU
multiplier
ALU and Accumulator Register file
Register file
(a)
(b)
[Liu2008] Andreas Ehliar
05 - Microarchitecture, RF and ALU
ALU high level schematic B [15:0]
A [15:0]
Masker, guard, carry-in, and other preprocessing
Shift unit
Logic unit
Saturation and flag processing FA/FC, FS, FZ
Result [15:0]
[Liu2008] Andreas Ehliar
05 - Microarchitecture, RF and ALU
Pre-processing
I
Select operands: from one of the source I
I
Register file, control path, HW constant
Typical operand pre processing: I
Guard: one guard
I
Invert: Conditional/non-conditional invert Supply constant 0, 1, -1 Mask operand(s) Select proper carry input
I
I I I
(does not support iterative computing)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Post-processing
I
Select result from multiple components I
I
Saturation operation I I
I
From AU, logic unit, shift unit, and others Decide to generate carry-out flag or saturation Perform saturation on result if required
Flag operation I
Flag computing and prediction
Andreas Ehliar
05 - Microarchitecture, RF and ALU
General instructions
ADD SUB ABS CMP NEG INC DEC AVG
Operation Addition Subtraction Absolute Compare Negate Increment Decrement Average
opa + + +/+ + + +
opb + 1 -1 +
Andreas Ehliar
Carry in 0 1 A[15] 1 1 0 0 0
Carry out Cout/SAT Cout/SAT SAT SAT SAT SAT SAT SAT
05 - Microarchitecture, RF and ALU
Special Instructions
Mnemonic MAX MIN DTA ADT
Description Select larger value Select smaller value Difference of two absolute values Absolute of the difference of two values
Andreas Ehliar
Operation RF <= max(OpA,OpB) RF <= min(OpA,OpB) GR <= |OpA| − |OpB| GR <= |OpA−OpB|
05 - Microarchitecture, RF and ALU
Adder with carry in for RTL synthesis (safe solution)
{A[15], A[15:0], “1”} {B[15],B[15:0],CIN}
+ 18b full adder FAO [17:0]
Result [16:0] < =FAO [17:1] [Liu2008]
Andreas Ehliar
I
Full adder may have no carry in
I
One guard bit
I
We need 2 extra bits in the adder
I
LSB of the 18b result will not be used
I
MSB of the 18b result will be the guard
I
Works on all synthesis tools
05 - Microarchitecture, RF and ALU
Adder for RTL synthesis (modern version)
I
{Cout,R[15:0]}={1’b0,A[15:0]}+{1’b0,B[15:0]}+Cin;
I
Cout is 1 bit wide
I
Important: Cin is 1 bit wide! Modern synthesis tools can usually handle this case without creating two adders
I
I
(I’ve had to resort to the “safe” version shown on the previous slide in a few cases though. For example when combining an adder with other logic in an FPGA.)
Andreas Ehliar
05 - Microarchitecture, RF and ALU
Example: Implement an 8 bit ALU
Instructions NOP A+B A-B SAT(A+B) SAT(A-B) SAT(ABS(A)) SAT(ABS(A+B)) SAT(ABS(A-B)) CLR S I
Function No change of flags A + B (without saturation) A − B (without saturation) A + B (with saturation) A − B (with saturation) |A| (absolute operation, saturation) |A + B| (absolute operation, saturation) |A − B| (absolute operation, saturation) Clear S flag (other flags unchanged)
There shold be a negative, zero, and saturation flag!
Andreas Ehliar
05 - Microarchitecture, RF and ALU
OP 0 1 2 3 4 5 6 7 8
Example: Implement an 8 bit ALU Instructions NOP A+B A-B SAT(A+B) SAT(A-B) SAT(ABS(A)) SAT(ABS(A+B)) SAT(ABS(A-B)) CLR S
Function No change of flags A + B (without saturation) A − B (without saturation) A + B (with saturation) A − B (with saturation) |A| (absolute operation, saturation) |A + B| (absolute operation, saturation) |A − B| (absolute operation, saturation) Clear S flag (other flags unchanged)
I
There shold be a negative, zero, and saturation flag!
I
Discussion topic: How many adders are needed for each operation?
I
Discussion topic: How many guard bits are needed for each operation? Andreas Ehliar
05 - Microarchitecture, RF and ALU
OP 0 1 2 3 4 5 6 7 8