Preview only show first 10 pages with watermark. For full document please download

Microarchitecture, Rf, And Alu

   EMBED


Share

Transcript

05 - Microarchitecture, RF and ALU Andreas Ehliar September 15, 2015 Andreas Ehliar 05 - Microarchitecture, RF and ALU Microarchitecture Design I Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules. I Step 2: Collect all microoperations allocated in a module and specify hardware multiplexing for RTL coding of the module I Step 3: Fine-tune intermodule specifications of the ASIP architecture and finalize the top-level connections and pipeline. Andreas Ehliar 05 - Microarchitecture, RF and ALU Hardware Multiplexing I Reusing one hardware module for several different operations I Example: Signed and unsigned 16-bit multiplication Andreas Ehliar 05 - Microarchitecture, RF and ALU Hardware Multiplexing Possible functions 1. A + C 2. A + D 3. B + C 4. B + D 5. A * C 6. A * D 7. B * C 8. B * D 9. SAT(A + C) 10.SAT(A + D) 11.SAT(B + C) 12.SAT(B + D) 13.SAT(A * C) 14.SAT(A * D) 15.SAT(B * C) 16.SAT(B * D) Pre processing A Control[1] 0 B 1 MA MB CA DB 0 1 opa opb Kernel processing MUL ADD Post processing-1 Control[2] 0 1 MP1 result1 Saturation Post processing-2 Control[3] 0 1 MP2 [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU Control[0] Hardware multiplexing I Hardware multiplexing can be implemented either by SW or by configuring the HW I A processor is basically a very neat design pattern for multiplexing different HW units I Perhaps the most important skill of a good VLSI designer Andreas Ehliar 05 - Microarchitecture, RF and ALU Post-operation-I Pre-operation-II Post-operation-II ... ... Pre-operation-X Pre-operation-Y ... ... Post-operation-X Post-operation-Y [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU Results and verifications Pre-operation-I Kernel operations Collected micro operations Typical design pattern for datapath modules Discussion break: I What is most area expensive of these units? I I I I I 17 x 17 bit multiplier 32-bit Adder/subtracter 32-bit 16 to 1 mux 32-bit Adder 8 KiB memory Andreas Ehliar 05 - Microarchitecture, RF and ALU Area properties (a.k.a. what to optimize) I Relative areas of a few different components I I I I I I 32-bit Adder: 0.2 to 1 area units 32-bit Adder/subtracter: 0.3 to 2 area units 32-bit 16 to 1 mux: 0.5 – 0.6 area units 17 x 17 bit multiplier: 1.3-3.7 area units 8 KiB memory (32 bit wide): 33 area units Exam tips: You are typically supposed to minimize the area of the units you design. That is, don’t use more multipliers than necessary, avoid extra adders, don’t worry about small 2-to-1 multiplexers.(And don’t add extra SRAM memories if you can avoid it...) Andreas Ehliar 05 - Microarchitecture, RF and ALU Performance properties I Relative maximum frequencies I I I I I 32-bit adder: 0.1 to 1 32-bit adder/subtracter: 0.1 to 0.9 32-bit 16 to 1 mux: 0.31 to 0.9 17 x 17 bit multiplier: 0.11 – 0.44 8 KiB memory (32 bit wide): 0.53 Andreas Ehliar 05 - Microarchitecture, RF and ALU Optimizing memory size is often the most important task I MP3 decoder example I All memories in the chip are 3 time the size of the DSP core itself I (I/O pads are also larger than the DSP core itself) Andreas Ehliar 05 - Microarchitecture, RF and ALU Microarchitecture design of an instruction I Required microoperations for a typical convolution instruction: I I conv ACRx,DM0[ARy%++),DM1(ARz++) Required microoperations: I I I I I I I I Instruction decoding Perform addressing calculation Read memories Perform signed multiplication Add guard bits to the result of the multiplication Accumulate the result Set flags For a combined repeat/conv instruction: I I I I PC <= PC while in the loop PC <= PC + 1 as the last step in the loop No saturation/rounding during the iteration Saturate/round after final loop iteration Andreas Ehliar 05 - Microarchitecture, RF and ALU The register file (RF) I The RF gets data from data memories by running load instructions while preparing for an execution of a subroutine. I While running a subroutine, the register file is used as computing buffers. I After running the subroutine, results in the RF will be stored into data memories by running store instructions. Andreas Ehliar 05 - Microarchitecture, RF and ALU General register file Results and flags MEM ctrl Operation ctrl Target addresses, configuration vectors from register file [Liu2008] I Connected to almost all parts of the core Andreas Ehliar 05 - Microarchitecture, RF and ALU Results Exec unit ALU/MAC DM AGU Operand & result control Instruction decoder Configuration and status Program address PC FSM PM Instruction RF Program flow control Register file schematic Read circuit from register file from memory 1 from memory 2 from ALU from ports ...... from MAC Store circuit Write circuit register 1 OPA register 2 ctrl_o_a register 3 ctrl_o_b ............ ctrl_reg_in register n OPB [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU Register file speed I Almost (but not quite) the same speed as a very fast 32-bit adder (in this particular technology) I Also note that it is possible to use special register file memories (but at an increased verification cost) Andreas Ehliar 05 - Microarchitecture, RF and ALU Read before write or write before read I I A processor architect has to decide how the register file should work when reading and writing the same register Read before write I I The old value is read Write before read I The new value is read (more costly in terms of the timing budget) Andreas Ehliar 05 - Microarchitecture, RF and ALU Physical design: fan-out problem From 32 registers in a register file Fan-out of the control signal For the first stage: 16*16*2 =512 Fan-out of the control signal For the second stage: 16*8*2 =256 Fan-out of the control signal For the third stage: 16*4*2 =128 Fan-out of the control signal For the fourth stage: 16*2*2 =64 Fan-out of the control signal For the fivth stage: 16*1*2 =32 Selected operand [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU Register File in Verilog reg [15:0] rf[31:0]; // 16 bit wide RF with 32 entries always @(posedge clk) begin if(we) rf[waddr] <= wdata; end always @* begin op_a = rf[opaddr_a]; op_b = rf[opaddr_b]; end Andreas Ehliar 05 - Microarchitecture, RF and ALU Special Purpose Registers I Sometimes we need special purpose registers (SPR or SR) I I I I I I I BOT/TOP for modulo addressing AR for address register SP I/O Core configuration registers etc Should these be included in the general purpose register file? Andreas Ehliar 05 - Microarchitecture, RF and ALU Special Purpose Registers as normal registers I Convenient for the programmer. Special purpose registers can be accessed like any normal register. I I I Example: add bot0,1 ; Move ringbuffer bottom one word Example 2 (from ARM): pop pc Drawbacks: I I Wastes entries in the general purpose register file Harder to use specialized register file memories Andreas Ehliar 05 - Microarchitecture, RF and ALU Special purpose registers needs special instructions I I Special instructions required to access SR:s Example: I I I I I I I Advantage: I I I move r0,bot0 ; Move ringbuffer bottom one word (nop) ; May need nop(s) here add r0,1 (nop) ; May need nop(s) here bot0,r0 (Move is encoded as move from from/to special purpose register here) Easier to meet timing as special purpose registers can easier be located anywhere in the core Can scale easily to hundreds of special purpose registers if required. (Common on large and complex processors such as ARM/x86) Drawback: I Inconvenient for special registers you need to access all the time Andreas Ehliar 05 - Microarchitecture, RF and ALU Conclusions: SPRs I Only place SPRs as a normal register if you believe it will be read/written via normal instructions very often Andreas Ehliar 05 - Microarchitecture, RF and ALU ALU in general I ALU: Arithmetic and Logic Unit I I I I I Arithmetic, Logic, Shift/rotate, others No guard bits for iterative computing One guard bit for single step computing Get operands from and send result to RF Handles single precision computing Andreas Ehliar 05 - Microarchitecture, RF and ALU Separate ALU or ALU in MAC Register file Register file multiplier ALU Accumulator DTU multiplier ALU and Accumulator Register file Register file (a) (b) [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU ALU high level schematic B [15:0] A [15:0] Masker, guard, carry-in, and other preprocessing Shift unit Logic unit Saturation and flag processing FA/FC, FS, FZ Result [15:0] [Liu2008] Andreas Ehliar 05 - Microarchitecture, RF and ALU Pre-processing I Select operands: from one of the source I I Register file, control path, HW constant Typical operand pre processing: I Guard: one guard I Invert: Conditional/non-conditional invert Supply constant 0, 1, -1 Mask operand(s) Select proper carry input I I I I (does not support iterative computing) Andreas Ehliar 05 - Microarchitecture, RF and ALU Post-processing I Select result from multiple components I I Saturation operation I I I From AU, logic unit, shift unit, and others Decide to generate carry-out flag or saturation Perform saturation on result if required Flag operation I Flag computing and prediction Andreas Ehliar 05 - Microarchitecture, RF and ALU General instructions ADD SUB ABS CMP NEG INC DEC AVG Operation Addition Subtraction Absolute Compare Negate Increment Decrement Average opa + + +/+ + + + opb + 1 -1 + Andreas Ehliar Carry in 0 1 A[15] 1 1 0 0 0 Carry out Cout/SAT Cout/SAT SAT SAT SAT SAT SAT SAT 05 - Microarchitecture, RF and ALU Special Instructions Mnemonic MAX MIN DTA ADT Description Select larger value Select smaller value Difference of two absolute values Absolute of the difference of two values Andreas Ehliar Operation RF <= max(OpA,OpB) RF <= min(OpA,OpB) GR <= |OpA| − |OpB| GR <= |OpA−OpB| 05 - Microarchitecture, RF and ALU Adder with carry in for RTL synthesis (safe solution) {A[15], A[15:0], “1”} {B[15],B[15:0],CIN} + 18b full adder FAO [17:0] Result [16:0] < =FAO [17:1] [Liu2008] Andreas Ehliar I Full adder may have no carry in I One guard bit I We need 2 extra bits in the adder I LSB of the 18b result will not be used I MSB of the 18b result will be the guard I Works on all synthesis tools 05 - Microarchitecture, RF and ALU Adder for RTL synthesis (modern version) I {Cout,R[15:0]}={1’b0,A[15:0]}+{1’b0,B[15:0]}+Cin; I Cout is 1 bit wide I Important: Cin is 1 bit wide! Modern synthesis tools can usually handle this case without creating two adders I I (I’ve had to resort to the “safe” version shown on the previous slide in a few cases though. For example when combining an adder with other logic in an FPGA.) Andreas Ehliar 05 - Microarchitecture, RF and ALU Example: Implement an 8 bit ALU Instructions NOP A+B A-B SAT(A+B) SAT(A-B) SAT(ABS(A)) SAT(ABS(A+B)) SAT(ABS(A-B)) CLR S I Function No change of flags A + B (without saturation) A − B (without saturation) A + B (with saturation) A − B (with saturation) |A| (absolute operation, saturation) |A + B| (absolute operation, saturation) |A − B| (absolute operation, saturation) Clear S flag (other flags unchanged) There shold be a negative, zero, and saturation flag! Andreas Ehliar 05 - Microarchitecture, RF and ALU OP 0 1 2 3 4 5 6 7 8 Example: Implement an 8 bit ALU Instructions NOP A+B A-B SAT(A+B) SAT(A-B) SAT(ABS(A)) SAT(ABS(A+B)) SAT(ABS(A-B)) CLR S Function No change of flags A + B (without saturation) A − B (without saturation) A + B (with saturation) A − B (with saturation) |A| (absolute operation, saturation) |A + B| (absolute operation, saturation) |A − B| (absolute operation, saturation) Clear S flag (other flags unchanged) I There shold be a negative, zero, and saturation flag! I Discussion topic: How many adders are needed for each operation? I Discussion topic: How many guard bits are needed for each operation? Andreas Ehliar 05 - Microarchitecture, RF and ALU OP 0 1 2 3 4 5 6 7 8