Preview only show first 10 pages with watermark. For full document please download

Custom Computing

Rating
Date

September 2018
Size

1.1MB
Views

5,562
Categories

Computers & electronics Computer components System components Memory modules

Transcript

Custom computing systems • difference engine: Charles Babbage 1832 - compute maths tables • digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic • Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc • Harp1: Oxford University 1995 - FPGA + microprocessor (transputer) • SONIC, UltraSonic: Sony + Imperial College 1999-2002 - multi-FPGA, professional video processing • MaxWorkstation, MaxNode: 2011 - 1/4 FPGA card(s), up to 192GB memory, used by JP Morgan… wl 2017 2.1 The Exaflop Supercomputer (2018) • 1 exaflop = 1018 FLOPS • using processor cores with 8FLOPS/clock at 2.5GHz • 50M CPU cores • what about power? - assume power envelope of 100W per chip - Moore’s Law scaling: 6 cores today  ~100 cores/chip - 500k CPU chips • 50MW (just for CPUs!)  100MW likely • ‘Jaguar’ power consumption: 6MW source: Maxeler wl 2017 2.2 The Exaflop Supercomputer (2018) • 1 exaflop = 1018 FLOPS How do we program this? • using processor cores with 8FLOPS/clock at 2.5GHz • 50M CPU cores • what about power? - assume power envelope of 100W per chip - Moore’s Law scaling: 6 cores today  ~100 cores/chip Who pays for this? - 500k CPU chips • 50MW (just for CPUs!)  100MW likely • ‘Jaguar’ power consumption: 6MW source: Maxeler wl 2017 2.3 Technology comparison DSP: Digital Signal Processor Dedicated HW=ASIC/FPGA wl 2017 2.4 Intel 6-Core X5680 “Westmere” L1 data cache Execution units Computation Out-of-order scheduling & retirement Memory ordering and execution L2 Cache & interrupt servicing Paging Core Branch prediction Instruction fetch & L1 cache Instruction decode and microcode Core Core Shared L3 cache Core Uncore Core Core Core Shared L3 cache I/O and QPI I/O and QPI Memory controller wl 2017 2.5 A special purpose computer • a chip customised for a specific application • no instructions  no instruction decode logic • no branches  no branch prediction • explicit parallelism  no out-of-order scheduling • data streamed onto-chip  no multi-level caches MyApplication Chip (Lots of) Memory Rest of the world source: Maxeler wl 2017 2.6 A special purpose computer • but we have more than one application • impractical to optimise machines for only one application - need to run many applications in a typical system OtherApplication MyApplication Chip MyApplication Chip MyApplication Chip Chip Memory Memory Memory Memory Rest of the world Network Network Network source: Maxeler wl 2017 2.7 A special purpose computer • use reconfigurable chip: reprogram at runtime to implement: - different applications, or - different versions of the same application Optimized for Config 1 Application A D B C E Memory Network source: Maxeler wl 2017 2.8 Instruction processors source: Maxeler wl 2017 2.9 Dataflow/stream processors source: Maxeler wl 2017 2.10 Accelerating real applications • CPUs are good for: - latency-sensitive, control-intensive, non-repetitive code • dataflow engines are good for: - high throughput repetitive processing on large data volumes  A system should contain both Lines of code Total Application Kernel to accelerate Software to restructure 1,000,000 2,000 20,000 source: Maxeler wl 2017 2.11 Custom computing in a PC Processor L1$ Register file L2$ Dimms North/South Bridge PCI Bus Where is the Custom Architecture? • on-chip with access to register file • co-processor w/ access to level 1 cache • next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport • as Memory Controller not North/South Bridge • as main memory (DIMMs) • as a peripheral on PCI Express bus • inside peripheral, eg customizable Disk controller Disk wl 2017 2.12 Embedded systems Processor Register file Data Custom Architecture Instructions • partition programs into software and hardware (custom architecture) - hardware software co-design • System-on-Chip: SoC (cover later) • custom architecture as extension of the processor instruction set wl 2017 2.13 Where to locate custom architecture? • depends on the application - avoid system bottleneck for the application • possible bottlenecks - memory access latency - memory access bandwidth - memory size - processor local memory size - processor ALU resource - processor ALU operation latency - various bus bandwidths wl 2017 2.14 Bottleneck example: Bing page ranking source: Microsoft wl 2017 2.15 Reconfigurable computing with FPGAs Logic Cell (105 elements) DSP Block IO Block Xilinx Virtex-6 FPGA Block RAM DSP Block Block RAM (20TB/s) wl 2017 2.16 High density compute with FPGAs: examples • 1U Form Factor for racks DFE: Data Flow Engine source: Maxeler wl 2017 2.17 How could we program it? • schematic entry of circuits • hardware Description Languages - VHDL, Verilog, SystemC • object-oriented languages - C/C++, Python, Java, and related languages • dataflow languages: e.g. MaxJ, OpenSPL • functional languages: e.g. Haskell, Ruby • high level interface: e.g. Mathematica, MatLab • schematic block diagram e.g. Simulink • domain specific languages (DSLs) wl 2017 2.18 Level of Abstraction Accelerator programming models DSL DSL DS L DSL Higher Level Libraries Higher Level Libraries Flexible Compiler System: MaxCompiler/Ruby Possible applications wl 2017 2.19 Acceleration development flow Start Original Application Identify code for acceleration and analyze bottlenecks Transform app, architect and model performance Write accelerator code Integrate with Host code NO NO Accelerated Application YES Meets performance goals? Simulate YES Build for Hardware Functions correctly? source: Maxeler wl 2017 2.20 Acceleration development flow Start Original Application Identify code for acceleration and analyze bottlenecks Transform app, architect and model performance Write accelerator code Integrate with Host code Simulate Mainly for project NO NO Accelerated Application YES Meets performance goals? YES Build for Hardware Functions correctly? source: Maxeler wl 2017 2.21 Customisation techniques • FPGA technology offers customisation opportunities - some data may remain constant: e.g. algebraic simplification - adopt different data structures: e.g. number representation - transform: e.g. enhance parallelism, pipelining, serialisation • reuse possibilities (more next lecture) - description: repeating unit, parametrisation - transforms: patterns, laws, proofs • example: polynomial evaluation for numbers ai, x y = a0 + a1 x + a2 x2 + a3 x3 (repeat many times) wl 2017 2.22 Performance estimation • clocked circuit: no combinational loops • gates have delay, and speed limited by propagation delay through the slowest combinational path • slowest path: usually carry path • clock rate: approx. 1/(delay of slowest path) assuming - edge-triggered design - register propagation delay, set-up time, clock skew etc assumed negligible • lowest level: logic gates, do not worry about transistors wl 2017 2.23 First polynomial evaluator • compute y = a0 + a1 x + a2 x2 + a3 x3 • simplification: assume x constant a3 x x x a2 a1 a0 x x + x + y=0; for i = 0 .. 3 y = y + ai x xi ; + y • problems: speed? size? repeating units? wl 2017 2.24 Customisation possibilities 1. exploit algebraic properties 2. enhance parallelism 3. pipelining Other possibilities • serialisation • customise data representation - non-standard word-length, e.g. 18 bits rather than 32 bits - non-standard arithmetic, e.g. logarithmic, residue… wl 2017 2.25 1. Algebraic property: Horner’s Rule b • given b x a x a + x + ax + bx • then = (a + b) x a3 a3 x x x a2 x a2 a1 a0 x x + x + x a1 + x + a0 + a1 x + a2 x2 + a3 x3 + a0 = + a0 + x (a1 + x (a2 + a3x)) wl 2017 2.26 2. Enhance parallelism R R R R R R R R R R R R R R wl 2017 2.27 3. Pipelining • split up combinational circuit: add pipeline registers • shorter cycle time, assembly-line parallelism, lower power • pipelined design (if regular: systolic array – more later) - mandatory: same number of additional registers for all inputs - preferable: balance delay in different stages - preferable: addition of registers preserves regularity f Source: M Spivey g h wl 2017 2.28 Horner’s Rule for pipelining? • given P Q R R Q P and Q are registers, R is computational component • then P Q Q Q R Q Q P R R P R Q R R wl 2017 2.29 Pipelined incrementer: Verilog • parameterize: a[15..12] a[11..8] a[7..4] a[3..0] incrementer incrementer incrementer sum[11..8] sum[7..4] sum[3..0] 1-stage pipeline - G groups of N bits - width = G*N - bits per stage = N cin • Verilog implementation: incrementer sum[15..12] - decompose into: • upper register triangle • chain of incrementers + register (1-stage pipeline) • lower register triangle - only top level shown - need to manage array indices cout module incr_pipe #(parameter G=4,N=4) // G groups of N bits (output [G*N-1:0] outp, input [G*N-1:0] inp, input clk); wire [G:0] carry; // carry chain wire [G*N-1:0] temp1; // output of delay triangle genvar i; // loop counter assign carry[G] = 1; // prime carry input upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle generate for (i = 0; i < G; i = i + 1) // for each group generate begin // 1-stage pipelined incrementer incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N], temp1[(i+1)*N-1:i*N], carry[G-i], clk); end endgenerate wl 2017 2.30 endmodule Concise parametric representation • given P Q R R Q [P, Q] ; R • then P = R ; Q, Q Q Q R Q Q P R R P R Q R R [nP, Qn] ; rdrn R = P and Q are registers rdrn (2Q ; R) wl 2017 2.31 Pipelined incrementer: Verilog vs Ruby • parameterize: - G groups of N bits - width = G*N - bits per stage = N Verilog: cin a[15..12] a[11..8] a[7..4] a[3..0] incrementer incrementer incrementer incrementer sum[11..8] sum[7..4] sum[3..0] sum[15..12] module incr_pipe #(parameter G=4,N=4) // G groups of N bits (output [G*N-1:0] outp, input [G*N-1:0] inp, input clk); wire [G:0] carry; // carry chain wire [G*N-1:0] temp1; // output of delay triangle genvar i; // loop counter assign carry[G] = 1; // prime carry input upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle generate for (i = 0; i < G; i = i + 1) // for each group generate begin // 1-stage pipelined incrementer incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N], temp1[(i+1)*N-1:i*N], carry[G-i], clk); end endgenerate endmodule cout Ruby: Pipelined_incrementer G N = snd (tri G (tri N D)) ; # upper reg triangle row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre fst (tri~ G (tri~ N D)) # lower reg triangle * can generate Verilog or MaxJ! wl 2017 2.32

Custom Computing

Rating

Date

Size

Views

Categories

Share

Transcript