Transcript
Custom computing systems • difference engine: Charles Babbage 1832 - compute maths tables
• digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic
• Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc
• Harp1: Oxford University 1995 - FPGA + microprocessor (transputer)
• SONIC, UltraSonic: Sony + Imperial College 1999-2002 - multi-FPGA, professional video processing
• MaxWorkstation, MaxNode: 2011 - 1/4 FPGA card(s), up to 192GB memory, used by JP Morgan… wl 2017 2.1
The Exaflop Supercomputer (2018) • 1 exaflop = 1018 FLOPS • using processor cores with 8FLOPS/clock at 2.5GHz • 50M CPU cores
• what about power? - assume power envelope of 100W per chip - Moore’s Law scaling: 6 cores today ~100 cores/chip - 500k CPU chips
• 50MW (just for CPUs!) 100MW likely • ‘Jaguar’ power consumption: 6MW source: Maxeler
wl 2017 2.2
The Exaflop Supercomputer (2018) • 1 exaflop = 1018 FLOPS How do we program this? • using processor cores with 8FLOPS/clock at 2.5GHz • 50M CPU cores
• what about power? - assume power envelope of 100W per chip - Moore’s Law scaling: 6 cores today ~100 cores/chip Who pays for this? - 500k CPU chips
• 50MW (just for CPUs!) 100MW likely • ‘Jaguar’ power consumption: 6MW source: Maxeler
wl 2017 2.3
Technology comparison
DSP: Digital Signal Processor
Dedicated HW=ASIC/FPGA
wl 2017 2.4
Intel 6-Core X5680 “Westmere”
L1 data cache Execution units Computation
Out-of-order scheduling & retirement
Memory ordering and execution
L2 Cache & interrupt servicing Paging
Core
Branch prediction Instruction fetch & L1 cache
Instruction decode and microcode
Core
Core
Shared L3 cache
Core
Uncore
Core
Core
Core
Shared L3 cache
I/O and QPI
I/O and QPI
Memory controller
wl 2017 2.5
A special purpose computer • a chip customised for a specific application • no instructions no instruction decode logic • no branches no branch prediction • explicit parallelism no out-of-order scheduling • data streamed onto-chip no multi-level caches
MyApplication Chip
(Lots of) Memory
Rest of the world
source: Maxeler
wl 2017 2.6
A special purpose computer • but we have more than one application • impractical to optimise machines for only one application - need to run many applications in a typical system
OtherApplication MyApplication Chip MyApplication
Chip MyApplication Chip Chip
Memory Memory Memory Memory
Rest of the world Network Network Network
source: Maxeler
wl 2017 2.7
A special purpose computer • use reconfigurable chip: reprogram at runtime to implement: - different applications, or - different versions of the same application
Optimized for Config 1 Application A D B C E
Memory
Network
source: Maxeler
wl 2017 2.8
Instruction processors
source: Maxeler
wl 2017 2.9
Dataflow/stream processors
source: Maxeler
wl 2017 2.10
Accelerating real applications • CPUs are good for: - latency-sensitive, control-intensive, non-repetitive code • dataflow engines are good for: - high throughput repetitive processing on large data volumes
A system should contain both Lines of code Total Application Kernel to accelerate Software to restructure
1,000,000 2,000 20,000 source: Maxeler
wl 2017 2.11
Custom computing in a PC Processor
L1$
Register file
L2$ Dimms
North/South Bridge
PCI Bus
Where is the Custom Architecture? • on-chip with access to register file • co-processor w/ access to level 1 cache • next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport • as Memory Controller not North/South Bridge • as main memory (DIMMs) • as a peripheral on PCI Express bus • inside peripheral, eg customizable Disk controller
Disk
wl 2017 2.12
Embedded systems
Processor Register file
Data
Custom Architecture
Instructions
• partition programs into software and hardware (custom architecture) - hardware software co-design • System-on-Chip: SoC (cover later) • custom architecture as extension of the processor instruction set
wl 2017 2.13
Where to locate custom architecture? • depends on the application - avoid system bottleneck for the application
• possible bottlenecks - memory access latency - memory access bandwidth - memory size - processor local memory size - processor ALU resource - processor ALU operation latency - various bus bandwidths
wl 2017 2.14
Bottleneck example: Bing page ranking
source: Microsoft
wl 2017 2.15
Reconfigurable computing with FPGAs Logic Cell
(105
elements)
DSP Block
IO Block
Xilinx Virtex-6 FPGA
Block RAM
DSP Block
Block RAM (20TB/s) wl 2017 2.16
High density compute with FPGAs: examples • 1U Form Factor for racks
DFE: Data Flow Engine
source: Maxeler
wl 2017 2.17
How could we program it? • schematic entry of circuits • hardware Description Languages - VHDL, Verilog, SystemC
• object-oriented languages - C/C++, Python, Java, and related languages
• dataflow languages: e.g. MaxJ, OpenSPL • functional languages: e.g. Haskell, Ruby • high level interface: e.g. Mathematica, MatLab • schematic block diagram e.g. Simulink • domain specific languages (DSLs)
wl 2017 2.18
Level of Abstraction
Accelerator programming models
DSL
DSL
DS L
DSL Higher Level Libraries
Higher Level Libraries
Flexible Compiler System: MaxCompiler/Ruby Possible applications wl 2017 2.19
Acceleration development flow
Start
Original Application
Identify code for acceleration and analyze bottlenecks
Transform app, architect and model performance
Write accelerator code
Integrate with Host code
NO
NO Accelerated Application
YES
Meets performance goals?
Simulate
YES Build for Hardware
Functions correctly?
source: Maxeler
wl 2017 2.20
Acceleration development flow
Start
Original Application
Identify code for acceleration and analyze bottlenecks
Transform app, architect and model performance
Write accelerator code
Integrate with Host code
Simulate
Mainly for project NO
NO Accelerated Application
YES
Meets performance goals?
YES Build for Hardware
Functions correctly?
source: Maxeler
wl 2017 2.21
Customisation techniques • FPGA technology offers customisation opportunities - some data may remain constant: e.g. algebraic simplification - adopt different data structures: e.g. number representation - transform: e.g. enhance parallelism, pipelining, serialisation
• reuse possibilities (more next lecture) - description: repeating unit, parametrisation - transforms: patterns, laws, proofs
• example: polynomial evaluation for numbers ai, x y = a0 + a1 x + a2 x2 + a3 x3 (repeat many times)
wl 2017 2.22
Performance estimation • clocked circuit: no combinational loops • gates have delay, and speed limited by propagation delay through the slowest combinational path • slowest path: usually carry path • clock rate: approx. 1/(delay of slowest path) assuming - edge-triggered design - register propagation delay, set-up time, clock skew etc assumed negligible
• lowest level: logic gates, do not worry about transistors
wl 2017 2.23
First polynomial evaluator • compute y = a0 + a1 x + a2 x2 + a3 x3 • simplification: assume x constant a3
x x x a2 a1 a0
x
x
+
x
+
y=0; for i = 0 .. 3 y = y + ai x xi ;
+
y
• problems: speed? size? repeating units? wl 2017 2.24
Customisation possibilities 1. exploit algebraic properties 2. enhance parallelism
3. pipelining Other possibilities • serialisation
• customise data representation - non-standard word-length, e.g. 18 bits rather than 32 bits - non-standard arithmetic, e.g. logarithmic, residue… wl 2017 2.25
1. Algebraic property: Horner’s Rule b
• given
b
x
a
x
a
+
x
+
ax + bx
• then
=
(a + b) x
a3
a3
x
x
x
a2
x a2 a1 a0
x
x
+
x
+
x a1
+
x
+
a0 + a1 x + a2 x2 + a3 x3
+
a0
=
+
a0 + x (a1 + x (a2 + a3x))
wl 2017 2.26
2. Enhance parallelism
R
R
R
R
R
R
R
R
R
R
R
R R
R wl 2017 2.27
3. Pipelining • split up combinational circuit: add pipeline registers • shorter cycle time, assembly-line parallelism, lower power
• pipelined design (if regular: systolic array – more later) - mandatory: same number of additional registers for all inputs - preferable: balance delay in different stages - preferable: addition of registers preserves regularity
f Source: M Spivey
g
h wl 2017 2.28
Horner’s Rule for pipelining? • given P
Q
R
R
Q
P and Q are registers, R is computational component
• then
P
Q
Q
Q
R
Q
Q
P
R
R
P
R
Q
R
R
wl 2017 2.29
Pipelined incrementer: Verilog • parameterize:
a[15..12]
a[11..8]
a[7..4]
a[3..0]
incrementer
incrementer
incrementer
sum[11..8]
sum[7..4]
sum[3..0]
1-stage pipeline
- G groups of N bits - width = G*N - bits per stage = N
cin
• Verilog implementation:
incrementer
sum[15..12]
- decompose into: • upper register triangle • chain of incrementers + register (1-stage pipeline) • lower register triangle - only top level shown - need to manage array indices
cout
module incr_pipe #(parameter G=4,N=4) // G groups of N bits (output [G*N-1:0] outp, input [G*N-1:0] inp, input clk); wire [G:0] carry; // carry chain wire [G*N-1:0] temp1; // output of delay triangle genvar i; // loop counter assign carry[G] = 1; // prime carry input upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle generate for (i = 0; i < G; i = i + 1) // for each group generate begin // 1-stage pipelined incrementer incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N], temp1[(i+1)*N-1:i*N], carry[G-i], clk); end endgenerate wl 2017 2.30 endmodule
Concise parametric representation • given P
Q
R
R
Q
[P, Q] ; R
• then
P
=
R ; Q,
Q
Q
Q
R
Q
Q
P
R
R
P
R
Q
R
R
[nP, Qn] ; rdrn R
=
P and Q are registers
rdrn (2Q ; R) wl 2017 2.31
Pipelined incrementer: Verilog vs Ruby • parameterize: - G groups of N bits - width = G*N - bits per stage = N Verilog:
cin
a[15..12]
a[11..8]
a[7..4]
a[3..0]
incrementer
incrementer
incrementer
incrementer
sum[11..8]
sum[7..4]
sum[3..0]
sum[15..12]
module incr_pipe #(parameter G=4,N=4) // G groups of N bits (output [G*N-1:0] outp, input [G*N-1:0] inp, input clk); wire [G:0] carry; // carry chain wire [G*N-1:0] temp1; // output of delay triangle genvar i; // loop counter assign carry[G] = 1; // prime carry input upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle generate for (i = 0; i < G; i = i + 1) // for each group generate begin // 1-stage pipelined incrementer incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N], temp1[(i+1)*N-1:i*N], carry[G-i], clk); end endgenerate endmodule
cout
Ruby: Pipelined_incrementer G N = snd (tri G (tri N D)) ; # upper reg triangle row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre fst (tri~ G (tri~ N D)) # lower reg triangle
* can generate Verilog or MaxJ! wl 2017 2.32