Transcript
Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA
1
http://cag.csail.mit.edu/streamit
Multicores Are Here! 512
Picochip PC102
256
Ambric AM2045
Cisco CSR-1
Intel Tflops
128 64 32
# of cores 16
Raw
Niagara
8
Broadcom 1480
4 2 1
4004
8080
8086
286
386
486
Pentium
8008
1970 2
Raza XLR
1975
1980
1985
1990
Cavium Octeon
Cell Opteron 4P Xeon MP
Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Itanium 2 Athlon
1995
2000
2005
20??
Multicores Are Here! 512 256 128 64 32
# of cores 16 8 4
For uniprocessors, Uniprocessors: C was: C •Portable is the common machine language •High Performance •Composable •Malleable •Maintainable
Picochip PC102 Cisco CSR-1
Intel Tflops
Raw
4004
8086
286
386
486
Broadcom 1480
Pentium
8008
1970 3
8080
1975
1980
1985
1990
Raza XLR
Niagara
2 1
Ambric AM2045
Cavium Octeon
Cell Opteron 4P Xeon MP
Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Itanium 2 Athlon
1995
2000
2005
20??
Multicores Are Here! What is the common machine language for multicores?
512 256 128
Picochip PC102
Ambric AM2045
Cisco CSR-1
Intel Tflops
64 32
# of cores 16
Raw
Niagara
8
Broadcom 1480
4 2 1
4004
8080
8086
286
386
486
Pentium
8008
1970 4
Raza XLR
1975
1980
1985
1990
Cavium Octeon
Cell Opteron 4P Xeon MP
Xbox360 PA-8800 Opteron Tanglewood Power4 PExtreme Power6 Yonah P2 P3 Itanium P4 Itanium 2 Athlon
1995
2000
2005
20??
Common Machine Languages Uniprocessors: Common Properties
Multicores: Common Properties
Single flow of control
Multiple flows of control
Single memory image
Multiple local memories
Differences:
Differences:
Number and capabilities of cores Register Allocation Communication Model ISA Instruction Selection Synchronization Model Functional Units Instruction Scheduling Register File
von-Neumann languages represent the common properties and abstract away the differences 5
Need common machine language(s) for multicores
Streaming as a Common Machine Language AtoD
• Regular and repeating computation FMDemod
• Independent filters with explicit communication – Segregated address spaces and multiple program counters
Scatter LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
• Natural expression of Parallelism: – Producer / Consumer dependencies – Enables powerful, whole-program transformations
Gather
Adder Speaker
6
Types of Parallelism Task Parallelism – Parallelism explicit in algorithm – Between filters without producer/consumer relationship Scatter Data Parallelism – Peel iterations of filter, place within scatter/gather pair (fission) – parallelize filters with state Gather
7
Task
Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized
Types of Parallelism Task Parallelism – Parallelism explicit in algorithm Data Parallel – Between filters without Gather producer/consumer relationship
Scatter
Pipeline
Scatter Data Parallelism – Between iterations of a stateless filter – Place within scatter/gather pair (fission) – Can’t parallelize filters with state Gather Data
8
Task
Pipeline Parallelism – Between producers and consumers – Stateful filters can be parallelized
Types of Parallelism Traditionally:
Scatter
Gather
Pipeline
Scatter Data Parallelism – Data parallel loop (forall)
Gather Data
9
Task Parallelism – Thread (fork/join) parallelism
Task
Pipeline Parallelism – Usually exploited in hardware
Problem Statement Given: – Stream graph with compute and communication estimate for each filter – Computation and communication resources of the target machine Find: – Schedule of execution for the filters that best utilizes the available parallelism to fit the machine resources 10
Our 3-Phase Solution
Coarsen Granularity
Data Parallelize
Software Pipeline
1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters Compile to a 16 core architecture – 11
11.2x mean throughput speedup over single core
Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution
12
The StreamIt Project • Applications
StreamIt Program
– DES and Serpent [PLDI 05] – MPEG-2 [IPDPS 06] – SAR, DSP benchmarks, JPEG, …
Front-end
• Programmability – StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05) – Programming Environment in Eclipse (P-PHEC 05)
Annotated Java
• Domain Specific Optimizations – Linear Analysis and Optimization (PLDI 03) – Optimizations for bit streaming (PLDI 05) – Linear State Space Analysis (CASES 05)
Simulator (Java Library)
Stream-Aware Optimizations
• Architecture Specific Optimizations – Compiling for Communication-Exposed Architectures (ASPLOS 02) – Phased Scheduling (LCTES 03) – Cache Aware Optimization (LCTES 05) – Load-Balanced Rendering (Graphics Hardware 05) 13
Uniprocessor backend
Cluster backend
Raw backend
IBM X10 backend
C/C++
MPI-like C/C++
C per tile + msg code
Streaming X10 runtime
Model of Computation • Synchronous Dataflow [Lee ‘92] A/D
– Graph of autonomous filters – Communicate via FIFO channels
Band Pass
• Static I/O rates – Compiler decides on an order of execution (schedule) Detect – Static estimation of computation LED
14
Duplicate
Detect
Detect
Detect
LED
LED
LED
Example StreamIt Filter 0
1
2
3
4
5
6
7
8
9 10 11
FIR 0
1
output
float→float filter FIR (int N, float[N] weights) {
Stateless work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } } 15
input
Example StreamIt Filter 0
1
2
3
4
5
6
7
8
9 10 11
FIR 0
1
output
N) float[N] { weights) { float→float filter FIR (int N, ;
Stateful
work push 1 pop 1 peek N { float result = 0; weights = adaptChannel(weights); for (int i = 0; i < N; i++) { result += weights[i] ∗ peek(i); } pop(); push(result); } } 16
input
StreamIt Language Overview • StreamIt is a novel language for streaming – Exposes parallelism and communication – Architecture independent – Modular and composable – Simple structures composed to creates complex graphs
filter pipeline may be any StreamIt language construct
splitjoin
splitter
parallel computation
joiner
– Malleable – Change program behavior with small modifications
feedback loop joiner
17
splitter
Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution
18
Baseline 1: Task Parallelism • Inherent task parallelism between two processing pipelines
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder 19
• Task Parallel Model: – Only parallelize explicit task parallelism – Fork/join parallelism • Execute this on a 2 core machine ~2x speedup over single core • What about 4, 16, 1024, … cores?
Evaluation: Task Parallelism 16
Raw Microprocessor Parallelism: Not matched to target! 16 inorder, single-issue cores with D$ and I$ Synchronization: Not matched towith target! 16 memory banks, each bank DMA
15
Cycle accurate simulator
19
Throughput Normalized to Single Core StreamIt
18 17
14 13 12 11 10 9 8 7 6 5 4 3 2 1
20
R ad G ar eo m et ric M ea n
V oc od er
TD M E P E G 2D ec od er
S er pe nt
ad io FM R
er ba nk Fi lt
FF T
D E S
D C T
B i to ni cS or C t ha nn el V oc od er
0
Baseline 2: Fine-Grained Data Parallelism Splitter
Splitter
Splitter
BandPass BandPass BandPass BandPass
BandPass BandPass BandPass BandPass
Joiner
Joiner
Splitter
Splitter
Compress Compress Compress Compress
Joiner
Process Process Process Process
Compress Compress Compress Compress Joiner
Joiner
Splitter
Splitter
Process Process Process Process
Splitter
Splitter
Expand Expand Expand Expand BandStop BandStop BandStop BandStop
Expand Expand Expand Expand Joiner
Joiner
Splitter
Splitter
Joiner
Joiner Splitter
– Fiss each stateless filter N ways (N is number of cores) – Remove scatter/gather if possible
• We can introduce data parallelism
BandStop BandStop BandStop BandStop
Joiner
– Example: 4 cores
• Each fission group occupies entire machine
BandStop BandStop BandStop Adder Adder Joiner
21
Joiner
• Each of the filters in the example are stateless • Fine-grained Data Parallel Model:
22 G
eo
Se
m
et
ri c
M
r
ea
n
ar
de
ad
co
R
Vo
E
er
TD
t
io
k
en
ad
rp
R
an
T
16
FM
rb
T
r
ES
FF
D
C
de
D
co
lte
Vo
Fi
el
t
17
Task Fine-Grained Data
G 2D ec od
nn
or
18
PE
ha
ni cS
19
M
C
Bi to
Throughput Normalized to Single Core StreamIt
Evaluation: Fine-Grained Data Parallelism Good Parallelism! Too Much Synchronization!
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Outline • StreamIt language overview • Mapping to multicores – Baseline techniques – Our 3-phase solution
23
Phase 1: Coarsen the Stream Graph Splitter
BandPass
Peek
Compress
Compress
Process
Process
Expand
BandStop
Peek
Expand
Peek
Joiner
Adder
24
BandPass
BandStop
Peek
• Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream
Phase 1: Coarsen the Stream Graph Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandStop
BandStop
• Before data-parallelism is exploited • Fuse stateless pipelines as much as possible without introducing state – Don’t fuse stateless with stateful – Don’t fuse a peeking filter with anything upstream
• Benefits: Joiner
Adder
25
– Reduces global communication and synchronization – Exposes inter-node optimization opportunities
Phase 2: Data Parallelize Data Parallelize for 4 cores
Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandStop
BandStop
Joiner Splitter
Adder Adder Adder Adder Joiner
26
Fiss 4 ways, to occupy entire chip
Phase 2: Data Parallelize Data Parallelize for 4 cores
Splitter Splitter
Splitter
BandPass BandPass Compress Compress Process Process Expand Expand
BandPass BandPass Compress Compress Process Process Expand Expand
Joiner
Joiner
BandStop
BandStop
Joiner Splitter
Adder Adder Adder Adder Joiner
27
Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip
Phase 2: Data Parallelize Data Parallelize for 4 cores
Splitter Splitter
Splitter
BandPass BandPass Compress Compress Process Process Expand Expand
BandPass BandPass Compress Compress Process Process Expand Expand
Joiner
Joiner
Splitter
Splitter
BandStop BandStop
BandStop BandStop Joiner
Joiner Joiner
Splitter
Adder Adder Adder Adder Joiner
28
• Task-conscious data parallelization – Preserve task parallelism
• Benefits: – Reduces global communication and synchronization Task parallelism, each filter does equal work Fiss each filter 2 times to occupy entire chip
Evaluation: Coarse-Grained Data Parallelism Task Fine-Grained Data Coarse-Grained Task + Data
19
Throughput Normalized to Single Core StreamIt
18 17
Good Parallelism! Low Synchronization!
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
29
n
r M
ea
ad a
et ri
c
R
r co de
de ec o 2D
G
eo
m
G
Vo
r
E
t en rp
TD PE
FM
R
ad io
k rb an
FF
ES D
CT D
T Fi lte
Se
M
C
ha n
ne
Bi to
lV
oc
ni c
od
So
er
rt
0
Simplified Vocoder Splitter
6
AdaptDFT
AdaptDFT
6
Joiner
RectPolar
20
Data Parallel
Splitter Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Data Parallel, but too little work!
Joiner Joiner
PolarRect
30
20
Data Parallel Target a 4 core machine
Data Parallelize Splitter
6
AdaptDFT
AdaptDFT
6
Joiner Splitter
RectPolar RectPolar RectPolar RectPolar
20 5 Joiner
Splitter Splitter
2
UnWrap
Unwrap
2
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
1
Joiner Joiner Splitter
RectPolar RectPolar RectPolar PolarRect
20 5 Joiner
31
Target a 4 core machine
Data + Task Parallel Execution Splitter
6
6
Cores
Joiner Splitter
5 Joiner
Splitter Splitter
2
2
1
1
1
1
1
1
Time
21
Joiner Joiner Splitter
5
RectPolar Joiner
32
Target 4 core machine
We Can Do Better! Splitter
6
6
Cores
Joiner Splitter
5 Joiner
Splitter Splitter
2
2
1
1
1
1
1
1
Time
16
Joiner Joiner Splitter
5
RectPolar Joiner
33
Target 4 core machine
Phase 3: Coarse-Grained Software Pipelining Prologue
New Steady State
RectPolar
RectPolar
• New steady-state is free of dependencies • Schedule new steady-state using a greedy partitioning 34
RectPolar
RectPolar
Greedy Partitioning Cores
To Schedule:
Time
35
16
Target 4 core machine
Task Coarse-Grained Task + Data
19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
n
r
M
ea
ad a
et ri
c
R
r co de
de ec o 2D
m eo G
M
PE
G
Vo
r
E TD
nt rp e Se
FM
R
ad io
k rb an
FF
ES D
CT
T Fi lte
lV oc ne ha n C
D
er od
So ni c Bi to
36
Fine-Grained Data Coarse-Grained Task + Data + Software Pipeline
Best Parallelism! Lowest Synchronization!
rt
Throughput Normalized to Single Core StreamIt
Evaluation: Coarse-Grained Task + Data + Software Pipelining
Generalizing to Other Multicores • Architectural requirements: – Compiler controlled local memories with DMA – Efficient implementation of scatter/gather • To port to other architectures, consider: – Local memory capacities – Communication to computation tradeoff • Did not use processor-to-processor communication on Raw 37
Related Work • Streaming languages: – Brook [Buck et al. ’04] – StreamC/KernelC [Kapasi ’03, Das et al. ’06] – Cg [Mark et al. ‘03] – SPUR [Zhang et al. ‘05] • Streaming for Multicores: – Brook [Liao et al., ’06] • Ptolemy [Lee ’95] • Explicit parallelism: – OpenMP, MPI, & HPF 38
Conclusions • Streaming model naturally exposes task, data, and pipeline parallelism • This parallelism must be exploited at the correct granularity and combined correctly Task
Fine-Grained Coarse-Grained Data Task + Data
Coarse-Grained Task + Data + Software Pipeline
Parallelism
Not matched
Good
Good
Best
Synchronization
Not matched
High
Low
Lowest
• Good speedups across varied benchmark suite • Algorithms should be applicable across multicores 39