Transcript
MicRun:A Framework for Scale-free Graph Algorithms on SIMD Architecture of the Xeon Phi Jie Lin, Qingbo Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li and Lei Luo College of Computer
National University of Defense Technology
10/7/2017
Outline Section 1 Backgrounds & Motivation – Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture
Section 2 The MicRun Framework – Bucket Grouping Module – Auto-tuning Module
Section 3 Experiments & Conclusions
2
Outline Section 1 Backgrounds & Motivation – Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture
Section 2 The MicRun Framework – Bucket Grouping Module – Auto-tuning Module
Section 3 Experiments & Conclusions
3
Backgrounds & Motivation • Scale-free Graphs are Widely Used − Social Networks Applications − Chemical Molecular Structures − Reference Citations
• Features of Scale-free Graphs − The Sparsity Characteristic of Graphs − The Connectivity of Vertices Follows Power-law Distribution 6
5
10
10
Number of Vertices
Number of Vertices
4
10
3
10
2
10
1
10
y = x-γ
2
10
0
0
10 0 10
4
10
1
10
2
10 Degree
3
10
(a) Higgs-twitter
4
10
10 0 10
1
2
10
10
3
10
Degree
(b) Soc-pokec
4
Backgrounds & Motivation • Graph Algorithms Sequential computation steps − Load values of source vertices − Load values of edges − Compute (e.g. Addition Minimum et.)
− Update destination vertices
5
Backgrounds & Motivation
• The Xeon Phi Architecture − − − − − −
Architecture: Many Integrated Core (MIC) 512-bit VPU and four hyper-threads supported Frequency is more than 1.50GHz Memory (GDDR5) is more than 8GB 57-72 cores with optimized KNC Instruction set Connect to CPU with PCIE
6
Backgrounds & Motivation • Challenges of Executing Graph Algorithms on Phi − SIMD access locality influenced by access range − Write conflicts can occur in SIMD Parallelism
7
Backgrounds & Motivation • Tiling-and-Grouping Strategy is Commonly Used − Tiling Enhance the data locality − Grouping Remove Parallel conflict − Related Citations Efficient Parallel Graph Processing over CPU and MIC (Chen et al. CGO. 2016) Reusing Data Reorganization of graph Applications. (Jiang et al. IPDPS. 2016) Optimizing scale-free SPVM on the Intel Xeon Phi. (Tang et al. CGO 2015)
8
Backgrounds & Motivation • New Challenges Appear − High Penalty when Using Greedy Grouping − Difficult to Select the Optimal Tile Size 350
2500
soc. blocking time soc. grouping time higgs. blocking time higgs. grouping time
250
2000
File Size (MB)
Time (second)
300
200 150
soc-pokec higgs-twitter
1500
1000
100 500
50 0
0
soc-pokec
higgs-twitter
(a) Time Overhead
orig
128
256
512 1024 2048 4096 8192 16384 Tile Size
(b) Memory Overhead 9
Outline Section 1 Backgrounds & Motivation – Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture
Section 2 The MicRun Framework – Bucket Grouping Module – Auto-tuning Module
Section 3 Experiments & Conclusions
10
The MicRun Framework • Overview of the Framework and the Modules − − − −
Tiling Module Bucket Grouping Module Auto-tuning Module Graph Algorithms
Workflow of the MicRun Framework.
11
The MicRun Framework • Grouping Module − Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency Dest. Vertices
1
2
3 4
5
Source Vertices
6 9
7 8 11
10 12
13
2
O(n )
16 15 14 11 8 9 6 10 nnz in buckets 1 12 2 4 13 5 7 3 Bucket number 1 2 3 4 5 6 7 8
14
15 16
(a) nnz in a tile Group1 Bucket Sequential (Chen. 2016)
Group2
(b) nnz transformed into groups using buckets
2 O(n ) Group3
7-1-2-4 11-3-9-12 14-6-10-13 1-2-3-4
5-6-7-8
Group4
Group5
Group6
SIMD
15-5-8-D
16-D-D-D
NULL
16/20
9-10-11-12 13-14-D-D
15-D-D-D
16-D-D-D 16/2412
The MicRun Framework • Grouping Module − Bucket Structure is introduced to construct groups − Max-heap Optimization is used to improve efficiency Dest. Vertices
1
2
3 4
5
Source Vertices
6 9
7 8 11
10 12
13 14
15 16
(a) nnz in a tile
2 O(n )
16 15 14 11 8 9 6 10 nnz in buckets 1 12 2 4 13 5 7 3 Bucket number 1 2 3 4 5 6 7 8
(b) nnz transformed into groups using buckets
O(n*log(b))
13
The MicRun Framework • Auto-tuning Module − Extract Features Based on the Ideal Graph Application Tsum
t p int q float float Tc , r Tnc , g Tcomp Tnc , s nnztotal j 1 k 1 i 1
sizes of the adjust matrix of graphs is related to the sparsity character The nnzs in the graph can influence the whole memory The number of nnzs in each column is related to the nnzs’ distribution The average stride between nnzs can influence the cache miss The feature tuple is constructed as: (s, n, γ, NC , ST)
− Decision Tree Model is Employed The training target OT is obtained by manually probing
14
Outline Section 1 Backgrounds & Motivation – Scale-free Graphs & Graph Algorithms – The Xeon Phi Architecture
Section 2 The MicRun Framework – Bucket Grouping Module – Auto-tuning Module
Section 3 Experiments & Conclusions
15
Experiments • Platform − − − − −
MIC node on the Tianhe-Ⅱ supercomputer The version of the Xeon Phi is 31S1P 57 X86 cores, 1.10 GHz, 4 hyper threads per core The capacity of L2 cache is 28.5MB Intel ICC 13.0.0, -O3 enabled
• Graph Applications − Bellman-Ford Algorithm − PageRank Algorithm
• Datasets − SNAP Dataset − University of Florida Sparse Matrix Collection 16
Experiments • College of Computer of NUDT
• Hometown of Supercomputers: Tianhe - Ⅱ – No. 1 in TOP500 (2013.6 – 2015.11) – 33.86 PFLOPS, 32,000 CPUs+48,000 MICs
17
Experiments • Bucket Grouping vs. Seq. Grouping (Chen. 2016) − Grouping Time Overhead − SIMD Utilization Ratio
(a) Time Overhead during Grouping Stage
Decrease stably
(b) SIMD utilization by two Grouping Strategies
Converge to 1 faster
18
Experiments • The Execution of two Graph Algorithms
(b) Execution Time of Bellman-Ford
(a) Comparison of Execution Time
1.2x on Average
(c) Execution Time of PageRank 19
Experiments • The Performance of the Auto-tuning Module SPEEDUP ACHIEVED BY OPT. AND AUTO. TILING OVER SEQUENTIAL TILING PERFORMANCE Bellman-Ford Datasets OPT. vs. SEQ. AUTO. vs. SEQ. Val Size Val Size lp_osa_60 1.08 1024 1.03 256 msdoor 1.11 1152 1.05 4096 rajat24 1.18 2048 1.09 256 Si87H76 1.05 128 1.05 128 higgs-twitter 1.26 896 1.13 3072 kron-logn18 1.29 4096 1.29 4096
PageRank OPT. vs. SEQ. AUTO. vs. SEQ. Val Size Val Size 1.07 256 1.07 256 1.14 512 1.14 512 1.09 768 1.09 768 1.14 128 1.03 512 1.33 1024 1.21 640 1.36 2048 1.25 1024
Optimal 0ver Sequential
1.05x ~ 1.36x
Auto-tuning 0ver Sequential
1.03x ~ 1.29x 20
Conclusions • The MicRun Framework − Grouping Module Bucket structure is employed Max-heap mechanism is embedded
− Auto-tuning Module Decision Tree Classifier is introduced
• Future work − Enrich the graph algorithms built-in − Expand the framework to MIMD parallel level
21
The Tianhe-2 supercomputer is available online. All the scientists can collaborate with us to develop new software and access Tianhe-2 through the Internet.
Welcome to contact us ! Email:
[email protected]
22
Thank you! Questions?