Transcript
HEAD:CCS 名称/
2 The University of Tokyo Keio University CCS・筑波大ロゴ 1
Collaboration with U. of Tsukuba
Evaluation of Graph Application using Tightly Coupled Accelerators Toshihiro Hanawa1, Takahiro Kaneda2, Hideharu Amano2
Tightly Coupled Accelerators (TCA) Architecture, PEACH2, and PEACH3
G2 x8
To PEACH2 (Root Complex / Endpoint)
Nios II (CPU)
Port S
Block diagram of PEACH2/3 Chip (Port S is omitted on PEACH3 Board)
CPU0
G2 x8
G2 x16
G2 x16
G2 x16
GPU1
GPU2
GPU3
CPU1
QPI
GPU0
GPU1
GPU2
GPU3
PCIe
G3 x8
Infini Band
G3 x8
Infini Band
Block diagram of computation node of HA-PACS/TCA (same for PEACH3)
5 4
Intel Xeon CPU E5-2680 v2
GPU
NVIDIA Tesla K20x (Gen2 x16)
NVIDIA Tesla K40m (Gen3 x16)
# of Nodes
2 of 64
2
CUDA
7.0
MPI
MVAPICH2-GDR/2.1rc2
MVAPICH2-GDR/2.1
InfiniBand
Infiniband QDR 2-rail
(Infiniband FDR 1-rail)
PEACH
PEACH2
PEACH3
PEACH3 Specification vs. PEACH2 Specification PEACH2
PEACH3
FPGA Family
Altera Stratix IV GX
Altera Stratix V GX
FPGA Chip
EP4SGX530NF45C2 ES5GXA7N3F45C2
Process Technology
40nm
28nm
Available LEs
531K
622K
Port
PCIe Gen2 x8
PCIe Gen3 x8
Maximum Bandwidth
4 Gbyte/sec
7.9 Gbyte/sec
Operation Frequency
250 MHz
250 MHz
Internal Bus Width
128 bit
256 bit
Usage of LEs
22
38
DRAM on Board (Available)
DDR3 512 Mbyte
DDR3 512 Mbyte
Communications on BFS
1
3
7 1
3
GPUtoGPU(PEACH3)
4
0
LEVEL 1
6
2
2
4
!"#$ %&'(%))*
2
0
1
CPU
PEACH3 Test Env.
PEACH3 Communication Board (PCIe CEM Spec., single height)
Level-Synchronized BFS Each thread works on the same depth (level) in synchronization
6
HA-PACS/TCA
Implementation and Evaluation of BFS using PEACH3
Basic Performance of PEACH2 and PEACH3
Latency(usec)
PEACH2
GPU0
PCIe
G2 x16
PEACH2
G2 x 8
Memory
Port E
CPU1
QPI
G2 x16
To PEACH2 (Endpoint)
Routing function
CPU0
G2 x8
G2 x16
DMAC
To PEACH2 (Root Complex)
Port W
PEACH2
G2 x16
CPU & GPU side (Endpoint)
G2 x8
G2 x 8
PEACH2
G2 x16
Port N
Evaluation Environment
InfiniBand Network
G2 x8
GPGPU is now widely used for accelerating scientific and engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing. Especially, latency on inter-node GPU communication significantly increases by several memory copies. To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes using PCI Express. PEACH2 (PCI Express Adaptive Communication Hub ver. 2) chip was developed and it has been evaluated using HA-PACS/TCA cluster, which employs PEACH2 board in each node . However, PEACH2 performance was limited by PCI Express Gen2 x8. In order to improve the PCI Express performance, we introduce the new FPGA which supports PCI Express Gen3 with the hard IP. We have designed and implemented new hub chip named “PEACH3.” PEACH3 board has also been developed as an PCI Express extension board similar to PEACH2 board. PEACH3 basic bandwidth attains twice better than PEACH2 bandwidth,
7
3
5
5 6
GPUtoGPU(PEACH2)
LEVEL 2
+$,,-.$'/01$ 23"4$5
(%674' 8'9'3:/
;$4-0< %&'=%))*
(>++?:@AB
C
C'4%'D
:<-.'EF$4F$G'-7'-HI-=$74''J$G4$K'0,' &%67H' %G'7%4
(>++?>L!
M,=-<$C'8'NON
C'4%'D
P$G4$K'07&%G)-40%7' %7 $-=F'<$J$<
(>++?!QRR
M,=-<$S'8'NON
9
:07-<'G$,6<4,
scale: level (2scale = # of vertices), P: # of GPUs
LEVEL 3
GPUtoGPU(MPI/IB)
0 8
32
128
512
TEPS (Traversed Edges per second)
Ping-pong Latency using DMA Ideal of PEACH3
Bandwidth Bandwidth (Mbytes/sec) (GBytes/sec)
8 7
GPU to remoteGPU
6
CPU to remoteCPU
5
GPU to remoteCPU CPU to remoteGPU
4 3 2 1 0
8
128
2048 Data Size(bytes)
32768
3.00.E-03
6.00E+08
2.50.E-03
5.00E+08
COMM_OUT COMM_TREE COMM_FLAG
2.00.E-03
4.00E+08 COMM_FLAG+COMM_OUT +COMM_TREE
3.00E+08 2.00E+08
COMM_OUT
5.00.E-04
COMM_TREE
1.00E+08
1.50.E-03 1.00.E-03
COMM_FLAG
0.00.E+00
MPI/IB
PEACH3
PEACH2
PEACH2
0.00E+00 11
524288
Ping-pong Bandwidth using DMA
Average Comm. Time per BFS
Size (Bytes)
13
Level
15
17
TEPS on Graph500 application in comparison with MPI/IB, PEACH2, and PEACH3
Break down of Communication Time using PEACH2 and PEACH3 with Level-18
HA-PACS Project was supported by MEXT special fund as a program named “Research and Education on Interdisciplinary Computational Science Based on Exascale Computing Technology Develpment (FY2011-2013)” in U. of Tsukuba, and by the JST/CREST program entitled “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era” in the research area of “Development of System Software Technologies for post-Peta Scale High Performance Computing.”
FOOT:連絡先/ CCS のアドレス