Preview only show first 10 pages with watermark. For full document please download

Head:ccs 名称/ Ccs・筑波大ロゴ Foot:連絡先/ Ccs のアドレス

   EMBED


Share

Transcript

HEAD:CCS 名称/ 2 The University of Tokyo Keio University CCS・筑波大ロゴ 1 Collaboration with U. of Tsukuba Evaluation of Graph Application using Tightly Coupled Accelerators Toshihiro Hanawa1, Takahiro Kaneda2, Hideharu Amano2 Tightly Coupled Accelerators (TCA) Architecture, PEACH2, and PEACH3 G2 x8 To PEACH2 (Root Complex / Endpoint) Nios II (CPU) Port S Block diagram of PEACH2/3 Chip (Port S is omitted on PEACH3 Board) CPU0 G2 x8 G2 x16 G2 x16 G2 x16 GPU1 GPU2 GPU3 CPU1 QPI GPU0 GPU1 GPU2 GPU3 PCIe G3 x8 Infini Band G3 x8 Infini Band Block diagram of computation node of HA-PACS/TCA (same for PEACH3) 5 4 Intel Xeon CPU E5-2680 v2 GPU NVIDIA Tesla K20x (Gen2 x16) NVIDIA Tesla K40m (Gen3 x16) # of Nodes 2 of 64 2 CUDA 7.0 MPI MVAPICH2-GDR/2.1rc2 MVAPICH2-GDR/2.1 InfiniBand Infiniband QDR 2-rail (Infiniband FDR 1-rail) PEACH PEACH2 PEACH3 PEACH3 Specification vs. PEACH2 Specification PEACH2 PEACH3 FPGA Family Altera Stratix IV GX Altera Stratix V GX FPGA Chip EP4SGX530NF45C2 ES5GXA7N3F45C2 Process Technology 40nm 28nm Available LEs 531K 622K Port PCIe Gen2 x8 PCIe Gen3 x8 Maximum Bandwidth 4 Gbyte/sec 7.9 Gbyte/sec Operation Frequency 250 MHz 250 MHz Internal Bus Width 128 bit 256 bit Usage of LEs 22 38 DRAM on Board (Available) DDR3 512 Mbyte DDR3 512 Mbyte Communications on BFS 1 3 7 1 3 GPUtoGPU(PEACH3) 4 0 LEVEL 1 6 2 2 4 !"#$ %&'(%))* 2 0 1 CPU PEACH3 Test Env. PEACH3 Communication Board (PCIe CEM Spec., single height) Level-Synchronized BFS Each thread works on the same depth (level) in synchronization 6 HA-PACS/TCA Implementation and Evaluation of BFS using PEACH3 Basic Performance of PEACH2 and PEACH3 Latency(usec) PEACH2 GPU0 PCIe G2 x16 PEACH2 G2 x 8 Memory Port E CPU1 QPI G2 x16 To PEACH2 (Endpoint) Routing function CPU0 G2 x8 G2 x16 DMAC To PEACH2 (Root Complex) Port W PEACH2 G2 x16 CPU & GPU side (Endpoint) G2 x8 G2 x 8 PEACH2 G2 x16 Port N Evaluation Environment InfiniBand Network G2 x8 GPGPU is now widely used for accelerating scientific and engineering computing to improve performance significantly with less power consumption. However, I/O bandwidth bottleneck causes serious performance degradation on GPGPU computing. Especially, latency on inter-node GPU communication significantly increases by several memory copies. To solve this problem, TCA (Tightly Coupled Accelerators) enables direct communication among multiple GPUs over computation nodes using PCI Express. PEACH2 (PCI Express Adaptive Communication Hub ver. 2) chip was developed and it has been evaluated using HA-PACS/TCA cluster, which employs PEACH2 board in each node . However, PEACH2 performance was limited by PCI Express Gen2 x8. In order to improve the PCI Express performance, we introduce the new FPGA which supports PCI Express Gen3 with the hard IP. We have designed and implemented new hub chip named “PEACH3.” PEACH3 board has also been developed as an PCI Express extension board similar to PEACH2 board. PEACH3 basic bandwidth attains twice better than PEACH2 bandwidth, 7 3 5 5 6 GPUtoGPU(PEACH2) LEVEL 2 +$,,-.$'/01$ 23"4$5 (%674' 8'9'3:/ ;$4-0< %&'=%))* (>++?:@AB C C'4%'D :<-.'EF$4F$G'-7'-HI-=$74''J$G4$K'0,' &%67H' %G'7%4 (>++?>L! M,=-<$C'8'NON C'4%'D P$G4$K'07&%G)-40%7' %7 $-=F'<$J$< (>++?!QRR M,=-<$S'8'NON 9 :07-<'G$,6<4, scale: level (2scale = # of vertices), P: # of GPUs LEVEL 3 GPUtoGPU(MPI/IB) 0 8 32 128 512 TEPS (Traversed Edges per second) Ping-pong Latency using DMA Ideal of PEACH3 Bandwidth Bandwidth (Mbytes/sec) (GBytes/sec) 8 7 GPU to remoteGPU 6 CPU to remoteCPU 5 GPU to remoteCPU CPU to remoteGPU 4 3 2 1 0 8 128 2048 Data Size(bytes) 32768 3.00.E-03 6.00E+08 2.50.E-03 5.00E+08 COMM_OUT COMM_TREE COMM_FLAG 2.00.E-03 4.00E+08 COMM_FLAG+COMM_OUT +COMM_TREE 3.00E+08 2.00E+08 COMM_OUT 5.00.E-04 COMM_TREE 1.00E+08 1.50.E-03 1.00.E-03 COMM_FLAG 0.00.E+00 MPI/IB PEACH3 PEACH2 PEACH2 0.00E+00 11 524288 Ping-pong Bandwidth using DMA Average Comm. Time per BFS Size (Bytes) 13 Level 15 17 TEPS on Graph500 application in comparison with MPI/IB, PEACH2, and PEACH3 Break down of Communication Time using PEACH2 and PEACH3 with Level-18 HA-PACS Project was supported by MEXT special fund as a program named “Research and Education on Interdisciplinary Computational Science Based on Exascale Computing Technology Develpment (FY2011-2013)” in U. of Tsukuba, and by the JST/CREST program entitled “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era” in the research area of “Development of System Software Technologies for post-Peta Scale High Performance Computing.” FOOT:連絡先/ CCS のアドレス