Preview only show first 10 pages with watermark. For full document please download

Mpi Opt And Tuning - National Computational Infrastructure

   EMBED


Share

Transcript

nci.org.au @NCInews nci.org.au NCI and Raijin National Computational Infrastructure 2 nci.org.au Our Partners nci.org.au Accelerators • General purpose, highly parallel processors – High FLOPs/watt and FLOPs/$ – Unit of execution Kernel – Separate memory subsystem • GPGPU – Example: Nvidia Tesla GPU (K80) 2 x 2496 cores (562MHz/875MHz) 2 x 12 GB RAM 500 GB/s mem bandwidth 2.91 Tflops double prec. 8.74 Tflops single prec. • Coprocessor – Example: Intel Xeon Phi 7120X (MIC architecture) 61 cores (244 threads) 16 GB RAM 352 GB/s mem bandwidth 1.2 Tflops double prec. nci.org.au Dell C4130 Node CPU0 12 cores x16 CPU1 12 cores QPI x16 x16 x8 x16 IB FDR PCIE-3 K80 K80 K80 K80 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7 Raijin Node Dell C4130 Node Processor 2 SandyBridge Xeon E5-2670 CPUs 2 Haswell Xeon E5-2670 v3 CPUs #Cores 16 24 Memory 32 GB 128 GB Network Infiniband FDR Infiniband FDR Accelerator None 4 NVIDIA Tesla K80s nci.org.au NVIDIA Tesla K80 GPU Tesla K80 GK210 12GB GDDR5 GK210 PCIe switch › cores › memory › memory BW › clock base › clock max › power › SP › DP › Architecture › PCIe 4992 24 GB (48x256M) 480 GB/s (384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.91 TFLOPs Kepler Gen 3 (15.7 GB) 12GB GDDR5 PCIe Gen3 Connector nci.org.au GK210 and SMX › Number of SMX › Manufacturing › Register File Size › Shared Mem / L1 Cache › Transistor Count 13/15 TSMC 28nm 512 KB 128 KB 7.1 B › single-prec cores › double-prec units › special-func units › load/store units 192 64 32 32 nci.org.au Software Stack Item Software OS CentOS Kernel 3.14.46.el6.x86_64 OFED Mellanox OFED 2.1 Host Compiler Intel-CC/12.1.9.293 MPI Library OpenMPI/1.6.5 MKL Library Intel-MKL/12.1.9.293 CUDA Toolkit CUDA 6.5 CUDA Driver 340.87 nci.org.au HPL 16000 14000 46.20 HPL GFLOPS and Speedups 45 40 12000 35 GFLOPS 30 22.04 8000 16.95 6000 13960 20 15 4000 0 25 Acceleration (X) 10000 2000 50 6.90 1.00 302 Raijin 2.46 742 2084 Haswell Haswell+2K80s Actual GFLOPS 6659 Haswell+4K80s 10 5122 2node 4K80 5 2node 8K80 0 Speedup • Binary version: hpl-2.1_cuda-6.5 from NVIDIA • Auto boost is used in all the tests (manual CLOCK may give better results ) • Some experiments are not fully-tuned (e.g., half GPUs) • Speedups are based on one Raijin node nci.org.au Power Consumption (Watts) 3500 3000 Power (W) Power (W) 2500 2000 1500 1000 500 0 Time Time 3500 2n 4K80 2n 8K80 Haswell+2K80 Haswell+4K80 3000 2500 2000 1500 1000 500 0 • System power reading from ipmi-sensors • As a reference, 2 Raijin nodes consume ~600 W nci.org.au GPU Autoboost 1400 1000 900 1200 800 CLOCK 600 800 500 600 400 300 Power in Watt 1000 700 400 200 200 100 0 0 CLOCK POWER • HPL benchmarking on single node using 8 gpus • Power consumption is for GPUs only • CLOCK range from 374 MHz to 875 MHz nci.org.au NAMD 16 14 13.69 12.67 Acceleration (X) 12 10 8 5.68 6 0 Raijin Haswell (24cores) 6.70 Haswell+2K80s Haswell+4K80s 4 2 9.03 8.54 2.33 1.00 2.07 1.00 apoa 1.00 f1atpase 2.03 stmv • GPU version - NAMD 2.10_Linux_x86_64-multicore-CUDA • CPU version - NAMD 2.10 • Speedups are based on one Raijin node nci.org.au NAMD STMV Comparison with Raijin 3 7000 2.5 6000 days/ns 4000 1.5 3000 1 0.696212 2000 0.5 0 Power 5000 2 1000 4 8 12 16 20 24 28 32 36 0 Number of Nodes • The performance of 24 nodes using MPI is similar to one GPU node 24 nodes 0.696, GPU node 0.681 • The power consumption is 5463 W compared to GPU node 3111 W nci.org.au HPL Tuning on a GPU node 8000 6659 6000 GFLOPS 5936 4000 3804 2000 0 fermi naïve highly-tuned • HPL running on single node with 8 GPUs, with same input • Code version does matter - from fermi to NVIDIA-hpl-2.1 binary • Tuning does matter - optimised binary is not sufficient nci.org.au Hybrid Programming Model NUMA-aware, accelerator-aware, 1 billion vs 1000 x 1000 x 1000 MPI + OpenMP + CUDA Accelerator Programming • • • • • CUDA OpenMP 4.0 OpenCL OpenACC MIC nci.org.au Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality nci.org.au Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality CPU0 12 cores x16 CPU1 12 cores QPI x16 r3596 x16 x8 x16 IB FDR PCIE-3 K80 K80 K80 K80 GPU0 GPU2 GPU4 GPU6 GPU1 GPU3 GPU5 GPU7 connect to r3597 Infiniband 56Gb/s nci.org.au Execution Model of HPL module load openmpi/1.6.5 cuda/6.5 ... export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 mpirun -np 16 --bind-to-none ... ./run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0]) export CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0,2,4 ./xhpl ;; [1]) export CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6,8,10 ./xhpl ;; ... [7]) esac export CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=19,21,23 ./xhpl ;; nci.org.au Resource Utilisation • A program view from a computer scientist Computation CPU, ILP, Parallelism Memory Caching, Conflict, Locality Communication Bandwidth, Latency I/O IO Caching, Granularity Program Machine nci.org.au PAPI CUDA Component papi/5.4.1-cuda module on Raijin supporting CUDA counters Sample output: Process GPU results on 8 GPUs... PAPI counterValue 4432 PAPI counterValue 9977 PAPI counterValue 4432 PAPI counterValue 10228 PAPI counterValue 4432 PAPI counterValue 9961 PAPI counterValue 4432 PAPI counterValue 9885 PAPI counterValue 4432 PAPI counterValue 9942 PAPI counterValue 4432 PAPI counterValue 9852 PAPI counterValue 4432 PAPI counterValue 9836 PAPI counterValue 4432 PAPI counterValue 9757 --> cuda:::device:0:inst_executed --> cuda:::device:0:elapsed_cycles_sm --> cuda:::device:1:inst_executed --> cuda:::device:1:elapsed_cycles_sm --> cuda:::device:2:inst_executed --> cuda:::device:2:elapsed_cycles_sm --> cuda:::device:3:inst_executed --> cuda:::device:3:elapsed_cycles_sm --> cuda:::device:4:inst_executed --> cuda:::device:4:elapsed_cycles_sm --> cuda:::device:5:inst_executed --> cuda:::device:5:elapsed_cycles_sm --> cuda:::device:6:inst_executed --> cuda:::device:6:elapsed_cycles_sm --> cuda:::device:7:inst_executed --> cuda:::device:7:elapsed_cycles_sm nci.org.au Performance Modelling • Performance Modelling (or Performance Expectation) – estimate baseline performance – estimate potential benefit – identify critical resources • Benchmarking is not performance modelling • Combine performance tools with analytical methods Compute Compute 61.3% walltime 17.5% in scalar numeric ops 2.5% in vector numeric ops 80.0% in memory accesses MPI 31.8% walltime 57.6% in collective calls, process rate 12.6 MB/s 42.4% in point-to-pint calls, process rate 108 MB/s I/O MPI I/O 6.9% walltime 0% in reads, process read rate 0 MB/s 100% in writes, process write rate 28.9 MB/s nci.org.au Computational Intensity Computational Intensity = number of calculation operations each memory load/store Example loop CI Key factor A(:) = B(:) + C(:) 0.33 Memory A(:) = c * B(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) 0.5 Memory A(:) = B(:) * C(:) + D(:) * E(:) 0.6 Memory A(:) = c * B(:) + d * C(:) 1.0 Still memory A(:) = c + B(:) * (d + B(:) * e) 2.0 Calculation nci.org.au Working and To Do • Profiling Tools - nvprof, nsight, etc. - PAPI CUDA components - CUDA-aware MPI - OpenMPI 1.10.0 built with CUDA awareness - GPU Direct RDMA - PBS Scheduling and GPUs - resource utilisation - nvidia-smi permissions nci.org.au References • • • • • • • Tesla K80 GPU Accelerator Board Specification, Jan 2015 NVIDIA’s CUDA Compute Architecture: Kepler GK110/210 (white paper) GPU Performance Analysis and Optimisation (NVIDIA) 2015 OpenMPI with RDMA Support and CUDA GPU Hardware Execution Model and Overview, University of Utah, 2011 NCI http://www.nci.org.au Nvidia CUDA http://www.nvidia.com/object/cuda_home_new.html nci.org.au