Transcript
nci.org.au @NCInews nci.org.au
NCI and Raijin
National Computational Infrastructure
2
nci.org.au
Our Partners
nci.org.au
Accelerators • General purpose, highly parallel processors – High FLOPs/watt and FLOPs/$ – Unit of execution Kernel – Separate memory subsystem
• GPGPU – Example: Nvidia Tesla GPU (K80) 2 x 2496 cores (562MHz/875MHz) 2 x 12 GB RAM 500 GB/s mem bandwidth 2.91 Tflops double prec. 8.74 Tflops single prec.
• Coprocessor – Example: Intel Xeon Phi 7120X (MIC architecture) 61 cores (244 threads) 16 GB RAM 352 GB/s mem bandwidth 1.2 Tflops double prec. nci.org.au
Dell C4130 Node CPU0 12 cores x16
CPU1 12 cores
QPI
x16
x16
x8
x16
IB FDR
PCIE-3
K80
K80
K80
K80
GPU0
GPU2
GPU4
GPU6
GPU1
GPU3
GPU5
GPU7
Raijin Node
Dell C4130 Node
Processor
2 SandyBridge Xeon E5-2670 CPUs
2 Haswell Xeon E5-2670 v3 CPUs
#Cores
16
24
Memory
32 GB
128 GB
Network
Infiniband FDR
Infiniband FDR
Accelerator
None
4 NVIDIA Tesla K80s
nci.org.au
NVIDIA Tesla K80 GPU Tesla K80
GK210 12GB GDDR5
GK210 PCIe switch
› cores › memory › memory BW › clock base › clock max › power › SP › DP › Architecture › PCIe
4992 24 GB (48x256M) 480 GB/s (384-bit wide) 562 MHz 875 MHz 300W max. 5.61/8.74 TFLOPs 1.87/2.91 TFLOPs Kepler Gen 3 (15.7 GB)
12GB GDDR5
PCIe Gen3 Connector
nci.org.au
GK210 and SMX
› Number of SMX › Manufacturing › Register File Size › Shared Mem / L1 Cache › Transistor Count
13/15 TSMC 28nm 512 KB 128 KB 7.1 B
› single-prec cores › double-prec units › special-func units › load/store units
192 64 32 32 nci.org.au
Software Stack Item
Software
OS
CentOS
Kernel
3.14.46.el6.x86_64
OFED
Mellanox OFED 2.1
Host Compiler
Intel-CC/12.1.9.293
MPI Library
OpenMPI/1.6.5
MKL Library
Intel-MKL/12.1.9.293
CUDA Toolkit
CUDA 6.5
CUDA Driver
340.87
nci.org.au
HPL 16000 14000
46.20
HPL GFLOPS and Speedups
45 40
12000
35
GFLOPS
30 22.04
8000
16.95
6000
13960
20 15
4000
0
25
Acceleration (X)
10000
2000
50
6.90 1.00 302 Raijin
2.46 742
2084
Haswell
Haswell+2K80s Actual GFLOPS
6659
Haswell+4K80s
10
5122
2node 4K80
5 2node 8K80
0
Speedup
• Binary version: hpl-2.1_cuda-6.5 from NVIDIA • Auto boost is used in all the tests (manual CLOCK may give better results ) • Some experiments are not fully-tuned (e.g., half GPUs) • Speedups are based on one Raijin node nci.org.au
Power Consumption (Watts) 3500 3000
Power (W)
Power (W)
2500 2000 1500 1000 500 0
Time
Time
3500
2n 4K80
2n 8K80
Haswell+2K80
Haswell+4K80
3000 2500 2000 1500 1000 500 0
• System power reading from ipmi-sensors • As a reference, 2 Raijin nodes consume ~600 W nci.org.au
GPU Autoboost 1400
1000 900
1200
800
CLOCK
600
800
500 600
400 300
Power in Watt
1000
700
400
200
200
100
0
0 CLOCK
POWER
• HPL benchmarking on single node using 8 gpus • Power consumption is for GPUs only • CLOCK range from 374 MHz to 875 MHz
nci.org.au
NAMD
16 14
13.69
12.67
Acceleration (X)
12 10 8 5.68
6
0
Raijin Haswell (24cores)
6.70
Haswell+2K80s Haswell+4K80s
4 2
9.03
8.54
2.33 1.00
2.07 1.00
apoa
1.00 f1atpase
2.03
stmv
• GPU version - NAMD 2.10_Linux_x86_64-multicore-CUDA • CPU version - NAMD 2.10 • Speedups are based on one Raijin node
nci.org.au
NAMD STMV Comparison with Raijin 3
7000
2.5
6000
days/ns
4000
1.5
3000
1
0.696212
2000
0.5 0
Power
5000
2
1000
4
8
12
16
20
24
28
32
36
0
Number of Nodes
• The performance of 24 nodes using MPI is similar to one GPU node 24 nodes 0.696, GPU node 0.681 • The power consumption is 5463 W compared to GPU node 3111 W
nci.org.au
HPL Tuning on a GPU node 8000
6659
6000
GFLOPS
5936 4000 3804 2000
0
fermi
naïve
highly-tuned
• HPL running on single node with 8 GPUs, with same input • Code version does matter - from fermi to NVIDIA-hpl-2.1 binary • Tuning does matter - optimised binary is not sufficient
nci.org.au
Hybrid Programming Model
NUMA-aware, accelerator-aware, 1 billion vs 1000 x 1000 x 1000
MPI
+
OpenMP + CUDA
Accelerator Programming • • • • •
CUDA OpenMP 4.0 OpenCL OpenACC MIC
nci.org.au
Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality
nci.org.au
Locality and Affinity Display Multi-GPU and IB affinity, NUMA locality CPU0 12 cores x16
CPU1 12 cores
QPI
x16
r3596
x16
x8
x16
IB FDR
PCIE-3
K80
K80
K80
K80
GPU0
GPU2
GPU4
GPU6
GPU1
GPU3
GPU5
GPU7
connect to r3597 Infiniband 56Gb/s
nci.org.au
Execution Model of HPL module load openmpi/1.6.5 cuda/6.5 ... export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 mpirun -np 16 --bind-to-none ... ./run_script # run_script export OMP_NUM_THREADS=3 export MKL_NUM_THREADS=3 case $OMPI_COMM_WORLD_LOCAL_RANK in [0]) export CUDA_VISIBLE_DEVICES=0 numactl --physcpubind=0,2,4 ./xhpl ;; [1]) export CUDA_VISIBLE_DEVICES=1 numactl --physcpubind=6,8,10 ./xhpl ;; ... [7])
esac
export CUDA_VISIBLE_DEVICES=7 numactl --physcpubind=19,21,23 ./xhpl ;;
nci.org.au
Resource Utilisation •
A program view from a computer scientist
Computation
CPU, ILP, Parallelism
Memory
Caching, Conflict, Locality
Communication
Bandwidth, Latency
I/O
IO Caching, Granularity
Program
Machine nci.org.au
PAPI CUDA Component papi/5.4.1-cuda module on Raijin supporting CUDA counters Sample output: Process GPU results on 8 GPUs... PAPI counterValue 4432 PAPI counterValue 9977 PAPI counterValue 4432 PAPI counterValue 10228 PAPI counterValue 4432 PAPI counterValue 9961 PAPI counterValue 4432 PAPI counterValue 9885 PAPI counterValue 4432 PAPI counterValue 9942 PAPI counterValue 4432 PAPI counterValue 9852 PAPI counterValue 4432 PAPI counterValue 9836 PAPI counterValue 4432 PAPI counterValue 9757
--> cuda:::device:0:inst_executed --> cuda:::device:0:elapsed_cycles_sm --> cuda:::device:1:inst_executed --> cuda:::device:1:elapsed_cycles_sm --> cuda:::device:2:inst_executed --> cuda:::device:2:elapsed_cycles_sm --> cuda:::device:3:inst_executed --> cuda:::device:3:elapsed_cycles_sm --> cuda:::device:4:inst_executed --> cuda:::device:4:elapsed_cycles_sm --> cuda:::device:5:inst_executed --> cuda:::device:5:elapsed_cycles_sm --> cuda:::device:6:inst_executed --> cuda:::device:6:elapsed_cycles_sm --> cuda:::device:7:inst_executed --> cuda:::device:7:elapsed_cycles_sm
nci.org.au
Performance Modelling • Performance Modelling (or Performance Expectation) – estimate baseline performance – estimate potential benefit – identify critical resources
• Benchmarking is not performance modelling • Combine performance tools with analytical methods Compute
Compute 61.3% walltime 17.5% in scalar numeric ops 2.5% in vector numeric ops 80.0% in memory accesses
MPI 31.8% walltime 57.6% in collective calls, process rate 12.6 MB/s 42.4% in point-to-pint calls, process rate 108 MB/s
I/O MPI
I/O
6.9% walltime 0% in reads, process read rate 0 MB/s 100% in writes, process write rate 28.9 MB/s
nci.org.au
Computational Intensity Computational Intensity = number of calculation operations each memory load/store
Example loop
CI
Key factor
A(:) = B(:) + C(:)
0.33
Memory
A(:) = c * B(:)
0.5
Memory
A(:) = B(:) * C(:) + D(:)
0.5
Memory
A(:) = B(:) * C(:) + D(:) * E(:)
0.6
Memory
A(:) = c * B(:) + d * C(:)
1.0
Still memory
A(:) = c + B(:) * (d + B(:) * e) 2.0
Calculation
nci.org.au
Working and To Do • Profiling Tools - nvprof, nsight, etc. - PAPI CUDA components
-
CUDA-aware MPI - OpenMPI 1.10.0 built with CUDA awareness - GPU Direct RDMA
-
PBS Scheduling and GPUs - resource utilisation - nvidia-smi permissions
nci.org.au
References • • • • • • •
Tesla K80 GPU Accelerator Board Specification, Jan 2015 NVIDIA’s CUDA Compute Architecture: Kepler GK110/210 (white paper) GPU Performance Analysis and Optimisation (NVIDIA) 2015 OpenMPI with RDMA Support and CUDA GPU Hardware Execution Model and Overview, University of Utah, 2011 NCI http://www.nci.org.au Nvidia CUDA http://www.nvidia.com/object/cuda_home_new.html
nci.org.au