Transcript
CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RIC HARD PRES TAGE •
Frameworks – MPI, heterogenous “large” systems
•
Pipelines – hashpipe, psrdata, bifrost, htgs
•
Data transport – DPDK, libVMA, NTOP
•
Applications – correlators, beamformers, spectrometers, FRB
•
Hardware configurations and future hardware roadmaps
USEFUL LINKS MODERATOR: DANNY PRICE, SCRIBE: RIC HARD PRES TAGE •
hashpipe - https://github.com/david-macmahon/hashpipe
•
psrdada - http://psrdada.sourceforge.net
•
bifrost - https://github.com/ledatelescope/bifrost
•
htgs - https://pages.nist.gov/HTGS/
•
DPDK - http://www.dpdk.org
•
libVMA - https://github.com/Mellanox/libvma
•
NTOP - https://github.com/ntop/PF_RING
APPLICATIONS •
FRB searching (Dan) - Building systems for GBT, Arecibo, FAST. Using Heimdall.
•
Building the whole FPGA/switch/GPU processing engine. Have they build the whole “ultimate CASPER backend?” Not yet. There is a SETI GPU, an FRB GPU, etc. Heimdall dedispersion is the hardest computational task, but overall still swamped by the number of candidates.
•
Beamformers – Max Planck beamformer on MeerKAT (commensal backend)
•
Packet capture and beamforming in bifrost.
•
DifX (reported by Jonathon Weintroub) used some aspect of MPI to move existing DifX X-engine into GPU? [From discussions with Arash after the meeting: he did need to hand-port FFTW to CuFFT, and some aspects of X-engine to CUDA kernel.]
•
Dan: Use GPU correlators for ~ 2**8 antennas. Not needed for small number of antennas (e.g. VLBI).
DATA TRANSPORT •
DPDK, etc zero copy operations, bypass kernel space. Goes from NIC to GPU memory saving one hop. RDMA – direct to GPU with Inifiniband, Rocky = similar over Ethernet, layer above RDMA. All still have to go through system memory.
•
DPDK – has to have an Intel NIC (or clone) – can get 80 Gb/sec into GPU (2x 40 Gb NICs).
[Edit: DPDK does support some Mellanox / Broadcomm / Cisco / Chelsio chipset]
•
libVMA: equivalent with Mellanox NICs; 40 Gb/sec per NIC. Using SPEAD packets
•
Would like a SPEAD reader using DPDK for psrdada, bifrost, etc.
•
Dan – bottleneck into PCs is packets/sec, not bits/sec, want giant packets. (Jumbo = 9k packets).
•
NICs now supporting interrupt coalescing – will wait for e.g. 10 packets before it interrupts the CPU. Dave’s hashpipe uses this. Kernel tuning parameters critical – need a CASPER memo for this. Danny – maybe one exists. Application code also needs to be bound to correct processor. Threads need to be locked to the correct core.
•
Dan: action item – group to get together to identify memo(s) of “required reading” before attempting to develop HPC code. Group to consist of: John Ford, Dave MacMahon, Danny Price. “How to do high speed data transport”.
HOW TO DO HIGH-SPEED DATA TRANSPORT: A READING LIS T FOR THE CURIOUS C ASPERITE •
Digital signal processing using stream high performance computing: A 512-input broadband correlator for radio astronomy, J Kocz, LJ Greenhill, BR Barsdell… arXiv:1401.8288
•
A Scalable Hybrid FPGA/GPU FX Correlator, J Kocz, LJ Greenhill, BR Barsdell… - Journal of Astronomical Instrumentation, 2014
•
The Breakthrough Listen Search for Intelligent Life: A Wideband Data Recorder System for the Robert C. Byrd Green Bank Telescope, D MacMahon, DC Price, M Lebofsky…, arXiv:1707.06024
•
An Efficient Real-time Data Pipeline for the CHIME Pathfinder Radio Telescope X-Engine, A Recnik, K Bandura, N Denman… arXiv: 1503.06189
HARDWARE CONFIGURATIONS •
Danny: Breakthrough uses 4U servers from SuperMicro, dual xeons, capture raw voltages to disk. After observations – play back through NVIDIA 1080 gaming cards – one per node.
•
Typically BTL/GBT use one GPU per box. Others using 2/4 GPUs per box. CHIME correlator uses AMD. Code written in OpenCL.
•
Dan – NVIDIA is into supercomputing; AMD is selling chips to gamers. Can run OpenCL on NVIDIA.
•
CUDA gives you cuFFT, cuBLAS, Thrust library. Does AMD have equivalents?
•
Number of PCI Express lanes the CPU can support is important. AMD CPU + NVIDIA GPU may be beneficial.
•
Power 8/9 have “bluelink” connections. May develop NICs which use bluelink. IBM has shown a lot of dedication to getting the GPU as high speed interconnect as possible.
•
Vendors: very cheap 10/40Gb transceivers from FiberStore (fs.com). Also sell 100 Gb switches.
PIPELINES •
HTGS does inverse of bifrost. Bifrost binds thread to an operation. HTGS define nodes in a graph, nodes will be bound to a CPU thread. Aim is to overlap data transport and computation. Get hybrid, multicore pipeline. Uses explicit graph representation throughout.
•
Hashpipe – originally developed for GUPPI (Paul D.) Generalized by Dave MacMahon. Not as sophisticated as bifrost/HTGS. Provides support for metadata. Hashpipe does not support forking ring buffers. Simple and straightforward, well documented, CASPER tutorials available.
•
PSRDADA similar to hashpipe. Low level. Simple and conservative: use hashpipe or PSRDADA. Bifrost in a single instrument.
•
HTGS just starting prototyping use in GB. Unique in using graph representation – maintained through analysis and execution. Also can use multiple GPUs – formulate a subgraph and encapsulate it into an execution pipeline graph, bound to a GPU.
•
Should put a link to Tim’s thesis from CASPER website. Link to paper is https:// link.springer.com/article/10.1007/s11265-017-1262-6
GPU ROADMAP •
vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 – formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.
GPU ROADMAP •
vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 – formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards.
FPGA ROADMAP •
latest generation Ultrascale+. Some chips in production. Lots more memory on chip. 10s of Mbytes -> Gbits. 26 Gbit links, 100 Gb Ethernet on eval boards.
•
$7k for a VCU118 eval board with a $20k chip on. Not engineered for industrial applications. HBM (high bandwidth memory) superhigh bandwidth DRAM, connects over substrate.
•
FPGAs with high-speed ADCs/DACs on chip. 8 3Gsps ADCs/DACs. Not generally available yet, will be out at the end of the year.
•
Working on 7nm chips – no date for availability yet. Dan: for performance/$, use latest generation family, but medium-size chip.
•
Can buy VCU118 boards in bulk. Power to FPGA is throttled to 60W (?). May be a problem for full utilization, but looks encouraging. Full investigation not complete.