Preview only show first 10 pages with watermark. For full document please download

Casper And Gpus

   EMBED


Share

Transcript

CASPER AND GPUS MODERATOR: DANNY PRICE, SCRIBE: RIC HARD PRES TAGE • Frameworks – MPI, heterogenous “large” systems • Pipelines – hashpipe, psrdata, bifrost, htgs • Data transport – DPDK, libVMA, NTOP • Applications – correlators,  beamformers, spectrometers, FRB • Hardware configurations and future hardware roadmaps USEFUL LINKS MODERATOR: DANNY PRICE, SCRIBE: RIC HARD PRES TAGE • hashpipe - https://github.com/david-macmahon/hashpipe • psrdada - http://psrdada.sourceforge.net • bifrost - https://github.com/ledatelescope/bifrost • htgs - https://pages.nist.gov/HTGS/ • DPDK - http://www.dpdk.org • libVMA - https://github.com/Mellanox/libvma • NTOP - https://github.com/ntop/PF_RING APPLICATIONS • FRB searching (Dan) - Building systems for GBT, Arecibo, FAST. Using Heimdall. • Building the whole FPGA/switch/GPU processing engine. Have they build the whole “ultimate CASPER backend?” Not yet. There is a SETI GPU, an FRB GPU, etc. Heimdall dedispersion is the hardest computational task, but overall still swamped by the number of candidates. • Beamformers – Max Planck beamformer on MeerKAT (commensal backend) • Packet capture and beamforming in bifrost. • DifX  (reported by Jonathon Weintroub) used some aspect of MPI to move existing DifX X-engine into GPU? [From discussions with Arash after the meeting: he did need to hand-port FFTW to CuFFT, and some aspects of X-engine to CUDA kernel.] • Dan: Use GPU correlators for ~ 2**8 antennas. Not needed for small number of antennas (e.g. VLBI). DATA TRANSPORT • DPDK, etc zero copy operations, bypass kernel space. Goes from NIC to GPU memory saving one hop. RDMA – direct to GPU with Inifiniband, Rocky = similar over Ethernet, layer above RDMA. All still have to go through system memory. • DPDK – has to have an Intel NIC (or clone) – can get 80 Gb/sec into GPU (2x 40 Gb NICs). 
 [Edit: DPDK does support some Mellanox / Broadcomm / Cisco / Chelsio chipset] • libVMA: equivalent with Mellanox NICs; 40 Gb/sec per NIC. Using SPEAD packets • Would like a SPEAD reader using DPDK for psrdada, bifrost, etc. • Dan – bottleneck into PCs is packets/sec, not bits/sec, want giant packets. (Jumbo = 9k packets). • NICs now supporting interrupt coalescing – will wait for e.g. 10 packets before it interrupts the CPU. Dave’s hashpipe uses this. Kernel tuning parameters critical – need a CASPER memo for this. Danny – maybe one exists. Application code also needs to be bound to correct processor. Threads need to be locked to the correct core. •  Dan: action item – group to get together to identify memo(s) of “required reading” before attempting to develop HPC code. Group to consist of: John Ford, Dave MacMahon, Danny Price. “How to do high speed data transport”. HOW TO DO HIGH-SPEED DATA TRANSPORT: A READING LIS T FOR THE CURIOUS C ASPERITE • Digital signal processing using stream high performance computing: A 512-input broadband correlator for radio astronomy, J Kocz, LJ Greenhill, BR Barsdell… arXiv:1401.8288 • A Scalable Hybrid FPGA/GPU FX Correlator, J Kocz, LJ Greenhill, BR Barsdell… - Journal of Astronomical Instrumentation, 2014 • The Breakthrough Listen Search for Intelligent Life: A Wideband Data Recorder System for the Robert C. Byrd Green Bank Telescope, D MacMahon, DC Price, M Lebofsky…, arXiv:1707.06024 • An Efficient Real-time Data Pipeline for the CHIME Pathfinder Radio Telescope X-Engine, A Recnik, K Bandura, N Denman… arXiv: 1503.06189 HARDWARE CONFIGURATIONS • Danny: Breakthrough uses 4U servers from SuperMicro, dual xeons, capture raw voltages to disk. After observations – play back through NVIDIA 1080 gaming cards – one per node. • Typically BTL/GBT use one GPU per box. Others using 2/4 GPUs per box. CHIME correlator uses AMD. Code written in OpenCL. • Dan – NVIDIA is into supercomputing; AMD is selling chips to gamers. Can run OpenCL on NVIDIA. • CUDA gives you cuFFT, cuBLAS, Thrust library. Does AMD have equivalents? • Number of PCI Express lanes the CPU can support is important. AMD CPU + NVIDIA GPU may be beneficial. • Power 8/9 have “bluelink” connections. May develop NICs which use bluelink. IBM has shown a lot of dedication to getting the GPU as high speed interconnect as possible. • Vendors: very cheap 10/40Gb transceivers from FiberStore (fs.com). Also sell 100 Gb switches. PIPELINES • HTGS does inverse of bifrost. Bifrost binds thread to an operation. HTGS define nodes in a graph, nodes will be bound to a CPU thread. Aim is to overlap data transport and computation. Get hybrid, multicore pipeline. Uses explicit graph representation throughout. • Hashpipe – originally developed for GUPPI (Paul D.) Generalized by Dave MacMahon. Not as sophisticated as bifrost/HTGS. Provides support for metadata. Hashpipe does not support forking ring buffers. Simple and straightforward, well documented, CASPER tutorials available. • PSRDADA similar to hashpipe. Low level. Simple and conservative: use hashpipe or PSRDADA. Bifrost in a single instrument. • HTGS just starting prototyping use in GB. Unique in using graph representation – maintained through analysis and execution. Also can use multiple GPUs – formulate a subgraph and encapsulate it into an execution pipeline graph, bound to a GPU. • Should put a link to Tim’s thesis from CASPER website. Link to paper is https:// link.springer.com/article/10.1007/s11265-017-1262-6 GPU ROADMAP • vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 – formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards. GPU ROADMAP • vega for AMD coming out next week. Volta for NVIDIA. Volta has tensor cores – 4x4 matrix multiplications, 16-bit inputs, 32-bit outputs (designed for AI training / inferencing). CUDA 9 – formalized some of the threading models –can write CUDA kernels that work on a thread block. No announcement on GTX line, but will probably announce Volta GTX soon. Consumer cards will stick with DDR RAM. SLIbridge can communicate between cards. FPGA ROADMAP • latest generation Ultrascale+. Some chips in production. Lots more memory on chip. 10s of Mbytes -> Gbits. 26 Gbit links, 100 Gb Ethernet on eval boards. • $7k for a VCU118 eval board with a $20k chip on. Not engineered for industrial applications. HBM (high bandwidth memory) superhigh bandwidth DRAM, connects over substrate. • FPGAs with high-speed ADCs/DACs on chip. 8 3Gsps ADCs/DACs. Not generally available yet, will be out at the end of the year. • Working on 7nm chips – no date for availability yet. Dan: for performance/$, use latest generation family, but medium-size chip. • Can buy VCU118 boards in bulk. Power to FPGA is throttled to 60W (?). May be a problem for full utilization, but looks encouraging. Full investigation not complete.