Preview only show first 10 pages with watermark. For full document please download

Tesla Heterogeneous Computing

   EMBED


Share

Transcript

© NVIDIA Corporation 2011 Overview Developer Technology Group Application development Extreme Scale CUDA © NVIDIA Corporation 2011 COSMO Collaboration Extreme-scale & co-design Developer Technology © NVIDIA Corporation 2011 Developer Technology (Devtech) Developer education Samples, presentations, papers, books, on-site Application development Working with strategic developers Porting and optimisation Next-generation Future algorithms and needs Future architectures and programming models © NVIDIA Corporation 2011 Professional CUDA Applications Available Now Tools Libraries Future CUDA C/C++ PGI Accelerators Platform LSF Cluster Mgr TauCUDA Perf Tools Parallel Nsight Vis Studio IDE PGI CUDA Fortran CAPS HMPP Bright Cluster Manager Allinea DDT Debugger ParaTools VampirTrace CUDA FFT CUDA BLAS EMPhotonics CULAPACK Thrust C++ Template Lib NVIDIA NPP Perf Primitives MAGMA (LAPACK) Headwave Suite OpenGeoSolut ions OpenSEIS GeoStar Seismic Suite Acceleware RTM Solver StoneRidge RTM ffA SVI Pro VSG Open Inventor Seismic City RTM Tsunami RTM AMBER NAMD HOOMD TeraChem BigDFT ABINT GROMACS LAMMPS VMD GAMESS CP2K CUDA-BLASTP MUMmerGPU CUDA-MEME PIPER Docking CUDA SW++ SmithWaterm GPU-HMMR CUDA-EC HEX Protein Docking ACUSIM AcuSolve 1.8 Autodesk Moldflow Prometch Particleworks Remcom XFdtd 7.0 Oil & Gas AccelerEyes Jacket MATLAB PGI CUDA x86 TotalView Debugger MATLAB Wolfram Mathematica NVIDIA RNG & SPARSE Video Libraries CUDA Libraries Paradigm RTM Panorama Tech Paradigm SKUA Acellera ACEMD DL-POLY Bio-Chemistry BioInformatics CAE Available Announced © NVIDIA Corporation 2011 OpenEye ROCS ANSYS Mechanical FluiDyna OpenFOAM Metacomp CFD++ MSC.Software Marc 2010.2 LSTC LS-DYNA 971 Professional CUDA Applications Available Now Video Rendering Finance EDA Other Available Future Adobe Premier Pro CS5 ARRI Various Apps GenArts Sapphire TDVision TDVCodec Black Magic Da Vinci MainConcept CUDA Encoder Elemental Video Fraunhofer JPEG2000 Cinnafilm Pixel Strings Assimilate SCRATCH Bunkspeed Shot (iray) Refractive SW Octane Random Control Arion ILM Plume Autodesk 3ds Max Cebas finalRender mental images iray (OEM) NVIDIA OptiX (SDK) Caustic Graphics Weta Digital PantaRay Lightworks Artisan Chaos Group V-Ray GPU NAG RNG Numerix Risk SciComp SciFinance RMS Risk Mgt Solutions Aquimin AlphaVision Hanweck Options Analy Murex MACS Agilent EMPro 2010 CST Microwave Agilent ADS SPICE Acceleware FDTD Solver Synopsys TCAD SPEAG SEMCAD X Gauda OPC Acceleware EM Solution Siemens 4D Ultrasound Digisens Medical Schrodinger Core Hopping Useful Progress Med MotionDSP Ikena Video Manifold GIS Dalsa Machine Digital Vision Anarchy Photo Announced © NVIDIA Corporation 2011 The Foundry Kronos Rocketick Veritlog Sim MvTec Machine Vis Works Zebra Zeany Example: ANSYS and NVIDIA Collaboration NVIDIA GPU Status Structural Mechanics Available Today ANSYS Mechanical (Single GPU) Updates for 2011 ANSYS Mechanical (Multi-GPU Intent) Fluid Dynamics ANSYS Nexxim ANSYS HFSS Product Evaluation Research Evaluation Electromagnetics ANSYS Maxwell ANSYS CFD (FLUENT + CFX) NVIDIA Provides Business and Engineering Investments in ANSYS Technology Developments © NVIDIA Corporation 2011 ANSYS Presentation at NVIDIA GTC 2010 Sep 20 — 23, 2010 San Jose Convention Center, San Jose, California, USA Accelerating System Level Signal Integrity Simulation with GPU Dr. Ekanathan Palamadai, ANSYS Nexxim R13 Convolution Results for Tesla C2050: Intel Nehalem 8 core CPU, OpenMP: 108 H Lower is better NVIDIA Tesla C2050 GPU, OpenMP: 4 H Single Precision ~27x Double Precision ~13x © NVIDIA Corporation 2011 COSMO & Devtech © NVIDIA Corporation 2011 Working with COSMO Collaborating with CSCS and SCS Regular contact Planning and design Create an easy path to GPU Maximise potential GPU performance Implementation Estimate target performance on the GPU Consult on development © NVIDIA Corporation 2011 Core Components on the GPU Stencils Leverage work from seismic (RTM) Tridiagonal solve Leverage work from UC Davis, NVIDIA Research and Devtech Challenge is to ensure the proximity of data © NVIDIA Corporation 2011 Echelon: Extreme-Scale © NVIDIA Corporation 2011 NVIDIA Research Leadership Bill Dally, Michael Garland, Steve Keckler, David Kirk, David Luebke Exploring challenging topics spanning many domains Graphics, science, languages, circuit design, computer architecture Support advances through collaboration Academic and industrial research institutions © NVIDIA Corporation 2011 Echelon Team © NVIDIA Corporation 2011 System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube L20 L21023 NV RAM MC Self-Aware OS NIC L0 C0 C7 Self-Aware Runtime LC7 L0 LC0 NoC SM0 SM127 Processor Chip (PC) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB Echelon System © NVIDIA Corporation 2011 N7 M15 CN Locality-Aware Compiler & Autotuner Power is THE Problem 1 Data Movement Dominates Power 2 Optimize the Storage Hierarchy 3 Tailor Memory to the Application © NVIDIA Corporation 2011 The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit DP 20pJ 26 pJ 256 pJ 256-bit buses 256-bit access 8 kB SRAM 16 nJ 500 pJ 50 pJ 1 nJ 28nm DRAM Rd/Wr Efficient off-chip link Exascale © NVIDIA Corporation 2011 Lane: 4 DFMAs, 20 GFLOPS L0 D$ L0 I$ Main Registers Operand Registers DFMA © NVIDIA Corporation 2011 DFMA DFMA DFMA LSI LSI SM: 8 Lanes – 160 GFLOPS L1$ Switch P © NVIDIA Corporation 2011 P P P P P P P Chip: 128 SMs – 20.48 TFLOPS + 8 Latency Processors 1024 SRAM Banks, 256KB each SRAM SRAM MC SRAM MC NI NoC SM SM SM 128 SMs 160GF each © NVIDIA Corporation 2011 SM SM LP LP Node MCM: 20TF + 256GB 150GB/s Network BW GPU Chip 20TF DP 256MB 1.4TB/s DRAM BW DRAM Stack © NVIDIA Corporation 2011 DRAM Stack DRAM Stack NV Memory Cabinet: 128 Nodes – 2.56PF – 38 kW NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect © NVIDIA Corporation 2011 System: to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW © NVIDIA Corporation 2011 CUDA © NVIDIA Corporation 2011 CUDA 4.0: Highlights Easier Parallel Application Porting • Share GPUs across multiple threads Faster Multi-GPU Programming New & Improved Developer Tools • Single thread access to all GPUs • Unified Virtual Addressing • Auto Performance Analysis • No-copy pinning of system memory • NVIDIA GPUDirect™ v2.0 • C++ Debugging • New CUDA C/C++ features • Peer-to-Peer Access • GPU Binary Disassembler • Thrust templated primitives library • Peer-to-Peer Transfers • NPP image/video processing library • cuda-gdb for MacOS • GPU-accelerated MPI • Layered Textures © NVIDIA Corporation 2011 NVIDIA GPUDirect™ Version 1.0 Version 2.0 for applications that communicate over a network for applications that communicate within a node • Direct access to GPU memory for 3rd party devices • Eliminates unnecessary sys mem copies & CPU overhead • Supported by Mellanox and Qlogic • Up to 30% improvement in communication performance © NVIDIA Corporation 2011 • Peer-to-Peer memory access, transfers & synchronization • MPI implementations natively support GPU data transfers • Less code, higher programmer productivity Details @ http://www.nvidia.com/object/software-for-tesla-products.html Before GPUDirect v2.0 Required Copy into Main Memory GPU1 Memory GPU2 Memory System Memory CPU GPU1 PCI-e © NVIDIA Corporation 2011 GPU2 Chip set GPUDirect v2.0: Peer-to-Peer Communication Direct Transfers b/w GPUs GPU1 Memory GPU2 Memory System Memory CPU GPU1 PCI-e © NVIDIA Corporation 2011 GPU2 Chip set Unified Virtual Addressing Easier to Program with Single Address Space No UVA: Multiple Memory Spaces System Memory GPU0 Memory GPU1 Memory 0x0000 0x0000 0x0000 0xFFFF 0xFFFF 0xFFFF CPU GPU0 GPU1 System Memory GPU0 Memory GPU1 Memory 0x0000 0xFFFF CPU PCI-e © NVIDIA Corporation 2011 UVA : Single Address Space GPU0 GPU1 PCI-e NVIDIA CUDA Summary New in CUDA 4.0 Platform Programming Model Parallel Libraries Tools GPUDirecttm (v 2.0) Peer-Peer Communication Unified Virtual Addressing Thrust C++ Library Parallel Nsight Pro 1.5 NVIDIA Library Support NVIDIA Tools Support C++ new/delete C++ Virtual Functions Hardware Features ECC Memory Double Precision Native 64-bit Architecture Concurrent Kernel Execution Dual Copy Engines 6GB per GPU supported Operating System Support MS Windows 32/64 Linux 32/64 Mac OS X 32/64 Designed for HPC Cluster Management GPUDirect Tesla Compute Cluster (TCC) Multi-GPU support © NVIDIA Corporation 2011 C support NVIDIA C Compiler CUDA C Parallel Extensions Function Pointers Recursion Atomics malloc/free C++ support Classes/Objects Class Inheritance Polymorphism Operator Overloading Class Templates Function Templates Virtual Base Classes Namespaces Fortran support CUDA Fortran (PGI) Templated Performance Primitive Library Complete math.h Complete BLAS Library (1, 2 and 3) Sparse Matrix Math Library RNG Library FFT Library (1D, 2D and 3D) Video Decoding Library (NVCUVID) Video Encoding Library (NVCUVENC) Image Processing Library (NPP) Video Processing Library (NPP) 3rd Party Math Libraries CULA Tools (EM Photonics) MAGMA Heterogeneous LAPACK IMSL (Rogue Wave) VSIPL (GPU VSIPL) Parallel Nsight for MS Visual Studio cuda-gdb Debugger with multi-GPU support CUDA/OpenCL Visual Profiler CUDA Memory Checker CUDA C SDK CUDA Disassembler NVML CUPTI 3rd Party Developer Tools Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPP Conclusion © NVIDIA Corporation 2011 GPU Computing is the Future 1 GPU Computing is #1 Today 2 GPU Computing Enables ExaScale 3 The GPU is the Computer 4 The Real Challenge is Software © NVIDIA Corporation 2011 On Top 500 AND Dominant on Green 500 At Reasonable Power A general purpose computing engine, not just an accelerator Thank you © NVIDIA Corporation 2011 GPU Technology Conference 2011 Oct. 11-14 | San Jose, CA 3rd annual GPU Technology Conference New for 2011: Co-located with Los Alamos HPC Symposium 300+ Research Scientists from National Labs 2010 highlights 280 hours of sessions 100+ Research posters 42 countries represented © NVIDIA Corporation 2011 www.gputechconf.com Resources © NVIDIA Corporation 2011 NVIDIA CUDA Developer Resources DEVELOPMENT TOOLS CUDA Toolkit Complete GPU computing development kit cuda-gdb GPU hardware debugging SDKs AND CODE SAMPLES GPU Computing SDK CUDA C/C++, DirectCompute, OpenCL code samples and documentation Visual Profiler Books Programming Massively Parallel Processors, CUDA by Example, GPU Computing Gems Parallel Nsight Integrated development environment for Visual Studio Optimization Guides Best Practices for GPU computing and graphics development GPU hardware profiler for CUDA C and OpenCL ENGINES & LIBRARIES Math Libraries CUFFT, CUBLAS, CUSPARSE, CURAND 3rd Party Libraries CULA LAPACK, VSIPL, … NPP Image Libraries Performance primitives for imaging App Acceleration Engines Ray Tracing: Optix, iRay Video Libraries NVCUVID / NVCUVENC © NVIDIA Corporation 2011 http://developer.nvidia.com CUDA Math Libraries High performance math routines for your applications: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS Library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library Included in the CUDA Toolkit (free download) www.nvidia.com/getcuda For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216 © NVIDIA Corporation 2011 CUDA 3rd Party Ecosystem Cluster Tools Cluster Management Platform HPC Platform Symphony Bright Cluster manager Ganglia Monitoring System Moab Cluster Suite Altair PBS Pro Job Scheduling Altair PBSpro TORQUE Platform LSF Parallel Language and API Solutions PGI CUDA Fortran PGI Accelerator PGI CUDA x86 CAPS HMPP pyCUDA (Python) Tidepowerd CUDA.net (.NET) JCuda (java) Khronos OpenCL Microsoft DirectCompute 3rd Party Math Libraries CULA Tools (EM Photonics) MAGMA Heterogeneous LAPACK MPI Libraries OpenMPI IMSL (Rogue Wave) VSIPL (GPU VSIPL) NAG © NVIDIA Corporation 2011 Parallel Tools Parallel Debuggers MS Visual Studio with Parallel Nsight Allinea DDT Debugger TotalView Debugger Compute Platform Providers Cloud Providers Amazon EC2 Peer 1 OEM’s Dell Parallel Performance Tools ParaTools VampirTrace TauCUDA Performance Tools PAPI HPC Toolkit HP IBM Infiniband Providers Mellanox QLogic GPU Computing Research & Education http://research.nvidia.com World Class Research Leadership and Teaching Proven Research Vision John Hopkins University Mass. Gen. Hospital/NE Univ University of Cambridge Nanyan University North Carolina State University Harvard University Technical University-Czech Swinburne University of Tech. University of Utah CSIRO Techische Univ. Munich University of Tennessee SINTEF UCLA University of Maryland HP Labs University of New Mexico University of Illinois at Urbana-Champaign ICHEC University Of Warsaw-ICM Tsinghua University Barcelona SuperComputer Center VSB-Tech Tokyo Institute of Technology Clemson University University of Ostrava Chinese Academy of Sciences Fraunhofer SCAI And more coming shortly. National Taiwan University Karlsruhe Institute Of Technology Georgia Institute of Technology Academic Partnerships / Fellowships © NVIDIA Corporation 2011 GPGPU Education 350+ Universities