Transcript
© NVIDIA Corporation 2011
Overview Developer Technology Group
Application development
Extreme Scale
CUDA
© NVIDIA Corporation 2011
COSMO Collaboration
Extreme-scale & co-design
Developer Technology
© NVIDIA Corporation 2011
Developer Technology (Devtech) Developer education Samples, presentations, papers, books, on-site
Application development Working with strategic developers Porting and optimisation
Next-generation Future algorithms and needs Future architectures and programming models © NVIDIA Corporation 2011
Professional CUDA Applications Available Now
Tools
Libraries
Future
CUDA C/C++
PGI Accelerators
Platform LSF Cluster Mgr
TauCUDA Perf Tools
Parallel Nsight Vis Studio IDE
PGI CUDA Fortran
CAPS HMPP
Bright Cluster Manager
Allinea DDT Debugger
ParaTools VampirTrace
CUDA FFT CUDA BLAS
EMPhotonics CULAPACK
Thrust C++ Template Lib
NVIDIA NPP Perf Primitives
MAGMA (LAPACK)
Headwave Suite
OpenGeoSolut ions OpenSEIS
GeoStar Seismic Suite
Acceleware RTM Solver
StoneRidge RTM
ffA SVI Pro
VSG Open Inventor
Seismic City RTM
Tsunami RTM
AMBER
NAMD
HOOMD
TeraChem
BigDFT ABINT
GROMACS
LAMMPS
VMD
GAMESS
CP2K
CUDA-BLASTP
MUMmerGPU
CUDA-MEME
PIPER Docking
CUDA SW++ SmithWaterm
GPU-HMMR
CUDA-EC
HEX Protein Docking
ACUSIM AcuSolve 1.8
Autodesk Moldflow
Prometch Particleworks
Remcom XFdtd 7.0
Oil & Gas
AccelerEyes Jacket MATLAB
PGI CUDA x86
TotalView Debugger
MATLAB
Wolfram Mathematica
NVIDIA RNG & SPARSE Video Libraries CUDA Libraries
Paradigm RTM
Panorama Tech
Paradigm SKUA Acellera ACEMD
DL-POLY
Bio-Chemistry
BioInformatics
CAE Available
Announced
© NVIDIA Corporation 2011
OpenEye ROCS
ANSYS Mechanical
FluiDyna OpenFOAM
Metacomp CFD++
MSC.Software Marc 2010.2
LSTC LS-DYNA 971
Professional CUDA Applications Available Now
Video
Rendering
Finance
EDA
Other
Available
Future
Adobe Premier Pro CS5
ARRI Various Apps
GenArts Sapphire
TDVision TDVCodec
Black Magic Da Vinci
MainConcept CUDA Encoder
Elemental Video
Fraunhofer JPEG2000
Cinnafilm Pixel Strings
Assimilate SCRATCH
Bunkspeed Shot (iray)
Refractive SW Octane
Random Control Arion
ILM Plume
Autodesk 3ds Max
Cebas finalRender
mental images iray (OEM)
NVIDIA OptiX (SDK)
Caustic Graphics
Weta Digital PantaRay
Lightworks Artisan
Chaos Group V-Ray GPU
NAG RNG
Numerix Risk
SciComp SciFinance
RMS Risk Mgt Solutions
Aquimin AlphaVision
Hanweck Options Analy
Murex MACS
Agilent EMPro 2010
CST Microwave
Agilent ADS SPICE
Acceleware FDTD Solver
Synopsys TCAD
SPEAG SEMCAD X
Gauda OPC
Acceleware EM Solution
Siemens 4D Ultrasound
Digisens Medical
Schrodinger Core Hopping
Useful Progress Med
MotionDSP Ikena Video
Manifold GIS
Dalsa Machine Digital Vision Anarchy Photo
Announced
© NVIDIA Corporation 2011
The Foundry Kronos
Rocketick Veritlog Sim
MvTec Machine Vis
Works Zebra Zeany
Example: ANSYS and NVIDIA Collaboration NVIDIA GPU Status
Structural Mechanics
Available Today
ANSYS Mechanical (Single GPU)
Updates for 2011
ANSYS Mechanical (Multi-GPU Intent)
Fluid Dynamics
ANSYS Nexxim
ANSYS HFSS
Product Evaluation Research Evaluation
Electromagnetics
ANSYS Maxwell
ANSYS CFD (FLUENT + CFX)
NVIDIA Provides Business and Engineering Investments in ANSYS Technology Developments © NVIDIA Corporation 2011
ANSYS Presentation at NVIDIA GTC 2010
Sep 20 — 23, 2010 San Jose Convention Center, San Jose, California, USA
Accelerating System Level Signal Integrity Simulation with GPU Dr. Ekanathan Palamadai, ANSYS
Nexxim R13 Convolution Results for Tesla C2050: Intel Nehalem 8 core CPU, OpenMP: 108 H Lower is better
NVIDIA Tesla C2050 GPU, OpenMP: 4 H Single Precision ~27x Double Precision ~13x
© NVIDIA Corporation 2011
COSMO & Devtech
© NVIDIA Corporation 2011
Working with COSMO Collaborating with CSCS and SCS Regular contact
Planning and design Create an easy path to GPU Maximise potential GPU performance
Implementation Estimate target performance on the GPU Consult on development © NVIDIA Corporation 2011
Core Components on the GPU Stencils Leverage work from seismic (RTM)
Tridiagonal solve Leverage work from UC Davis, NVIDIA Research and Devtech
Challenge is to ensure the proximity of data
© NVIDIA Corporation 2011
Echelon: Extreme-Scale
© NVIDIA Corporation 2011
NVIDIA Research Leadership Bill Dally, Michael Garland, Steve Keckler, David Kirk, David Luebke
Exploring challenging topics spanning many domains Graphics, science, languages, circuit design, computer architecture
Support advances through collaboration Academic and industrial research institutions
© NVIDIA Corporation 2011
Echelon Team
© NVIDIA Corporation 2011
System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube
DRAM Cube
L20
L21023
NV RAM
MC
Self-Aware OS NIC
L0
C0
C7
Self-Aware Runtime
LC7
L0
LC0
NoC
SM0 SM127 Processor Chip (PC) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB Echelon System © NVIDIA Corporation 2011
N7 M15 CN
Locality-Aware Compiler & Autotuner
Power is THE Problem
1
Data Movement Dominates Power
2
Optimize the Storage Hierarchy
3
Tailor Memory to the Application
© NVIDIA Corporation 2011
The High Cost of Data Movement
Fetching operands costs more than computing on them 20mm 64-bit DP 20pJ
26 pJ
256 pJ
256-bit buses
256-bit access 8 kB SRAM
16 nJ 500 pJ
50 pJ
1 nJ 28nm
DRAM Rd/Wr Efficient off-chip link
Exascale
© NVIDIA Corporation 2011
Lane: 4 DFMAs, 20 GFLOPS L0 D$ L0 I$ Main Registers Operand Registers DFMA
© NVIDIA Corporation 2011
DFMA
DFMA
DFMA
LSI
LSI
SM: 8 Lanes – 160 GFLOPS
L1$
Switch
P
© NVIDIA Corporation 2011
P
P
P
P
P
P
P
Chip: 128 SMs – 20.48 TFLOPS + 8 Latency Processors 1024 SRAM Banks, 256KB each SRAM
SRAM
MC
SRAM
MC
NI
NoC
SM
SM
SM
128 SMs 160GF each
© NVIDIA Corporation 2011
SM
SM
LP
LP
Node MCM: 20TF + 256GB 150GB/s Network BW
GPU Chip 20TF DP 256MB
1.4TB/s DRAM BW DRAM Stack
© NVIDIA Corporation 2011
DRAM Stack
DRAM Stack
NV Memory
Cabinet: 128 Nodes – 2.56PF – 38 kW NODE
NODE
ROUTER
NODE
NODE
NODE
NODE
ROUTER
NODE
NODE
NODE
NODE
ROUTER
NODE
NODE
NODE
NODE
ROUTER
NODE
NODE
MODULE
MODULE
MODULE
MODULE
MODULE
32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect © NVIDIA Corporation 2011
System: to ExaScale and Beyond
Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW © NVIDIA Corporation 2011
CUDA
© NVIDIA Corporation 2011
CUDA 4.0: Highlights Easier Parallel Application Porting
• Share GPUs across multiple threads
Faster Multi-GPU Programming
New & Improved Developer Tools
• Single thread access to all GPUs
• Unified Virtual Addressing
• Auto Performance Analysis
• No-copy pinning of system memory
• NVIDIA GPUDirect™ v2.0
• C++ Debugging
• New CUDA C/C++ features
• Peer-to-Peer Access
• GPU Binary Disassembler
• Thrust templated primitives library
• Peer-to-Peer Transfers
• NPP image/video processing library
• cuda-gdb for MacOS
• GPU-accelerated MPI
• Layered Textures
© NVIDIA Corporation 2011
NVIDIA GPUDirect™ Version 1.0
Version 2.0
for applications that communicate over a network
for applications that communicate within a node
• Direct access to GPU memory for 3rd party devices • Eliminates unnecessary sys mem copies & CPU overhead • Supported by Mellanox and Qlogic
• Up to 30% improvement in communication performance
© NVIDIA Corporation 2011
• Peer-to-Peer memory access, transfers & synchronization • MPI implementations natively support GPU data transfers • Less code, higher programmer productivity
Details @ http://www.nvidia.com/object/software-for-tesla-products.html
Before GPUDirect v2.0 Required Copy into Main Memory GPU1 Memory
GPU2 Memory System Memory
CPU GPU1
PCI-e © NVIDIA Corporation 2011
GPU2
Chip set
GPUDirect v2.0: Peer-to-Peer Communication Direct Transfers b/w GPUs GPU1 Memory
GPU2 Memory System Memory
CPU GPU1
PCI-e
© NVIDIA Corporation 2011
GPU2
Chip set
Unified Virtual Addressing Easier to Program with Single Address Space No UVA: Multiple Memory Spaces System Memory
GPU0 Memory
GPU1 Memory
0x0000
0x0000
0x0000
0xFFFF
0xFFFF
0xFFFF
CPU
GPU0
GPU1
System Memory
GPU0 Memory
GPU1 Memory
0x0000 0xFFFF
CPU
PCI-e
© NVIDIA Corporation 2011
UVA : Single Address Space
GPU0
GPU1 PCI-e
NVIDIA CUDA Summary New in CUDA 4.0
Platform
Programming Model
Parallel Libraries
Tools
GPUDirecttm (v 2.0) Peer-Peer Communication
Unified Virtual Addressing
Thrust C++ Library
Parallel Nsight Pro 1.5
NVIDIA Library Support
NVIDIA Tools Support
C++ new/delete C++ Virtual Functions
Hardware Features ECC Memory Double Precision Native 64-bit Architecture Concurrent Kernel Execution Dual Copy Engines 6GB per GPU supported Operating System Support MS Windows 32/64 Linux 32/64 Mac OS X 32/64 Designed for HPC Cluster Management GPUDirect Tesla Compute Cluster (TCC) Multi-GPU support © NVIDIA Corporation 2011
C support NVIDIA C Compiler CUDA C Parallel Extensions Function Pointers Recursion Atomics malloc/free
C++ support Classes/Objects Class Inheritance Polymorphism Operator Overloading Class Templates Function Templates Virtual Base Classes Namespaces
Fortran support CUDA Fortran (PGI)
Templated Performance Primitive Library
Complete math.h Complete BLAS Library (1, 2 and 3) Sparse Matrix Math Library RNG Library FFT Library (1D, 2D and 3D) Video Decoding Library (NVCUVID) Video Encoding Library (NVCUVENC) Image Processing Library (NPP) Video Processing Library (NPP) 3rd Party Math Libraries CULA Tools (EM Photonics) MAGMA Heterogeneous LAPACK IMSL (Rogue Wave) VSIPL (GPU VSIPL)
Parallel Nsight for MS Visual Studio cuda-gdb Debugger with multi-GPU support CUDA/OpenCL Visual Profiler CUDA Memory Checker CUDA C SDK CUDA Disassembler NVML CUPTI
3rd Party Developer Tools Allinea DDT RogueWave /Totalview Vampir Tau CAPS HMPP
Conclusion
© NVIDIA Corporation 2011
GPU Computing is the Future
1
GPU Computing is #1 Today
2
GPU Computing Enables ExaScale
3
The GPU is the Computer
4
The Real Challenge is Software
© NVIDIA Corporation 2011
On Top 500 AND Dominant on Green 500 At Reasonable Power
A general purpose computing engine, not just an accelerator
Thank you
© NVIDIA Corporation 2011
GPU Technology Conference 2011 Oct. 11-14 | San Jose, CA 3rd annual GPU Technology Conference
New for 2011: Co-located with Los Alamos HPC Symposium 300+ Research Scientists from National Labs 2010 highlights 280 hours of sessions 100+ Research posters 42 countries represented © NVIDIA Corporation 2011
www.gputechconf.com
Resources
© NVIDIA Corporation 2011
NVIDIA CUDA Developer Resources DEVELOPMENT TOOLS CUDA Toolkit Complete GPU computing development kit
cuda-gdb
GPU hardware debugging
SDKs AND CODE SAMPLES GPU Computing SDK CUDA C/C++, DirectCompute, OpenCL code samples and documentation
Visual Profiler
Books Programming Massively Parallel Processors, CUDA by Example, GPU Computing Gems
Parallel Nsight Integrated development environment for Visual Studio
Optimization Guides Best Practices for GPU computing and graphics development
GPU hardware profiler for CUDA C and OpenCL
ENGINES & LIBRARIES Math Libraries
CUFFT, CUBLAS, CUSPARSE, CURAND
3rd Party Libraries
CULA LAPACK, VSIPL, …
NPP Image Libraries Performance primitives for imaging App Acceleration Engines Ray Tracing: Optix, iRay Video Libraries NVCUVID / NVCUVENC
© NVIDIA Corporation 2011
http://developer.nvidia.com
CUDA Math Libraries High performance math routines for your applications: cuFFT – Fast Fourier Transforms Library cuBLAS – Complete BLAS Library cuSPARSE – Sparse Matrix Library cuRAND – Random Number Generation (RNG) Library Included in the CUDA Toolkit (free download) www.nvidia.com/getcuda
For more information on CUDA libraries: http://www.nvidia.com/object/gtc2010-presentation-archive.html#session2216
© NVIDIA Corporation 2011
CUDA 3rd Party Ecosystem Cluster Tools Cluster Management Platform HPC
Platform Symphony Bright Cluster manager Ganglia Monitoring System
Moab Cluster Suite Altair PBS Pro
Job Scheduling Altair PBSpro TORQUE Platform LSF
Parallel Language and API Solutions PGI CUDA Fortran PGI Accelerator
PGI CUDA x86 CAPS HMPP pyCUDA (Python)
Tidepowerd CUDA.net (.NET) JCuda (java) Khronos OpenCL
Microsoft DirectCompute
3rd Party Math Libraries CULA Tools (EM Photonics) MAGMA Heterogeneous LAPACK
MPI Libraries OpenMPI
IMSL (Rogue Wave) VSIPL (GPU VSIPL) NAG
© NVIDIA Corporation 2011
Parallel Tools
Parallel Debuggers MS Visual Studio with Parallel Nsight Allinea DDT Debugger TotalView Debugger
Compute Platform Providers Cloud Providers Amazon EC2
Peer 1
OEM’s Dell
Parallel Performance Tools ParaTools VampirTrace TauCUDA Performance Tools PAPI HPC Toolkit
HP
IBM
Infiniband Providers Mellanox QLogic
GPU Computing Research & Education http://research.nvidia.com
World Class Research Leadership and Teaching
Proven Research Vision John Hopkins University
Mass. Gen. Hospital/NE Univ
University of Cambridge
Nanyan University
North Carolina State University
Harvard University
Technical University-Czech
Swinburne University of Tech.
University of Utah
CSIRO
Techische Univ. Munich
University of Tennessee
SINTEF
UCLA
University of Maryland
HP Labs
University of New Mexico
University of Illinois at Urbana-Champaign
ICHEC
University Of Warsaw-ICM
Tsinghua University
Barcelona SuperComputer Center VSB-Tech
Tokyo Institute of Technology
Clemson University
University of Ostrava
Chinese Academy of Sciences
Fraunhofer SCAI
And more coming shortly.
National Taiwan University
Karlsruhe Institute Of Technology
Georgia Institute of Technology
Academic Partnerships / Fellowships
© NVIDIA Corporation 2011
GPGPU Education 350+ Universities