Preview only show first 10 pages with watermark. For full document please download

Mixing Graphics And Compute With Multiple Gpus

Rating
Date

November 2018
Size

700.9KB
Views

7,700
Categories

Computers & electronics Computer components System components Graphics cards

Transcript

Mixing graphics and compute with multiple GPUs Agenda  Compute and Graphics Interoperability  Interoperability I bili at a system llevell  Application design considerations Putting Graphics & Compute together  Compute and Visualize the same data Application n CUDA OpenGL /DirectX Compute/Graphics interoperability  Set of compute API functions — Graphics G hi sets t up th the objects bj t — Register/Unregister the objects with compute context — Mapping/Unmapping of the objects to/from the compute context every frame Applicationn CUDA Linear Memory CUDA Array CUDA Buffer Texture OpenGL/DX Simple OpenGL-CUDA interop sample: Setup and Register of Buffer Objects GLuint imagePBO; cudaGraphicsResource t cudaGraphicsResource_t cudaResourceBuf; //OpenGL buffer creation glGenBuffers(1, &imagePBO); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, lBi dB ff (GL PIXEL UNPACK BUFFER ARB imagePBO); i PBO) glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL, GL_DYNAMIC_DRAW); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0); //Registration with CUDA cudaGraphicsGLRegisterBuffer(&cudaResourceBuf, imagePBO, cudaGraphicsRegisterFlagsNone); Simple OpenGL-CUDA interop sample: Setup and Register of Texture Objects GLuint imageTex; cudaGraphicsResource t cudaGraphicsResource_t cudaResourceTex; //OpenGL texture creation glGenTextures(1, &imageTex); glBindTexture(GL_TEXTURE_2D, lBi dT t (GL TEXTURE 2D iimageTex); T ) //set texture parameters here glTexImage2D(GL_TEXTURE_2D,0, GL_RGBA8UI_EXT, width, height, 0, GL RGBA INTEGER EXT GL_UNSIGNED_BYTE, GL_RGBA_INTEGER_EXT, GL UNSIGNED BYTE NULL) NULL); glBindTexture(GL_TEXTURE_2D, 0); //Registration with CUDA cudaGraphicsGLRegisterImage (&cudaResourceTex, imageTex, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsNone); Simple OpenGL-CUDA interop sample unsigned char *memPtr; cudaArray *arrayPtr; while (!done) { cudaGraphicsMapResources(1, &cudaResourceTex, cudaStream); p p ( , &cudaResourceBuf,, cudaStream); ); cudaGraphicsMapResources(1, cudaGraphicsSubResourceGetMappedArray(&cudaArray, cudaResourceTex, 0, 0); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(cudaArray memPtr doWorkInCUDA(cudaArray, memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, & cudaResourceTex, cudaStream); cudaGraphicsUnmapResources(1, & cudaResourceBuf, cudaStream); doWorkInGL(imagePBO imageTex); //asynchronous doWorkInGL(imagePBO, } Interoperability behavior:single GPU  The resource is shared API Interop GPU  Tasks T k are serialized i li d CUDA Context OpenGL Context Data compute draw A CUDA A A A vsync OpenGL /DX time Interoperability behavior: multiple GPUs API Interop  Each context owns a copy of the resource  Tasks T k are serialized i li d  map/unmap might do a host side id synchronization h i ti Synchronize/ Map compute Synchronize/ Unmap draw A GPU GPU CUDA Context OpenGL Context Data Data A A CUDA A A A OpenGL p /DX A vsync time Interoperability behavior: multiple GPUs Improvements  If one of the APIs is a producer and another is a consumer then the tasks can overlap. overlap CUDA as a producer example: Synchronize/ Map compute Synchronize/ Unmap A draw A A A A CUDA A A A A OpenGL /DX time vsync Simple OpenGL-CUDA interop sample  Use mapping hint with cudaGraphicsResourceSetMapFlags() cudaGraphicsMapFlagsReadOnly/cudaGraphicsMapFlagsWriteDiscard: unsigned char *memPtr; cudaGraphicsResourceSetMapFlags(cudaResourceBuf, cudaGraphicsMapFlagsWriteDiscard) while (!done) { cudaGraphicsMapResources(1, &cudaResourceBuf, cudaStream); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, &cudaResourceBuf, cudaStream); doWorkInGL(imagePBO); //asynchronous } Interoperability behavior: multiple GPUs Improvements  use ping-pong buffers to ensure the graphics and compute are not stepping on each other’s other s toes. toes GPU2 GPU1 A B B A A B A B vsync A CUDA OpenGL p /DX time Simple OpenGL-CUDA interop sample  ping-pong buffers: unsigned char *memPtr; memPtr; int count = 0; cudaGraphicsResourceSetMapFlags(cudaResourceBuf, cudaGraphicsMapFlagsWriteDiscard) while (!done) { cudaResourceBuf = (count%2) ? cudaResourceBuf1 : cudaResourceBuf2; imagePBO = (count%2) ? imagePBO2 : imagePBO1; cudaGraphicsMapResources(1, &cudaResourceBuf, cudaStream); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, & cudaResourceBuf, cudaStream); doWorkInGL(imagePBO); //asynchronous count++; } Simple OpenGL-CUDA interop sample: What now?  You can continue using a single threaded application if map/unmap and other calls are CPU asynchronous. asynchronous If they are CPU synchronous, this won’t be possible: GPU3 A2 GPU2 A1 GPU1 B B2 A2 A1 B1 B2 A2 A1 B1 A B A2 A1 CUDA CUDA OpenGL /DX Application Example:pseudocode  Multithreaded CUDA centric application: Adobe Premiere Pro with an OpenGL plugin Main CUDA thread mainCtx = wglCreateContext(hDC); wglMakeCurrent(hDC, mainCtx); //Register OpenGL objects with CUDA int count = 1; int count = 0; while (!done) { while (!done) { SignalWait(oglDone[count]); SignalWait(cudaDone[count]); doWorkInCUDA(memPtr, NULL); cudaGraphicsUnmapResources(1, &cudaResourceBuf[count], NULL); EventSignal(cudaDone[count]); doWorkInGL(imagePBO[count]); count = (count+1)%2; cudaGraphicsMapResources(1, &cudaResourceBuf[count], NULL); } cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf[count]); EventSignal(oglDone[count]); count = (count+1)%2; } OpenGL Worker thread Application Example:pseudocode  Multithreaded OpenGL centric application: Autodesk Maya with a CUDA plug-in  S0364 - Interacting with Huge Particle Simulations in Maya with the GPU, Wil B., GTC 2012 Proceedings SignalWait(setupCompleted); wglMakeCurrent(hDC,workerCtx); //Register OpenGL objects with CUDA CUDA worker thread N i count = 1 int 1; while (!done) { SignalWait(oglDone[count]); glWaitSync(endGLSync[count]); cudaGraphicsMapResources(1 &cudaResourceBuf[count], cudaGraphicsMapResources(1, &cudaResourceBuf[count] cudaStream[N]); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf[count]); mainCtx = wglCreateContext(hDC); workerCtx = wglCreateContextAttrib (hDC,mainCtx…); wglMakeCurrent(hDC, mainCtx); //Create OpenGL objects EventSignal(setupCompleted); int count = 0; while (!done) { SignalWait(cudaDone[count]); doWorkInCUDA(memPtr, cudaStream[N]); doWorkInGL(imagePBO[count]); endGLSync[count] = glFenceSync(…); cudaGraphicsUnmapResources(1, &cudaResourceBuf[count], cudaStream[N]); cudaStreamSynchronize(cudaStreamN); EventSignal(oglDone[count]); EventSignal(cudaDone[count]); count = (count+1)%2; } count = (count+1)%2; } Main OpenGL thread Application design considerations  Use cudaD3D[9|10|11]GetDevices/cudaGLGetDevices to chose h the h right i h d device i to provision i i ffor multi-GPU l i GPU environments.  Avoid synchronized GPUs for CUDA!  CUDA-OpenGL interop can perform slower if OpenGL p multiple p GPU! context spans Application design considerations  Allow users to specify the GPUs! — Typical T i l heuristics: h i ti  TCC mode  GPU #  available memory  # of processing units — Affecting factors: OS, OS ECC ECC, TCC mode Don’t make your users go here: API Interop hides all the complexity If not cross-GPU API interop then what? A) B) GPU 1 GPU 2 CUDA Context OpenGL Context CUDA memcpy glBufferSubData Buffer RAM Compute/Graphics interoperability: What’ss new with CUDA 5 What 5.0? 0?  cudaD3D[9|10|11]SetDirect3DDevice/cudaGLSetGlDevice are no longer required  All mipmap levels are shared  Interoperate p with OpenGL p and DirectX at the same time  Lots and lots of Windows WDDM improvements Conclusions/Resources  The driver will do all the heavy lifting but..  Scalability S l bili and d fi finall performance f iis up to the h d developer l and..  For fine grained control you might want to move data yourself. p  CUDA samples/documentation: http://developer.nvidia.com/cuda-downloads

Mixing Graphics And Compute With Multiple Gpus

Rating

Date

Size

Views

Categories

Share

Transcript

Forgot your password?.