Preview only show first 10 pages with watermark. For full document please download

Mixing Graphics And Compute With Multiple Gpus

   EMBED


Share

Transcript

Mixing graphics and compute with multiple GPUs Agenda  Compute and Graphics Interoperability  Interoperability I bili at a system llevell  Application design considerations Putting Graphics & Compute together  Compute and Visualize the same data Application n CUDA OpenGL /DirectX Compute/Graphics interoperability  Set of compute API functions — Graphics G hi sets t up th the objects bj t — Register/Unregister the objects with compute context — Mapping/Unmapping of the objects to/from the compute context every frame Applicationn CUDA Linear Memory CUDA Array CUDA Buffer Texture OpenGL/DX Simple OpenGL-CUDA interop sample: Setup and Register of Buffer Objects GLuint imagePBO; cudaGraphicsResource t cudaGraphicsResource_t cudaResourceBuf; //OpenGL buffer creation glGenBuffers(1, &imagePBO); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, lBi dB ff (GL PIXEL UNPACK BUFFER ARB imagePBO); i PBO) glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL, GL_DYNAMIC_DRAW); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0); //Registration with CUDA cudaGraphicsGLRegisterBuffer(&cudaResourceBuf, imagePBO, cudaGraphicsRegisterFlagsNone); Simple OpenGL-CUDA interop sample: Setup and Register of Texture Objects GLuint imageTex; cudaGraphicsResource t cudaGraphicsResource_t cudaResourceTex; //OpenGL texture creation glGenTextures(1, &imageTex); glBindTexture(GL_TEXTURE_2D, lBi dT t (GL TEXTURE 2D iimageTex); T ) //set texture parameters here glTexImage2D(GL_TEXTURE_2D,0, GL_RGBA8UI_EXT, width, height, 0, GL RGBA INTEGER EXT GL_UNSIGNED_BYTE, GL_RGBA_INTEGER_EXT, GL UNSIGNED BYTE NULL) NULL); glBindTexture(GL_TEXTURE_2D, 0); //Registration with CUDA cudaGraphicsGLRegisterImage (&cudaResourceTex, imageTex, GL_TEXTURE_2D, cudaGraphicsRegisterFlagsNone); Simple OpenGL-CUDA interop sample unsigned char *memPtr; cudaArray *arrayPtr; while (!done) { cudaGraphicsMapResources(1, &cudaResourceTex, cudaStream); p p ( , &cudaResourceBuf,, cudaStream); ); cudaGraphicsMapResources(1, cudaGraphicsSubResourceGetMappedArray(&cudaArray, cudaResourceTex, 0, 0); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(cudaArray memPtr doWorkInCUDA(cudaArray, memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, & cudaResourceTex, cudaStream); cudaGraphicsUnmapResources(1, & cudaResourceBuf, cudaStream); doWorkInGL(imagePBO imageTex); //asynchronous doWorkInGL(imagePBO, } Interoperability behavior:single GPU  The resource is shared API Interop GPU  Tasks T k are serialized i li d CUDA Context OpenGL Context Data compute draw A CUDA A A A vsync OpenGL /DX time Interoperability behavior: multiple GPUs API Interop  Each context owns a copy of the resource  Tasks T k are serialized i li d  map/unmap might do a host side id synchronization h i ti Synchronize/ Map compute Synchronize/ Unmap draw A GPU GPU CUDA Context OpenGL Context Data Data A A CUDA A A A OpenGL p /DX A vsync time Interoperability behavior: multiple GPUs Improvements  If one of the APIs is a producer and another is a consumer then the tasks can overlap. overlap CUDA as a producer example: Synchronize/ Map compute Synchronize/ Unmap A draw A A A A CUDA A A A A OpenGL /DX time vsync Simple OpenGL-CUDA interop sample  Use mapping hint with cudaGraphicsResourceSetMapFlags() cudaGraphicsMapFlagsReadOnly/cudaGraphicsMapFlagsWriteDiscard: unsigned char *memPtr; cudaGraphicsResourceSetMapFlags(cudaResourceBuf, cudaGraphicsMapFlagsWriteDiscard) while (!done) { cudaGraphicsMapResources(1, &cudaResourceBuf, cudaStream); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, &cudaResourceBuf, cudaStream); doWorkInGL(imagePBO); //asynchronous } Interoperability behavior: multiple GPUs Improvements  use ping-pong buffers to ensure the graphics and compute are not stepping on each other’s other s toes. toes GPU2 GPU1 A B B A A B A B vsync A CUDA OpenGL p /DX time Simple OpenGL-CUDA interop sample  ping-pong buffers: unsigned char *memPtr; memPtr; int count = 0; cudaGraphicsResourceSetMapFlags(cudaResourceBuf, cudaGraphicsMapFlagsWriteDiscard) while (!done) { cudaResourceBuf = (count%2) ? cudaResourceBuf1 : cudaResourceBuf2; imagePBO = (count%2) ? imagePBO2 : imagePBO1; cudaGraphicsMapResources(1, &cudaResourceBuf, cudaStream); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf); doWorkInCUDA(memPtr, cudaStream);//asynchronous cudaGraphicsUnmapResources(1, & cudaResourceBuf, cudaStream); doWorkInGL(imagePBO); //asynchronous count++; } Simple OpenGL-CUDA interop sample: What now?  You can continue using a single threaded application if map/unmap and other calls are CPU asynchronous. asynchronous If they are CPU synchronous, this won’t be possible: GPU3 A2 GPU2 A1 GPU1 B B2 A2 A1 B1 B2 A2 A1 B1 A B A2 A1 CUDA CUDA OpenGL /DX Application Example:pseudocode  Multithreaded CUDA centric application: Adobe Premiere Pro with an OpenGL plugin Main CUDA thread mainCtx = wglCreateContext(hDC); wglMakeCurrent(hDC, mainCtx); //Register OpenGL objects with CUDA int count = 1; int count = 0; while (!done) { while (!done) { SignalWait(oglDone[count]); SignalWait(cudaDone[count]); doWorkInCUDA(memPtr, NULL); cudaGraphicsUnmapResources(1, &cudaResourceBuf[count], NULL); EventSignal(cudaDone[count]); doWorkInGL(imagePBO[count]); count = (count+1)%2; cudaGraphicsMapResources(1, &cudaResourceBuf[count], NULL); } cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf[count]); EventSignal(oglDone[count]); count = (count+1)%2; } OpenGL Worker thread Application Example:pseudocode  Multithreaded OpenGL centric application: Autodesk Maya with a CUDA plug-in  S0364 - Interacting with Huge Particle Simulations in Maya with the GPU, Wil B., GTC 2012 Proceedings SignalWait(setupCompleted); wglMakeCurrent(hDC,workerCtx); //Register OpenGL objects with CUDA CUDA worker thread N i count = 1 int 1; while (!done) { SignalWait(oglDone[count]); glWaitSync(endGLSync[count]); cudaGraphicsMapResources(1 &cudaResourceBuf[count], cudaGraphicsMapResources(1, &cudaResourceBuf[count] cudaStream[N]); cudaGraphicsResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf[count]); mainCtx = wglCreateContext(hDC); workerCtx = wglCreateContextAttrib (hDC,mainCtx…); wglMakeCurrent(hDC, mainCtx); //Create OpenGL objects EventSignal(setupCompleted); int count = 0; while (!done) { SignalWait(cudaDone[count]); doWorkInCUDA(memPtr, cudaStream[N]); doWorkInGL(imagePBO[count]); endGLSync[count] = glFenceSync(…); cudaGraphicsUnmapResources(1, &cudaResourceBuf[count], cudaStream[N]); cudaStreamSynchronize(cudaStreamN); EventSignal(oglDone[count]); EventSignal(cudaDone[count]); count = (count+1)%2; } count = (count+1)%2; } Main OpenGL thread Application design considerations  Use cudaD3D[9|10|11]GetDevices/cudaGLGetDevices to chose h the h right i h d device i to provision i i ffor multi-GPU l i GPU environments.  Avoid synchronized GPUs for CUDA!  CUDA-OpenGL interop can perform slower if OpenGL p multiple p GPU! context spans Application design considerations  Allow users to specify the GPUs! — Typical T i l heuristics: h i ti  TCC mode  GPU #  available memory  # of processing units — Affecting factors: OS, OS ECC ECC, TCC mode Don’t make your users go here: API Interop hides all the complexity If not cross-GPU API interop then what? A) B) GPU 1 GPU 2 CUDA Context OpenGL Context CUDA memcpy glBufferSubData Buffer RAM Compute/Graphics interoperability: What’ss new with CUDA 5 What 5.0? 0?  cudaD3D[9|10|11]SetDirect3DDevice/cudaGLSetGlDevice are no longer required  All mipmap levels are shared  Interoperate p with OpenGL p and DirectX at the same time  Lots and lots of Windows WDDM improvements Conclusions/Resources  The driver will do all the heavy lifting but..  Scalability S l bili and d fi finall performance f iis up to the h d developer l and..  For fine grained control you might want to move data yourself. p  CUDA samples/documentation: http://developer.nvidia.com/cuda-downloads