Preview only show first 10 pages with watermark. For full document please download

Gpu Jit (aka Shader Compiler) From Il To Isa

   EMBED


Share

Transcript

GPU JIT (aka Shader Compiler) From IL to ISA Gang Chen AMD OUTLINE ƒRole of the JIT ƒBenefit of IL and JIT ƒInside JIT ƒWhat is coming next ƒFuture of the JIT 3 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 ROLE OF THE JIT ƒ The final compilation layer on GPU DirectX ®, OpenCLTM, OpenGL ƒ Invoked by driver/runtime – Typically at the beginning of the application AMD IL ƒ “CreateShader” or “CreateKernel” – Could happen at any time before “dispatch/draw” SC: GPU JIT – JIT binary is inside every computer with AMD GPU or APU ƒ Also inside the developer tools GPU ISA – AMD GPU ShaderAnalyzer – AMD Stream KernelAnalyzer ƒ “Shader” is a historical name for GPU program – JIT is an essential part in both compute and graphics software stack 4 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 GPU AMD-IL EXAMPLE il_cs_1_0 dcl_num_thread_per_group 64, 1, 1 dcl_literal l0, 64, 4, 0, 0 dcl_raw_uav_id(0) ; dst buffer dcl_raw_uav_id(1) ; src buffer imad r0.x, vThreadGrpId.x, l0.x, vTidInGrp.x ishl r0.x, r0.x, l0.y uav_raw_load_id(1) r1, r0.x uav_raw_store_id(0) mem, r0.x, r1 end ƒ Bears a lot of resemblance to Microsoft DirectX assembly language ƒ A superset in order to support OpenCLTM, OpenGL, and multimedia 5 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 GPU ISA EXAMPLE ƒ Features – VLIW, X/Y/Z/W functional units – Banked register, X/Y/Z/W banks – Instruction clauses ƒ ALU clause ƒ Fetch clause – Clause temps – Previous vector-value register ƒ IL and ISA are quite different! 6 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 00 ALU: ADDR(32) CNT(5) 0 z: LSHL ____, R1.x, 6 1 y: ADD_INT ____, R0.x, PV0.z 2 x: LSHL R1.x, PV1.y, 2 01 TEX: ADDR(48) CNT(1) 3 VFETCH R0, R1.x, fc155 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 02 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4) MARK VPM 03 ENDEND_OF_PROGRAM BENEFIT OF IL & JIT ƒ This virtual-machine approach has been working well for PC graphics – Enable backward and forward compatibility ƒ New games run on old hardware ƒ Old games run on new hardware – Enable hardware innovation – Make GPU easier to program ƒ Flip side – May hinder full utilization of hardware capability – JIT compile-time budget ƒ Will it stay this way (especially for compute)? 7 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 INSIDE JIT | Compilation Flow ƒ More like the back end part of a CPU compiler ƒ As an example, in contrast to the Open64 compiler – No loop-nesting optimization, no alias analysis, no auto-vectorization – No multiple iterations of optimization ƒ Approximate compilation flow – IL to IR – Loop unrolling and function inlining – SSA construction – Optimizations – Local instruction scheduling and register allocation – Global register allocation – Assemble 8 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 INSIDE JIT | Optimizations ƒ Classical optimizations – Global value numbering ƒ Full redundancy elimination ƒ Constant folding – Global code motion – Copy propagation – Dead-code elimination – Dead-branch removal – Peephole optimization 9 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 INSIDE JIT | GPU-Unique Optimizations ƒ Re-association on floating-point operations – Graphics applications are more amenable to such transformation ƒ Channel Remapping ƒ Instruction scheduling – Delicate balance between data parallelism and instruction-level parallelism ƒ Two-pass Register allocation – Redistribute temps into register banks – Local register allocation during scheduling – Benefit both instruction scheduling and register allocation – Global register allocation only on cross-block temps ƒ Load (or store) combining – Loading 2/3/4 adjacent dwords is as fast as loading 1 dword 10 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 – Many kernels with large basic blocks and simple control-flow INSTRUCTION SCHEDULING | ILP versus Data Parallelism ƒ The fundamental problem is the gap between memory bandwidth and peek FLOPS – HD6970 max memory bandwidth:F 175GB/s; peek FLOPS: 2703 GFlop/s – Most effort in both HW and SW is spent on this problem ƒ One important part of the solution is to use lots of threads 11 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 ALU Overlapped fetch and alu Many outstanding fetches INSTRUCTION SCHEDULING | ILP versus Data Parallelism (cont’d) ƒ In order to maximize threads (data-parallelism) – You want to minimize registers, which is the typical limiting factor – All active threads must be residents on GPU ƒ However, you may hit other hardware limits when maximizing threads – Number of outstanding loads/stores – Similar issue on the input/output side for a graphics stage – May worsen cache performance ƒ There are ALU-limited kernels ƒ Conclusion: it is a very fine balance for instruction scheduler between packing instructions and using registers 12 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 TIPS FOR DEVELOPER ƒ http://developer.amd.com/gpu/AMDAPPSDK/documentation/Pages/default.aspx – Several optimization case studies ƒ Memory optimizations usually provide the bigger boost – JIT can do very little on this ƒ Load/store float4 or float2 ƒ Provide sufficient amount of ILP for ALU-intensive kernel – That is why thread-fusion and loop unrolling are very helpful – Operate on float4 or float2 ƒ Avoid spilling if possible – KernelAnalyzer will tell you in the “Scratch Reg” column of the compiler statistics 13 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 WHAT IS COMING NEXT ƒ New architecture – No longer VLIW, no clause, and no register bank – Want to know more? ƒ Go to session 2620 “Graphics Core Next” by Mike Houston and Mike Mantor ƒ Go to the closing keynote “Evolution of AMD’s Graphics Core and Preview of Graphics Core Next” by Eric Demers ƒ New IL for compute – Closely matches the new architecture ƒ Significant part of JIT has been rewritten for the new architecture – New register allocator – New instruction scheduler – Different optimizations 14 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 EVOLUTION OF GPU ƒ As the application field widens, GPU instruction set is getting richer and richer CPU GPU 2006 GPU 2011 IEEE float Single & double N Y Hardware exception N Y Flat adress space N Coming Soon Atomic operations N Y Read&write cache N Y Very limited Extensive Byte&Short operations, Multimedia operations 15 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 FUTURE OF THE JIT ƒ Tremendous amount of work to support new hardware features ƒ Less emphasis on optimizations – Hardware will get better (easier to get performance) – High-level compiler will do more optimizations ƒ More emphasis on productivity features – Support very large GPU programs, and more language features ƒ E.g. dynamic library – Improved compilation – Support debugger and profiler – Expose tuning knob to the end users 16 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011 QUESTIONS Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. Microsoft and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. © 2011 Advanced Micro Devices, Inc. All rights reserved. 18 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011