Transcript
GPU JIT (aka Shader Compiler) From IL to ISA Gang Chen AMD
OUTLINE Role of the JIT Benefit of IL and JIT Inside JIT What is coming next Future of the JIT
3 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
ROLE OF THE JIT The final compilation layer on GPU
DirectX ®, OpenCLTM, OpenGL
Invoked by driver/runtime – Typically at the beginning of the application
AMD IL
“CreateShader” or “CreateKernel”
– Could happen at any time before “dispatch/draw”
SC: GPU JIT
– JIT binary is inside every computer with AMD GPU or APU Also inside the developer tools
GPU ISA
– AMD GPU ShaderAnalyzer – AMD Stream KernelAnalyzer “Shader” is a historical name for GPU program – JIT is an essential part in both compute and graphics software stack
4 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
GPU
AMD-IL EXAMPLE
il_cs_1_0 dcl_num_thread_per_group 64, 1, 1 dcl_literal l0, 64, 4, 0, 0 dcl_raw_uav_id(0) ; dst buffer dcl_raw_uav_id(1) ; src buffer imad r0.x, vThreadGrpId.x, l0.x, vTidInGrp.x ishl r0.x, r0.x, l0.y uav_raw_load_id(1) r1, r0.x uav_raw_store_id(0) mem, r0.x, r1 end
Bears a lot of resemblance to Microsoft DirectX assembly language A superset in order to support OpenCLTM, OpenGL, and multimedia
5 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
GPU ISA EXAMPLE Features – VLIW, X/Y/Z/W functional units – Banked register, X/Y/Z/W banks – Instruction clauses ALU clause Fetch clause
– Clause temps – Previous vector-value register IL and ISA are quite different!
6 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
00 ALU: ADDR(32) CNT(5) 0 z: LSHL ____, R1.x, 6 1 y: ADD_INT ____, R0.x, PV0.z 2 x: LSHL R1.x, PV1.y, 2 01 TEX: ADDR(48) CNT(1) 3 VFETCH R0, R1.x, fc155 FORMAT(32_32_32_32_FLOAT) FETCH_TYPE(NO_INDEX_OFFSET) 02 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(0)[R1], R0, ARRAY_SIZE(4) MARK VPM 03 ENDEND_OF_PROGRAM
BENEFIT OF IL & JIT This virtual-machine approach has been working well for PC graphics – Enable backward and forward compatibility New games run on old hardware Old games run on new hardware
– Enable hardware innovation – Make GPU easier to program Flip side – May hinder full utilization of hardware capability – JIT compile-time budget Will it stay this way (especially for compute)?
7 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
INSIDE JIT | Compilation Flow More like the back end part of a CPU compiler As an example, in contrast to the Open64 compiler – No loop-nesting optimization, no alias analysis, no auto-vectorization – No multiple iterations of optimization Approximate compilation flow – IL to IR – Loop unrolling and function inlining – SSA construction – Optimizations – Local instruction scheduling and register allocation – Global register allocation – Assemble
8 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
INSIDE JIT | Optimizations Classical optimizations – Global value numbering Full redundancy elimination Constant folding
– Global code motion – Copy propagation – Dead-code elimination – Dead-branch removal – Peephole optimization
9 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
INSIDE JIT | GPU-Unique Optimizations Re-association on floating-point operations – Graphics applications are more amenable to such transformation Channel Remapping
Instruction scheduling – Delicate balance between data parallelism and instruction-level parallelism Two-pass Register allocation
– Redistribute temps into register banks
– Local register allocation during scheduling
– Benefit both instruction scheduling and register allocation
– Global register allocation only on cross-block temps
Load (or store) combining – Loading 2/3/4 adjacent dwords is as fast as loading 1 dword
10 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
– Many kernels with large basic blocks and simple control-flow
INSTRUCTION SCHEDULING | ILP versus Data Parallelism The fundamental problem is the gap between memory bandwidth and peek FLOPS – HD6970 max memory bandwidth:F 175GB/s; peek FLOPS: 2703 GFlop/s – Most effort in both HW and SW is spent on this problem One important part of the solution is to use lots of threads
11 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
ALU
Overlapped fetch and alu Many outstanding fetches
INSTRUCTION SCHEDULING | ILP versus Data Parallelism (cont’d) In order to maximize threads (data-parallelism) – You want to minimize registers, which is the typical limiting factor – All active threads must be residents on GPU However, you may hit other hardware limits when maximizing threads – Number of outstanding loads/stores – Similar issue on the input/output side for a graphics stage – May worsen cache performance There are ALU-limited kernels Conclusion: it is a very fine balance for instruction scheduler between packing instructions and using registers
12 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
TIPS FOR DEVELOPER http://developer.amd.com/gpu/AMDAPPSDK/documentation/Pages/default.aspx – Several optimization case studies Memory optimizations usually provide the bigger boost – JIT can do very little on this Load/store float4 or float2 Provide sufficient amount of ILP for ALU-intensive kernel – That is why thread-fusion and loop unrolling are very helpful – Operate on float4 or float2 Avoid spilling if possible – KernelAnalyzer will tell you in the “Scratch Reg” column of the compiler statistics
13 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
WHAT IS COMING NEXT New architecture – No longer VLIW, no clause, and no register bank – Want to know more? Go to session 2620 “Graphics Core Next” by Mike Houston and Mike Mantor Go to the closing keynote “Evolution of AMD’s Graphics Core and Preview of Graphics Core Next” by Eric Demers
New IL for compute – Closely matches the new architecture Significant part of JIT has been rewritten for the new architecture – New register allocator – New instruction scheduler – Different optimizations
14 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
EVOLUTION OF GPU As the application field widens, GPU instruction set is getting richer and richer CPU
GPU 2006
GPU 2011
IEEE float Single & double
N
Y
Hardware exception
N
Y
Flat adress space
N
Coming Soon
Atomic operations
N
Y
Read&write cache
N
Y
Very limited
Extensive
Byte&Short operations, Multimedia operations
15 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
FUTURE OF THE JIT Tremendous amount of work to support new hardware features Less emphasis on optimizations – Hardware will get better (easier to get performance) – High-level compiler will do more optimizations More emphasis on productivity features – Support very large GPU programs, and more language features E.g. dynamic library
– Improved compilation – Support debugger and profiler – Expose tuning knob to the end users
16 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011
QUESTIONS
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. Microsoft and DirectX are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. © 2011 Advanced Micro Devices, Inc. All rights reserved.
18 | GPU JIT (aka Shader Compiler) from IL to ISA | June 2011