Preview only show first 10 pages with watermark. For full document please download

Ray Tracing With Reduced-precision Bounding Volume Hierarchies

   EMBED


Share

Transcript

THE UNIVERSITY OF CALGARY Ray Tracing with Reduced-Precision Bounding Volume Hierarchies by Jeffrey A. Mahovsky A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF COMPUTER SCIENCE CALGARY, ALBERTA August, 2005 c Jeffrey A. Mahovsky 2005 THE UNIVERSITY OF CALGARY FACULTY OF GRADUATE STUDIES The undersigned certify that they have read, and recommend to the Faculty of Graduate Studies for acceptance, a thesis entitled “Ray Tracing with Reduced-Precision Bounding Volume Hierarchies” submitted by Jeffrey A. Mahovsky in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Supervisor, Dr. Brian Wyvill Department of Computer Science Dr. Laurence Turner Schulich School of Engineering External Examiner Dr. Peter Shirley School of Computing University of Utah Dr. Lisa Higham Department of Computer Science Dr. John Kendall Department of Computer Science Date ii Abstract Bounding volume hierarchies (BVHs) are an effective method for improving the performance of ray tracing algorithms. A BVH consists of a tree of bounding volumes enclosing the objects within the scene. Bounding volumes are typically axis-aligned bounding boxes, or AABBs. This research first presents up-to-date surveys of literature relevant to BVHs and ray tracing hardware. A new method is derived for determining if a ray and an AABB overlap using Pl¨ ucker coordinates. When used in a BVH, this method improves ray tracing performance by up to 27%. A reduced-precision, hierarchical, integer representation for the BVH’s AABB coordinates is investigated that reduces BVH memory requirements by 63% to 75%. Unfortunately, significant computational overhead is incurred because the integer representation must be converted back to floating-point during ray tracing. Coherent ray tracing is employed to amortize the conversion cost over many rays, reducing the overhead to a negligible level. Lastly, reduced-precision, integer ray-AABB overlap tests are derived and hardware designs are compared to the equivalent floating-point tests. These integer tests are guaranteed to produce identical images to the higher-precision floating-point tests because they correctly handle uncertainty due to numerical imprecision. The integer tests are shown to reduce the circuit size and complexity of a hardware ray-AABB test by roughly half. iii Acknowledgments This research was made possible by the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Alberta Informatics Circle of Research Excellence (iCORE). I would like to thank my supervisor, Dr. Brian Wyvill, for providing guidance, helpful feedback, and moral support during this project’s numerous changes and re-writes. Special thanks goes to Dr. Al Davis and Ali Ibrahim (both from the University of Utah) for assisting me with the hardware designs and analysis in Chapter 7 and providing the circuit performance data in Table 7.1. I wish to thank my parents, Larry and Gloria Mahovsky, for providing support and encouragement during this work. I also wish to thank their little kitty, Sam, for providing comic relief. Thanks to my Calgary relatives, Gordon and Pat Mahovsky and Lorna and Eric Sommerfeldt for providing numerous home-cooked Sunday meals. Finally, I wish to thank my friends for their willingness to go hiking with me year-round in all kinds of weather, day or night, especially Peter MacMurchy, Martin Fuhrer, and Wendy Osborn. Thanks to the Graphics Jungle lunch room crowd for great conversation and humour. iv Table of Contents Approval Page ii Abstract iii Acknowledgments iv Table of Contents v 1 Introduction 1.1 Problem Statement . . . . . . . . . . 1.2 Original Contributions . . . . . . . . 1.3 Thesis Overview . . . . . . . . . . . . 1.4 Publications and the Contributions of . . . . 1 1 1 2 3 . . . . . . . . . . . . . . . Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Previous Work 2.1 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Bounding Volume Hierarchies (BVH) . . . . . . . . . . . . 2.3 Ray-Axis Aligned Bounding Box (AABB) Tests . . . . . . 2.3.1 Slabs Method . . . . . . . . . . . . . . . . . . . . . 2.3.2 Woo’s Method . . . . . . . . . . . . . . . . . . . . . 2.3.3 Separating Axis Theorem Line Segment-AABB Test 2.4 Survey of BVH Literature . . . . . . . . . . . . . . . . . . 2.4.1 BVH Algorithms . . . . . . . . . . . . . . . . . . . 2.4.2 Supporting Algorithms . . . . . . . . . . . . . . . . 2.5 Survey of Ray Tracing Hardware Literature . . . . . . . . 2.6 Software Optimization versus Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 8 10 11 14 14 16 16 21 22 36 3 Ray-AABB Overlap Tests with Pl¨ ucker Coordinates 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Pl¨ ucker Coordinates . . . . . . . . . . . . . . . . . . . 3.3 Pl¨ ucker Ray-AABB Overlap Test . . . . . . . . . . . . 3.4 Numerical Issues . . . . . . . . . . . . . . . . . . . . . 3.5 Ray-AABB Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 38 42 47 49 . . . . . 53 53 53 56 60 62 4 Ray-BVH Traversal with the Plu-AABB 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 Test Scenes . . . . . . . . . . . . . . . . 4.3 BVH Construction . . . . . . . . . . . . 4.4 BVH-Plu versus BVH-Slabs . . . . . . . 4.5 Child Node Testing Order . . . . . . . . v Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Memory-Conserving BVHs with Coherent Ray Tracing 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Hierarchical BVH Encoding . . . . . . . . . . . . . . . . . 5.3 Hierarchical BVH Traversal . . . . . . . . . . . . . . . . . 5.4 Hierarchical BVH Rendering Performance . . . . . . . . . 5.5 Reducing Overhead with Coherent Ray Tracing . . . . . . 5.6 Rendering Performance with Coherent Ray Tracing . . . . 5.7 Single vs. Double Precision Floating Point . . . . . . . . . 5.8 Out-of-Core Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 67 69 70 75 78 84 84 6 Ray-BVH Traversal with Integer 6.1 Overview . . . . . . . . . . . . . 6.2 Integer Slabs-AABB Test . . . . 6.2.1 Scene and Ray Setup . . 6.2.2 Ray-AABB Tests . . . . 6.2.3 BVH Efficiency . . . . . 6.3 Integer Plu-AABB Test . . . . . 6.3.1 Scene and Ray Setup . . 6.3.2 Ray-AABB Tests . . . . 6.3.3 BVH Efficiency . . . . . Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Integer Hardware for Ray-AABB Tests 7.1 Introduction . . . . . . . . . . . . . . . . . . 7.2 Hardware Design Issues . . . . . . . . . . . . 7.3 Circuit Area, Delay, and Pipeline Stages . . 7.4 Hardware Designs . . . . . . . . . . . . . . . 7.4.1 Floating-point Slabs-AABB Pipeline 7.4.2 Integer Slabs-AABB Pipeline . . . . 7.4.3 Floating-point Plu-AABB Pipeline . 7.4.4 Integer Plu-AABB Pipeline . . . . . 7.5 Performance Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 88 90 90 93 97 103 103 104 111 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 116 116 117 120 121 125 128 134 139 8 Conclusions and Future Work 145 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A List of Abbreviations 149 B BVH Performance vs. Maximum Triangles per Node 151 C BVH Construction C++ Code 160 D BVH-Plu-DSA Traversal C++ Code 167 E BVH-Slabs-DSA Traversal C++ Code 183 vi Bibliography 186 vii List of Tables 3.1 3.2 Single-precision ray-AABB performance . . . . . . . . . . . . . . . . . . . . Double-precision ray-AABB performance . . . . . . . . . . . . . . . . . . . 51 52 4.1 4.2 4.3 4.4 Rendering times, BVH-Plu versus BVH-Slabs DSA algorithm . . . . . . . . . . . . . . . . . Performance increase from DSA algorithm . . DSA vs. distance-order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 63 64 64 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Hierarchically-encoded BVH performance (scenes 1-4) . . Hierarchically-encoded BVH performance (scenes 5-8) . . BVH performance with coherent ray tracing (scenes 1-2) BVH performance with coherent ray tracing (scenes 3-4) BVH performance with coherent ray tracing (scenes 5-6) BVH performance with coherent ray tracing (scenes 7-8) BVH performance (256-ray coherent ray tracing) . . . . . 2Lucy hierarchically-encoded BVH performance . . . . . 2Lucy BVH performance with coherent ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 74 79 80 81 82 83 85 86 6.1 6.2 6.3 6.4 6.5 6.6 6.7 BVH BVH BVH BVH BVH BVH BVH size versus precision for Lucy scene . . . . . . . . . . performance vs. precision (Slabs-AABB), scenes 1-3 . performance vs. precision (Slabs-AABB), scenes 4-6 . performance vs. precision (Slabs-AABB), scenes 7-8 . performance vs. precision (Plu-AABB), scenes 1-3 . . performance vs. precision (Plu-AABB), scenes 4-6 . . performance vs. precision (Plu-AABB), scenes 7-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 100 101 102 113 114 115 7.1 7.2 Circuit area, delay and pipeline stages . . . . . . . . . . . . . . . . . . . . Ray-AABB pipeline area and performance . . . . . . . . . . . . . . . . . . 118 140 B.1 B.2 B.3 B.4 B.5 B.6 B.7 B.8 Spheres scene - Performance vs. Maximum triangles per node . . . . . Bunny scene - Performance vs. Maximum triangles per node . . . . . . Branch scene - Performance vs. Maximum triangles per node . . . . . . Poppy scene - Performance vs. Maximum triangles per node . . . . . . Powerplant-Front scene - Performance vs. Maximum triangles per node Powerplant-North scene - Performance vs. Maximum triangles per node Powerplant-Boiler scene - Performance vs. Maximum triangles per node Lucy scene - Performance vs. Maximum triangles per node . . . . . . . 152 153 154 155 156 157 158 159 viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Figures 2.1 2.2 2.3 2.4 2.5 Ray tracing . . . . . . . . . . . . . . . Bounding volume hierarchy . . . . . . Axis-aligned bounding box (AABB) . . 2D “slabs” ray-plane distance intervals 2D “slabs” ray-box test example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 8 11 12 12 3.1 3.2 3.3 3.4 3.5 3.6 Relative orientation of ray and line . . . . . . . . Derivation of the side relation . . . . . . . . . . . Ray-polygon intersection with Pl¨ ucker coordinates Axis-aligned bounding box and MMM rays . . . . C++ code for “MMM” Plu-AABB test . . . . . . Largest intersection distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 42 43 46 46 4.1 4.2 4.3 Test scenes 1-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test scenes 5-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empty BVH child . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 55 57 5.1 5.2 5.3 5.4 Pseudocode for regular BVH traversal . Pseudocode for 8-bit BVH traversal . . Pseudocode for coherent BVH traversal 2Lucy test scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 71 77 85 6.1 6.2 6.3 6.4 Integer vs. floating-point slabs interval, cases 1-3 . . Integer vs. floating-point slabs interval, cases 4-6 . . Pseudocode for FP Plu-AABB test, case “MMM” . . Pseudocode for integer Plu-AABB test, case “MMM” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 96 105 106 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 Overview of pipelines . . . . . . . . . . . . . . . FP Slabs-AABB pipeline stages 1-7 . . . . . . . FP Slabs-AABB pipeline stage 13 . . . . . . . . Integer Slabs-AABB pipeline stages 1-4 (part A) Integer Slabs-AABB pipeline stages 1-4 (part B) Integer Slabs-AABB pipeline stage 5 . . . . . . FP Plu-AABB pipeline stage 1-7 (part A) . . . FP Plu-AABB pipeline stage 1-7 (part B) . . . FP Plu-AABB pipeline stages 8-12 . . . . . . . FP Plu-AABB pipeline stages 13-20 . . . . . . . Integer Plu-AABB pipeline stage 1 (part A) . . Integer Plu-AABB pipeline stage 1 (part B) . . Integer Plu-AABB pipeline stages 2-4 . . . . . . Integer Plu-AABB pipeline stage 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 123 124 126 127 128 130 131 132 133 135 136 137 138 ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 Problem Statement The problem tackled by this research is to reduce the memory usage and complexity of ray tracing using bounding volume hierarchies (BVH.) BVHs are a popular and effective method for dramatically increasing the performance of ray tracing algorithms. Scene geometry is enclosed in a tree of bounding volumes, providing an efficient method for determining which scene objects a ray intersects. Typically, bounding volumes are axis-aligned bounding boxes (AABBs), although other shapes may also be used. BVH implementations usually employ 32-bit or 64-bit floating-point arithmetic for the BVH data structure and for performing ray-AABB intersection tests. The use of 32-bit or 64-bit floating-point causes the BVH to require large amounts of memory for complex scenes. The complexity of hardware ray-AABB testing circuitry is also high, because floating-point arithmetic hardware is more complex and requires more chip area than simpler integer hardware. 1.2 Original Contributions This research makes several original contributions to the field of ray tracing: 1. A comprehensive, up-to-date survey of literature related to BVHs. 2. A comprehensive, up-to-date survey of ray tracing hardware. 3. An efficient ray-axis aligned bounding box (AABB) overlap test based on Pl¨ ucker coordinates. 1 2 4. A simple method for traversing BVH nodes in the order they appear along the ray. A similar method was independently developed by Mansfield [Man02] during the writing of this research. 5. An investigation into the use of a low-precision, integer, relative encoding scheme for BVH node coordinates. This reduces the memory consumption of the BVH by 63% or more. A similar idea was originally proposed by Rusinkiewicz and Levoy [RL00], but no investigation was performed. 6. The use of coherent ray tracing to mitigate the computational overhead from using the low-precision BVH node coordinate encoding scheme. 7. Robust, lower-precision, integer ray-AABB tests derived from the traditional “slabs” ray-AABB test and the new Pl¨ ucker-AABB test. These methods are guaranteed to produce correct images regardless of the precision used for the BVH. 8. Hardware designs are presented that implement the robust integer ray-AABB tests. Performance estimates are given, showing the advantage of the integer methods versus using floating-point. 1.3 Thesis Overview This research is divided into six main chapters: Chapter 2: This chapter begins with an overview of ray tracing, BVHs, and ray-AABB intersection tests such as the Slabs-AABB test. Surveys of BVH and ray tracing hardware literature are presented. Chapter 3: A new method for performing ray-AABB overlap tests based on Pl¨ ucker coordinates is presented. The method’s performance is compared to other ray-AABB tests by testing sets of random rays against random AABBs. 3 Chapter 4: The chapter begins by reviewing BVH construction. Test scenes used throughout this research are introduced. BVH traversal using the new Pl¨ ucker-AABB test from Chapter 3 is tested against the Slabs-AABB test. An improved BVH traversal method that traverses BVH nodes in their order along the ray is presented and tested. Chapter 5: Relative, low-precision, integer encoding for BVH node coordinates is tested. This saves a substantial amount of memory, but is found to have high run-time overhead. Coherent ray tracing is employed to reduce the overhead to negligible levels. Chapter 6: Robust, lower-precision, integer ray-AABB tests based on the Slabs-AABB and Pl¨ ucker-AABB tests are introduced. Their effectiveness when used in a BVH for various test scenes is evaluated. Chapter 7: Hardware designs are presented for ray-AABB circuits based on the robust integer algorithms from Chapter 6. Performance estimates are made based on the properties of the circuit components. Hardware for the robust integer methods is compared to floatingpoint implementations. 1.4 Publications and the Contributions of Others Three papers and one technical report based on this research were published or submitted to journals. The thesis chapters are not verbatim copies of these publications. The material has been re-organized, extensively edited, and expanded with additional results and analysis. 1. Jeffrey Mahovsky and Brian Wyvill. “Fast Ray-axis Aligned Bounding Box Overlap Tests with Pl¨ ucker Coordinates.” Journal of Graphics Tools, 9(1):35-46, 2004. Chapter 3 presents an extensively updated and revised version of the material in this paper. Brian Wyvill acted in an advisory role, providing assistance with the expression of ideas and editing of the paper. The ideas presented within are my own, as is the implementation and testing. 4 2. Jeffrey Mahovsky. “Ray Tracing Bounding Volume Hierarchies with the Plu-AABB Test.” University of Calgary Technical Report 2004-759-24. April 2005. Chapter 4 is based on this technical report. 3. Jeffrey Mahovsky and Brian Wyvill. “Memory-Conserving Bounding Volume Hierarchies with Coherent Ray Tracing.” Submitted to Computer Graphics Forum, May 2005. Chapter 5 presents an expanded version of the material in this paper, with new results showing the effectiveness of the technique when running with insufficient physical memory. Brian Wyvill acted in an advisory role, assisting with the editing of the paper. The ideas and implementation are my own. 4. Jeffrey Mahovsky, Brian Wyvill, Al Davis, and Ali Ibrahim. “Efficient Ray-Bounding Volume Hierarchy Traversal with Integer Arithmetic Hardware.” Submitted to ACM Transactions on Graphics, May 2005. Chapters 6 and 7 present the integer Slabs-AABB test and hardware described in this paper. These chapters were greatly expanded to also present the integer Plu-AABB test and hardware. Brian Wyvill acted in an advisory role, assisting with the editing of the paper and also was instrumental in obtaining the assistance of Al Davis and his graduate student, Ali Ibrahim. Al Davis designed the initial hardware pipeline for the Plu-AABB test, based on my discussions with him. Al also taught me how to design a basic hardware pipeline, thus the Slabs-AABB pipeline is my own design. He also contributed a section in the paper providing background information on hardware implementation issues. I have re-written the section in my own words for this thesis. Al Davis and Ali Ibrahim provided the essential circuit performance data on which the performance estimates are based. The major ideas and implementations are my own, with the exceptions mentioned above. Chapter 2 Previous Work This chapter begins by reviewing the ray tracing method for generating realistic images of three-dimensional scenes. The use of bounding volume hierarchies (BVHs) to accelerate ray tracing is discussed. Existing methods for performing the ray-bounding box tests needed for ray-BVH traversal are reviewed. Two literature surveys are presented that cover other relevant research. The first reviews BVH-related literature, and the second surveys ray tracing hardware. Finally, a discussion of hardware versus software approaches for increasing ray tracing performance is given. 2.1 Ray Tracing Ray tracing (or recursive ray tracing) generates images by tracing the paths of rays of light through a virtual three-dimensional environment [App68, Whi79]. Primary rays are traced from a virtual camera or eye and are tested for intersection with the objects in the scene (Figure 2.1.) When a ray hits an object, an intersection point is computed and shadow rays are cast to the lights to determine the illumination contribution of each light source. If a shadow ray hits an object before reaching the light source, then the intersection point is within a shadow. When a ray hits a reflective object, a reflected ray is traced that computes the reflected light. This accurately computes reflections for virtually any kind of three-dimensional object. If the object has refractive properties (e.g. is made of glass or similar material), a transmitted or refracted ray is traced into the object. This accurately simulates the optical properties of objects such as lenses. Reflected and transmitted rays are often called secondary rays. 5 6 Light source Light source Camera / Eye Shadow ray Shadow ray Primary ray Shadow rays Transmitted ray Reflected ray Reflected ray Ray-object Intersection point Transmitted ray Sphere Figure 2.1: Ray tracing. Reflected and transmitted rays may strike other objects in the scene. When this happens, another intersection point is computed and further shadow, reflected, and transmitted rays are traced. Hence, the ray tracing algorithm is recursive, allowing the accurate simulation of complex optical phenomena such as mirrors reflecting each other. The ray tracing algorithm is actually the reverse of what happens in reality. Light normally comes from light sources, and a small amount of that light enters the camera, forming an image. Simulating this in a ray tracer is impractical because only a tiny fraction of the light coming from the light sources would actually hit the camera lens and contribute to the rendered image. Instead, rays are traced from the camera into the scene, and shadow rays are traced towards the light sources. 7 Ray tracing was first used by Appel [App68] for shading of CAD (Computer Aided Design) and architectural models. These drawing were output on a printer/plotter. Shading was indicated by various patterns of dots and other shapes using a halftoning algorithm. The shading was fairly coarse because lighting was only computed for regularly-spaced locations on the page. Appel’s ray tracing algorithm was not recursive because it only computed shadow rays, and not reflected or transmitted rays. Whitted [Whi79] improved Appel’s technique by adding secondary rays and recursion. Whitted was able to produce stunning colour imagery, but the computer hardware at that time was barely up to the task. Simple scenes required upwards of an hour to render on a Digital Equipment Corporation VAX 11/780 minicomputer. Since then, a large amount of effort has been made to improve the performance of ray tracing. Kay and Greenberg [KG79] developed an approximate transparency algorithm that avoids the computational cost of ray tracing. Transparent surfaces are processed in backto-front order using a z-buffering hidden surface removal algorithm. The algorithm exploits assumptions made about the object geometry and viewing parameters. Ray tracing is not a complete simulation of the interaction of light with an environment. Its most serious drawback is that it does not compute light that is diffusely reflected from objects within the scene - it only considers light coming directly from the light sources and from the reflected/transmitted ray directions. Fortunately, other algorithms based on ray tracing are able to solve this problem, such as path tracing [Kaj86] and photon mapping [Jen01]. The findings of this research also apply to these methods because they perform similar ray tracing functions. For a more detailed discussion of ray tracing techniques, see Glassner [Gla89] and Shirley and Morley [SM03]. 8 Figure 2.2: Bounding volume hierarchy (BVH). Left: The scene, showing objects and bounding boxes. Right: The BVH. 2.2 Bounding Volume Hierarchies (BVH) A simple ray tracer will test each ray against each object in the scene for ray-object intersection. This is inefficient when the scene contains more than a handful of objects, because there can be millions of rays and potentially millions of objects. Portions of the scene can be enclosed within bounding volumes to reduce the number of ray-object tests [RW80]. Consider a scene consisting of a room containing a chair. If the chair contains 5,000 triangles, each ray would need to be tested for intersection with each of the 5,000 triangles. If the chair is fully enclosed by a bounding volume, the ray tracer first tests the bounding volume for ray intersection. If the ray intersects the bounding volume, then the ray is tested against the 5,000 triangles within. Only a small fraction of the rays might hit the bounding volume, thus the total number of ray-object tests is dramatically reduced. This method can be improved by enclosing different parts of the chair in smaller bounding volumes that are enclosed by the chair’s bounding volume. If the ray hits the chair’s bounding volume, it is then tested against the bounding volumes enclosing the chair parts. 9 This forms a bounding volume hierarchy, or BVH. There are two types of BVH nodes: branch or interior nodes that contain other BVH nodes, and leaf nodes that contain the scene objects. Branch nodes typically contain two children, although some BVH implementations allow more. Leaf nodes may contain any number of objects, although the BVH is usually constructed such that the number of objects within a leaf node is below a certain value. BVH nodes may overlap each other in space, but the objects contained within a BVH node are fully contained within that node. An object will only belong to one BVH child node. Figure 2.2 shows a BVH for a simple scene. Ray-BVH traversal is a depth-first tree traversal, beginning at the root node of the BVH. If the ray hits a BVH branch node, its children are tested for ray intersection. If the ray hits a leaf node, the scene objects within are tested for ray intersection. Ray tracing systems typically construct BVHs using an automatic algorithm. They can be constructed manually, but this is labour-intensive for complex models. BVH construction is discussed in the survey in Section 2.4 and in Chapter 4. There are several other schemes for reducing the number of ray-object tests [AK89]. Unlike the BVH, these schemes typically partition space instead of scene objects. Thus, nodes or cells do not overlap in space, but objects may overlap multiple nodes or cells. Grids [Vat84, FTI86, CW88] partition the scene into a three-dimensional array of cells. The cells contain either empty space or scene objects. In some implementations, grid cells may also contain subgrids [JW89]. Octrees [Gla84] are a recursive spatial subdivision that divides the scene into eight equal subspaces. These subspaces are recursively subdivided until some criteria is met. Octree nodes contain either child nodes, empty space, or scene objects. Binary Space Partitioning (BSP) trees and k-d trees [FS88] recursively divide the scene into two equal (for BSP) or non-equal (for k-d tree) subspaces. Like octrees, the subspaces are recursively subdivided until some criteria is met. BSP or k-d tree nodes contain either 10 empty space, child nodes, or scene objects. More exotic schemes also exist, such as 5-dimensional spatial subdivision [AK87]. Both the three-dimensional scene space and the two-dimensional ray direction space are subdivided. These schemes are not investigated in this work because of their implementation complexity and because there is not a clear performance advantage to using them. 2.3 Ray-Axis Aligned Bounding Box (AABB) Tests AABBs are the most popular bounding volume used for BVHs because they are both simple and relatively efficient. Computing the smallest AABB that encloses a set of objects is easy, because only the minimum and maximum extents of the objects need to be known. There are other types of bounding volumes that may enclose objects more tightly, but these often require more computation to create and/or more computation to test for ray intersection. An AABB structure contains six coordinate values which are the minimum and maximum extents of the box along each coordinate axis. The coordinates are named xmin, xmax, ymin, ymax, zmin, and zmax (Figure 2.3.) Additional data is needed for ray-BVH traversal, and this is described in Chapter 4. A ray data structure must be defined before the details of the ray-AABB tests can be presented. A ray contains several parts: 1. An origin O with coordinates (Ox , Oy , Oz ). ~ with components (Di , Dj , Dk ). 2. A direction vector D 3. A length (rlen), which is zero or a positive finite value, or a value of -1 to indicate an infinitely-long ray. Alternatively, a 1-bit boolean value rinf can be included to indicate an infinite ray. 11 Y (xmin,ymax,zmin) (xmax,ymax,zmin) (xmax,ymax,zmax) (xmin,ymax,zmax) (xmin,ymin,zmin) (xmax,ymin,zmin) X (xmax,ymin,zmax) (xmin,ymin,zmax) Z Figure 2.3: Axis-aligned bounding box (AABB). 2.3.1 Slabs Method The most commonly-used ray-AABB test is the Slabs-AABB test [KK86]. This test works by computing three distance intervals along the ray to the pairs of planes (slabs) which define the box faces. For the ray to hit the AABB, the intersection of these intervals must be non-empty and must overlap the ray (Figure 2.5.) The first step in the Slabs-AABB test is to compute six distances along the ray (t0x , t1x , t0y , t1y , t0z , t1z ), one for each plane defining a box face. A 2D example is given in Figure 2.4. For brevity, the derivations will focus on the x-axis formulas. The y- and z-axis are similar. Hence, the intersection distances to the x-axis planes of the box are given by: xmin − Ox Di xmax − Ox = Di t0x = t1x 12 t1y ymax t1x t0x t0y ymin ray xmin ray xmax Figure 2.4: 2D “slabs” ray-plane distance intervals shown as [t0x , t1x ] and [t0y , t1y ]. tfary tnearx tfarx ymax tnearx tfarx tfary ymax box tneary tneary ymin ray xmin xmax ymin ray xmin xmax Figure 2.5: 2D “slabs” ray-box test example: left: hit - [tnearx , tf arx ] and [tneary , tf ary ] overlap, right: miss - [tnearx , tf arx ] and [tneary , tf ary ] do not overlap. To avoid division, we can multiply by pre-computed inverse ray direction vector components Di−1 , Dj−1 , and Dk−1 . These inverse components will be termed “IRDVCs.” t0x = (xmin − Ox ) ∗ Di−1 t1x = (xmax − Ox ) ∗ Di−1 Note that the IRDVCs may have values of +Infinity or -Infinity if the corresponding ray component is the floating-point value +0 or −0. Values of infinity for the IRDVCs are acceptable for the Slabs-AABB algorithm, as the IEEE floating-point standard handles them properly [Smi98]. Care must be taken to ensure that the implementation of SlabsAABB properly handles +0 and −0 ray components [WBMS05], because it means the 13 difference between +Infinity and -Infinity when IRDVCs are computed, or when dividing. Next, three intervals are formed: [tnearx , tf arx ], [tneary , tf ary ] and [tnearz , tf arz ]. If Di−1 is positive, then tnearx = t0x and tf arx = t1x , else tnearx = t1x and tf arx = t0x . The rule is similar for [tneary , tf ary ] and [tnearz , tf arz ]. A 2D example is shown in Figure 2.5. By forming the intersection of these three intervals, the near and far intersection distances of the ray and AABB are obtained: [tnear, tf ar] = [max(tnearx , tneary , tnearz ), min(tf arx , tf ary , tf arz )] For the ray to hit the AABB in front of the ray origin, the following conditions must be satisfied: tf ar ≥ 0 tf ar ≥ tnear tnear ≤ rlen There is a problem with the Slabs-AABB test that is revealed when examining the equations for t0x and t1x . If the subtraction result is zero and the IRDVC is infinity, then the result of the multiplication is Not a Number, or NaN [Gol91]. The NaN result might cause a failure of the algorithm because it is impossible to properly compare values to a NaN. All comparisons with NaN produce a false result, even when comparing NaN to itself. (Interestingly, one way to check for x == NaN is to compare x == x and check for a false result. This is not recommended because a compiler might not understand this trick and optimize the expression out of the code. Fortunately, an isnan() function is available on many platforms.) The NaN problem also happens when dividing by ray components instead of multiplying by IRDVCs because the division becomes 00 , which is also NaN. The NaN problem can be avoided by handling axis-parallel rays as a special case, or by ensuring the ray-AABB test handles NaNs properly (possibly by exploiting the fact that 14 comparison with NaN always produces false, causing NaNs to be ignored.) My implementation of Slabs-AABB is not affected by NaNs (see Appendix E.) Other traversal algorithms that compute distances along the ray to planes (such as k-d trees) must also consider these infinity and NaN issues. Floating-point imprecision can also cause problems with the Slabs-AABB test. For example, edge- and corner-grazing intersections might produce incorrect results because the distance values can become very close to each other at the box edges. 2.3.2 Woo’s Method Woo [Woo90] proposed a ray-AABB test that only needs to compute intersection distances to three planes instead of six. The algorithm first determines which three candidate planes to test, based on the relative positions of the ray origin and the AABB faces. Intersection distances are computed to the three candidate planes, and the furthest intersection distance becomes the potential ray-AABB intersection point. The coordinates of the intersection point are computed and compared to the AABB to check if the intersection point is outside the box, in which case the ray misses the box. The code required to implement this test is more complex than for Slabs-AABB, and seems to require more operations. Therefore, it is not tested in this research. 2.3.3 Separating Axis Theorem Line Segment-AABB Test Another algorithm exists that can be applied to the problem of ray-AABB overlap tests and BVH traversal, if somewhat awkwardly. This algorithm is the line segment-AABB overlap test based on the Separating Axis Theorem (SAT) [GLGT99], called SAT-AABB. The Separating Axis Theorem states that for any two convex and disjoint polyhedra that there is an axis on which the projections of the two polyhedra are also disjoint. The polyhedra can be separated by an axis that is orthogonal (or a plane that is parallel) to 15 a face or an edge of either polyhedra. A line segment can be considered to be a convex polyhedra with no faces that encloses no volume. Thus, the line segment is considered to be an “edge” for the SAT-AABB test. The SAT-AABB test first projects the line segment and the AABB onto the x axis and tests if the projections overlap. If they do not, the line segment and AABB do not intersect. This test is repeated for the y and z axes. Next, a new axis a is formed by taking the cross product of the line’s direction with the x axis. The AABB and line are projected onto a and the projections are tested for overlap. The process repeats for the y and z axes. The SAT-AABB test has not seen much use in ray tracing for several possible reasons. First, it requires that the ray be represented as a line segment center point and forward/backward half-distance vector, which is incompatible with the normal ray origin and direction vector representation. Conversion between the two different representations is possible, but adds additional complexity which can make integration into a ray tracer somewhat challenging. It may be possible to resolve this problem above by adjusting the algorithm to accommodate the standard ray origin and direction vector input, making it more useful for ray tracing. Second, it cannot handle rays/line segments of infinite length. (Although, if the scene boundaries are known, infinite-length rays can be converted to finite-length rays.) Primary and secondary rays are typically of infinite length, as well as rays that sample infinitelydistant light sources. Third, it does not compute an intersection distance along the line/ray to the AABB. This would be a serious problem if the child nodes in a BVH are to be tested in the order of their distance along the ray. It is no longer a problem because the new DSA technique can be used (see Section 4.5), which eliminates the need for intersection distances. In Chapter 3, the performance of the SAT technique is compared to the Slabs-AABB 16 and Pl¨ ucker ray-AABB overlap tests. The SAT technique is not used for tests involving actual ray-BVH traversal because of the first two problems listed above. 2.4 Survey of BVH Literature This section presents a survey of literature relevant to BVHs, including ray-box tests and bounding volume construction techniques. The section is divided into two subsections. The first presents BVH algorithms. The second presents algorithms used in conjunction with a BVH, such as algorithms to compute bounding volumes. Comparing the effectiveness and performance of the various algorithms is difficult because the publications do not use consistent test scenes, testing methods, or similar computer systems. 2.4.1 BVH Algorithms 1. [RW80] A 3-Dimensional Representation for Fast Rendering of Complex Scenes. This paper introduces the idea of using a tree of bounding volumes to represent objects and speed up ray tracing. Unlike later BVH methods, the bounding volumes (oriented parallelepipeds) actually represent the geometry within the scene instead of functioning as containers for geometry. For example, a triangle is represented as the intersection of three bounding boxes. This simplifies the system because an explicit triangle primitive is not required, but complicates the BVH traversal because the bounding volumes are more complex than simple boxes. The authors state that the method provides logarithmic run-time. The BVH was created manually with an editor, which may be labour-intensive for complex scenes. 2. [WHG84] Improved Computational Methods for Ray Tracing. This paper examines the effectiveness of different types of bounding volumes used in a BVH, and the tradeoffs between using simple, loosely-fitting bounding volumes and more complex, 17 tightly fitting bounding volumes. Manual construction of the BVH with a graphical editor is discussed. The research proposes a hybrid ray tracing / z-buffering algorithm [FvDFH95]. Z-buffering is used instead of shooting primary rays, increasing performance but also increasing complexity because both z-buffering code and ray tracing code must be implemented. The resulting image is an “item buffer” containing item IDs of the closest objects instead of colours. Ray tracing is used for secondary rays. 3. [KK86] Ray Tracing Complex Scenes. The paper introduces the “slabs” ray-bounding volume intersection test. A bounding volume is defined by slabs, which are pairs of planes in 3D space. At least three slabs are needed to enclose an object, forming a convex bounding volume. The slabs’ plane normals are restricted to the three coordinate axes and diagonals, making the bounding volumes larger than necessary but easier to test for ray intersection. The amount of bounding volume expansion from this simplification depends on the shape that is enclosed by the bounding volume. Each BVH node can contain two or more children, and the children are traversed in ray intersection distance order. A heap is used for BVH traversal instead of recursive function calls. The heap facilitates the sorting of the BVH nodes by distance order. The “median-cut” top-down, recursive BVH construction scheme is also introduced. 4. [GS87] Automatic Creation of Object Hierarchies for Ray Tracing. An algorithm is presented that builds BVHs incrementally by adding objects one-at-a-time and choosing their location in the BVH based on minimizing the increase in BVH cost. This type of construction technique is often called “bottom-up” BVH construction. A new BVH node or object is inserted at a location that minimizes the increase in the parent nodes’ surface area. The method produces different BVHs for different object insertion orders, so several different object insertion orders are compared. The method is not quantitatively compared to the “top-down” BVH construction method that recursively divides the BVH nodes based on a splitting plane [KK86]. 18 5. [Gla88] Spacetime Ray Tracing for Animation. A hybrid BVH method is presented that combines a spatial subdivision (e.g. an octree) with the tight-fitting bounds of a BVH. An octree is constructed in top-down, recursive fashion, and then bounding volumes are wrapped around the geometry contained within the octree nodes. Unlike a regular BVH, the bounding volumes cannot overlap, and bounding volumes do not always fully enclose their objects. This means that an object may overlap multiple BVH nodes, potentially reducing efficiency because an object may be tested for ray intersection multiple times. The BVH is extended to four dimensions (including time). This is more efficient for rendering animation sequences because the BVH need only be constructed once for the entire animation, instead of once for each frame. The nonoverlapping nature of the 4D spacetime bounding volumes also reduces the amount of BVH traversal and the number of ray-object intersection tests. Unfortunately, this technique is more complex than a regular BVH. 6. [Gla89] An Introduction to Ray Tracing. Glassner’s book contains a section explaining BVHs, which reviews the techniques presented in the other BVH papers published before 1989. 7. [CS90] Efficient Traversal of Well-Behaved Hierarchical Trees of Extents for RayTracing Complex Scenes. This paper introduces a BVH algorithm where the BVH nodes cannot overlap, and therefore objects can span multiple BVH nodes. This is the opposite of a traditional BVH, and more closely resembles a k-d or BSP tree. Having BVH objects potentially overlapping multiple nodes can reduce efficiency, because the objects may be tested for ray-object intersections multiple times. Additionally, the bounding boxes completely fill the empty space within their parent nodes instead of shrinking to fit the objects. This is also similar to a k-d or BSP tree, and permits optimization of the BVH traversal. The traversal algorithm uses 6-bit codes to determine which faces of the bounding boxes to test for intersection. The traversal 19 distinguishes between “internal” and “external” BVH nodes, testing all internal nodes before external nodes. Internal nodes contain the ray origin, and external nodes do not contain the ray origin. The BVH is constructed using a recursive, median-cut algorithm. 8. [Hai91] Efficiency Improvements for Hierarchy Traversal in Ray Tracing. Several BVH traversal optimizations are presented: • Do not traverse BVH nodes that are further away than the current closest rayobject intersection distance. • Cache the last object hit by a shadow ray, and test new shadow rays against this object before performing a full BVH traversal. • Cache the closest object last intersected by a regular ray and test it first for ray-object intersection. • Precompute the BVH traversal for common ray origins such as eye rays and light sources and store the node/objects traversed in an ordered list. Rays from these origins are tested against the list instead of the BVH. • Perform bottom-up BVH traversal for rays originating from an object. Ordered traversal lists can be constructed for individual objects. • Sort bounding volumes by characteristics, e.g. test larger bounding volumes first. Unfortunately, no test results are given, so the effectiveness of these optimizations is unclear. 9. [KC93] An Efficient Hierarchical Traversal Algorithm for Ray Tracing. This paper improves the BVH traversal method described by Kay and Kajiya [KK86]. BVHs are usually traversed in a depth-first fashion starting at the root node, even for secondary rays. BVH traversal for secondary rays may be improved by starting at the bottom of 20 the BVH and working upwards. Secondary ray traversal begins at the node containing the object hit by the primary ray. This complicates the traversal somewhat, but test results show up to an 18% speed-up overall. A plane-sweep method is also introduced that reduces the number of intersection tests for eye rays. 10. [ST94] Two Optimization Methods for Ray Tracing. In a BVH node containing two AABB children, the children may have several faces coincident with the parent’s faces. This can be exploited to reduce the number of ray-plane computations when using the Slabs-AABB test by more than half. The coincident faces of the child AABBs are flagged during BVH construction, requiring additional data in the BVH nodes. The authors’ testing shows a speed-up of approximately 20%. The BVH is constructed in a top-down fashion. 11. [Smi98] Efficiency Issues for Ray Tracing. Several BVH optimizations are proposed: • The Slabs-AABB test actually works properly when a ray is parallel to an AABB face, because IEEE-compliant floating-point arithmetic handles values of infinity correctly. Thus, rays parallel to AABB faces do not need to be handled as special cases. • The BVH can be encoded as an array by using skip pointers because the traversal order is always the same in the author’s implementation. This eliminates recursion overhead because moving to the next BVH node only requires incrementing a pointer by the skip value. (Traversing the BVH in the same order regardless of the ray direction is inefficient, as shown in Chapter 4.) • Regular rays and shadow rays are handled separately, because shadow rays can end their traversal as soon as any intersection is found, unlike regular rays which require the closest intersection. • Object caching similar to Haines’ technique [Hai91] is also discussed. 21 12. [SM03] Realistic Ray Tracing, Second edition. Chapter 9 of Shirley’s book describes top-down BVH construction and BVH traversal using the Slabs-AABB test. The book provides C++ code for the BVH construction and traversal algorithms, which reveals important details that are not discussed by other publications. 2.4.2 Supporting Algorithms 1. [Rit90] An Efficient Bounding Sphere. A two-pass, linear-time algorithm is introduced for finding an approximate minimum bounding sphere for a set of points. The resulting sphere is approximately 5% larger than the ideal minimum bounding sphere. The first pass identifies three pairs of points (one pair for each axis) that are at the minimum and maximum positions along each axis. The pair with the maximum separation forms the initial sphere. The second pass compares each point to the initial sphere and incrementally increases the sphere’s size and adjusts the center point to enclose any points that are outside. Later work shows that there are cases where this algorithm fails [Wu92]. 2. [Woo90] Fast Ray-Box Intersection. Woo’s ray-box test is described in Section 2.3.2. 3. [Tru92] Rectangular Bounding Volumes for Popular Primitives. This article presents methods for computing AABBs for transformed shapes such as cubes, polygons, cylinders, cones, conics, spheres, and torii. The shapes are assumed to be positioned and oriented within the scene by applying a cumulative transformation matrix (CTM) to standard primitive shapes. The coefficients of the CTM are used to obtain the AABB’s coordinates. 4. [Wu92] A Linear-time Simple Bounding Volume Algorithm. An O(N) algorithm is presented that computes an oriented bounding box that encloses a set of N points. The algorithm rotates the coordinate system containing the points and then computes 22 the minimum orthogonal bounding volume. The algorithm can be adapted to produce a better bounding sphere for cases where Ritter’s bounding sphere algorithm [Rit90] produces a poor solution, such as when the point distribution is not orthogonal to the x, y, and z axes. 5. [St¨ u96] Bounding Volume Construction Using Point Clouds. Computing the bounding volumes for complex objects, such as CSG (Constructive Solid Geometry) objects [FvDFH95], can be difficult. A method is proposed that generates a cloud of points around complex objects. The bounding volume that encloses the points also encloses the object, simplifying the generation of tight-fitting bounding volumes. The paper’s results show a nearly five-fold improvement in rendering speed from having better bounding volumes. 6. [GLGT99] A Framework for Fast and Accurate Collision Detection for Haptic Interaction. This paper describes a line segment-box overlap test based on the Separating Axis Theorem. The test is described in Section 2.3.3. 7. [WBMS05] An Efficient and Robust Ray-Box Intersection Algorithm. This paper describes an improved version of the Slabs-AABB test. The algorithm correctly handles +0 and −0 ray direction vector components, unlike other algorithms which do not consider the sign of zero. The properties of IEEE floating-point arithmetic are used to eliminate the need for special cases when the direction components are +0 or −0. Ray-AABB face distances are computed by multiplying by precomputed ray direction vector component inverses, instead of dividing by the direction components. 2.5 Survey of Ray Tracing Hardware Literature Although ray tracing hardware has not yet appeared in desktop PCs, a large amount of research has been performed. This section presents a survey of literature describing 23 customized ray tracing hardware and systems designed for ray tracing. It does not include GPU (Graphics Processing Unit)-based ray tracing [Pur04] or more general multi-processor or transputer architectures. It is difficult to compare the effectiveness of the various hardware techniques, primarily because different test scenes and testing methods were used. Thus, the results are summarized but are not compared, unless the authors explicitly compare their system to another. 1. [CWVB83] Design and Analysis of a Parallel Ray Tracing Computer. The authors propose a hardware ray tracing system constructed from a 3D grid of processors, each with local memory. Each processor node corresponds to a grid cell in a uniform grid subdivision, and each node is connected via communication links to its 6 neighbours. Rays travel between nodes as they travel through the grid. The authors estimate that a system with 125,000 Intel 8086 microprocessors (a 50 by 50 by 50 grid) could achieve real-time ray tracing for $50 million (1983 dollars.) 2. [CWBV86] Multiprocessor Ray Tracing. This paper is a follow-up to Cleary et al’s earlier paper [CWVB83]. A theoretical analysis for both 2D and 3D arrays of processors is presented. The authors conclude that a 2D array of processors is superior to a 3D array because a 2D array has better theoretical performance and is easier to cool. The actual hardware tested was the CM 2 (Calgary Mesh Machine) which consisted of a 3 by 3 grid of processor boards containing 10 MHz Motorola 68000 CPUs with 512 kilobytes of RAM per CPU. Detailed analysis showed that duplicating the entire scene database per processor is a better approach for achieving high performance, with each processor tracing a fraction of the rays independently of the other processors. 3. [PK87b, PK87a] (The Feasibility of ) A VLSI Chip for Ray Tracing Bicubic Patches. 24 This paper proposes a VLSI chip for performing ray-bicubic patch intersection tests. The ray is tested against the bounding box of the patch, and the patch is repeatedly subdivided until the size of the bounding box falls below a certain value. The circuitry requires 42% - 50% of the space on a 36 square millimetre chip when implemented with 2 micron NMOS chip technology. It completes a ray/patch intersection test every 15 microseconds when running at 20 MHz. 12-bit fixed-point precision is used for the patch control points. 24-bit precision can also be used, in which case a ray/patch intersection would require 140 microseconds. 4. [Ler87] Towards a Z-buffer and Ray-tracing Multimode System based on Parallel Architecture and VLSI chips. A real-time (interactive) rendering system is presented that contains two major hardware components: CUBI 7 and CRISTAL. CUBI 7 is a real-time z-buffering renderer [FvDFH95], and the CRISTAL module is used for ray tracing. CUBI 7 also mixes z-buffer rendered pixels with ray traced pixels from CRISTAL. To save computation, ray tracing is only used for image regions that require it, such as refractive polygons. CRISTAL consists of 16-128 parallel microprocessors (National Semiconductor’s NS 32032 CPU and NS 16081 FPU.) No hardware was built and no simulation results are presented. 5. [NYTN87] SIGHT – A Dedicated Computer Graphics Machine. SIGHT is an interactive, real-time (interactive) image generation system based on a ray tracing algorithm. It contains an array of 64 custom processing elements (PEs), each approximately 2.3 to 4.3 times faster than a Digital Equipment Corporation VAX 11/780 minicomputer with floating-point accelerator. Hence, the authors claim the maximum computation speed is 275 times faster (4.3 * 64) than a VAX 11/780. Scene geometry is defined with constructive solid geometry (CSG). The scene database is duplicated on each PE, providing a high degree of scalability. The PEs perform the ray tracing calculations using 32-bit floating-point arithmetic. No spatial subdivision or BVH scheme is 25 used, so rays are tested against each object in the scene. No hardware was actually constructed before publication of the paper. 6. [YNT91] A Dedicated Graphics Processor SIGHT-2. SIGHT-2 is an improved version of SIGHT [NYTN87]. The PEs are now 10 times faster than a VAX 11/780, and the system uses 16 PEs instead of 64. As with SIGHT, the PEs use 32-bit floating-point arithmetic. Each PE maintains its own copy of the scene in its own 64 megabyte RAM. The system has a total of 1GB of RAM, which was very impressive for that time. The PEs are designed for ray tracing but function more like microprocessors because they execute a microcode software program. A complete hardware system was constructed, consisting of a large cabinet filled with circuit boards with custom VLSI chips. Tests showed that a system containing 16 PEs is 158.8 times faster than a VAX. 7. [Sch88] Ray Tracing Rational B-Spline Patches in VLSI. A hardware implementation of the OSLO subdivide-and-intersect b-spline algorithm [CLR80] is presented in this paper. No hardware was actually constructed or simulated, but detailed schematics and estimates of component counts are given. No other results are given. 8. [YHDD89] A Hardware Algorithm for Fast Realistic Image Synthesis. This paper describes a hardware-based radiosity [GTGB84] system. Ray tracing is used for both preprocessing (computation of form factors, which is the mutual visibility between surface patches) and rendering. Geometry may be defined with B´ezier surfaces, allowing curved surfaces and patches. CORDIC (COrdinate Rotation DIgital Computer) [Vol59] hardware is used to rotate predefined hemispheres of rays and place them above a surface. Ray-patch intersection tests are performed similarly as described in Schneider’s design [Sch88]. The system does not use any spatial subdivision or BVHs. Only a software simulation was constructed. 26 9. [BSC89] A VLSI Chip for Ray Tracing Bicubic Patches. This paper shares similarities with Pulleyblank and Kapenga’s work [PK87b], but provides more detail about the system architecture. Patches are recursively divided into four subpatches, and then the bounding boxes of their subpatches are checked for ray intersection. The system uses a Slabs-AABB technique for ray-AABB intersection tests. Software simulations showed significant image artifacting due to the use of integer arithmetic. Specifically, arithmetic overflows occurred within the Slabs-AABB test when rays were nearly parallel to box faces. The authors propose that these cases should be handled in software and not by the chip. The hardware is described in detail, but only brief and incomplete statistics are presented for the actual hardware implementation. The authors estimate the entire chip requires 30 thousand transistors, and would be 30 square millimetres in size. The system does not use any spatial subdivision or BVHs, except for the ray-patch intersection. 10. [BD89] A VLSI System Architecture for High-Speed Radiative Transfer 3D Image Synthesis. This is a hardware-based radiosity system that uses ray tracing, with similarities to Yilmaz et al’s system [YHDD89]. The environments consist of only diffuse polygonal surfaces. Form factors are computed with ray tracing, but a linear system solver is used for the radiosity equations. CORDIC hardware rotates polygons to a convenient orientation that simplifies ray-polygon tests. No spatial subdivision or BVHs are used, so every ray must be tested against every patch. No results are presented. 11. [PH89] The Pixel Machine: A Parallel Image Computer. The Pixel Machine consists of a pipeline of pipe nodes and an n x m array of pixel nodes. Non-parallel algorithms run on the pipe nodes, and parallel algorithms run on the pixel nodes. Each pixel node has access to every m-th pixel on every n-th scanline of the frame buffer. The pipe nodes and pixel nodes are based on programmable AT&T DSP32 digital signal 27 processors containing floating-point hardware and 4 kilobytes of on-chip RAM. Nodes also contain 32 kilobytes of external RAM, for a total of 36 kilobytes of storage. The Pixel Machine is able to perform polygon rendering, ray tracing, and volume rendering. When ray tracing, each pixel node has a copy of the scene data. If the 36 kilobytes of node memory is insufficient for the scene data, the node requests memory pages from the host. Pipe nodes are used for computation of bounding volumes and transformations of objects during preprocessing. Ray tracing performance was 2-3 orders of magnitude faster than a typical workstation. 12. [AML+ 89] A Cellular Architecture for Ray Tracing. This hardware system implements discrete ray tracing, where the scene geometry is comprised only of voxels with surface normals. The hardware is a four-stage pipeline. The first three stages transform scene geometry and converts it into a voxel representation. The last stage is an array of computational cells, with one cell per pixel. Each cell computes ray-voxel intersections for a subset of the voxel array. Rays are passed between the cells in the 2D cell network, which is mapped to a virtual 3D network. Each cell is assigned an (x,y) voxel position within the scene and a complete column of voxels in the zdirection. Cells contain a 20-bit ALU (arithmetic/logic unit). The system is prone to communication deadlocks because there is limited storage for rays within the network. No hardware was constructed, but the system was simulated on a transputer. Unfortunately, no results are presented. 13. [Sun92], [SK95] (Tracing Rays with the) Area Sampling Machine. The Area Sampling Machine (ASM) is actually a z-buffering hardware renderer that can simulate ray tracing. Rays are represented as beams / frustums, and the scene geometry is scanconverted into each ray. The frustum is the beam that passes through one screen pixel. For each primary “ray”, a 32 by 32 pixel image is rendered, providing a high degree of anti-aliasing (1024 samples per pixel.) Only one pixel is rendered for reflected 28 rays. The scene database is repeatedly broadcast to the array of renderers. The algorithm runs in linear time because no logarithmic search structure such as a BVH is used to accelerate the ray-scene intersection tests. No hardware was built, but a cycle-accurate simulation was performed in software. Performance was estimated at 8.6 minutes to render a 512 by 512 pixel image of chessboard consisting of 14,537 polygons and 3 light sources. The authors estimate the speed-up from the ASM to be the same order as from the Pixel Machine [PH89], but at a much lower cost. 14. [HA96] TigerSHARK: A Hardware Accelerated Ray-tracing Engine. TigerSHARK is based on an array of general-purpose DSPs (Digital Signal Processors). 16 Texas Instruments TMS320C32 floating-point DSPs would be placed on a PCI add-in card installed within a desktop PC. These DSPs cost approximately $10 each at the time the paper was written. Each DSP is assigned a ray to work on, and the system uses ray coherence to optimize memory accesses. Only one of the 16 DSPs is attached to main memory, and the 15 other DSPs share the data retrieved by that DSP. All 16 DSPs work on similar (coherent) rays, and thus all need similar data. DSPs are connected to each other via serial links and have their own small local memories. Ray tracing is accelerated with bounding volumes. No results are given, but the authors state a 4-DSP prototype was being developed. 15. [Wri99] U.S. Patent 5,933,146 - Method and Apparatus for Constructing an Image of a Notional Scene by a Process of Ray Tracing. This patent details a VLSI chip for ray tracing, designed by Advanced Rendering Technology, Inc. The chip uses a hierarchy of bounding spheres for an acceleration technique. Groups of rays are processed at once, reducing memory bandwidth requirements. Each chip has a local ray memory / ray stack, intersection units, an eye ray computation unit, a redirection unit (computes secondary ray directions), and an output generation unit that outputs pixel RGB values. 20-bit logarithmic arithmetic is used for intersection calculations, 29 which simplifies multiplication and division but complicates addition and subtraction. The chip also supports path tracing, a global illumination technique based on ray tracing [Kaj86]. Multiple chips can be connected together to increase performance. 16. [Hal99] The AR250: A New Architecture for Ray Traced Rendering. The AR250 ray tracing chip is the successor to the chip described by Wrigley [Wri99]. It contains a custom RISC processor core and 32 32-bit IEEE-compatible floating-point units, unlike the previous design which used 20-bit logarithmic arithmetic. The chip also implements multi-dimensional noise, trigonometric, and square root functions. It is implemented on an 0.35 micron chip process, contains 650,000 gates, is 106 square millimetres in size, and runs at 50 MHz. 16 AR250 chips are connected to form a RenderDrive system, a standalone LAN-connected unit containing a host system with CPU, an embedded operating system, and flash disc boot memory. RenderDrive is not designed for real-time ray tracing, but rather as an accelerator for commercial 3D rendering software. RenderDrive reads compressed scene descriptions in Pixar’s Renderman RIB format from a LAN and renders frames in a batch-mode fashion. RenderDrive achieves nearly linear speedup with the number of AR250 processors, and renders a test scene 102 times faster than a software ray tracer running on a 333 MHz Pentium II. 17. [Hal01] The AR350: Today’s Ray Trace Rendering Processor. The AR350 is a refinement of the AR250 chip [Hal99]. The AR350 contains 64 32-bit IEEE-compatible floating point units, 1.8 million gates, is 110 square millimetres in size, and is manufactured on an 0.22 micron process. A RenderDrive system now contains 40 AR350 chips instead of 16 AR250s. No performance results were given. 18. [KSS+ 01] 3DCGiRAM: An Intelligent Memory Architecture for Photo-realistic Image Synthesis. The 3DCGiRAM consists of modules that merge memory and logic to 30 provide high bandwidth between memory and processing elements. Modules store objects within a grid subdivision and perform grid traversal and ray-object intersections with the objects stored within the module. Each module consists of a grid traversal unit, intersection calculation units (ICUs), a secondary ray generator (SRG), an intensity calculation unit and a frame buffer cache. Modules are connected together via a ray distributor which transfers rays from the CPU to the modules. The system is designed for two-pass operation, running in radiosity mode first and then using ray tracing for view-dependent image generation. Estimated performance for a scene with 419,188 patches/triangles, a 64 by 64 by 64 cell grid size, and 512 by 512 pixel resolution was 1 frame per second using 16 200MHz 3DCGiRAM modules. This is nearly 25 times faster than a software ray tracer running on an 800 MHz Pentium III CPU. 19. [DTB+ 01] Interactive Ray Tracing on Reconfigurable SIMD MorphoSys. MorphoSys is a reconfigurable SIMD (Single Instruction Multiple Data) multimedia processor that is targeted at portable devices such as cell phones and PDAs [LSL+ 99]. The device consists of 64 reconfigurable cells (RCs) controlled by a TinyRISC core. The RCs each have a 16-bit fixed-point ALU and 1 kilobyte of RAM. 64 coherent (similar) rays are processed at once, and all RCs are fed the same scene data, avoiding the need to duplicate the scene within each RC’s memory. Performance is increased with a BSP tree spatial subdivision scheme. RCs can be selectively disabled to prevent unnecessary work when performing coherent traversal of the BSP tree, since not all rays will traverse the same BSP nodes. The device is highly programmable, allowing for several different types of geometry. Rendering performance of 13.3 frames per second was achieved for 256 by 256 pixel resolution at 300MHz for 64-way parallelism. Only 0.36 frames per second was achieved for single-ray ray tracing, thus the design scales well with the number of RCs. 31 20. [SN01, Sun99] FPGA Implementation of the Ray Tracing Algorithm Used in the XPATCH Software. This paper describes an FPGA (Field Programmable Gate Array) implementation of the Slabs-AABB test. The XPATCH software simulates radar signatures of CAD models with ray tracing. The FPGAs are part of a Hardware Object Technology Development System (HOT DS) PCI expansion card. A C++ interface allows the XPATCH software to interact with the ray-AABB circuit on the PCI card. The authors present statistics for the number of FPGA cells required to implement the circuit in both a 4096-cell Xilinx XC6216 FPGA (75% required), and an 16384-cell XC6264 (19% required). The arithmetic circuits used 8 to 18-bit integer precisions. 21. [TL01] Reconfigurable Designs for Ray Tracing. This paper is a brief feasibility study for FPGA-based ray tracing. A ray tracer was designed based on a ray-sphere intersector and was implemented on a Xilinx Virtex XCV1000-5 FPGA on an RC-1000PP development board. A 32-bit fixed-point number format (8.24) is used instead of floating-point. The design runs at 16 MHz and computes roughly 5 million raysphere intersections per second. An equivalent software implementation running on an 800 MHz Pentium III only completed 2 million intersections per second. No spatial subdivision scheme or BVH is used. The authors estimate that 15.4 billion intersections per second are required to ray trace a 2000-object scene in real-time (25 frames per second) at 640 by 480 pixel resolution. They extrapolate that real-time performance could be achieved if the system comprised 10 FPGAs running at 100MHz, and software techniques reduced the number of intersections by a factor of 15. 22. [Pur01] SHARP Ray Tracing Architecture. The SHARP ray tracing architecture is implemented on a Smart Memories device which integrates an array of memory and computation units into a single chip [MPJ+ 00]. This eliminates memory-access bottlenecks faced by regular CPUs. The system uses a grid spatial subdivision scheme 32 because it is simple to implement and traversal hardware is efficient. Grid traversal performance is increased by using 4 by 4 by 4 cell “bricks” of grid cells. Each ray has a hash table of intersection-tested triangles to avoid duplicating intersection calculations for triangles overlapping multiple grid cells (called “parallel mailboxing.”) The author discusses bandwidth issues for the ray tracing algorithm and demonstrates how the high bandwidth available within Smart Memories can solve the bandwidth problems. Performance is estimated at 50 frames per second for a conference room scene rendered at 512 by 512 pixel resolution with simple shading. 23. [ABT+ 02] Interactive Ray Tracing Using a SIMD Reconfigurable Architecture. This is a continuation of Du et al’s work [DTB+ 01], using the newer MorphoSys-IIG chip instead of MorphoSys-I. The design does not use floating-point due to power and chip area requirements, but block floating-point (where several numbers share the same exponent) is used for some intermediate calculations. A rendering speed of 12.5 frames per second was achieved at 200 by 200 pixel resolution using a simulated 100MHz MorphoSys chip. However, they assume that an average of only 0.5 rays per pixel are traced, with interpolation or adaptive sampling filling in the gaps. Although the resolution is low, 200 by 200 pixels is adequate for their target application, which is a ray-traced video game on a hand-held device. 24. [LWH02] Design of a Pipelined Architecture for Ray/B´ezier Patch Intersection Computation. This paper presents a hardware algorithm for ray tracing B´ezier patches based on patch subdivision. Patches that are hit by the ray are recursively subdivided and tested for ray-patch intersection. The subdivision stops when the subdivided patches are sufficiently small, producing an accurate ray-patch intersection point. A pipelined hardware architecture is proposed but only a software implementation was tested. The brief test results only identify the bottlenecks within the system and do not show overall system performance. The authors state that a system throughput of 33 over 106 ray-patch intersections per second is achievable. No comparisons are made with other designs, such as Pulleyblank and Kapenga’s [PK87b] or Bouatouch et al [BSC89]. 25. [Fen02] The Design and Implementation of a Hardware Accelerated Ray Tracer Using the TM3a FPGA Prototyping System. This publication describes an FPGAbased hardware ray tracing system implemented on the TM3a (Transmogrifier 3a) [TM3]. The TM3a contains four interconnected Xilinx Virtex2000 FPGAs, each with 2 megabytes of memory. The ray tracer uses a BVH to increase performance, but does not implement a ray-AABB test. Only a ray-triangle test is implemented in hardware. AABBs are broken into triangles when tested for intersection. The system uses a mixture of fixed-point and pseudo floating-point, where values can be scaled by 1, 16, 256, or 4096 depending on a 2-bit flag. The hardware consists of 38 pipeline stages, runs at 50 MHz, and can complete one ray-triangle intersection per clock cycle, or 50 million ray-triangle tests per second. Performance is compared to an optimized software implementation on an 800MHz Pentium III CPU (achieving 36 million ray-triangle tests per second) and the POV-Ray ray tracer [POV] (achieving 4 million.) The author concludes that pipelined memory reads are the main contributor to the high performance of the hardware because software needs to wait for reads to finish before computations can start. 26. [SWS02] SaarCOR - A Hardware Architecture For Ray Tracing. SaarCOR is a hardware ray tracing chip design targeted towards high-performance real-time applications such as gaming and virtual reality. SaarCOR consists of three units: RGS (ray generation and shading), RTC (ray tracing core) and RTC-MI (memory interface). Multiple RGS and RTC units can operate in parallel, allowing the design to scale. SaarCOR implements a BSP/k-d tree spatial subdivision scheme. 64 similar rays are traced at once to reduce the memory bandwidth requirements and improve perfor- 34 mance. The chip also implements a form of multithreading to hide memory access delays. SaarCOR is designed to operate at 533 MHz and interface to regular PC133 SDRAM. Each chip contains 192 floating-point units (simplified and not fully IEEEcompliant), but even with this much hardware, the complexity is only half that of then state-of-the-art GPUs. Performance estimates from a cycle-accurate hardware simulator show that a scene from the Quake 3 Arena video game [Sof] containing 34,000 triangles can be rendered at a rate of 235 frames per second at 1024 by 768 pixel resolution. 27. [SLS03] A Virtual Memory Architecture for Real-time Ray Tracing Hardware. This paper describes a virtual memory scheme that permits rendering of scenes larger than the local memory of a SaarCOR hardware ray tracer. The hardware virtual-memory system implements a caching scheme to minimize accesses to the host computer’s memory over the slow PCI bus. The virtual-memory system is transparent to the ray tracing hardware and software, caching all types of data, such as geometry, textures, and shaders. Only a small amount of local memory/cache is needed on the ray tracing hardware. An 880 megabyte test scene was efficiently rendered with only 64 megabytes of cache memory. 28. [WLH03] A Pipelined Architecture for Ray/B´ezier Patch Intersection Computation. This paper is a superset of Lewis et al’s previous work [LWH02]. Additional results and analysis are presented, such as additional schematics and performance results from a software implementation on a Texas Instruments TMS320C6701 DSP chip. A portion of the algorithm (the “culler” module) was found to be a bottleneck and was implemented in an FPGA. 29. [FR03] A High-speed Ray Tracing Engine Built on a Field-programmable System. This paper is an updated version of Fender’s previous publication [Fen02] using the same 35 (or very similar) FPGA-based ray tracer design. Updated performance results show the FPGA is 0.35 to 23 times faster than a 2 GHz Pentium 4 system running the POV-Ray ray tracer [POV]. The authors estimate that a larger, higher-speed FPGA and an enhanced memory subsystem would increase performance to between 4 and 288 times that of POV-Ray. 30. [PKMS03] U.S. Patent 6,556,200 - Temporal and Spatial Coherent Ray Tracing for Rendering Scenes with Sampled and Geometry Data. This patent proposes a hardware design for coherent ray tracing with a grid subdivision scheme. Each grid cell contains a queue of rays. The system selects grid cells to process based on a heuristic that considers the amount of processing that can be completed for each cell and the ease of retrieving the required data from memory. The proposed hardware implementation is a PCI card containing a ray tracing chip and two levels of memory hierarchy. The first level consists of some EDRAM (Embedded DRAM) which is connected to multiple processing elements within the ray tracing chip. The second level contains regular DRAM connected externally to the chip. The card can also use the host machine’s memory. The higher levels of the memory hierarchy act as a cache for the lower levels. 31. [SWW+ 04] Realtime Ray Tracing of Dynamic Scenes on an FPGA Chip. This paper describes an FPGA implementation of the SaarCOR ray tracing hardware. The FPGA only runs at 90 MHz, but achieves rendering performance 3 to 5 times faster than a software implementation on a 2.66 GHz Pentium 4 CPU. The authors estimate an actual VLSI implementation of their design would be up to two orders of magnitude faster than the FPGA prototype. The device uses 24-bit floating-point numbers with 16-bit mantissas to reduce hardware costs. 32. [Lat03] U.S. Patent 6,597,359 - Hierarchical Space Subdivision Hardware for Ray 36 Tracing. This patent by Raychip, Inc. describes octree-based spatial subdivision hardware for accelerated ray tracing. The patent covers octree construction, construction during rendering (only nodes that are touched by rays are subdivided based on statistical information), node collapsing (merging children back into their parent to save resources), a caching method for scene objects, and a method for octree traversal. A method for clipping triangles against octree node boundaries is also presented. This is used for determining if any part of a triangle is within an octree node. 33. [WSS05] RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing. The RPU is the next generation of hardware ray tracer designs from the same researchers that designed SaarCOR [SWS02]. The largest improvement over SaarCOR is the high degree of programmability, allowing for programmable shading as well as programmable geometry. The design was implemented on an FPGA and runs at 66 MHz. Unfortunately, performance is slower than SaarCOR, and is similar to a software ray tracer running on a 2.66 GHz Pentium 4 CPU. The authors estimate that a VLSI implementation of their design would perform 27 times faster than their FPGA prototype. 2.6 Software Optimization versus Hardware It is clear that many attempts have been made to implement ray tracing in hardware. To date, there is only one commercially-available ray tracing chip - Advanced Rendering Technology’s AR350 [Hal01]. Unfortunately, it is only useful for accelerating batch-mode rendering. Advanced FPGA prototypes exist, such as the RPU [WSS05], but their performance is limited by the FPGA hardware. To achieve high performance, the design must be implemented in a customized VLSI chip, which is much more complex and expensive than an FPGA-based design. Ray tracing hardware has great potential for improving ray 37 tracing performance, but this potential is difficult and expensive to realize. Algorithmic improvements such as spatial subdivision and BVHs have been a far more effective means of improving performance, to date. Some of the older hardware designs focus only on improving the speed of ray-object intersection tests, and do not reduce the number of ray-object tests by using spatial subdivision or BVHs. It is difficult to generalize the benefit from using spatial subdivision or BVHs, but it can be up to several orders of magnitude for complex scenes. If these older hardware systems were actually constructed and benchmarked, it would become clear that a good software implementation using spatial subdivision or a BVH equals or outperforms the hardware for most scenes. It is almost always preferable to use better algorithms before resorting to customized hardware to achieve higher performance. Hence, all recent ray tracing hardware designs incorporate spatial subdivision or BVHs. Chapter 3 Ray-AABB Overlap Tests with Pl¨ ucker Coordinates 3.1 Introduction This chapter presents a new ray-axis aligned bounding box (AABB) overlap test based on Pl¨ ucker coordinates, called the Plu-AABB test. Pl¨ ucker coordinates are a useful tool for determining the relative orientation of lines in 3-D space [TH99, MD02, Sho98]. The Plu-AABB test works by determining the relative orientation of the ray to the box edges. 3.2 Pl¨ ucker Coordinates A directed line L passes through two points A and B in 3-space, from point A to point B. Points A and B have coordinates (Ax , Ay , Az ) and (Bx , By , Bz ). Line L can be expressed as six Pl¨ ucker coordinates L0 ..L5 : L0 = Ax By − Bx Ay L1 = Ax Bz − Bx Az L2 = Ax − Bx L3 = Ay Bz − By Az L4 = Az − Bz L5 = By − Ay A ray R can also be expressed as Pl¨ ucker coordinates R0 ..R5 . The first point in space is the ray origin O with coordinates (Ox , Oy , Oz ). The second point is the origin O plus ~ with components (Di , Dj , Dk ), giving coordinates: the direction vector D (Ox + Di , Oy + Dj , Oz + Dk ) 38 39 R L L R side(R,L) > 0 R L L R side(R,L) < 0 R R L L L R L R side(R,L) = 0 Figure 3.1: Side relation and relative orientation of the ray and line. L is perpendicular to the page and points toward the reader. side(R, L) is positive if the relative orientation is clockwise (top), negative if counterclockwise (middle), and zero if the ray and line intersect or are parallel (same or opposite directions.) Substituting into the Pl¨ ucker coefficient equations gives: R0 = Ox Dj − Di Oy R1 = Ox Dk − Di Oz R2 = −Di R3 = Oy Dk − Dj Oz R4 = −Dk R5 = Dj An important relation is side(R, L), which provides information about the relative orientation of the ray and the line (Figure 3.1). The side relation can also determine the relative orientation of two rays or of two lines, because the Pl¨ ucker coordinate representa- 40 D A N = (A-O) x (B-O) O B D 'above' plane side(R,L) = -(D dot N) side(R,L) > 0 A N = (A-O) x (B-O) O B D D 'below' plane side(R,L) = -(D dot N) side(R,L) < 0 Figure 3.2: Derivation of the side relation. tions of rays and directed lines are similar. An expression can be derived for side(R, L) as the permuted inner product of the ray’s Pl¨ ucker coordinates Ri , and the line’s Pl¨ ucker coordinates Li [YN97, BR94]. Given a line L passing through points A and B, and a ray R ~ we wish to determine the relative orientation of the with origin O and direction vector D, ~ of the ray and the line (Figure 3.2). Using a cross product, compute the surface normal N plane that contains the three points A, B, and O: ~ = (A − O) × (B − O) N 41 The relative orientation of the ray and line (or side relation) is determined by the dot ~ and N ~: product of D ~ ·N ~) side(R, L) = −(D Hence: ~ · ((A − O) × (B − O)) side(R, L) = −D Expanding and simplifying, the Pl¨ ucker coefficient equations and the side relation equation are obtained: side(R, L) = −Di ((Ay − Oy )(Bz − Oz ) − (Az − Oz )(By − Oy )) −Dj ((Az − Oz )(Bx − Ox ) − (Ax − Ox )(Bz − Oz )) −Dk ((Ax − Ox )(By − Oy ) − (Ay − Oy )(Bx − Ox )) side(R, L) = −Di (Ay Bz − By Az ) + Dj (Ax Bz − Bx Az ) −Dk (Ax By − Bx Ay ) + (Ox Dk − Di Oz )(By − Ay ) + (Ox Dj − Di Oy )(Az − Bz ) + (Oy Dk − Dj Oz )(Ax − Bx ) Substituting back in the definitions of the Pl¨ ucker coordinates we get the permuted inner product: side(R, L) = R2 L3 + R5 L1 + R4 L0 + R1 L5 + R0 L4 + R3 L2 The need to store R2 , R5 , and R4 can be removed by substituting for −Di , Dj , −Dk : side(R, L) = −Di L3 + Dj L1 − Dk L0 + R1 L5 + R0 L4 + R3 L2 Pl¨ ucker coordinates can also be used for ray-convex polygon intersection tests (Figure 3.3). If the polygon vertices are defined counterclockwise, then the side relations for the ray and each of the edges must be positive or zero for the ray to intersect the front side 42 A side(R,AB) > 0 side(R,BC) > 0 side(R,CA) > 0 A side(R,AB) < 0 side(R,BC) > 0 side(R,CA) > 0 MISS HIT R C B R C B Figure 3.3: Ray-polygon intersection with Pl¨ ucker coordinates. of the polygon [AC97]. Alternatively, each of the side relations must be negative or zero for the ray to intersect the back side of the polygon. If all the side relations are zero, the ray lies within the plane of the polygon. The intersection distance along the ray is not produced by the side relations and must be computed separately. 3.3 Pl¨ ucker Ray-AABB Overlap Test The Pl¨ ucker coordinate convex polygon test may be applied to the problem of determining if a ray and an AABB overlap. Normally, each face of the box (or planes of the faces, when using the Slabs-AABB method in Section 2.3.1) would be tested to determine if the ray passes through one of them, and hence intersects the box. The number of faces tested can be reduced from six to three by classifying the ray based on the signs of its direction vector components (Di , Dj , Dk ). For a box, only three faces are visible at once from the ray’s point of view. For example, a ray with all three components negative (classified M M M , for “minus minus minus”) will pass through one of the three faces F EHG, CGHD or BF GC if it intersects the box (Figure 3.4.) The other seven ray classes are M M P, M P M, M P P, P M M, P M P, P P M, 43 Y MMM Ray B (xmin,ymax,zmin) F (xmin,ymax,zmax) C (xmax,ymax,zmin) G (xmax,ymax,zmax) A (xmin,ymin,zmin) MMM Ray D (xmax,ymin,zmin) X E (xmin,ymin,zmax) H (xmax,ymin,zmax) Z Figure 3.4: Axis-aligned bounding box and MMM rays. and P P P with the letters M and P corresponding to the signs of the Di , Dj , and Dk components of the ray. Each ray class is tested against a different set of three box faces. The ray class can be determined when the ray is created and stored as 3-bit value within the ray. To determine if a ray intersects the box, the side relation of the ray and several of the box’s edges must be computed. The Pl¨ ucker coefficients for the box edges are simpler than those of an arbitrary convex polygon due to the axis-aligned nature of the faces. When computing the coefficients, the edges may be treated either as rays or as lines passing through two points in space. Equivalent results are produced with either method, the ray form is used here as it results in a shorter derivation. For example, the coefficients for edge F E with origin (xmin, ymax, zmax) and direction 44 (0, −1, 0) are: F E0 = −xmin F E1 = 0 F E2 = 0 F E3 = zmax F E4 = 0 F E5 = −1 Consequently, the side relation of R and edge F E is: side(R, F E) = −R1 − Di ∗ zmax + Dk ∗ xmin Instead of testing each of the four edges per front-facing box face (maximum 12 tests), the test is optimized so that only the edges of the silhouette of the box are tested. Thus, only a maximum of six tests are needed. For example, an M M M ray is tested against box edges BF, F E, EH, HD, DC, and CB. It does not matter that the six edges are not within the same plane, as the convexity of the silhouette and counterclockwise ordering of the edges is preserved as long as the ray is M M M . A different set of edges is evaluated for each of the seven other ray classes. Hence for an M M M ray, the algorithm is as follows: if(side(R,DH) > 0) then MISS else if(side(R,BF) < 0) then MISS else if(side(R,EF) > 0) then MISS else if(side(R,DC) < 0) then MISS else if(side(R,BC) > 0) then MISS else if(side(R,EH) < 0) then MISS else HIT 45 Depending on the directions of the edges, some of the inequalities may be reversed. For example, the side relation and inequality side(R, EF ) > 0 can be rewritten as side(R, F E) < 0. A separate test is needed to determine if the box is on the positive portion of the ray, because the Pl¨ ucker test treats rays like lines, which are infinitely long in both directions. This can be determined by comparing the ray origin to the AABB coordinates. For an M M M ray, if the ray origin Ox < xmin, or Oy < ymin or Oz < zmin, then the ray does not overlap the AABB. The ray is still assumed to be infinitely long in the positive direction, but a similar additional test could be used for a truncated ray if the x, y, and z coordinates of the ray’s endpoint are known. Computation and storage of the ray’s Pl¨ ucker coordinates R0 , R1 and R3 can be eliminated. Before each ray-box test, subtract the ray’s origin O from the box’s edge coordinates, producing new temporary box coordinates used for the intersection test. The ray’s origin is now assumed to be (0, 0, 0), which makes the ray’s R0 , R1 , and R3 Pl¨ ucker coordinates equal to zero in the side relation equations. Hence, no Pl¨ ucker coordinates need to be computed or stored to use this technique despite it being based on Pl¨ ucker coordinates. This permits integration into existing ray tracing systems. C++ code for case “MMM” of the Plu-AABB test is given in Figure 3.5. This code includes the optimization which removes the need to store Pl¨ ucker coordinates for the ray by subtracting the ray origin from the box coordinates. The source code in Appendix D provides full details of the Plu-AABB test for all eight ray direction cases, implemented in a ray-BVH traversal algorithm. The Plu-AABB test does not, by itself, compute an intersection distance. This is easily added by computing the distance along the ray to the three closest planes of the box, once an intersection has been confirmed. The three closest planes are already known from the ray classification. The correct distance is the largest of the three distances (Figure 3.6 46 bool PluAABB_MMM(AABB *aabb, Ray *ray) { if((ray->Ox < aabb->xmin) || (ray->Oy < aabb->ymin) || (ray->Oz < aabb->zmin)) return false; float float float float float float xa ya za xb yb zb = = = = = = aabb->xmin aabb->ymin aabb->zmin aabb->xmax aabb->ymax aabb->zmax if((ray->Di * ya (ray->Dj * xa (ray->Dk * xa (ray->Di * za (ray->Dj * za (ray->Dk * ya return false; return true; - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; - ray->Dj * - ray->Di * - ray->Di * - ray->Dk * - ray->Dk * - ray->Dj * /* miss */ xb yb zb xb yb zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) /* hit */ } Figure 3.5: C++ code for “MMM” Plu-AABB test. ymax tx box ty ymin ray xmin xmax Figure 3.6: Largest intersection distance: The two closest planes are xmin and ymin, giving intersection distances tx and ty . The desired intersection distance is tx (the largest.) 47 shows a 2D example.) The distances to the planes can be computed with either division or by multiplication with IRDVCs. The Plu-AABB test has several useful characteristics. The technique is free of division operations, unless an intersection distance is needed. Even then, division can be avoided by multiplying by IRDVCs when computing ray-face intersection distances (Section 2.3.1.) The Slabs-AABB method can also become division-free by multiplying by IRDVCs. The difference is that the Plu-AABB test does not require precomputed IRDVCs to avoid division, provided that an intersection point is unnecessary. Also, subtraction of the ray origin from the AABB coordinates is simply a vector subtraction. Computation of the six side relations are independent of each other, allowing them to be computed in parallel. Computation of a side relation is simply a 2-term dot product when using the code in Figure 3.5. This algorithmic simplicity should permit vectorization of the code, potentially improving performance substantially. This could be implemented using Intel’s SSE vector instructions [Int02] that are available on all Pentium III and newer Pentium-family microprocessors. Similar vector instructions are also available on several PowerPC microprocessors [App05]. 3.4 Numerical Issues An important aspect of any ray-AABB test is numerical issues such as NaNs, infinities, and overall stability. NaNs or infinities never occur within the Plu-AABB test, unlike Slabs-AABB. The PluAABB test does not perform any division, and is composed of only additions, subtractions, multiplications, and comparisons. Therefore, the results of the operations are limited to finite ranges. These ranges are based on the range of the ray direction vector components (always [-1.0, 1.0]) and the range of the box coordinates (the scene boundaries when used for ray tracing.) The Plu-AABB test is not without potential problems, however. The ray-edge tests 48 are still subject to FP imprecision. For example, one of the six ray-box edge tests for the M M M ray code in Figure 3.5 is: float ya = aabb->ymin - ray->Oy; float xb = aabb->xmax - ray->Ox; if(ray->Di * ya - ray->Dj * xb < 0) then MISS First, consider the case where the ray is parallel to the edge (and the z-axis). In this case, ray→Di and ray→Dj are both zero and the result of the ray→Di∗ya−ray→Dj ∗xb computation is exactly zero. This is not a box miss because the ray can still pass through the box when the ray is parallel to an axis. Thus, the algorithm works correctly for axisparallel rays provided the comparison is a “less-than zero” and not a “less-than-or-equal-to zero.” Another potential problem occurs when the ray intersects the edge or is very near the edge, but is not parallel to the edge. This causes a result that is close to zero, and floating-point imprecision can cause problems by introducing a small amount of error into the result. The error might result in a slightly negative value (a miss), when a hit (a zero or positive value) should have been computed. Fortunately, such edge- and corner-grazing intersections are uncommon and the geometry enclosed by the box is seldom touching the box edges or corners. Thus, even if the box is accidentally missed when it should have been hit, there would probably not be any geometry in that part of the box anyway. A partial solution to this problem is discussed in Section 4.3. If the box is accidentally hit when it should have been missed, there is no potential for an error in the rendered image. Unnecessary BVH traversal and ray-object tests will result, however. 49 3.5 Ray-AABB Performance The performance of the Plu-AABB test was compared to the Slabs-AABB and SAT-AABB tests using both single- and double-precision floating-point. The tests measured only rayAABB testing speed by measuring the time required to test a set of rays against a set of AABBs. Performance of the Plu-AABB test in a ray tracer is evaluated in Chapter 4. To test the algorithms, 500,000 pairs of random rays and boxes were generated beforehand and stored in arrays. Each ray/box pair was tested for intersection 100 times, for a total of 50 million intersection tests. It was necessary to repeat the tests 100 times to ensure the run time was long enough to effectively measure. The percentage of ray/box hits was varied between 0%, 50% and 100%. This was done by predetermining which ray/box pairs intersected and then storing the appropriate mixture in the ray/box arrays. The same set of ray/box pairs was used for each test, provided the hit percentages were the same. The correctness of the Plu-AABB test was verified by comparing its results to those of the Slabs-AABB and SAT-AABB tests for all 500,000 ray/box pairs. This comparison was done separately to avoid influencing the run-time of the tests. The testing environment was a PC with a Pentium 4 3.0GHz CPU with 512K cache and HyperThreading, 800 MHz bus, 2GB RAM, running Windows XP, and using Visual C++ .NET 2003. Compiler flags were: /O2 /G7 /GA /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /GS /arch:SSE2 /Yu"stdafx.h" /Fp"Release/JGT.pch" /Fo"Release/" /Fd"Release/vc70.pdb" /W3 /nologo /c /Wp64 /Zi /TP Timings for the methods are presented in Table 3.1 and Table 3.2. The different tests are broken down as follows: • plu: The Plu-AABB test. Determines whether the ray hits the box, but does not compute an intersection distance. 50 • plu-int-div : The same as “plu” except it computes an intersection distance. The intersection distance is computed as an additional step, since the Plu-AABB test does not compute distances. Division is used to compute the intersection distance. • plu-int-mul : Similar to “plu-int-div”, except the intersection distances are computed by multiplying by IRDVCs. • plu-cff : Similar to “plu”, except it uses 3 precomputed Pl¨ ucker coordinates R0 , R1 , and R3 that must be computed once for each ray and carried within the ray. • plu-cff-int-div : The same as “plu-cff”, except an intersection distance is computed as an additional step. Division is used to compute the intersection distance. • plu-cff-int-mul : Similar to “plu-cff-int-div”, except the intersection distances are computed by multiplying by IRDVCs. • slabs-div : The Slabs-AABB test, which determines if the ray hits the box and computes intersection distances using division. • slabs-mul : The Slabs-AABB test, which determines if the ray hits the box and computes intersection distances using multiplication by IRDVCs. • slabs8-div : The Slabs-AABB test, divided into eight optimized ray direction cases. The individual cases are simpler than the “slabs-div” test because several comparisons can be eliminated within the code. The correct distance order of the ray-plane distances is already known for each case. The intersection distances are computed with division. • slabs8-mul : The Slabs-AABB test, divided into eight optimized ray direction cases, and using multiplication by IRDVCs to compute the intersection distances. Source code for this Slabs-AABB variant used for ray-BVH traversal is provided in Appendix E. 51 SINGLE PRECISION hit/miss 0M/50M 25M/25M plu 2.3 2.4 plu-int-div 2.3 3.1 plu-int-mul 2.3 2.8 plu-cff 2.2 2.3 plu-cff-int-div 2.3 3.1 plu-cff-int-mul 2.3 2.8 slabs-div 3.9 4.8 slabs-mul 3.3 3.9 slabs8-div 3.6 4.2 slabs8-mul 3.0 3.4 sat 2.2 2.2 50M/0M 2.5 3.9 3.2 2.4 3.9 3.2 5.6 4.5 4.6 3.7 2.1 Table 3.1: Single-precision ray-AABB performance. Times are measured in seconds, lower is faster. • sat: The separating axis theorem line-segment-AABB test, modified for performing ray-AABB overlap tests. Infinitely-long rays are converted to finite line segments by assigning appropriate endpoints. The results show that the Pl¨ ucker methods are considerably faster than Slabs-AABB, even when computing intersection distances. Times for the single precision Slabs-AABB tests ranged from 3.0-5.6 seconds, while the Plu-AABB tests ranged from 2.2-3.9 seconds. Double precision times for Slabs-AABB ranged from 4.0-6.9 seconds, while the Plu-AABB tests ranged from 3.4-5.1 seconds. The performance difference between the “plu” and “plu-cff” tests was negligible. At most, the difference was 0.1 seconds, or only a few percent. This is reasonable because using precomputed Pl¨ ucker coordinates R0 , R1 and R3 in “plu-cff” does not simplify the ray-AABB test much. Hence, the added overhead of using precomputed Pl¨ ucker coordinates is not worthwhile. From hereon, the Plu-AABB test used in this thesis will be based on “plu.” 52 DOUBLE PRECISION hit/miss 0M/50M 25M/25M plu 3.5 3.6 plu-int-div 3.5 4.3 plu-int-mul 3.5 3.8 plu-cff 3.4 3.6 plu-cff-int-div 3.5 4.3 plu-cff-int-mul 3.5 3.8 slabs-div 4.9 6.0 slabs-mul 4.3 4.8 slabs8-div 4.5 5.2 slabs8-mul 4.0 4.2 sat 3.5 3.7 50M/0M 3.8 5.1 4.1 3.8 5.0 4.1 6.9 5.3 5.9 4.5 3.9 Table 3.2: Double-precision ray-AABB performance. Times are measured in seconds, lower is faster. Multiplying by IRDVCs instead of dividing by ray direction vector components (“mul” vs. “div” tests) to compute intersection distances also improved performance. Performance improved by up to 24% for single precision and up to 31% for double precision. Dividing the Slabs-AABB test into eight optimized cases based on ray direction significantly improved the performance, up to 22% for single precision and up to 18% for double precision. The “sat” test was the fastest test for single precision, and was the third fastest for double precision, after “plu” and “plu-cff.” Unfortunately, the “sat” test is somewhat awkward for ray tracing (see Section 2.3.3), so the slightly slower Plu-AABB test is preferred. Chapter 4 Ray-BVH Traversal with the Plu-AABB Test 4.1 Introduction Chapter 3 showed that the Plu-AABB test is significantly faster than the Slabs-AABB tests. This chapter benchmarks the ray-AABB tests in an actual BVH to determine their overall benefit to ray tracing. First, the test scenes used throughout this thesis are presented. Next, efficient BVH construction is discussed. Following that, rendering performance using the Plu-AABB and Slabs-AABB tests is compared. Finally, a new method to determine optimal BVH traversal order is introduced. 4.2 Test Scenes Before any performance measurements can be made, some test scenes are needed. Eight test scenes are used in this thesis, varying from thousands to millions of triangles (Figures 4.1 and 4.2.) The Bunny [Sta05], Branch [TH05], Poppy [Mac04], Power plant [Uni05], and Lucy [Sta05] models were obtained from other researchers or from the Internet. Test scenes contain triangles and point light sources. Test scene statistics are broken down into the number of triangles, the number of light sources, the number of BVH nodes, and the number of primary, secondary, and shadow rays. Rendering was performed using single-precision floating-point, except where stated otherwise. 53 54 Spheres Triangle mesh spheres - solid, reflective, and refractive Triangles: 24,578 Lights: 2 BVH Nodes: 11,339 Primary rays: 4,194,304 Secondary rays: 1,220,088 Shadow rays: 4,658,758 Bunny Psychedelic reflective Stanford bunny in a colored box Triangles: 69,463 Lights: 2 BVH Nodes: 33,377 Primary rays: 4,194,304 Secondary rays: 1,565,756 Shadow rays: 8,380,660 Branch Branch model from the U of Calgary Graphics Jungle Triangles: 312,706 Lights: 2 BVH Nodes: 155,257 Primary rays: 4,194,304 Shadow rays: 6,057,826 Poppy Poppy model from the U of Calgary Graphics Jungle Triangles: 395,460 Lights: 2 BVH Nodes: 187,551 Primary rays: 4,194,304 Shadow rays: 5,527,312 Figure 4.1: Test scenes 1-4. 55 Powerplant - Front Front view of the UNC power plant model Triangles: 12,748,512 Lights: 1 BVH Nodes: 6,374,789 Primary rays: 4,194,304 Secondary Rays: 1,112,742 Shadow rays: 3,257,733 Powerplant - North North/bottom view of the UNC power plant model Triangles: 12,748,512 Lights: 3 BVH Nodes: 6,374,789 Primary rays: 4,194,304 Secondary Rays: 57,961 Shadow rays: 12,593,457 Powerplant - Boiler Inside the boiler of the UNC power plant model Triangles: 12,748,512 Lights: 1 BVH Nodes: 6,374,789 Primary rays: 4,194,304 Shadow rays: 4,194,304 Lucy Angelic statue Triangles: 28,057,792 Lights: 2 BVH Nodes: 13,548,145 Primary rays: 4,194,304 Secondary rays: 1,556,156 Shadow rays: 11,500,918 Figure 4.2: Test scenes 5-8. 56 4.3 BVH Construction This section focuses on efficient construction of the BVH, which can be challenging because there is no single “best” algorithm. See Section 2.2 for an overview of BVHs and BVH traversal. First, a BVH node data structure must be defined. A BVH node has three parts: 1. The coordinates of the AABB: xmin, ymin, zmin, xmax, ymax, and zmax. 2. The number of triangles enclosed by the node, if a leaf node (numT ris.) This value can be negative to indicate that the node is a branch node (also called an interior node.) A value of −1 denotes a branch/interior node that was split along the x axis, −2 for the y axis, and −3 for the z axis. This information is used to improve the BVH traversal order with the DSA algorithm explained in Section 4.5. 3. A pointer to two children (child0 and child1) if a branch/interior node, or a pointer to a list of objects enclosed by the node, if a leaf node. There is only one pointer used for both children, as both children can be allocated at once as an array of two objects. This eliminates the need for a second pointer and reduces the amount of memory management. The BVH construction technique used for this research is loosely based on Shirley and Morley’s algorithm [SM03]. The BVH was constructed in a recursive, top-down fashion, similar to a k-d or BSP tree. It is possible to construct the BVH in a bottom-up fashion [GS87], but Havran showed that type of BVH to be inefficient [Hav01]. (Unfortunately, Havran did not test top-down BVHs.) Each BVH branch/interior node has two children. Children are constructed by splitting the set of triangles enclosed by a node into two sets, based on comparing each triangle’s center point to a splitting plane. The splitting plane is perpendicular to the x, y, or z 57 (0,0) (0,0) child0 Y-axis split, y = 0.5 child0 Center point X = 0.8, Y = 0.2 Center point X = 0.5, Y = 0.8 child1 Center point X = 0.8, Y = 0.2 Center point X = 0.5, Y = 0.8 child1 (1,1) (1,1) X-axis split, x = 0.5 Child1 contains both objects because the pentagon’s X ≥ 0.5 and the triangle’s X ≥ 0.5. Child0 is empty. Try a Y-axis split instead. Child0 contains the pentagon because its Y < 0.5. Child1 contains the triangle because its Y ≥ 0.5. Thus, an empty child is avoided. Figure 4.3: Empty BVH child, 2D example. axis, and its orientation alternates in round-robin fashion as the BVH is constructed. Each node’s bounding box is the minimum AABB that encloses all of its objects or children. When constructing a BVH, it is important to avoid empty child/leaf nodes which waste memory and cause excess BVH traversal. There is a simple and effective strategy to avoid this problem. If subdividing a node along an axis produces an empty child, the next axis is tried. If subdividing along all 3 axes produces empty children, then perform an arbitrary split by putting half of the objects in each child, regardless of their positions. Figure 4.3 shows a situation that can result in an empty BVH child node. I have observed up to a 30% improvement in rendering speed when using this strategy to avoid empty nodes. The hierarchy termination criteria is 6 or fewer objects per node, and a maximum BVH depth of 60 levels. The value of 6 objects per node was experimentally determined to both minimize rendering time and use memory efficiently. Values of less than 4 increased 58 rendering time, as did values greater than 6. Values of 4-5 slightly improved speed for some cases but significantly increased BVH memory usage. Analysis of rendering performance versus objects per node is presented in Appendix B. Appendix B also shows that a value of 6 objects per node results in approximately half as many BVH nodes as triangles in the scene. None of the scenes tested (Section 4.2) reached the maximum hierarchy depth of 60 when using a value of 6 or fewer objects per node. Limiting the tree depth to a lower value did not seem to improve performance. The BVH algorithm used here does not seem prone to “running away” and producing hierarchies of excessive depths. It is possible to apply heuristics to BVH construction, such as the Surface Area Heuristic [MB89]. The Surface Area Heuristic subdivides a node if the cost of subdividing is less than the cost of not subdividing. The cost value is based on the surface area of the node and the number of objects contained within. The use of heuristics has been analyzed in detail for k-d or BSP trees with good results [Hav01]. Detailed investigation has not been performed for the BVH, although my own (undocumented) experiments suggest that the benefit is similar to that achieved for k-d trees. There are a few potentially unstable situations that can occur with a BVH. The first is when an AABB has zero volume. This can happen, for example, when an AABB contains a single triangle that has its surface normal parallel to a coordinate axis. The ray-AABB test may not behave correctly when testing boxes that are actually only 2-D rectangles. Another potential problem is when the AABB has non-zero volume, but an object is fully contained within a box face. This raises the question of what is considered to be “inside” the box. Is “inside” in the x- axis direction, for instance, defined as [xmin, xmax] or (xmin, xmax)? These problems can be handled with careful programming, but it is better to avoid them in the first place. One solution is to slightly increase the size of the box by adding/subtracting 59 an epsilon value from the AABB coordinates. This research uses an epsilon value of 5∗10−7 for a scene that is scaled to fit a [−1, +1] coordinate system. If the scene is larger, the epsilon value must be scaled accordingly. This expands the box so it is guaranteed to fully and unambiguously enclose all of its objects (such as the triangle contained within the box face), and ensures no boxes have zero volume. The epsilon value of 5 ∗ 10−7 is an educated guess, and may be generous. The machine epsilon for 32-bit floating-point numerics is approximately 6 ∗ 10−8 , which is the effective coordinate system resolution near the coordinate value of 1.0. The resolution improves as numbers approach zero, but we must assume the worst-case scenario. Hence, epsilon should be no smaller than the machine epsilon value of 6 ∗ 10−8 . Testing showed no significant difference in rendering time or BVH efficiency when using the smaller 6∗10−8 epsilon value, so it is better to err on the side of caution and use 5 ∗ 10−7 . If the epsilon value is too large, it will reduce the effectiveness of the hierarchy because the boxes will have been expanded by too much. This causes “false hits” of boxes and excess hierarchy traversal. Expanding the box also partially addresses the problem of the uncertainty in the rayAABB test when a ray grazes an edge or corner (Section 3.4) by providing a margin of safety. I am unsure if all false misses due to edge- and corner-grazing intersections can be eliminated simply by expanding the AABB, or the amount of expansion that is necessary to eliminate the problem. This remains to be investigated. The BVH construction algorithm should avoid subdividing a node along a particular axis if its AABB is too narrow along that axis. Instead, the construction algorithm should try the next axis. An node’s AABB should be at least 2 * epsilon or 1 ∗ 10−6 wide along an axis before attempting a split along that axis. If the AABB is too narrow along all 3 axes, the node is made a leaf node. The AABB width used here is the actual width, not the padded width which adds the 5 ∗ 10−7 epsilon value as previously described. Appendix C presents C++ code that implements the BVH construction technique de- 60 Scene BVH-Slabs Spheres 26.9s Bunny 30.0s Branch 21.3s Poppy 14.7s Powerplant-Front 82.1s Powerplant-North 233s Powerplant-Boiler 441s Lucy 58.6s BVH-Plu 24.6s (9% faster) 26.8s (12% faster) 18.5s (15% faster) 13.2s (11% faster) 70.8s (16% faster) 184s (27% faster) 366s (20% faster) 49.6s (18% faster) Table 4.1: Rendering times for BVH-Plu versus BVH-Slabs ray-BVH traversal. scribed in this section. 4.4 BVH-Plu versus BVH-Slabs The Pl¨ ucker coordinate ray-AABB overlap test from Section 3.5 that is most suitable for ray-BVH traversal is the “plu” test, also called Plu-AABB. This method is broken into eight cases based on the ray classification, it does not compute a ray-AABB intersection distance, and no Pl¨ ucker coordinates are stored within the ray. Storing Pl¨ ucker coordinates increases storage requirements, and the results in Section 3.5 show that the benefit is too small to be worthwhile. The BVH traversal algorithm that uses the Plu-AABB test is called BVH-Plu. The fastest Slabs-AABB algorithm tested in Section 3.5 is the “slabs8-mul” test. This method is divided into eight cases based on the ray classification, and computes ray-plane intersection distances by multiplying by IRDVCs. This algorithm is used for ray-BVH traversal in this chapter, and is called BVH-Slabs. (The “sat” test is actually faster for the single-precision tests in Section 3.5, but its use in a ray tracer is awkward, and was not tested in this chapter.) Both BVH-Plu and BVH-Slabs always traverse child nodes in the same order: child0 and then child1. Section 4.5 discusses improving the traversal order to increase performance. 61 Rendering times for BVH-Slabs and BVH-Plu are compared in Table 4.1. An improvement of 9% to 27% in rendering times was achieved by using BVH-Plu. This is a good, consistent result, however it is less than the speedup achieved in Section 3.5, which shows that single-precision “plu” is between 30% and 48% faster than “slabs8-mul.” There are several reasons for this discrepancy. The tests in Section 3.5 always used rays of infinite length. A ray tracer must also handle finite-length rays. This adds an additional set of comparisons to the BVH-Plu code that were not present in the “plu” test code. Specifically, the endpoint of a finite-length ray must also be checked against the AABB, not just the origin. Thus, some performance is lost due to the additional comparisons. The ray-AABB tests in Section 3.5 were benchmarked by testing pre-generated rays and AABBs. These were stored in two arrays: rays[] and aabbs[]. A loop tested rays[i] with aabbs[i], linearly stepping through both arrays. This is extremely efficient because data is contiguous in memory and is accessed sequentially, letting the CPU reduce some of the memory access delays with tricks such as pre-fetching. In a ray tracer, BVH nodes are typically not accessed or stored in sequential fashion, so the CPU waits longer for data to be fetched from memory. A portion of the rendering time is also spent intersecting geometry, generating rays, computing textures and shading. Profiling shows that the majority of the rendering time is spent traversing the BVH, however. Additionally, BVH traversal is implemented as recursive function calls, while the tests in Section 3.5 were performed with a simple loop. The recursion adds overhead to both BVH-Slabs and BVH-Plu, diminishing the overall percentage improvement from faster ray-AABB testing. The algorithms must also make a choice based on whether the node is a branch or a leaf, adding more overhead. Although this section shows that Plu-BVH is up to 27% faster than BVH-Slabs, these results are only valid for one particular testing environment. Using a different compiler, compiler optimization flags, CPU, or slightly different C++ code may increase or decrease 62 the performance gap between the two methods. With this in mind, it is unwise to conclude that BVH-Plu will always outperform BVH-Slabs. Ideally, a ray tracer will implement both and choose the faster one based on some performance tests that are automatically run by the installation program. 4.5 Child Node Testing Order BVH traversal performance can be greatly improved by traversing a BVH node’s children in their distance order along the ray. This is normally accomplished by computing the distance to each child’s AABB, and traversing the closest child first. This is simple for the BVH-Slabs traversal because the Slabs-AABB test computes the ray-AABB intersection distance. It is a potential problem for the BVH-Plu traversal because the Plu-AABB test does not compute intersection distances. Fortunately, it is not necessary to know the intersection distances to traverse in approximately the distance order. This research introduces a new method called “DSA”, for ”Direction and Split Axis.” Only the direction classification (e.g. M M M ) and the axis of the splitting plane (e.g. x, y, or z) that separated the children during BVH construction are needed. Storing the splitting plane axis consumes negligible storage, as it can be incorporated into the numT ris field in the AABB structure (see Section 4.3.) The DSA child node traversal order is presented in Table 4.2. The DSA algorithm follows the convention that child0 is on the negative side of the splitting plane, and child1 is on the positive side. The DSA logic can be implemented with one comparison test for 6 out of 8 cases. The other two cases (M M M and P P P ) traverse in the same order for all 3 splitting axes. Note that DSA will not always give the same child node ordering as traversing based on actual intersection distances, because the nodes can overlap in odd ways. Regardless of the actual intersection distances, the majority of child0’s objects should be on the negative side of the partition, and the majority of child1’s objects should be on 63 Ray direction Splitting plane MMM x y z MMP x y z MPM x y z MPP x y z PMM x y z PMP x y z PPM x y z PPP x y z Traversal order child1, child0 child1, child0 child1, child0 child1, child0 child1, child0 child0, child1 child1, child0 child0, child1 child1, child0 child1, child0 child0, child1 child0, child1 child0, child1 child1, child0 child1, child0 child0, child1 child1, child0 child0, child1 child0, child1 child0, child1 child1, child0 child0, child1 child0, child1 child0, child1 Table 4.2: DSA algorithm. the positive side. This will almost always be true even if the boxes overlap in strange ways. Combining DSA and the BVH-Plu traversal gives the BVH-Plu-DSA traversal algorithm. Source code implementing BVH-Plu-DSA is presented in Appendix D. Interestingly, a similar technique was independently derived by Mansfield [Man02] during this research. His technique differs from BVH-Plu-DSA because his ray-AABB tests and BVH traversal are not split into eight separate cases based on the ray direction sign bits. Instead, the ray direction vector is stored as a 3-element array. The splitting plane axis for the current node becomes the array index, and the indexed value is sign-tested to determine the node traversal order. Mansfield observed performance increases of up to a 64 Scene BVH-Plu Spheres 24.6s Bunny 26.8s Branch 18.5s Poppy 13.2s Powerplant-Front 70.8s Powerplant-North 184s Powerplant-Boiler 366s Lucy 49.6s BVH-Plu-DSA 21.4s (15% faster) 22.4s (20% faster) 18.9s (2% slower) 12.1s (9% faster) 33.9s (109% faster) 153s (20% faster) 110s (233% faster) 49.6s (–) Table 4.3: Performance increase from DSA algorithm. Scene BVH-Slabs Spheres 26.9s Bunny 30.0s Branch 21.3s Poppy 14.7s Powerplant-Front 82.1s Powerplant-North 233s Powerplant-Boiler 441s Lucy 58.6s BVH-Slabs-Dist BVH-Slabs-DSA 23.7s 23.4s 27.2s 24.6s 20.8s 20.9s 13.0s 13.5s 38.6s 37.8s 231s 190s 135s 133s 62.2s 56.3s BVH-Plu-DSA 21.4s 22.4s 18.9s 12.1s 33.9s 153s 110s 49.6s Table 4.4: DSA vs. distance-order. factor of 2.6, although the increase was highly scene-dependent. To test the effectiveness of DSA, the BVH-Plu-DSA traversal was compared to BVHPlu, which always tests child0 before child1 regardless of the splitting plane axis. Rendering times are presented in Table 4.3. The results show that for some scenes, DSA can improve performance by up to 233%. The Branch scene slows down by 2%. The BVH-Slabs traversal may also be improved by using the DSA traversal order. This algorithm is thus called BVH-Slabs-DSA. It is also useful to compare BVH-Slabs-DSA to a BVH-Slabs algorithm that traverses children based on actual distances along the ray (called BVH-Slabs-Dist.) This will indicate the effectiveness of DSA versus using actual distances. Test results are presented in Table 4.4. The BVH-Slabs-DSA algorithm equals or out- 65 performs BVH-Slabs-Dist on 6 out of 8 tests, confirming the effectiveness of DSA. DSA is particularly effective on Powerplant-north versus traversing in distance order. BVH-PluDSA also outperforms BVH-Slabs-DSA on every test scene, confirming the effectiveness of the Plu-AABB test. Since DSA is at least as effective as using actual distances, it does not matter that the Plu-AABB test does not compute intersection distances to the AABBs like the SlabsAABB tests. Thus, the Plu-AABB test is a suitable alternative to the “slabs”-based tests for ray tracing with BVHs. Chapter 5 Memory-Conserving BVHs with Coherent Ray Tracing 5.1 Introduction Bounding volume hierarchies (BVH) can consume a large amount of memory for complex scenes. In this chapter, a hierarchical scheme for encoding BVHs is presented that reduces the BVH storage requirements by 63% - 75%. The computational overhead of the scheme can be reduced to negligible levels by shooting bundles of rays through the BVH (coherent ray tracing). This gives the speed of a coherency-based ray tracer combined with substantial memory savings. The test scenes in Section 4.2 show that roughly half as many BVH nodes are needed as there are objects in the scene. Hence, for a scene with 28 million objects (such as the Lucy test scene), 13.5 million BVH nodes are needed. With so large an overhead, it is important to minimize the storage requirements of the BVH. Each BVH node consumes 32 bytes when 32-bit floating-point is used. Hence, a BVH containing 13.5 million nodes will consume 432 million bytes of memory. Reducing this overhead allows the rendering of larger scenes. The bulk of the storage within the BVH node is for the 32-bit floating-point axisaligned bounding box (AABB) coordinates. A hierarchical encoding scheme can reduce the coordinates to only a few integer bits, while still maintaining the effectiveness of the BVH. In this chapter, the hierarchical AABB coordinates are represented as 4-bit and 8bit integers - a large savings over 32-bit floating point. Any integer precision can be used, but 4 and 8-bits were chosen because they are a convenient size and pack tightly into the 66 67 BVH node structure. Unfortunately, the hierarchically encoded AABB coordinates must be converted back into regular floating-point coordinates during ray tracing, adding overhead. This overhead can be amortized over many rays with coherent ray tracing, reducing it to acceptable levels. Hierarchical encoding is also used in the QSplat multiresolution point rendering system [RL00]. In QSplat, point samples are encoded as a hierarchy of bounding spheres. Each bounding sphere’s position and size are stored relative to the parent’s values as fractions between 1/13 and 13/13. Recognizing that only a subset of the possible 134 values are valid (because many values represent children that aren’t fully enclosed by their parent an impossible situation for the BVH), only 13 bits are needed to represent the position and radius of each bounding sphere. The authors also propose that the QSplat bounding sphere hierarchy can be used for accelerating ray tracing, but do not discuss the idea further. 5.2 Hierarchical BVH Encoding Examining the BVH node structure (see Section 4.3) identifies ways to save memory. The AABB coordinates consume 4 bytes per component if using single precision floating point, for a total of 24 bytes. The numTris field and the children/object list pointer can be assumed to be 4 bytes each, totaling 8 bytes. Hence, each BVH node consumes 32 bytes. Hierarchical encoding can reduce the 32-bit floating-point values to only 4 or 8 bits each. The hierarchical encoding scheme stores AABB coordinates relative to the parent node’s AABB. Using 8-bit coordinates, each node has its own integer coordinate system in the range [0, 255] in x, y, and z. (When using 4-bit coordinates, the range is [0, 15].) The node’s children are represented by coordinate values in this [0, 255] coordinate system. These children have their own [0, 255] internal coordinate system, and so forth. When rendering, these [0, 255] coordinates are converted back into the normal floating-point world space coordinates for the ray-AABB intersection tests (see Figure 5.2.) This conversion 68 adds some computational overhead (see Section 5.3). A hierarchical encoding scheme can actually provide more precision than 32-bit floating point for nodes deeper in the BVH, as the effective precision increases with every step. Although, the geometry within the BVH is only represented with 32-bit floating point, so having some BVH nodes with effectively more then 32-bit floating point precision is not very useful. One problem is the possibility of rendering errors due to the quantization of the floatingpoint AABB coordinates to lower, integer precisions. For example, the AABB might shrink slightly when its coordinates are quantized to integers, and thus rays can miss the AABB when they should have hit. Fortunately, there is a simple solution to this problem. The corner of the AABB closest to the integer coordinate system origin has its floating-point coordinates rounded down to integer values, and the opposite corner has its coordinates rounded up. This guarantees that the quantized AABB fully encloses the actual AABB, avoiding errors in rendering. It is also important to note that the integer coordinates are relative to the quantized coordinates of the parent node, and not the actual coordinates of the parent. The integer AABBs are now slightly larger than the actual AABBs, so some rays that should have missed might now hit. This causes unnecessary BVH traversal and will increase rendering times. There is a tradeoff between the integer coordinate precision and the amount of excess traversal. The lower the integer precision, the more the AABBs have been enlarged, and the greater the unnecessary BVH traversal. Some additional memory savings can be obtained by shortening the BVH node’s numTris field. It is unlikely that there will be more than a handful of triangles in a BVH child node, so a 32-bit integer is unnecessary. For performance reasons, care must be taken to preserve 32-bit data alignment and to keep the BVH node a multiple of 32 bits in size. The compiler will likely pad the structure to a multiple of 32 bits, even if a shorter datatype is used. 69 For the scenes tested in this research, eight bits were more than sufficient for the numTris field, so 24 bits are wasted in maintaining alignment. Applying these reductions, the memory savings is substantial. Using 8-bit integer coordinates, a BVH node now consumes only 12 bytes instead of 32 bytes. Six bytes are used by the six AABB coordinates, two bytes for the number of triangles (it could be one byte, but alignment must be maintained), and four bytes for the pointer. This gives a memory savings of 63% when compared to using 32-bit floating-point. Using 4 bits per coordinate saves even more. In this case, 3 bytes are needed for the six AABB coordinates, one byte for the number of triangles, and four bytes for the pointer. This gives a total of 8 bytes per node, for a memory savings of 75%. Note that the memory savings figures are for the BVH, not overall. Overall memory savings, counting geometry but not texture maps, is in the 20% to 25% range for most of the scenes tested. This is not a large overall savings, but reducing the BVH node size by 63% to 75% also reduces the memory bandwidth requirements for the memory storing the BVH. Especially in a high-performance hardware ray tracer, fetches of BVH nodes from memory might be the limiting factor in traversal performance, so the bandwidth reduction could be more important than the memory savings. Even when bandwidth-conserving techniques such as coherent ray tracing (Section 5.5) are employed, an additional bandwidth reduction of 63% to 75% is obtained. 5.3 Hierarchical BVH Traversal BVH traversal is straightforward. If a ray hits a node’s AABB, both of the node’s children are traversed. Traversal proceeds recursively until a leaf node containing geometry is reached, and the geometry is tested for intersection. Pseudocode is presented in Figure 5.1. An in-depth discussion of BVH traversal is presented in Chapter 4. The results in this chapter are based on the BVH-Plu-DSA traversal algorithm. 70 procedure Traverse(node, ray) if ray hits AABB if node.numTris != -1 (Intersect ray with geometry contained within the node.) else (Determine near and far child) Traverse(node.child[near], ray) Traverse(node.child[far], ray) Figure 5.1: Pseudocode for regular BVH traversal. Several changes were made to the traversal to accommodate the hierarchically-encoded AABB coordinates. The traversal code must generate the actual floating-point coordinates of the node’s AABB based on the relative integer coordinate values and the parent node’s actual AABB coordinates. These coordinates are then used for the children’s traversal. Pseudocode is presented in Figure 5.2. Hierarchy traversal starts by calling Traverse() with the origin and width parameters set to the scene boundaries. To simplify things, the scene is scaled to fit within a cube of coordinates (−1, −1, −1) to (1, 1, 1). Thus, traversal for the 8-bit case starts with: Traverse(rootnode, ray, -1, -1, -1, 2/255, 2/255, 2/255) Note that the width parameters are 2 255 and not 2. This is an optimization that reduces the number of “divide by 255” operations required to reconstruct the box coordinates from the integer values. In general, n-bit traversal will use values of 5.4 2 2n −1 instead of 2 . 255 Hierarchical BVH Rendering Performance Rendering performance was evaluated by comparing rendering times for 32-bit floating point and 4-bit and 8-bit hierarchical integer BVHs. The test scenes used are the same as described in Section 4.2. 71 procedure Traverse(node, ray, pXOrigin, pYOrigin, pZOrigin, pXWidth, pYWidth, pZWidth) /* Generate the node’s actual box coordinates from the */ /* node’s 8-bit coordinates and the parent’s actual */ /* coordinates */ xmin = pXOrigin + pXWidth * node.xmin8 ymin = pYOrigin + pYWidth * node.ymin8 zmin = pZOrigin + pZWidth * node.zmin8 xmax = pXOrigin + pXWidth * node.xmax8 ymax = pYOrigin + pYWidth * node.ymax8 zmax = pZOrigin + pZWidth * node.zmax8 if ray hits AABB if node.numTris != -1 (Intersect ray with geometry contained within the node.) else /* Compute the width parameters, and scale by 1/255 so the */ /* children’s coordinates are properly scaled when the width */ /* is multiplied by the integer coordinate */ xWidth = (xmax - xmin) * 1/255 yWidth = (ymax - ymin) * 1/255 zWidth = (zmax - zmin) * 1/255 (Determine near and far child) Traverse(node.child[near], ray, xmin, ymin, zmin, xWidth, yWidth, zWidth) Traverse(node.child[far], ray, xmin, ymin, zmin, xWidth, yWidth, zWidth) Figure 5.2: Pseudocode for 8-bit BVH traversal. n-bit traversal uses values of 2n −1 instead of 255. Tests were run three times, taking the lowest time. The times varied by a small fraction of a percent. Images were rendered at 2048x2048 pixel resolution with no anti-aliasing. Images produced with the hierarchical integer BVHs were compared to images produced with 32-bit floating-point to ensure that no pixels differed. During testing, it was discovered that integer to float conversions were a computational 72 bottleneck. For the 8-bit case, the solution was to use a 256-entry look-up table (LUT) for the conversions, containing the floating-point values [0, 255]. For the 4-bit case, two 256-entry tables were used: one to convert the lower 4 bits of a byte to a floating-point value, and one to convert the higher 4 bits. This combines unpacking of the 4-bit values from 8-bit bytes within the node, and conversion to floating-point. Using look-up tables improved performance considerably. Several different types of statistics were recorded for each of the scenes: • BVH memory usage: Total bytes consumed by the BVH nodes. Floating-point nodes consume 32 bytes each, 8-bit nodes consume 12 bytes each, and 4-bit nodes consume 8 bytes. • Ray-Tri tests: Average number of ray-triangle tests per ray, broken down by primary, secondary, and shadow rays. M¨oller and Trumbore’s ray-triangle intersection test was used [MT97]. • Ray-Node tests: Average number of ray-node intersection tests per ray, broken down by primary, secondary, and shadow rays. • Time (s): Rendering time, excluding pre-processing. • Time w/LUT (s): Rendering time using look-up tables for the integer to float conversions and data unpacking. Tables 5.1 and 5.2 present the results. For brevity, actual values are only presented for the 32-bit floating-point case. The 4-bit and 8-bit statistics are given as percentage increases, where 0% means an identical value (within ±0.5%) to the floating-point case. The results show substantial memory savings. Hierarchy memory usage is reduced by 63% for all scenes when using BVH nodes with 8-bit integer values, and by 75% when using 4-bit integers. For complex scenes like Lucy, hundreds of megabytes are saved. 73 Spheres BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Bunny BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Branch BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT Poppy BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT 32-bit FP 4-bit Int 8-bit Int 362,848 5.290 12.623 11.115 38.338 68.164 62.082 21.4 90,712 15% 19% 14% 18% 15% 17% 80% 56% 136,068 1% 1% 1% 1% 1% 1% 44% 33% 1,068,064 11.291 16.938 7.472 43.003 60.837 22.341 22.4 267,016 8% 13% 4% 15% 15% 12% 60% 42% 400,524 0% 1% 0% 1% 1% 0% 36% 27% 4,968,224 4.517 4.948 52.466 33.116 18.9 1,242,056 15% 12% 18% 18% 78% 55% 1,863,084 1% 1% 1% 1% 42% 31% 6,001,632 2.672 4.700 23.254 29.124 12.1 1,500,408 13% 14% 16% 14% 67% 48% 2,250,612 1% 1% 1% 1% 40% 29% Table 5.1: Hierarchically-encoded BVH performance (scenes 1-4): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. 74 Power-Front BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Power-North BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Power-Boiler BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT Lucy BVH memory usage (bytes) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT 32-bit FP 4-bit Int 8-bit Int 203,993,248 13.945 6.634 11.329 103.226 41.274 89.688 33.9 50,998,312 23% 18% 27% 25% 32% 26% 86% 59% 76,497,468 1% 1% 2% 2% 2% 2% 43% 28% 203,993,248 20.783 16.308 12.997 326.527 162.040 201.019 153 50,998,312 29% 17% 31% 21% 17% 19% 95% 60% 76,497,468 2% 1% 2% 4% 1% 1% 52% 34% 203,993,248 21.019 14.955 362.569 313.087 110 50,998,312 28% 26% 17% 17% 89% 53% 76,497,468 1% 1% 1% 1% 48% 30% 433,540,640 108,385,160 162,577,740 6.536 19% 1% 10.369 18% 1% 12.058 14% 1% 59.852 28% 1% 47.344 14% 1% 63.399 17% 1% 49.6 78% 36% 49% 26% Table 5.2: Hierarchically-encoded BVH performance (scenes 5-8): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. 75 Unfortunately, this memory savings comes at a cost. The rendering times have increased greatly, especially for the 4-bit case. The look-up tables reduce the rendering times by a considerable amount, but the rendering is still 26% to 34% slower for the 8-bit case and 42% to 60% slower for the 4-bit case. Examining the ray-node and ray-tri statistics shows a significant increase in the number of intersection tests for the 4-bit case (4% to 32%), but only a slight increase for the 8-bit case (0% to 4%.) Additional tests were run using 12- and 16-bit precision, but the difference in raytri/node tests compared to the floating-point version was insignificant. Thus, it is generally not worthwhile to use more than 8-bits of precision. Rendering times were nearly unchanged versus the 8-bit case, suggesting that in the 8-bits and greater cases the bottleneck is not from the slight amount of excess BVH traversal but from the reconstruction of the node’s actual coordinates from the hierarchical integer representation. If this bottleneck could be avoided, the rendering times will likely be similar to the floating-point times. Comparing the 4-bit and 8-bit results shows a predictable pattern. The 4-bit results show roughly 5-30 times as much increase in ray-tri/node tests as the 8-bit case. This is unsurprising, given that the relative amount of node expansion due to integer quantization will be 16 times greater than for the 8-bit case. 5.5 Reducing Overhead with Coherent Ray Tracing Coherent ray tracing exploits the similarity between rays to improve performance. For example, the 64 rays for any 8 by 8 pixel region of the screen will be similar, and will intersect similar areas of the scene. Other researchers have used coherency to reduce the memory bandwidth requirements of ray tracing, by shooting 4 rays at once with Intel’s SSE instructions [WSBW01], or by shooting 64 rays at once for a prototype hardware design [SWS02]. These sets of rays are traversed through the hierarchy (either a BVH or BSP/k-d tree) as bundles, and are intersected with the geometry of the scene in bundles. This can 76 reduce the number of hierarchy node and geometry fetches from memory by up to a factor of n, when shooting bundles of n rays. Major rendering speed increases have been observed, as fetches from memory are no longer a bottleneck and the CPU or hardware can run more efficiently. Coherent ray tracing solves the problem of the integer BVH node conversion to floatingpoint overhead by amortizing the conversion cost over many rays. The conversion need only be performed once each time the node is loaded from memory, and not for each ray. This should dramatically reduce the overhead. The problem with coherent ray tracing is that the theoretical performance is seldom reached, because the coherent bundles of rays get broken up as they traverse the hierarchy. This happens because not all rays in a bundle will traverse the same nodes, especially further down in the hierarchy where the nodes are small. Care must be taken to ensure the bundles of rays are as coherent as possible to mitigate this problem, as described below. This implementation of coherent ray tracing does not use any special hardware or vector CPU instructions. Bundles of 64 or 256 rays are shot from the eye and traced through the hierarchy. Bundles of 64 rays are formed by grouping the primary rays passing through an 8 by 8 pixel area of the image. Thus, the 64 rays are similar and are coherent. Bundles of 256 rays are formed from the primary rays corresponding to a 16 by 16 pixel area. As many rays as possible should be shot, but if the bundles are too large, the rendering speed stops increasing because the set of rays becomes larger than the CPU caches. 64 rays fit within level 1 CPU cache (typically 8-16 kilobytes for today’s CPUs.) Assuming approximately 50 bytes per ray, 3200 bytes of storage is needed. Bundles of 256 rays were also tested to evaluate the technique’s scalability, but no further speed increase was observed for the reference floating-point cases. This may be because 256 rays are too large to comfortably fit within the level 1 cache of today’s CPUs, and the cache misses cancel the advantage from the greater coherency. 77 procedure Traverse(node, firstray, stacktop) newstacktop = stacktop /* newstacktop is a local variable */ for i = firstray to stacktop - 1 ray = stack[i] /* stack[] is global */ if ray hits AABB stack[newstacktop] = ray newstacktop++ if newstacktop != stacktop /* did any rays hit the node? */ if node.numTris != -1 (Intersect rays with geometry contained within the node.) else (Determine near and far child) Traverse(node.child[near], stacktop, newstacktop) Traverse(node.child[far], stacktop, newstacktop) Figure 5.3: Pseudocode for coherent BVH traversal. After ray traversal and intersection completes, the intersection results for the rays are examined. When shooting 64 rays per bundle, there will be a maximum of 64 intersection points. For each intersection point, a shadow ray is directed to the first light source. These (up to 64) shadow rays are collected into a bundle that is traced through the scene. The process repeats for each light source. Depending on the material properties of the intersections, bundles of transmitted and/or reflected rays are also created and traced through the scene. The BVH traversal must be modified to accommodate bundles of rays (Figure 5.3.) Instead of passing a single ray to the Traverse() function, the function passes and retrieves rays via a single, global stack of pointers to rays. A parameter is passed indicating the number of ray pointers to read from the top of the stack. The rays are retrieved and are tested for intersection with the BVH node. Intersecting rays have their pointers pushed onto the stack and these are passed to the node’s children. Recall that for a BVH, both 78 children are tested against the same set of rays that hit the parent. Implementors should note that when shooting up to 64 rays into a hierarchy that is up to 60 levels deep, the stack must accommodate 64 * 60 or 3840 ray pointers. One problem with tracing bundles of rays through a BVH is that the rays may have different directions, and thus they do not traverse the child nodes in the same order. Recall that in Section 4.5, the DSA heuristic is used for determining the correct order of child traversal for each ray. With DSA, all rays with the same direction vector component signs will traverse the children in the same order. Hence, rays with the same direction signs are bundled together and traced as a group. Ray-triangle intersection proceeds in a similar fashion, with the ray-triangle testing function reading ray pointers from the top of the stack and intersecting them with each triangle belonging to the node. Some minor performance gains are possible with a coherent form of the M¨oller and Trumbore ray-triangle test [MT97]. A few operations can be performed once for the entire bundles of rays, specifically the subtraction of the triangle vertex coordinates to produce the edge vectors. 5.6 Rendering Performance with Coherent Ray Tracing This section compares coherent ray tracing performance for 32-bit floating-point and hierarchical integer BVHs. The testing method is the same as before, except Nodes loaded and Tris loaded fields have been added to the tables. Nodes loaded indicates the number of BVH nodes loaded from memory and converted back into floating-point form. Tris loaded indicates the number of triangles loaded from memory. These fields monitor the effectiveness of shooting bundles of rays versus single rays. If rays are shot in bundles of 64, the values of nodes loaded and triangles loaded would ideally be 1/64th of the number of nodes and triangles tested for ray intersection. This ideal value is never reached because the ray bundles are broken up as they traverse the hierarchy. For the non-coherent case in the 79 32-bit FP Spheres Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Bunny Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT 4-bit Int 8-bit Int 0.109 0.326 0.271 5.290 12.623 11.856 0.638 1.291 1.141 38.338 68.164 65.209 13.8 14% 16% 12% 15% 19% 13% 18% 15% 16% 18% 15% 16% 16% 16% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 3% 1% 0.224 0.807 0.202 11.291 16.938 7.707 0.731 1.979 0.515 43.003 60.837 23.715 14.9 9% 12% 6% 8% 13% 4% 15% 14% 12% 15% 15% 12% 13% 14% 0% 1% 0% 0% 1% 0% 1% 1% 1% 1% 1% 0% 3% 3% Table 5.3: BVH performance with coherent ray tracing (scenes 1-2): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. previous section, the number of nodes loaded equals the number of nodes tested, and the number of triangles loaded equals the number of triangles tested. Additionally, the BVH memory usage fields have been removed because the BVH memory usage is the same as before. 80 32-bit FP 4-bit Int 8-bit Int Branch Tris Loaded (Primary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT Poppy Tris Loaded (Primary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT 0.219 0.403 4.517 5.137 1.014 1.044 52.466 34.037 12.6 11% 10% 15% 12% 17% 15% 18% 17% 18% 17% 1% 1% 1% 1% 1% 1% 1% 1% 6% 2% 0.146 0.353 2.672 5.026 0.483 0.805 23.154 30.836 8.41 10% 9% 13% 13% 15% 12% 16% 14% 16% 16% 1% 1% 1% 1% 1% 1% 1% 1% 6% 3% Table 5.4: BVH performance with coherent ray tracing (scenes 3-4): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. The coherent ray tracing results are presented in Tables 5.3, 5.4, 5.5, and 5.6. Once again, the 4-bit and 8-bit results are expressed as percentage increases versus the floatingpoint reference. These results are much better than the previous, single-ray results in Section 5.4. Exploiting ray coherency reduces the overhead by a large amount. The increase in rendering times for the 8-bit case with LUT ranges from 1% to 4% instead of from 26% to 34%. Unfortunately, the 4-bit times are still increased by 14% to 25%. Hence, 4 bits is generally not sufficient to effectively represent a BVH. 4-bit coordinates may outperform the 8-bit coordinates in some situations, such as when using 8-bit coordinates results in memory 81 32-bit FP Power-Front Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT Power-North Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT 4-bit Int 8-bit Int 0.579 0.295 0.707 13.945 6.634 11.667 2.315 1.017 2.612 103.226 41.274 91.533 20.5 26% 21% 27% 23% 18% 27% 23% 28% 23% 25% 32% 25% 25% 25% 2% 1% 2% 1% 1% 2% 2% 2% 2% 2% 2% 2% 3% 2% 0.411 0.558 0.341 20.783 16.308 14.222 5.384 4.124 3.852 326.527 162.040 212.348 81.8 28% 19% 27% 29% 17% 29% 21% 16% 18% 21% 17% 18% 23% 23% 2% 2% 2% 2% 1% 2% 4% 2% 1% 4% 1% 1% 4% 4% Table 5.5: BVH performance with coherent ray tracing (scenes 5-6): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. exhaustion (paging to disk) while using 4-bit coordinates does not. Comparing loaded versus tested statistics for the scenes confirms that coherent ray tracing drastically reduces the number of node and triangle fetches from memory. As a result, rendering times are decreased substantially as well. Comparing the coherent 32-bit 82 32-bit FP Power-Boiler Tris Loaded (Primary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Shadow) Time (s) Time with LUT Lucy Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) Time with LUT 4-bit Int 8-bit Int 0.968 0.840 21.019 15.377 7.358 6.809 362.569 316.648 58.1 25% 23% 28% 26% 18% 17% 17% 17% 22% 21% 2% 1% 1% 1% 1% 1% 1% 1% 6% 3% 1.469 0.221 1.985 6.536 10.369 12.168 2.306 0.816 2.858 59.852 47.344 64.265 31.8 14% 18% 12% 19% 18% 14% 16% 14% 11% 28% 14% 17% 20% 20% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 1% 5% 3% Table 5.6: BVH performance with coherent ray tracing (scenes 7-8): 32-bit float vs. 4-bit and 8-bit integer. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer cases are percentage increases relative to 32-bit FP. floating point rendering times and the non-coherent times in Section 5.4 shows that coherent ray tracing is 44% to 89% faster. This improvement was achieved using a purely softwarebased C++ implementation, and did not require the use of vector CPU instructions as described by Wald et al [WSBW01]. Coherent ray tracing eliminates virtually all the overhead from using the hierarchical integer BVH, and provides a 44% to 89% rendering speed improvement over non-coherent 83 32-bit FP 4-bit Int 8-bit Int Spheres Time (s) Time with LUT Bunny Time (s) Time with LUT Branch Time (s) Time with LUT Poppy Time (s) Time with LUT Power-Front Time (s) Time with LUT Power-North Time (s) Time with LUT Power-Boiler Time (s) Time with LUT Lucy Time (s) Time with LUT 55.3 17% 17% 3% 2% 58.7 14% 13% 2% 2% 48.2 19% 19% 5% 2% 32.6 16% 16% 4% 3% 78.6 24% 26% 2% 2% 328 21% 21% 4% 3% 229 18% 18% 4% 2% 121 19% 19% 3% 2% Table 5.7: BVH performance (256-ray coherent ray tracing). ray tracing as well. Hence, combining the two techniques has yielded an algorithm that is both fast and memory-efficient. Increasing the coherency by increasing the image resolution and the bundle size might reduce the overhead further. To test this, the scenes were also rendered at 4096x4096 pixel resolution, shooting a 16x16 pixel group of rays at once (256 rays). The resolution was doubled so that four times as many rays pass through the same solid angle, increasing coherency by a factor of four. Although 4096x4096 pixel resolution may seem excessive, it is roughly equivalent to 2048x2048 anti-aliased pixels. Today (2005), Apple Computer’s 30-inch Cinema Display sports 2560x1600 pixel resolution [App], and large, high resolution 84 displays like these will become more commonplace as technology improves. For brevity, only the rendering times are presented for the 4096x4096 pixel results in Table 5.7. The memory savings is the same regardless of the size of the ray bundles. The increased coherency reduces the overhead for the 8-bit integer technique to between 2% and 3%. This is approximately the same increase that resulted from shooting 64 rays at 2048x2048 resolution (1% to 4%.) The 4-bit results using 256 rays (13% - 26% increase) are also similar to the 64-ray results. 5.7 Single vs. Double Precision Floating Point Ray tracing is typically performed with 32-bit single precision, but for extremely large and complex scenes, 64-bit double precision may be desired. Double precision also has the advantage of being more resistant (but not immune) to imprecision-related rendering artifacts. For example, the epsilon values used to avoid incorrect surface intersections (also called “surface acne” [WPO96]) can be much smaller. The tests from Sections 5.4 and 5.6 were also run using 64-bit double precision, with similar results (within a few percent). The double precision rendering times were within a few percent of the single precision times. The only significant difference was the memory savings. BVH nodes using 64-bit double precision coordinates require 56 bytes per node instead of 32 bytes per node for 32-bit single precision. Hence, the savings when using 8-bit integers was 79% instead of 63% for 32-bit floating-point. Memory savings when using 4-bit integers was 86% instead of 75%. 5.8 Out-of-Core Results The memory savings from the hierarchical integer BVH may be sufficient to prevent physical memory exhaustion and paging to a swapfile during rendering. A special test scene, 2Lucy 85 2Lucy 1.5 Angelic statues Triangles: 43,057,792 Lights: 2 BVH Nodes: 20,793,659 Primary rays: 4,194,304 Secondary rays: 1,558,117 Shadow rays: 11,504,836 Figure 5.4: 2Lucy test scene. 32-bit FP 8-bit Int BVH memory usage (bytes) 665,397,088 249,523,908 Ray-Tri Tests (Primary) 6.948 1% Ray-Tri Tests (Secondary) 10.383 1% Ray-Tri Tests (Shadow) 12.271 1% Ray-Node Tests (Primary) 67.364 1% Ray-Node Tests (Secondary) 62.641 1% Ray-Node Tests (Shadow) 68.312 1% Time (s) 218s 69.3s (-68%) Table 5.8: 2Lucy hierarchically-encoded BVH performance: 32-bit float vs. 8-bit integer with LUT. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer case are percentage increases relative to 32-bit FP. (Figure 5.4) was devised that is too large for physical memory when using the 32-bit floating-point BVH, but fits within physical memory when using the hierarchical integer BVH. Using the 8-bit hierarchical integer BVH saves over 415 million bytes of memory. The 2Lucy scene was rendered with the 32-bit floating-point BVH and with the 8-bit integer BVH (with LUT.) Single-ray results are presented in Table 5.8, and 64-ray coherent ray tracing results are presented in Table 5.9. 86 Tris Loaded (Primary) Tris Loaded (Secondary) Tris Loaded (Shadow) Ray-Tri Tests (Primary) Ray-Tri Tests (Secondary) Ray-Tri Tests (Shadow) Nodes Loaded (Primary) Nodes Loaded (Secondary) Nodes Loaded (Shadow) Ray-Node Tests (Primary) Ray-Node Tests (Secondary) Ray-Node Tests (Shadow) Time (s) 32-bit FP 8-bit Int 1.857 1% 0.826 1% 2.730 1% 6.948 1% 10.383 1% 12.421 1% 3.054 1% 1.932 1% 4.099 1% 67.364 1% 62.641 1% 69.513 1% 89.9s 37.5s (-58%) Table 5.9: 2Lucy BVH performance with coherent ray tracing: 32-bit float vs. 8-bit integer with LUT. Ray statistics for 32-bit FP are given as per-ray averages, statistics for the integer case are percentage increases relative to 32-bit FP. The single ray results show a large performance increase from using the integer BVH. The integer BVH renders 68% faster than the 32-bit floating-point BVH, despite the overhead of converting the integer AABB coordinates to floating- point. This is because the 8-bit integer BVH reduces the memory consumption sufficiently to avoid physical memory exhaustion and swapfile usage. Recall that in Section 5.4, the rendering times were 26% to 34% slower, not 68% faster. Admittedly, the 2Lucy scene is not exactly the same as the scenes previously tested, but it is similar enough for a meaningful comparison. The floating-point coherent ray tracing results show that coherent ray tracing is highly beneficial when physical memory is exhausted. Previously, coherent ray tracing produced a speed-up of between 44% to 89% relative to tracing single rays. When physical memory was exhausted and the system was using the swapfile, coherent ray tracing produced a speed-up of 142%. Coherent ray tracing reduces the number of swapfile/disk accesses, reducing the performance penalty for swapfile usage. Coherent ray tracing performance with the 8-bit integer BVH is 58% faster than with 87 the floating-point BVH. Again, the 8-bit integer BVH reduces the memory consumption sufficiently to avoid physical memory exhaustion and swapfile usage. The previous results in Section 5.6 showed a 1% to 4% slowdown when using the integer BVH, not a 58% speed-up. Chapter 6 Ray-BVH Traversal with Integer Arithmetic 6.1 Overview Ray tracers commonly use single-precision floating-point arithmetic, including recent hardware ray tracers. Integer arithmetic may be a superior choice for hardware ray-AABB tests and BVH traversal, because of the reduced hardware complexity of integer arithmetic circuits. It is also possible to reduce the precision of the ray-AABB tests to further reduce hardware complexity. The key problem is adapting the ray-AABB tests to work robustly with integer arithmetic, especially if the precision is reduced. Integer versions of the Slabs-AABB and Plu-AABB tests are derived in this chapter and compared. To adapt the ray-AABB tests to work robustly with reduced precision integer arithmetic, a strategy to deal with potential errors due to imprecision is needed. There are two kinds of errors that may occur: false hits and false misses. A false hit occurs when the imprecision of the ray-AABB test causes the test to indicate a ray-AABB hit, when it should actually be a miss. This type of error does not cause incorrect images to be produced, but it does cause unnecessary ray-BVH traversal and ray-triangle tests. False misses are more serious, as geometry that should be tested for ray intersection is ignored, potentially causing incorrect images. A good strategy is to eliminate all false misses and minimize the number of false hits. Hence, if there is uncertainty due to numerical imprecision in the ray-AABB test, then a ray-AABB hit is assumed, erring on the conservative side. Lower precisions result in some excess BVH traversal, and the excess traversal increases as precision is decreased. This ensures that no BVH nodes are accidentally missed due to imprecision. These algorithms 88 89 P recision 32-bit 24-bit 20-bit 16-bit FP int int int Bytes / node 32 26 23 20 N odes 11,702,427 11,702,427 11,702,427 11,698,579 BV H size P ercent (bytes) savings 374,477,664 0% 304,263,102 19% 269,155,821 28% 233,971,580 38% Table 6.1: BVH size versus precision for Lucy scene. guarantee that a correct image is produced regardless of the amount of precision used. AABB coordinates are represented with n-bit integer values instead of 32-bit floatingpoint. Using lower-precision integer values instead of 32-bit floating-point also saves memory. This research tests 16, 20, and 24-bit integer precisions, where each BVH node consumes 20, 23, or 26 bytes. This results in BVH memory savings of 19% to 38%, versus using 32-bit floating-point values. As an example of the memory savings, Table 6.1 compares the memory consumption of the Lucy scene’s BVH for varying precisions. Floating-point arithmetic is still used during integer BVH construction, which uses an algorithm similar to that described in Section 4.3. Floating-point arithmetic is used for simplicity, because integer BVH construction has not yet been investigated. BVH node coordinates are converted to integer values as an additional, final step. This is not the same as the integer BVH encoding in Chapter 5, because the integer coordinates are not represented hierarchically but are simply scaled to fit within a fixed integer range. BVH nodes are not split along a particular axis if they are already smaller than the integer coordinate system grid in that direction. Another axis is chosen or the node becomes a leaf if it is too small in all three dimensions. Previously, a floating-point epsilon value had been used. The triangles contained within the BVH leaf nodes are still represented with floatingpoint values. Ray-triangle tests are computed with 32-bit floating-point arithmetic, using the algorithm described by M¨oller and Trumbore [MT97]. Ideally, the triangles would also be represented with integer values, avoiding the need to mix integer and floating-point 90 within the ray tracer. The development of a robust, integer ray-triangle test remains to be investigated. 6.2 Integer Slabs-AABB Test The Slabs-AABB test is described in Section 2.3.1. Modifying it to use integer arithmetic is particularly challenging because computing the ray-plane distances to the AABB faces involves division or multiplication by IRDVCs (inverse ray direction vector components). The results of these operations can be anywhere from -Infinity to +Infinity when implemented in floating-point. Integer arithmetic has a much more limited range than floating-point and cannot represent infinities. 6.2.1 Scene and Ray Setup The scene must be scaled such that the geometry, camera position (primary ray origins), and light sources all fit within the [(−1, −1, −1), (1, 1, 1)] cube (normalized coordinates.) This is necessary to ensure that conversion to a limited integer coordinate range does not overflow. (One caveat to using normalized coordinates is that calculations involving the ray length must be scaled to compensate for the normalization.) When using 16-bit arithmetic, for example, the ray origin (rx, ry, rz) and AABB coordinates (xmin, ymin, zmin, xmax, ymax, zmax) are multiplied by 16383 to fit within the range [−16383, 16383]. Why this is not [−32767, 32767], as would be expected for 16-bit arithmetic, will be explained shortly. In general, for n-bit arithmetic the range is [−2n−2 − 1, 2n−2 − 1], e.g. [−216−2 − 1, −216−2 − 1] = [−16383, 16383]. After the coordinates of the ray’s origin are multiplied by 16383, they are rounded to the nearest integer values. This produces an integer ray origin (irx, iry, irz). Rounding to integer values introduces a maximum error of ± 21 relative to the actual values. This error can be corrected by adding/subtracting 1 2 to the AABB coordinates during BVH 91 construction, after the coordinates are multiplied by 16383. This enlarges the AABB slightly, eliminating the possibility of false misses due to the imprecision of the integer ray origin. Hence, all integer ray origins can now be considered exact. The conversion of the AABB coordinates to integers also introduces error. These conversions are performed during BVH construction, not during rendering. The error/imprecision can be handled by taking the floor() of xmin, ymin, zmin and the ceil() of xmax, ymax, zmax after multiplication by 16383 and addition/subtraction of 1 2 (to compensate for the imprecision of the ray origin.) This guarantees that the actual AABB is fully enclosed by its integer approximation. The integer AABB coordinates can be considered exact because false misses due to the integer approximations have been eliminated. These conversions to integer increase the AABB’s size. This causes false hits, and the number of false hits increases as the numerical precision or number of bits decreases, because the relative increase in AABB size is greater with lower precision. To summarize, the integer AABB coordinates are given by:  ixmin = ixmax = iymin = iymax = izmin = izmax =  1 (int)f loor 16383 ∗ xmin − 2   1 (int)ceil 16383 ∗ xmax + 2   1 (int)f loor 16383 ∗ ymin − 2   1 (int)ceil 16383 ∗ ymax + 2   1 (int)f loor 16383 ∗ zmin − 2   1 (int)ceil 16383 ∗ zmax + 2 The integer ray-AABB test avoids division by multiplying by IRDVCs. When using 16bit arithmetic, the IRDVCs in integer form are values with range [−32767, 32767]. Thus, the largest IRDVC is scaled to equal ±32767. Mapping the floating-point IRDVCs to a fixed integer range avoids the problem of representing large and possibly infinite values 92 with integers. Also needed are integer floor() and ceil() values of the IRDVCs because they give lower and upper bounds on the actual IRDVCs. Fortunately, these values only need to be computed once for each ray. The integer IRDVCs are: scale = 32767 ∗ min(|ri|, |rj|, |rk|)   scale −1 irif loor = (int)f loor ri   scale −1 iriceil = (int)ceil ri   scale −1 irjf loor = (int)f loor rj   scale −1 irjceil = (int)ceil rj   scale −1 irkf loor = (int)f loor rk   scale −1 irkceil = (int)ceil rk Note that scale and the ray’s direction vector components (ri, rj, rk) are floating-point values. Special care must be taken when one or two of the floating-point ray direction components (ri, rj, rk) are +0 or -0. For this algorithm, +0 and -0 are interpreted as underflow conditions: positive or negative non-zero numbers too small to represent with floatingpoint. The values +0 and -0 can also be interpreted as infinitesimal values approaching zero from the positive or negative side. For this situation, the integer IRDVC floor/ceil values need to be set explicitly as a special case. Examining the equations that produce the integer components reveals the problem. If ri = +0 (and scale = +0), the division scale ri becomes 32767∗(+0) , +0 which gives a floating-point result of NaN (Not a Number). The +0 +0 problem is solved by application of L’Hopital’s rule (which states that a limit of indeterminate form 0 0 can be computed by differentiating both the numerator and de- 93 nominator): 32767 ∗ |ri| → ri→+0 ri lim 32767 ∗ 1 = 32767 ri→+0 1 lim If ri = -0, the result is -32767, since the denominator of the limit becomes negative. Thus, for each +0 floating-point ray component, the corresponding floor/ceil integer IRDVC value is +32767. For each -0 floating-point component, the floor/ceil integer IRDVC value is -32767. For the non-zero ray components when scale = +0, the conversion to integer IRDVCs is of the form +0 , val where val is ri, rj, or rk. This produces a +0 or -0 result. +0 is considered a tiny value greater than zero, so its floor value is 0, and ceil value is 1. -0 is interpreted as a tiny value less than zero, so its floor value is -1 and ceil value is 0. Hence, if the ray component is positive and non-zero, the floor value is 0 and ceil value is 1. If the ray component is negative and non-zero, the floor value is -1 and ceil value is 0. Implementors should be careful not to use the “C” library floor() and ceil() functions for these special cases. My testing shows that floor(-0) returns 0, and ceil(+0) returns 0. In the context of this algorithm, this is incorrect, because floor(-0) should return -1, and ceil(+0) should return 1. Finally, the ray length must also be converted to an integer in order to perform the ray-AABB test: irlen = (int)(rlen ∗ scale ∗ 16383) The multiplication by 16383 is necessary to ensure the integer ray length is scaled by the same amount as the result of the intersection distance calculation, described below. 6.2.2 Ray-AABB Tests Details of the floating-point Slabs-AABB test are given in Section 2.3.1. 94 The first step in the ray-AABB test is to subtract the ray origin from the AABB coordinates: dx0 = ixmin − irx dx1 = ixmax − irx dy0 = iymin − iry dy1 = iymax − iry dz0 = izmin − irz dz1 = izmax − irz The range of both the AABB coordinates and the ray origin are [−16383, 16383], so subtracting them gives a result with range [−32766, 32766]. This is why their ranges are limited to [−16383, 16383] instead of [−32767, 32767] when using 16-bit arithmetic. Otherwise, the subtraction could overflow and 17 bits would be needed to represent the result. Despite being integer, these subtraction results contain no error because the integer AABB coordinates and ray origin are considered exact. Compensation for the error has already been applied by enlarging the AABB (Section 6.2.1.) The next step is to obtain the intersection distances to the planes of the AABB faces. The subtraction results with range [−32766, 32766] are multiplied by the integer IRDVCs, which have the range [−32767, 32767]. This produces a result with range [−1073643522, 1073643522], which fits in 31 bits. These six intersection distances form the three intervals [itnearx , itf arx ], [itneary , itf ary ], [itnearz , itf arz ]. By using the proper floor() and ceil() integer IRDVCs, it is ensured that the integer distances fully enclose the actual distances that would result from using higher precision or floating-point. Figures 6.1 and 6.2 show the six possible cases of the integer computation of the [itnearx , itf arx ] interval, comparing them to the floating-point versions. In the figures, ignore the fact that the integer distances are scaled relative to the floating-point distances. 95 Case 1: iri-1floor ≥ 0, dx0 ≥ 0, dx1 ≥ 0 tfarx = dx1 * ri-1 -1 tnearx = dx0 * ri Ray x origin itfarx = dx1 * iri-1ceil itnearx = dx0 * iri-1floor ixmin ixmax Case 2: iri-1floor ≥ 0, dx0 < 0, dx1 ≥ 0 -1 tnearx = dx0 * ri tfarx = dx1 * ri-1 Ray origin x itfarx = dx1 * iri-1ceil itnearx = dx0 * iri-1ceil ixmin ixmax Case 3: iri-1floor ≥ 0, dx0 < 0, dx1 < 0 -1 tfarx = dx1 * ri-1 tnearx = dx0 * ri x Ray origin itfarx = dx1 * iri-1floor itnearx = dx0 * iri-1ceil ixmin ixmax Figure 6.1: Integer vs. floating-point slabs interval, cases 1-3. This does not matter because all the integer distances are scaled by the same amount (scale) and are only compared with each other, zero, and the scaled ray length irlen. A key problem is when to use the floor() IRDVC value and when to use the ceil() value. This depends on the direction of the ray and the signs of dx0 and dx1. In general, the interval calculation can be written as: itnearx = dxnear ∗ iri−1 near itf arx = dxf ar ∗ iri−1 f ar 96 Case 4: iri-1floor < 0, dx0 < 0, dx1 < 0 tnearx = dx1 * ri-1 tfarx = dx0 * ri -1 x Ray origin itnearx = dx1 * iri-1ceil itfarx = dx0 * iri-1floor ixmin ixmax Case 5: iri-1floor < 0, dx0 < 0, dx1 ≥ 0 tfarx = dx0 * ri -1 tnearx = dx1 * ri-1 Ray origin x itnearx = dx1 * iri-1floor itfarx = dx0 * iri-1floor ixmin ixmax Case 6: iri-1floor < 0, dx0 ≥ 0, dx1 ≥ 0 tfarx = dx0 * ri Ray origin x -1 tnearx = dx1 * ri-1 itnearx = dx1 * iri-1floor itfarx = dx0 * iri-1ceil ixmin ixmax Figure 6.2: Integer vs. floating-point slabs interval, cases 4-6. The logic to determine the appropriate variables is broken into three steps: 1. If iri−1 f loor ≥ 0, then dxnear = dx0 and dxf ar = dx1, else dxnear = dx1 and dxf ar = dx0 −1 −1 −1 2. If dxnear ≥ 0, then iri−1 near = irif loor , else irinear = iriceil . −1 −1 −1 3. If dxf ar ≥ 0, then iri−1 f ar = iriceil , else irif ar = irif loor . The rules are similar for the [itneary , itf ary ] and [itnearz , itf arz ] intervals. 97 Next, the overall [itnear, itf ar] interval is formed by taking min() and max() of the three intervals: itnear = max(itnearx , itneary , itnearz ) itf ar = min(itf arx , itf ary , itf arz ) For the ray to hit the AABB, the following conditions must be satisfied: itf ar ≥ 0 itnear ≤ itf ar itnear ≤ irlen The min() and max() operations and the comparisons can be simplified by truncating the [itnearx , itf arx ] (and other) interval values and ray length to a lower precision. This is important if hardware costs are to be minimized (see Chapter 7.) The 31-bit interval values can have the lower 15 bits removed, shortening them to 16 bits. Actually, any number of bits can be removed. Removing bits causes additional false hits but not false misses, so the integrity of the algorithm is preserved. This truncation cannot produce a false miss because if itnear < itf ar (a hit) when represented with full precision, itnear ≤ itf ar (still a hit) when truncated. At worst, itf ar and itnear will become equal (still a hit), and will not “switch places.” Equal values will remain equal after truncation, and positive values (or zero) will never become negative. 6.2.3 BVH Efficiency The integer ray-AABB test is conservative, eliminating false misses at the expense of increasing false hits. The proportion of false hits increases as the precision is reduced. As the number of false hits is increased, the efficiency of the BVH is reduced due to additional BVH traversal and ray-triangle tests. This section examines the effect of reducing the precision on the number of BVH nodes traversed and the number of ray-triangle tests. Ideally, 98 using lower integer precisions will only result in a small increase in the number of BVH nodes traversed and ray-triangle tests versus using 32-bit floating-point. The test scenes described in Section 4.2 were used. Images were rendered at 2048x2048 pixel resolution. The integer BVH images were compared pixel-for-pixel with the floatingpoint images to verify correctness of the integer BVH technique. Additionally, each integer BVH test result was checked against the corresponding floating-point test to ensure there were no false misses. Billions of ray-AABB tests were performed while rendering the images, and no false misses were recorded. Tables 6.2, 6.3, and 6.4 present the test statistics. The first data column presents the statistics for 32-bit floating-point. The next columns give statistics for the integer method. These are split into three precision categories: 24 bits, 20 bits, and 16 bits. Each precision category is split into three cases. The first case (labeled n/(n+8) bits) uses n-bit arithmetic to compute the [itnear, itf ar] intervals, and trims the [itnear, itf ar] values to (n+8) bits (as described in Section 6.2.2.) For example, when using 16-bit arithmetic, the [itnear, itf ar] values are 31 bits in length. The lower 7 bits are removed, truncating the values to 24 bits. The next cases, labeled n/(n+4) bits and n/n bits, trim the [itnear, itf ar] values to n+4 and n bits, respectively. Several statistics were measured for each test scene and precision: • Ray-Tri Tests (Pri/Sec/Sha): The number of ray-triangle tests performed for primary, secondary, and shadow rays. Statistics for 32-bit floating-point are given as per-ray averages. Integer statistics are given as percentage increases relative to 32-bit floating-point. Percentages have been rounded to the nearest whole number. • Ray-Node Tests (Pri/Sec/Sha): The number of ray-node tests performed for primary, secondary, and shadow rays. Statistics are presented in the same fashion as the ray-tri tests. Timings are not provided because the integer results are a simulation, and are not 99 tuned for speed. The integer times were slower than the 32-bit floating-point times due to the amount of additional bit manipulation needed to accommodate the variable precision settings. In particular, removing the lower bits of the multiplication results requires an expensive shift operation. This shift would not be required in a hardware design, because the lower bits of the multiplier outputs would simply not be connected to anything. A hardware design is proposed in Chapter 7. 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5.290 12.623 11.115 38.338 68.164 62.082 11.291 16.938 7.472 43.003 60.837 22.341 4.517 4.948 52.466 33.116 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 4% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 22% 15% 2% 2% 2% 2% 1% 1% 1% 0% 2% 3% 1% 0% 1% 0% 24% 15% 2% 2% 2% 2% 1% 1% 1% 0% 2% 3% 1% 0% 1% 0% 63% 20% 6% 3% 4% 3% 2% 1% 1% 1% 5% 5% 2% 1% 1% 0% Table 6.2: BVH performance versus precision (Slabs-AABB), scenes 1-3. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Spheres Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Bunny Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Branch Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 16/16 FP bits bits bits bits bits bits bits bits bits 100 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2.672 4.700 23.254 29.124 13.945 6.634 11.329 103.226 41.274 89.688 20.783 16.308 12.997 326.527 162.040 201.019 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 1% 0% 0% 0% 1% 1% 1% 0% 0% 0% 1% 1% 0% 0% 2% 1% 2% 0% 0% 0% 1% 1% 2% 0% 0% 0% 1% 1% 0% 0% 7% 2% 3% 1% 0% 0% 5% 3% 4% 1% 1% 1% 3% 1% 0% 0% 24% 16% 29% 5% 2% 3% 22% 11% 29% 4% 2% 5% 17% 10% 2% 2% 28% 19% 32% 5% 2% 3% 25% 14% 33% 4% 3% 5% 19% 10% 3% 2% 133% 42% 58% 13% 5% 5% 64% 39% 68% 9% 7% 10% 48% 18% 6% 3% Table 6.3: BVH performance versus precision (Slabs-AABB), scenes 4-6. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Poppy Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) Powerplant - Front Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Powerplant - North Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 16/16 FP bits bits bits bits bits bits bits bits bits 101 0% 1% 0% 0% 0% 0% 0% 0% 0% 0% 21.019 14.955 362.569 313.087 6.536 10.369 12.058 59.852 47.344 63.399 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 1% 0% 0% 0% 0% 0% 6% 2% 0% 0% 6% 0% 2% 1% 0% 0% 8% 9% 0% 0% 7% 0% 2% 1% 0% 0% 12% 11% 1% 0% 22% 1% 3% 3% 0% 1% 86% 22% 4% 1% 125% 2% 41% 13% 0% 7% 155% 182% 7% 7% 143% 3% 42% 14% 1% 7% 555% 25% 60% 50% 4% 10% 234% 1631% 208% 395% 10% 63% 8% 15% 16/16 bits Table 6.4: BVH performance versus precision (Slabs-AABB), scenes 7-8. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Powerplant - Boiler Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) Lucy Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 FP bits bits bits bits bits bits bits bits 102 103 The results show that for simple scenes, 16-bit integer arithmetic is sufficient. The Spheres and Bunny scenes show an increase of only a few percent in the number of BVH nodes tested. For the other scenes, “20/24” bit precision is enough for efficient operation of the BVH. More precision is needed for the more complex scenes because the BVH nodes are smaller relative to the integer coordinate system resolution. Powerplant-Boiler and Lucy are the most demanding scenes, showing up to a 6% increase in the number of triangles tested. However, the number of ray-node tests only increases by 1%, and there are far more ray-node tests than ray-triangle tests, thus the 6% increase in ray-triangle tests is not as detrimental as it may seem. If the increase in BVH traversal and ray-triangle tests absolutely needs to be minimized, then “24/24” bit precision should be used. 12 bits of precision was also tested, but the results were quite poor. One interesting consequence of using 12 bits is a nearly 50% reduction in the number of BVH nodes for the power plant scenes, saving a large amount of memory. The BVH nodes were too small to represent with the 12-bit coordinate grid, causing early termination of the BVH. Unfortunately, the BVH was so inefficient due to false hits when using 12-bits that this was of little value. The other scenes did not show a significant reduction in the number of BVH nodes. 6.3 Integer Plu-AABB Test The Plu-AABB test (see Chapter 3) and BVH traversal can also be adapted to use integer arithmetic. The derivations here will use 16-bit arithmetic, similar to the integer SlabsAABB section. 6.3.1 Scene and Ray Setup Integer BVH construction is similar to the construction for the integer Slabs-AABB method described in Section 6.2.1. The AABBs are expanded to compensate for error caused by 104 rounding the AABB coordinates and ray origin to integer values. The integer ray setup is much simpler than for the Slabs-AABB method because no IRDVCs need to be computed. The Plu-AABB test needs the ray origin (rx, ry, rz), ray direction (ri, rj, rk) and endpoint (rex, rey, rez). These are converted to the integers (irx, iry, irz), (iri, irj, irk) and (irex, irey, irez). The ray origin and endpoint have range [−16383, 16383] when using 16-bit arithmetic. The conversion to integers is the same as described in Section 6.2.1. The integer ray direction components are mapped to the range [−32767, 32767], with the largest component equal to ±32767. The components are computed as follows: largest = max(|ri|, |rj|, |rk|)   ri iri = round 32767 ∗ largest   rj irj = round 32767 ∗ largest   rk irk = round 32767 ∗ largest 6.3.2 Ray-AABB Tests Pseudocode for case “MMM” of the floating-point Plu-AABB test is given in Figure 6.3. The other seven cases are similar, except some variables and comparisons have been permuted. The pseudocode for the corresponding integer algorithm is presented in Figure 6.4. First, examine Part A and Part B of the floating-point pseudocode. Converting this portion to use 16-bit integer arithmetic is straightforward. The integer values are simply substituted for the floating-point values. The integer values are assumed to be exact because compensation for the rounding error in the integer AABB coordinates and ray origin/endpoint has already been applied by enlarging the AABB during BVH construction. The next stage (Part C) of the Plu-AABB test subtracts the ray origin from the AABB 105 // Part A: The AABB must be on the forward extension of the ray if rx < xmin OR ry < ymin OR rz < zmin then MISS // Part B: If the ray is not infinite, check the ray’s endpoint if (ray is not infinite) AND (rex > xmax OR rey > ymax OR rez > zmax) then MISS // xa ya za xb yb zb Part C: Subtract the ray origin from the AABB coordinates = xmin - rx = ymin - ry = zmin - rz = xmax - rx = ymax - ry = zmax - rz // Part D: Test if ri * ya - rj rj * xa - ri rk * xa - ri ri * za - rk rj * za - rk rk * ya - rj then MISS else HIT the ray against the six silhouette edges of the AABB * xb < 0 OR * yb < 0 OR * zb < 0 OR * xb < 0 OR * yb < 0 OR * zb < 0 Figure 6.3: Pseudocode for floating-point Plu-AABB test, case “MMM.” coordinates. The integer AABB coordinates and integer ray origin coordinates both have range [−16383, 16383], so the subtraction result has range [−32766, 32766]. The subtraction results are error-free because the integer ray origin and AABB coordinates are error free. Again, compensation has already been applied by enlarging the AABB during BVH construction. Next, the ray is tested against the six edges of the AABB (Part D). This part of the PluAABB test involves computation with the integer ray direction components (iri, irj, irk), which contain an error of ± 21 . Interval arithmetic will be used to derive an error bound 106 // Part A: The AABB must be on the forward extension of the ray if irx < ixmin OR iry < iymin OR irz < izmin then MISS // Part B: If the ray is not infinite, check the ray’s endpoint if (ray is not infinite) AND (irex > ixmax OR irey > iymax OR irez > izmax) then MISS // Part C: Subtract the ray origin from the AABB coordinates ixa = ixmin - irx iya = iymin - iry iza = izmin - irz ixb = ixmax - irx iyb = iymax - iry izb = izmax - irz // Part D: Test the ray against if iri * iya - irj * ixb < EBND irj * ixa - iri * iyb < EBND irk * ixa - iri * izb < EBND iri * iza - irk * ixb < EBND irj * iza - irk * iyb < EBND irk * iya - irj * izb < EBND then MISS else HIT the six silhouette edges of the AABB OR OR OR OR OR Figure 6.4: Pseudocode for integer Plu-AABB test, case “MMM.” on the computation results obtained. This error bound will be used in the comparisons for the six ray-edge tests in part D of the pseudocode. Thus, instead of comparing with zero, as in the floating-point case, the comparisons will be made with a constant error bound (EBN D) value. The value of EBN D is chosen conservatively such that all potential false ray-AABB misses are eliminated. EBN D is the same for all six edge tests. A constant error bound value is not as “tight” as one computed based on the actual values of the parameters, but is trivial to implement. A dynamically-computed, optimal error bound may be too expensive to implement, negating the advantage of using integer 107 arithmetic. In the derivation, values that include error information as intervals will be denoted with an uppercase letter, while scalars will be lowercase. An actual implementation using this idea would not use interval arithmetic. Rather, just the scalar values are computed and compared to the error bound derived from this interval arithmetic proof. The first expression (iri ∗ iya − irj ∗ ixb < EBN D) will be analyzed. This expression gives the relative orientation of the ray with respect to the AABB edge. If the result is less than EBND, the ray is on the “outside” of the edge and misses the AABB. To simplify the analysis, the expression (iri ∗ iya − irj ∗ ixb < EBN D) is first split into terms p and q, where: p = iri ∗ iya q = irj ∗ ixb p − q < EBN D The values p and q are 31 bits in length, since multiplying two signed 16-bit values gives a 31-bit result. Recall that iri and irj are 16-bit values with range [−32767, 32767] and iya and ixb are 16-bit values with range [−32766, 32766]. Interval arithmetic is used to determine the error characteristics. Compute the interval P based on p, recalling that the error in the ray’s integer direction component iri is ± 12 , and the error in iya is 0: 1 1 P = [iri − , iri + ] ∗ [iya, iya] 2 2 The rule for interval multiplication is: [a, b] ∗ [c, d] = [min(a ∗ c, a ∗ d, b ∗ c, b ∗ d), max(a ∗ c, a ∗ d, b ∗ c, b ∗ d)] Simplifying the rule for c = d: [a, b] ∗ [c, c] = [min(a ∗ c, b ∗ c), max(a ∗ c, b ∗ c)] 108 Expanding: iya iya , iri ∗ iya + ), 2 2 iya iya max(iri ∗ iya − , iri ∗ iya + )] 2 2 P = [min(iri ∗ iya − Computing the min and max expressions at run-time is impractical because it adds too much overhead. However, a maximum theoretical error bound can be obtained on the multiplication result. Again, this may not be as “tight” as the actual interval result obtained from fully evaluating the min and max expressions. Examining the interval P reveals that iri ∗ iya is common to all four subexpressions. The other portion of the subexpressions is an error term: err = iya 2 The maximum possible value for err is obtained by substituting the maximum possible value for iya (32766). Hence, err = 32766 = 16383 2 Using this simplification, P can be re-written as: P = [iri ∗ iya − err, iri ∗ iya + err] P = [iri ∗ iya − 16383, iri ∗ iya + 16383] Recall that the scalar p equals iri ∗ iya: P = [p − 16383, p + 16383] Q is of the same form and has the same error bounds as P , so its derivation is not shown. Hence, the scalars p and q both have error bounds of ±16383. Finally, p and q are subtracted and compared to the constant EBN D, which is yet to be determined. The result of the subtraction is 32 bits, because p and q are 31-bit values. 109 The error in the subtraction is determined by subtracting the intervals P and Q: P − Q < EBN D [p − 16383, p + 16383] − [q − 16383, q + 16383] < EBN D [p − q − 32766, p − q + 32766] < EBN D Thus, the result of the subtraction p − q has an error of ±32766. Recall that originally, the result’s comparison was with zero and not EBN D. A negative result meant that the ray was oriented “outside” that particular edge and the ray missed the AABB. A zero or positive result meant that the ray intersected the edge or was oriented “inside” the edge. If all six edge comparisons were zero or positive, the AABB was hit by the ray. Since there is an error of ±32766 on the integer result, any values between −32766 and +32766 represent an uncertain situation where it is unclear whether the ray misses or not. These uncertain results are converted to false hits to conservatively avoid missing AABBs that should have been hit. Hence, any result greater than or equal to −32766 is now considered a hit, and any result less than −32766 is guaranteed to be a true miss. Thus, EBN D = −32766. Substituting for p and q, the final expression for the first integer ray-AABB edge test becomes: iri ∗ iya − irj ∗ ixb < −32766 The other five edge tests have a similar form, and use the same EBN D value. In general, the EBN D for a given initial wordlength of n bits for the ray and AABB parameters and ray direction is: EBN D = −(2n−1 − 2) The derivation in this section used n = 16 bits. 110 The technique can be improved by realizing that the 31-bit multiplication results p and q can have their 16 lower bits truncated without adverse effects. Reducing the wordlength is important if hardware costs are to be minimized (see Chapter 7.) This truncation produces the 15-bit values pt and qt . Subtraction of the 15-bit pt and qt values is simpler than subtraction of the 31-bit p and q values. Removing the lower 16 bits of the multiplication results is equivalent to dividing by 65536 and taking the floor() of the result. EBN D must be adjusted to match the reduced length of pt and qt , producing the truncated EBN Dt . First, the error of ±16383 on p and q must be adjusted to match the truncation of bits. One may be tempted to simply divide the error by 65536, giving an error of ±0.24968. This is only partially correct. Some information is lost when the bottom 16 bits of p and q are dropped. This may shift the truncated values pt and qt downwards relative to their 31-bit values by up to 65535 . 65536 correct, but the lower error bound must have The upper error bound of +0.24968 is still 65535 65536 subtracted from it to compensate for the possible downward shift. The lower error bound becomes −0.24968 − 65535 = −1.24997. 65536 Using hiword() to denote the trimming of the lower 16 bits from p and q: pt = hiword(p) qt = hiword(q) pt − qt < EBN Dt Applying interval arithmetic to determine the error: Pt = [pt − 1.24997, pt + 0.24968] Qt = [qt − 1.24997, qt + 0.24968] Pt − Qt < EBN Dt Subtracting the intervals gives: [pt − qt − 1.49965, pt − qt + 1.49965] < EBN Dt 111 Hence, EBN D = −1.49965 if expressed as a decimal value. Substituting for pt , qt , and hiword() gives: hiword(iri ∗ iya) − hiword(irj ∗ ixb) < −1.49965 However, since the comparison with EBN Dt is an integer one, this can be rewritten as: hiword(iri ∗ iya) − hiword(irj ∗ ixb) < −1 This works because the next lowest integer value is −2, which is both less than −1.49965 and less than −1. The integer value of −1 is both not less than −1.49965 and not less than −1, so the correct behavior is preserved. Thus, EBN Dt is −1, which simplifies the comparison. In fact, EBN Dt is always −1 if the lower n bits of p and q are truncated, given an initial wordlength of n bits. It is also possible to truncate fewer than n bits provided EBN D is adjusted accordingly. 6.3.3 BVH Efficiency BVH traversal efficiency using the integer Plu-AABB test was measured with the same method used for the Slabs-AABB test in Section 6.2.3. Results are presented in Tables 6.5, 6.6, and 6.7. The integer Plu-AABB BVH traversal results are somewhat better than those for BVH traversal using the integer Slabs-AABB test. Using “24/24” bit precision increased the number of ray-triangle tests by a maximum of 2%, and the number of ray-node tests by a maximum of 0%. Compare this to 6% and 2%, respectively, for integer Slabs-AABB. As with Slabs-AABB, 16-bit precision was sufficient for the two simplest scenes (Spheres and Bunny), but not for the more complex scenes. The results suggest that the Pl¨ ucker method has an advantage at lower precisions. For example, the “Powerplant - Boiler” scene shows increases of 431% and 19% for “16/16” bit precision, compared to 1631% and 63% for integer Slabs-AABB. However, when “Powerplant - North” is rendered using the integer 112 Plu-AABB method, the increases are 42% and 6%. When rendered with integer SlabsAABB, the increases are only 29% and 5%. Overall, the integer Plu-AABB test is superior to the integer Slabs-AABB test, but the difference between them is less than a factor of two for most of the cases. Adding one bit of precision to either method should approximately halve the excess ray-triangle and ray-node tests. Therefore, a difference of a factor of two or less is somewhat unimportant because it is easy to make up the difference. 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 5.290 12.623 11.115 38.338 68.164 62.082 11.291 16.938 7.472 43.003 60.837 22.341 4.517 4.948 52.466 33.116 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 3% 2% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 27% 23% 3% 4% 1% 2% 1% 0% 1% 0% 2% 3% 1% 0% 1% 0% 27% 24% 3% 4% 1% 2% 1% 0% 1% 0% 2% 3% 1% 0% 1% 0% 49% 42% 5% 6% 2% 4% 1% 1% 1% 1% 3% 5% 2% 1% 1% 0% Table 6.5: BVH performance versus precision (Plu-AABB), scenes 1-3. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Spheres Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Bunny Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Branch Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 16/16 FP bits bits bits bits bits bits bits bits bits 113 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2.672 4.700 23.254 29.124 13.945 6.634 11.329 103.226 41.274 89.688 20.783 16.308 12.997 326.527 162.040 201.019 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 2% 2% 0% 0% 0% 1% 1% 2% 0% 0% 0% 1% 1% 0% 0% 2% 2% 2% 0% 0% 0% 1% 1% 2% 0% 0% 0% 1% 1% 0% 0% 4% 3% 4% 0% 1% 0% 2% 1% 2% 0% 0% 0% 1% 1% 0% 0% 41% 32% 42% 6% 4% 4% 29% 14% 34% 5% 3% 6% 14% 14% 2% 2% 42% 33% 43% 6% 4% 4% 29% 14% 35% 5% 3% 6% 15% 15% 2% 2% 74% 64% 76% 8% 6% 6% 47% 24% 52% 7% 5% 8% 25% 24% 4% 4% Table 6.6: BVH performance versus precision (Plu-AABB), scenes 4-6. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Poppy Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) Powerplant - Front Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) Powerplant - North Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 16/16 FP bits bits bits bits bits bits bits bits bits 114 1% 1% 0% 0% 0% 0% 0% 0% 0% 0% 21.019 14.955 362.569 313.087 6.536 10.369 12.058 59.852 47.344 63.399 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 0% 0% 0% 0% 0% 0% 1% 1% 0% 0% 4% 0% 3% 1% 0% 1% 10% 11% 1% 1% 4% 0% 4% 1% 0% 1% 11% 11% 1% 1% 7% 0% 6% 1% 0% 1% 17% 17% 1% 1% 94% 2% 74% 10% 0% 12% 226% 225% 11% 9% 96% 2% 76% 10% 1% 13% 232% 230% 11% 9% 185% 5% 150% 18% 1% 23% 431% 417% 19% 15% Table 6.7: BVH performance versus precision (Plu-AABB), scenes 7-8. Statistics for the 32-bit FP case are given as per-ray averages. Integer statistics are percentage increases relative to 32-bit FP. Powerplant - Boiler Ray-Tri Tests (Pri) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sha) Lucy Ray-Tri Tests (Pri) Ray-Tri Tests (Sec) Ray-Tri Tests (Sha) Ray-Node Tests (Pri) Ray-Node Tests (Sec) Ray-Node Tests (Sha) 32 bit 24/32 24/28 24/24 20/28 20/24 20/20 16/24 16/20 16/16 FP bits bits bits bits bits bits bits bits bits 115 Chapter 7 Integer Hardware for Ray-AABB Tests 7.1 Introduction This chapter presents hardware designs for the integer Slabs-AABB and Plu-AABB tests described in Chapter 6. Performance estimates of the integer ray-AABB circuits are given and are compared to the equivalent 32-bit floating-point circuits. The results show that there is a substantial advantage to using the integer approach versus floating-point. The circuit area and number of pipeline stages are reduced by approximately 50% or greater, while maintaining similar operating speeds. 7.2 Hardware Design Issues It is difficult to obtain an exact estimate of the benefit of using reduced precision integer arithmetic in a hardware ray tracer. There are many different ways to implement arithmetic circuits [WH04], and each method has size, speed and power consumption tradeoffs. For example, one way to increase the operating frequency of an arithmetic circuit is to add pipeline stages. This also increases the circuit’s area. For example, the Intel XScale processor’s arithmetic circuits run at only 400 MHz. The equivalent circuits in the Pentium 4 processor run at up to 3.8 GHz, but are an order of magnitude larger than those in the XScale [Fle05]. Both chips were designed at approximately the same time and are manufactured using similar technologies. There are also several design and implementation options for hardware devices. The easiest, cheapest to design, and lowest-performing option is to use an FPGA (Field Programmable Gate Array) [Sha98]. These devices consist of an array of reconfigurable cells 116 117 and interconnects that can be software-programmed to perform virtually any function. The circuit functions are specified in a hardware description language such as Verilog or VHDL. FPGAs are excellent for prototyping work because they can be easily re-programmed. The next option is to build a hard-wired VLSI chip. This gives higher performance than an FPGA, but is more complex and expensive to design. Additionally, the design cannot be altered without re-fabricating the chip. The chip functionality is specified in the Verilog or VHDL language. The designer may specify both standard-cell components (e.g. arithmetic and logic circuits) and low-level components such as individual transistors. A synthesis tool automatically lays out the transistors and wiring for the standard cell components. The designer may optimize a chip by replacing high-level standard cells with customized designs consisting of lower-level components and simpler standard cells, but this is labour-intensive and requires additional expertise. Most modern chips are hybrid designs that mix standard cells and customized circuitry. The circuit performance estimates presented in this chapter are based on arithmetic circuits that have been created by a synthesis tool. A full-custom design could improve the performance substantially, however, the relative benefit from using the integer ray-AABB tests versus floating-point should remain approximately the same. A full-custom design is left for future work. 7.3 Circuit Area, Delay, and Pipeline Stages To estimate the relative advantage from using the integer ray-AABB tests, the circuit area, delay, and number of pipeline stages for various arithmetic and logic circuits must first be examined. These statistics are presented in Table 7.1. Area is the amount of space taken up on a chip by the circuit. Delay is the amount of time needed to produce an output after an input is received. The more complex circuits are broken into several pipeline stages. Pipelining is a form of parallel processing that 118 Area (mm2 ) Function Integer Subtract (16-bit) Integer Subtract (20-bit) Integer Subtract (24-bit) Integer Multiply (16-bit) Integer Multiply (20-bit) Integer Multiply (24-bit) Integer Magnitude Compare (20-bit) Integer Magnitude Compare (24-bit) FP Subtract (32-bit) FP Multiply (32-bit) FP Magnitude Compare (32-bit) Scan Latch (1-bit) Scan Latch (20-bit) Scan Latch (24-bit) Scan Latch (32-bit) 2:1 Multiplexor (1-bit) 2:1 Multiplexor (20-bit) 2:1 Multiplexor (24-bit) 2:1 Multiplexor (32-bit) 0.007964 0.011005 0.013677 0.067137 0.089194 0.134324 0.004533 (est.) 0.005439 0.105673 0.174797 0.013083 0.000084 0.001670 0.002004 0.002673 0.000030 0.000608 0.000730 0.000973 Delay (ns) Pipeline stages 1.1 0 1.1 0 1.2 0 2.6 2 3.0 2 3.0 3 0.477 0 0.477 0 2.7 7 2.3 5 0.477 0 0.176 0 0.176 0 0.176 0 0.176 0 0.176 0 0.176 0 0.176 0 0.176 0 Table 7.1: Circuit area, delay and pipeline stages. increases circuit performance, because each pipeline stage may process a different piece of data simultaneously. The operating speed of a pipeline is limited by the delay of the slowest pipeline stage (indicated in the table.) The arithmetic circuits were synthesized using commercial tools (Synopsis Module- and Design-Compiler) using a standard-cell library provided by Artisan. The Cadence Silicon Ensemble tool was used for circuit placement and routing. The synthesized designs assume a 250 nanometre (0.25 micron) TSMC (Taiwan Semiconductor Manufacturing Corporation) bulk CMOS chip process technology with 5 layers of metal interconnect. Currently, the state-of-the-art in chip fabrication is a 90 nanometre (0.09 micron) process. Hence, the circuit area estimates presented here based on the 0.25 micron process will be larger than estimates based on an 0.09 micron process. This does not alter the conclusions because only the relative difference between integer and floating-point circuit designs matters. 119 The subtraction and multiplication circuits receive two inputs and produce one result. The subtraction circuit (labeled as a “-”) is a fast carry look-ahead design [Kor02]. The multiplier (labeled as an “x”) is a radix-8 Booth encoded design. The integer and floating-point magnitude comparators compare two numbers and output three signals: less-than, equal-to (not used for these designs), or greater-than. They are labeled as “<>” in the schematics. Latches are registers that hold data for a single clock cycle. They are often used to buffer the outputs of other circuits. Scan latches are similar to regular latches, except they incorporate testing capabilities. The testing capabilities of the scan latches are necessary for troubleshooting of complex chips. Scan latches are labeled with an “L” in the schematics in Section 7.4. The width of the latch in bits is given after the “L”, e.g. “L:1” for a 1-bit latch. A 2:1 multiplexor (or MUX) is a circuit that takes two values as input and outputs one of them, depending on whether its control input is a “1” or a “0”. The word-wide (multi-bit) muxes are labeled with “2:1”, and the single-bit muxes are labeled as “b:2:1”. The synthesis tool did a poor job of generating the latches and muxes, so their performance estimates are based on data from full-custom commercial designs [Fle05]. This data was based on a different process technology than for the synthesized circuits and has therefore been scaled to match the 250nm synthesis target. AND/OR logic gates are present in the schematics but are not shown in Table 7.1 or factored into the performance estimates because their contribution to area and delay is negligible. Logic inverters (shown as a “bubble” or small circle attached to the inputs of other circuit elements) are also not counted because their contribution is negligible. 120 Input Latches rx ry rz rlen ri-1 rj-1 rk-1 rinf xmin xmax ymin ymax zmin zmax FP Slabs-AABB Pipeline Stage 1-7 Stage 8-12 Stage 13 Input Latches irx iry irz irlen iri-1floor iri-1ceil irj-1floor irj-1ceil irk-1floor irk-1ceil rinf ixmin ixmax iymin iymax izmin izmax 24-bit Integer Slabs-AABB Pipeline Stage 1 Stage 2-4 Stage 5 Input Latches rx ry rz rex rey rez ri rj rk rinf xmin xmax ymin ymax zmin zmax FP Plu-AABB Pipeline Stage 1-7 Stage 8-12 Stage 13-20 Input Latches irx iry irz irex irey irez iri irj irk rinf ixmin ixmax iymin iymax izmin izmax 24-bit Integer Plu-AABB Pipeline Stage 1 Stage 2-4 Stage 5 HIT HIT HIT HIT Figure 7.1: Overview of the floating-point Slabs-AABB, 24-bit integer Slabs-AABB, floating-point Plu-AABB and 24-bit integer Plu-AABB pipelines. 7.4 Hardware Designs Four ray-AABB intersection pipelines were designed. The pipelines implement the floatingpoint Slabs-AABB test (Section 2.3.1), the integer Slabs-AABB test (Section 6.2), the floating-point Plu-AABB test (Section 3.3), and the integer Plu-AABB test (Section 6.3.) An overview of the ray-AABB pipelines is shown in Figure 7.1. The pipelines are fed by sets of input latches containing the ray and AABB parameters. Data flows through the pipelines from left to right, and is processed by the pipeline stages. A “HIT” signal is output from the end of the pipeline if the ray hits the AABB. Data is 121 shifted to the next pipeline stage at the beginning of each clock cycle. One ray-AABB test is completed with each clock cycle, and the pipeline is working on n ray-AABB tests simultaneously, where n is the number of pipeline stages. It takes n clock cycles for any individual ray-AABB test to be completed. The ray and AABB parameters stored within the input latches do not necessarily need to change with each clock cycle. This lets the AABB or ray be held constant. For example, the pipelines might be used to test one ray against many AABBs (traditional single-ray ray tracing), or many rays against one AABB (coherent ray tracing, as in Chapter 5.) Both the ray and AABB can also change, although this is usually not how a ray tracer would operate. The pipelines can run in a non-pipelined fashion, but then a ray-AABB test result is only produced every n clock cycles, where n is the number of pipeline stages. Keeping the pipeline full may not always be possible due to delays in loading rays and/or AABBs from memory, and due to the nature of the BVH traversal. In particular, the BVH traversal requires knowledge of an AABB hit/miss before it can determine which AABBs to test for intersection next. Keeping a pipeline full under these circumstances can be challenging, thus the pipeline should be kept as short as possible. Sundararajan also discusses hardware implementation of the Slabs-AABB test [SN01, Sun99]. The design is implemented on an FPGA, so the results are not comparable to this research which assumes a custom VLSI design. Additionally, no consideration is given to varying the precision or using integer versus floating-point, or the effect of precision on the ray-AABB calculations and BVH traversal. 7.4.1 Floating-point Slabs-AABB Pipeline Stages 1-12 of the 32-bit floating-point Slabs-AABB pipeline are shown in Figure 7.2. In stages 1-7, the ray origin is subtracted from the AABB coordinates. In stages 8-12, the 122 subtraction result is multiplied by the IRDVCs, producing the intersection distances to the planes of the AABB faces. It takes 12 clock cycles (one clock cycle per pipeline stage) for the intersection distances to be produced at stage 12 after the inputs are presented to stage 1. The slowest path through the circuit is through the 7-stage subtractor, and then through the 5-stage multiplier. Latches are needed on the subtractor and multiplier outputs to hold the results stable, so they can be absorbed by the next pipeline stage. There are numerous other latches that are necessary to pass values to stage 13 of the pipeline. Stage 13 of the floating-point pipeline is shown in Figure 7.3. This stage first sorts the six ray-AABB intersection distances based on the sign bits of the IRDVCs, forming the three [tnear, tf ar] intervals. These intervals are then compared to each other and the ray to determine if they all overlap, in which case the ray hits the AABB. The circuit outputs a one-bit “HIT” signal, which is a “1” for a ray-AABB hit, and a “0” for a miss. 123 rinf L:1 L:1 L:1 L:1 L:1 L:1 L:1 L:1 L:1 L:1 L:1 L:1 rlen L:32 L:32 L:32 L:32 L:32 L:32 L:32 L:32 L:32 L:32 L:32 L:32 xmin L:32 - xmin-rx t0x x rx L:32 ri-1sign ri-1 L:32 L:32 L:32 L:32 L:32 L:32 xmax - L:32 - L:32 L:32 L:1 L:1 xmax-rx x L:32 x L:32 L:1 L:1 L:1 t1x rx ymin ymin-ry ry t0y rj-1sign rj-1 L:32 L:32 L:32 L:32 L:32 L:32 ymax - L:32 - L:32 L:32 L:1 L:1 ymax-ry x L:32 x L:32 L:1 L:1 L:1 t1y ry zmin zmin-rz rz t0z rk-1sign rk-1 L:32 L:32 L:32 L:32 L:32 L:32 zmax - L:32 L:32 L:1 zmax-rz x L:1 L:32 L:1 L:1 t1z rz FP Slabs Pipeline Stage 1-6 (Slowest/Critical path) 1st – 6th stage of FP Subtract (2.7 ns per stage) FP Slabs Pipeline Stage 7 7th stage of FP Subtract (2.7 ns) 32-bit output latch (0.176ns) (Total: 2.876 ns) FP Slabs Pipeline Stage 8-11 (Slowest/Critical path) 1st-4th stage of FP Multiply (2.3 ns per stage) FP Slabs Pipeline Stage 12 (Slowest/Critical path) 5th stage of FP Multiply (2.3 ns) 32-bit output latch (0.176ns) (Total: 2.476 ns) Figure 7.2: Stages 1-7 and 8-12 of the Floating-point Slabs-AABB pipeline. L:1 ri-1sign rj-1sign rk-1sign t0x t1x t0y t1y t0z t1z 1 2:1 2:1 2:1 2:1 2:1 < tfarxy tnearxy 2:1 > 2:1 < > 2:1 2:1 tfar rlen tnear tfarsign < > rinf Figure 7.3: Stage 13 of the Floating-point Slabs-AABB pipeline. tfarz tnearz tfary tneary tfarx tnearx <> <> 2:1 <> <> <> <> 0 L:1 HIT FP Slabs Pipeline Stage 13 (Slowest/Critical path) 2:1 mux (0.176 ns) FP Mag Compare (0.477 ns) 2:1 mux (0.176 ns) FP Mag Compare (0.477 ns) 2:1 mux (0.176 ns) FP Mag Compare (0.477 ns) OR gate (0.075 ns) AND gate (0.075 ns) 1-bit output latch (0.176 ns) (Total: 2.285 ns) 124 125 7.4.2 Integer Slabs-AABB Pipeline Stages 1-4 of the 24-bit integer Slabs-AABB pipeline are shown in Figures 7.4 and 7.5. This explanation here will cover x-axis, as the y- and z-axis are similar. First, the integer ray origin irx is subtracted from the integer AABB coordinates ixmin and ixmax, producing the values dx0 and dx1. These values become dxnear and dxf ar based on the sign of iri−1 f loor . −1 The signs of dxnear and dxf ar then choose iri−1 near and irif ar , which are then multiplied by dxnear and dxf ar to produce itnearx and itf arx . These somewhat complex rules are explained in Section 6.2.2. Fortunately, hardware implementation of these rules is simple, requiring only 2:1 multiplexors, which are both small and fast. Only the top 24 bits of the multiplication result are passed to the next stage, the lower bits are discarded. This matches the “24/24”-bit integer method tested in Section 6.2.3. Stage 5 is shown in Figure 7.6. It’s function is similar to stage 13 in the floating-point circuit as it determines whether the three [itnear, itf ar] intervals and the ray overlap, outputting a one-bit “HIT” signal. 126 iri-1floor iri -1 0 1 ceil rinf L:1 L:1 L:1 L:1 irlen L:24 L:24 L:24 L:24 iri-1near 2:1 L:24 itnearx dxnearsign ixmin x L:24 dx0 irx ixmax dx1 2:1 L:24 2:1 L:24 dxnear iri-1floorsign - dxfar itfarx irx x dxfarsign iri-1ceil iri-1floor 2:1 L:24 irj-1floor irj-1near 2:1 irj-1ceil dynearsign iymin L:24 iri-1far L:24 itneary x L:24 dy0 iry iymax dy1 - 2:1 L:24 2:1 L:24 dynear irj-1floorsign dyfar itfary iry x dyfarsign irj-1ceil irj-1floor L:24 irj-1far 2:1 L:24 Figure 7.4: Stages 1-4 of the 24-bit Integer Slabs-AABB pipeline (part A.) 127 irk-1floor irk-1near 0 2:1 1 irk-1ceil dznearsign izmin L:24 itnearz x L:24 dz0 irz izmax dz1 - 2:1 L:24 2:1 L:24 dznear irk-1floorsign dzfar itfarz irz x dzfarsign irk-1ceil irk-1floor L:24 irk-1far 2:1 L:24 Int Slabs Pipeline Stage 1 (Slowest/Critical path) 24-bit Subtract (1.2 ns) 24-bit 2:1 Mux (0.176 ns) 24-bit 2:1 Mux (0.176 ns) 24-bit Output Latch (0.176 ns) (Total: 1.728 ns) Int Slabs Pipeline Stage 2-3 (Slowest/Critical path) 1st – 2nd stage of 24-bit Multiply (3.0 ns per stage) Int Slabs Pipeline Stage 4 (Slowest/Critical path) 3rd stage of 24-bit Multiply (3.0 ns) 24-bit Output Latch (0.176 ns) (Total: 3.176 ns) Figure 7.5: Stages 1-4 of the 24-bit Integer Slabs-AABB pipeline (part B.) 128 < <> 0 itneary 1 itnearxy 2:1 <> itnearx < itnear itnearz 2:1 rinf irlen itfarxy > HIT <> 2:1 > itfarz <> <> itfary > <> itfarx Int Slabs Pipeline Stage 5 (Slowest/Critical path) 24-bit Compare (0.477 ns) 24-bit 2:1 Mux (0.176 ns) 24-bit Compare (0.477 ns) 24-bit 2:1 Mux (0.176 ns) 24-bit Compare (0.477 ns) OR gate (0.075 ns) AND gate (0.075 ns) 1-bit Output Latch (0.176 ns) (Total: 2.109 ns) 2:1 < itfar L:1 itfarsign Figure 7.6: Stage 5 of the 24-bit Integer Slabs-AABB pipeline. 7.4.3 Floating-point Plu-AABB Pipeline Stages 1-7 of the 32-bit floating-point Plu-AABB pipeline are shown in Figures 7.7 and 7.8. This part of the pipeline performs two functions. First, the ray origin and endpoint are compared to the AABB coordinates to determine if the AABB and ray can overlap. If the ray is infinite, the ray endpoint is ignored by using an OR gate on the output of the circuitry that performs the endpoint comparison. The comparisons of the origin and endpoint with the ray origin produces a single-bit hitExtent signal that is used later in the pipeline. Second, the ray origin is subtracted from the AABB coordinates, producing values xa, xb, ya, yb, za, and zb. These values are passed to the next part of the pipeline. There are numerous multiplexors (2:1 muxes) in this stage. These permit the same circuitry to handle all eight cases of the ray direction vector (e.g. “MMM”, “MMP”, etc.), avoiding the need for eight separate ray-AABB pipelines. The muxes are controlled by the sign bits of the (ri, rj, rk) ray direction vector components. 129 The slowest path through this part of the pipeline is through the subtractors. The circuitry that computes the hitExtent signal is fast enough to fit within a single pipeline stage. Hence, six additional latches are needed to move the hitExtent signal to the beginning of stage 8. Numerous latches are also needed to move the (ri, rj, rk) ray direction vector components to the next part of the pipeline (stage 8). Stages 8-12 of the 32-bit floating-point Plu-AABB pipeline are shown in Figure 7.9. These stages select and multiply the appropriate (ri, rj, rk) ray direction vector components with the xa, xb, ya, yb, za, and zb values. Muxes select the appropriate values to multiply based on the sign bits of the (ri, rj, rk) ray direction components. This permits the circuitry to handle all eight ray direction classifications. Multiplication results P 0..P 5 and Q0..Q5 are produced and are passed to stage 13. Five one-bit latches also pass the hitExtent signal to stage 13. Stages 13-20 are shown in Figure 7.10. Multiplication results P 0..P 5 and Q0..Q5 are subtracted to produce the six side() relations that give the relative orientation of the ray and the six AABB silhouette edges. These six values are compared with zero, and the results of the comparisons are ANDed together to determine if the ray passes through the AABB silhouette. This result is then ANDed with hitExtent to produce the final “HIT” signal. 130 zmax zmin 0 1 < 2:1 <> > 0 b:2:1 1 rinf rksign rez < ymax ymin 2:1 <> b:2:1 > rjsign rey < xmax xmin 2:1 <> b:2:1 > risign rex hitExtent < zmax zmin 2:1 <> L:1 L:1 L:1 L:1 b:2:1 < ymax 2:1 <> b:2:1 > rjsign ry < xmax xmin L:1 > rksign rz ymin L:1 2:1 <> b:2:1 > risign rx rksign rjsign risign L:1 ri L:32 L:32 L:32 L:32 L:32 L:32 L:32 rj L:32 L:32 L:32 L:32 L:32 L:32 L:32 rk L:32 L:32 L:32 L:32 L:32 L:32 L:32 Figure 7.7: Stage 1-7 of the Floating-point Plu-AABB pipeline (part A.) 131 xa xmin rx ymin L:32 ya ry zmin L:32 za - L:32 rz xb xmax - L:32 rx yb ymax ry zmax L:32 zb - L:32 rz FP Pluecker Pipeline Stage 1-6 (Slowest/Critical path) 1st-6th stage of Subtract (2.7 ns per stage) FP Pluecker Pipeline Stage 7 (Slowest/Critical path) 7th stage of Subtract (2.7 ns) 32-bit output latch (0.176 ns) Total delay: 2.876 ns Figure 7.8: Stage 1-7 of the Floating-point Plu-AABB pipeline (part B.) 132 hitExtent L:1 L:1 0 1 L:1 L:1 L:1 xa/xb/rjsign Q0 2:1 rj rjsign x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 x L:32 xa/xb/rksign Q3 2:1 xa rk rksign xb/xa/rjsign P1 2:1 rjsign rj xb/xa/rksign P2 2:1 xb rk rksign ya/yb/risign Q1 2:1 risign ri ya/yb/rksign Q4 2:1 ya rksign rk yb/ya/risign P0 2:1 risign yb/ya/rksign ri P5 2:1 yb rk rksign za/zb/risign Q2 2:1 ri risign za/zb/rjsign Q5 2:1 za rj rjsign zb/za/risign P3 2:1 risign ri zb/za/rjsign P4 2:1 zb rjsign FP Pluecker Pipeline Stage 8 (Slowest/Critical path) 32-bit 2:1 Mux (0.176 ns) 1st stage of Multiply (2.3 ns) Total delay: 2.476 ns rj FP Pluecker Pipeline Stage 9-11 (Slowest/Critical path) 2nd-4th stage of Multiply (2.3 ns per stage) FP Pluecker Pipeline Stage 12 (Slowest/Critical path) 5th stage of Multiply (2.3 ns) 32-bit Output Latch (0.176 ns) Total delay: 2.476 ns Figure 7.9: Stages 8-12 of the Floating-point Plu-AABB pipeline. 133 hitExtent L:1 L:1 L:1 L:1 L:1 L:1 L:1 P0 L:32 <> - < zero Q0 hit01 P1 L:32 <> - zero Q1 < P2 L:32 <> - < zero Q2 HIT hit23 P3 L:32 <> - zero Q3 < P4 L:32 <> - < zero Q4 hit45 P5 Q5 L:32 <> - zero FP Pluecker Pipeline Stage 13-18 (Slowest/Critical path) 1st-6th stage of Subtract (2.7 ns per stage) FP Pluecker Pipeline Stage 19 (Slowest/Critical path) 7th stage of Subtract (2.7 ns) 32-bit Output Latch (0.176 ns) Total delay: 2.876 ns < FP Pluecker Pipeline Stage 20 (Slowest/Critical path) Mag compare (0.477 ns) AND gate (0.075 ns) AND gate (0.075 ns) AND gate (0.075 ns) 1-bit Output Latch (0.176 ns) Total delay: 0.878 ns Figure 7.10: Stages 13-20 of the Floating-point Plu-AABB pipeline. L:1 134 7.4.4 Integer Plu-AABB Pipeline The 24-bit integer Plu-AABB pipeline is similar to the 32-bit floating-point version, with a few minor differences. Stage 1 of the integer pipeline is shown in Figures 7.11 and 7.12. This roughly corresponds to stages 1-7 of the floating-point pipeline. Here, the primary difference between the integer and floating-point pipelines is that the muxes found in stages 8-12 of the floatingpoint pipeline have been moved to stage 1 of the integer pipeline, instead of stages 2-4. This is possible because the integer subtraction circuit in stage 1 is less than one stage long, permitting the muxes from the next stage to be combined within the same stage. The floating-point subtractor requires seven full stages, so the muxes must be in a separate stage (8) in the floating-point pipeline. The slowest path through stage 1 is through the subtractor and the muxes. Stages 2-4 of the integer pipeline are shown in Figure 7.13. This stage performs the multiplications that produce the integer P 0..P 5 and Q0..Q5 values. Only the top 24 bits of these values are passed to the next stage. This matches the “24/24”-bit integer method tested in Section 6.3.3. Stage 5 of the integer pipeline (Figure 7.14) is similar to stages 13-20 of the floatingpoint pipeline, except the subtraction results are compared with EBN D (-1) instead of zero. 135 izmax izmin 0 1 < 2:1 <> > 0 1 b:2:1 rinf irksign irez < iymax iymin 2:1 <> b:2:1 > irjsign irey < ixmax ixmin 2:1 <> b:2:1 > irisign irex hitExtent izmin L:1 < izmax 2:1 <> b:2:1 > irksign irz < iymax iymin 2:1 <> b:2:1 > irjsign iry < ixmax ixmin 2:1 <> b:2:1 > irksign irjsign irisign irisign irx iri (for next stage) iri L:24 irj L:24 irk L:24 irj (for next stage) irk (for next stage) Figure 7.11: Stage 1 of the 24-bit Integer Plu-AABB pipeline (part A.) 136 ixa/ixb/irjsign 0 2:1 L:24 1 ixmin irjsign - irx ixa/ixb/irksign 2:1 ixa irksign L:24 ixb/ixa/irjsign 2:1 ixmax irjsign - irx L:24 ixb/ixa/irksign 2:1 ixb irksign L:24 iya/iyb/irisign 2:1 iymin irisign - iry L:24 iya/iyb/irksign 2:1 iya irksign L:24 iyb/iya/irisign 2:1 iymax irisign - iry L:24 iyb/iya/irksign 2:1 iyb irksign L:24 iza/izb/irisign 2:1 izmin irisign - irz L:24 iza/izb/irjsign 2:1 iza irjsign L:24 izb/iza/irisign 2:1 izmax irisign - irz L:24 izb/iza/irjsign 2:1 izb L:24 irjsign Int Pluecker Pipeline Stage 1 (Slowest/Critical path) 24-bit Subtract (1.2 ns) 24-bit 2:1 Mux (0.176 ns) 24-bit Output Latch (0.176ns) Total delay: 1.552 ns Figure 7.12: Stage 1 of the 24-bit Integer Plu-AABB pipeline (part B.) 137 hitExtent (for next stage) hitExtent L:1 Q1 iya/iyb/irisign x L:1 P1 ixb/ixa/irjsign x L:24 iri L:24 irj Q0 ixa/ixb/irjsign x Q3 x x iri izb/iza/irisign L:24 P3 x L:24 irk P0 iyb/iya/irisign L:24 irj ixa/ixb/irksign L:24 iri Q2 iza/izb/irisign x iri iza/izb/irjsign x irk iyb/iya/irksign L:24 P5 x L:24 irj P2 ixb/ixa/irksign L:24 Q5 x L:24 irk Q4 iya/iyb/irksign x irk L:1 P4 izb/iza/irjsign L:24 x L:24 irj Int Pluecker Pipeline Stage 2-3 (Slowest/Critical path) 1st – 2nd stage of 24-bit Multiply (3.0 ns per stage) Int Pluecker Pipeline Stage 4 (Slowest/Critical path) 3rd stage of 24-bit Multiply (3.0 ns) 24-bit Output Latch (0.176 ns) (Total: 3.176 ns) Figure 7.13: Stages 2-4 of the 24-bit Integer Plu-AABB pipeline. 138 P0 - < <> EBND Q0 hit01 P1 - <> EBND Q1 < P2 - < <> EBND Q2 Int Pluecker Pipeline Stage 5 (Slowest/Critical path) Subtract (1.2 ns) Mag Compare (0.477 ns) AND gate (0.075 ns) AND gate (0.075 ns) AND gate (0.075 ns) 1-bit output latch (0.176 ns) Total delay: 2.078 ns hit23 HIT P3 - <> EBND Q3 hitExtent < P4 - < <> EBND Q4 hit45 P5 - <> Q5 EBND < Figure 7.14: Stage 5 of the 24-bit Integer Plu-AABB pipeline. L:1 139 7.5 Performance Estimates Ray-AABB pipeline performance was estimated based on the pipeline designs and the circuit performance parameters presented in Section 7.3. Performance estimates for the integer and floating-point ray-AABB designs are presented in Table 7.2. The table includes estimates for “20/24”-bit integer Slabs-AABB and Plu-AABB in addition to the “24/24”bit versions of these tests. Schematics for the “20/24”-bit hardware tests are not shown because of their similarity to the “24/24”-bit designs. The table summarizes the component counts, total circuit area, number of pipeline stages, pipeline stage delay, clock frequency, pipeline latency, and area-delay product. Lower values are better for all measures except clock frequency. Pipeline stage delay is the delay of the slowest pipeline stage. This determines the clock frequency of the circuit, which is the inverse of the pipeline stage delay. Pipeline latency is the time required to produce a “HIT” signal after the inputs are presented to the first pipeline stage, and is simply the number of stages multiplied by the pipeline stage delay. Area-delay product is the circuit area multiplied by the pipeline stage delay, and is a measure of overall performance which considers both circuit area and clock frequency. Area-delay product is a useful measure of performance for highly parallel systems (such as a ray tracing chip.) For example, circuit A might have an area of 1 mm2 and a delay of 1 ns. Thus, its area-delay product is 1 mm2 ∗ ns. Circuit B might have an area of 0.5 mm2 and a delay of 2 ns, with an area-delay product of 1 mm2 ∗ ns. The area-delay products are the same, despite circuit B only being half the speed of circuit A. However, two circuit Bs can fit within the same chip area as one circuit A, thus the overall performance (and area-delay product) are the same. This assumes that operations can be performed in parallel with 100% efficiency. 24/24 bit Int Plu-AABB Area (mm2 ) (5) 0.0004 20/24 bit Int Plu-AABB Area (mm2 ) (4) 0.0003 (15) 0.0251 (12) 0.0241 (6) 0.5352 (6) 0.0660 (12) 0.0073 (4) 0.0029 (12) 1.6119 (12) 0.1641 (18) 0.0131 (6) 0.0002 (27) 0.0541 (6) 0.0272 (6) 0.0326 (6) 0.0660 (6) 0.0821 (12) 1.0703 (6) 0.0002 (18) 0.0109 Table 7.2: Ray-AABB pipeline area and performance. Values in parenthesis are the total number of components, and areas represent the total area. ADP is the Area-Delay Product. Percentages are relative to 32-bit FP Slabs-AABB. (6) 0.8059 (6) 0.0821 (16) 0.0117 (38) 0.0762 13 20 (154%) 5 (38%) 4 (31%) 5 (38%) 4 (31%) 2.876 2.876 (100%) 3.176 (110%) 3.176 (110%) 3.176 (110%) 3.176 (110%) 348 348 (100%) 315 (91%) 315 (91%) 315 (91%) 315 (91%) 37.388 57.520 (154%) 15.880 (42%) 12.704 (34%) 15.880 (42%) 12.704 (34%) 5.546 10.533 (190%) 3.205 (58%) 2.247 (41%) 6.063 (109%) 4.252 (77%) (18) 0.0175 (10) 0.0097 20/24 bit Int Slabs-AABB Area (mm2 ) (5) 0.0004 (33) 0.0551 (4) 0.0080 Pipe stages: Stage delay (ns): Clock freq. (M Hz): Pipe latency (ns): ADP (mm2 ∗ ns): (45) 0.1203 (6) 0.0002 (58) 0.1550 24/24 bit Int Slabs-AABB Area (mm2 ) (6) 0.0005 (6) 0.0326 (6) 0.0326 (12) 0.0653 (6) 0.6340 (12) 1.2681 (6) 1.0488 (12) 2.0976 (6) 0.0785 (12) 0.1570 1.9284 3.6624 (190%) 1.0090 (52%) 0.7075 (37%) 1.9091 (99%) 1.3388 (69%) 32 bit FP Plu-AABB Area (mm2 ) (20) 0.0017 32 bit FP Slabs-AABB Area (mm2 ) (29) 0.0024 1-bit Latch 20-bit Latch 24-bit Latch 32-bit Latch 1-bit Mux 20-bit Mux 24-bit Mux 32-bit Mux 20-bit Int Subtract 24-bit Int Subtract 20-bit Int Multiply 24-bit Int Multiply 20-bit Int Compare 24-bit Int Compare FP Subtract FP Multiply FP Compare Total area (mm2 ): Component 140 141 The estimates in Table 7.2 account for wiring complexity and delay within the circuit components, but not between the components (e.g. between a multiplier and a scan latch.) The wiring complexity between the pipeline stages is trivial, because the stages would be placed adjacent to each other on a chip. Therefore, ignoring the inter-component wiring does not alter the conclusions significantly. When examining Table 7.2, first compare the 32-bit floating-point Slabs-AABB pipeline to the floating-point Plu-AABB pipeline. The results are surprising, given the superiority of the software Plu-AABB test in Chapters 3 and 4. In hardware, Slabs-AABB is the clear winner, at least when using the pipeline designs presented in this chapter. The circuit area and area-delay product of the Plu-AABB pipeline are 90% greater than the Slabs-AABB pipeline. The pipeline stage delays and clock speeds are identical. The number of pipeline stages and the pipeline latency are 54% greater for the Plu-AABB pipeline. The relatively poor performance of the Plu-AABB pipeline is primarily due to the large number of arithmetic circuits. Twice as many subtraction, multiplication, and comparison circuits are needed, compared to the Slabs-AABB pipeline. The reader should not conclude that the Plu-AABB method is a poor candidate for hardware implementation, because these estimates are based on one particular pipeline design. A better design might be able to reduce the number of arithmetic circuits by re-using them for multiple calculations. If so many arithmetic operations are needed for the Plu-AABB test, why does it perform so well in software? One explanation is that the structure of the Plu-AABB code (Appendix D) causes it to run more efficiently on modern processors, despite potentially performing more arithmetic operations. For example, the Plu-AABB code is able to finish sooner when a “MISS” occurs than the Slabs-AABB code (Appendix E), because it can perform a ray-AABB extent test before performing any calculations. Also, all the comparisons and most of the calculations in the Plu-AABB code are independent, which is not the case for Slabs-AABB where 142 there is much dependency on the results of previous algorithm steps (specifically, the tnear and tf ar values.) This independence likely increases the execution efficiency on today’s deeply-pipelined processors. An interesting exercise would be to test both Plu-AABB and Slabs-AABB in a profiling tool that shows the contents of the processor’s pipeline stages. Unfortunately, the profiling tools that I have used, such as Intel’s VTune 7.1 [Int] are not this advanced. Another explanation for the performance discrepancy is that on a general-purpose CPU, only a small portion of the execution time is consumed by arithmetic operations. Much of the time is spent fetching and decoding instructions and moving data around the CPU. Similarly, only a small amount of chip area is devoted to arithmetic circuitry. Thus, doubling the number of arithmetic operations does not halve the performance. The ray-AABB pipeline is more efficient because the bulk of the chip area is devoted to arithmetic operations, there are no instructions to fetch and decode, and data only passes between adjacent pipeline stages. Thus, doubling the number of arithmetic operations has a much larger impact on performance for the ray-AABB pipeline. Next, examine the integer versus floating-point results for the Slabs-AABB pipelines. The integer pipeline performs substantially better than the floating-point pipeline. The circuit areas for the integer pipelines are 37% (for “20/24”-bit precision) or 52% (“24/24”bit precision) of the 32-bit floating-point circuit area. The integer Slabs-AABB pipelines also have 31% to 38% of the pipeline stages of the floating-point pipeline. A shorter pipeline is desirable, because it is easier to keep it filled with data, and the time needed to produce a result is less. The integer pipeline latency is 34% to 42% of the floating-point pipeline latency. Recall that if the pipelines are kept full, they will produce one ray-AABB test result per clock cycle, but any individual result requires computation time equal to the pipeline latency. Longer pipelines are more difficult to keep filled with data, and are less likely to be able to produce a useable result with each clock cycle. 143 Interestingly, the clock frequency of the integer Slabs-AABB pipeline is slightly lower than the floating-point pipeline. This is due to the choices of arithmetic circuits, and the reader should not make any general conclusions about the clock frequencies of integer versus floating-point arithmetic circuits. The lower clock frequency does not necessarily translate into lower performance, because of the difference in pipeline length between the integer and floating-point designs. For example, the integer pipeline might be able to operate completely full with data, running at 315 MHz and producing 315 million ray-AABB test results per second. The longer floating-point pipeline running at 348 MHz might only be half full with data, producing a ray-AABB test result every second clock cycle on average, or 174 million ray-AABB tests per second. Area-delay product is also reduced to between 41% and 58% of the floating-point value. This suggests that approximately a factor-of-two improvement in overall ray-AABB testing performance is achieved by using the integer Slabs-AABB method. This area-delay product estimate does not consider pipeline efficiency effects, which will likely increase the benefit from using the integer method. Finally, compare the integer Plu-AABB pipelines to all of the other pipelines. The integer Plu-AABB pipelines have the same number of pipeline stages, pipeline stage delay, clock frequency, and pipeline latency as their integer Slabs-AABB counterparts. Unfortunately, the circuit area is nearly double that of the integer Slabs-AABB pipelines. The “24/24” bit Plu-AABB pipeline is virtually the same size as the floating-point Slabs-AABB pipeline. Similarly, the area-delay product of the Pl¨ ucker integer pipelines are nearly double that of the integer Slabs-AABB pipelines. The area-delay product for the “24/24” bit integer Plu-AABB pipeline actually exceeds the area-delay product of the floating-point Slabs-AABB pipeline. Based on all of these results, the integer Slabs-AABB test is preferred for a hardware implementation. A more advanced Plu-AABB pipeline design might close the gap be- 144 tween the two methods, but the integer Slabs-AABB pipeline is already small, efficient and relatively simple. Chapter 8 Conclusions and Future Work 8.1 Conclusions Several contributions to the field of ray tracing have been made in this research. It has been demonstrated that reducing the numerical precision of a ray tracer can save memory and reduce hardware complexity. New methods were devised that ensured the reduced precision could not negatively affect image quality. To summarize: • Detailed, up-to-date surveys of BVH research and ray tracing hardware were presented. • A new, Pl¨ ucker-coordinate method was devised for performing ray-AABB tests. The Plu-AABB test improves ray tracing performance by up to 27% over the traditional Slabs-AABB test, at least under the given testing conditions. A new method for ordered BVH traversal was also tested, showing large performance gains for complex scenes. • A hierarchical, low-precision integer encoding scheme for BVH coordinates was tested. Using only 8 integer bits per AABB coordinate instead of 32-bit floating-point, the memory consumed by the BVH nodes was reduced by 63%. Hundreds of megabytes of memory were saved for large scenes with tens of millions of triangles. The problem with this approach was the computational overhead incurred from converting the 8-bit relative coordinates back to absolute floating-point coordinates. This problem was solved by using coherent ray tracing, which amortizes the conversion cost over many rays. This reduced the performance penalty for using the method to just a few percent. 145 146 • New integer versions of the Slabs-AABB and Plu-AABB tests were devised that are robust when using low precision integer arithmetic. The integer ray-AABB tests are conservative because uncertain hit/miss cases are assumed to be hits. This ensures that the images are rendered correctly, regardless of the precision used. Test results show that 20-24 bit integer arithmetic is sufficient even for large, complex scenes containing millions of triangles. • Hardware designs were presented for the integer Slabs-AABB and Plu-AABB tests to determine the advantage from using reduced precision integer arithmetic. Generally, a factor-of-two or better savings was achieved in circuit size and complexity (such as the number of pipeline stages). This allows twice as many ray-AABB circuits to be placed on a chip and improves the performance of these circuits. This could double the throughput of a hardware ray tracing chip when performing ray-AABB tests. Additionally, the Slabs-AABB test was found to be superior to the Plu-AABB test when implemented in hardware. 8.2 Future Work Ray tracing is often regarded as a thoroughly-explored and mature field. There is, however, significant opportunity for further work in the topics covered by this research: • Pl¨ ucker coordinates can also be used for uniform grid traversal, but this has not yet been properly investigated. Any ray with a certain direction classification (e.g. M M M ) must exit one of three faces within the current cell. Determining which cell face is exited requires performing up to three side() relation computations with the three edges shared by the faces. It may be possible to use Pl¨ ucker-coordinate methods for other spatial hierarchy schemes as well. 147 • Hierarchy construction heuristics have been used successfully to improve the quality of k-d trees and improve rendering speed [Hav01]. Similar techniques should also apply to BVHs, but thorough investigation has yet to be performed. • Ray coherence was successfully used in Chapter 5 for standard recursive ray tracing, improving performance substantially. Most ray tracing hardware also relies on ray coherence to achieve high performance. Unfortunately, global illumination methods such as path tracing trace many random, non-coherent rays. This may make them unsuitable for coherency-based ray tracing systems. Exploiting ray coherency in global illumination algorithms is a problem requiring further research. • Other hierarchical spatial subdivision schemes may be encoded using low-precision, relative coordinates, not just the BVH. For example, the position of a k-d tree node’s splitting plane could be represented with an 8-bit value relative to the node’s dimensions. • The memory-saving coherent ray tracing technique from Chapter 5 could be combined with the integer methods from Chapters 6 and 7. This might produce a method with the advantages of both techniques. Unfortunately, I have been unable to devise a simple way to combine them. • It may be possible to develop robust integer traversal methods for other spatial subdivision schemes, not just the BVH. This would reduce hardware cost for traversal circuitry. • The integer ray-AABB tests may be improved by making them adaptive. The rayAABB test could be performed starting with 8-bits of precision. The precision could be increased by multiples of 8 bits if the result of the test is uncertain due to imprecision. The ray-AABB test could even be refined one bit at a time, re-using the result for the previous precision. 148 • This research investigated integer methods for BVH traversal, but not for ray-object intersection. An integer ray-triangle intersection algorithm would be useful because hardware costs might be reduced, and integer and floating-point would not need to be mixed within the core part of a hardware ray tracer. The ray-triangle test could be implemented with adaptive precision, only increasing precision for an uncertain result. • By using adaptive-precision integer ray-AABB tests and ray-triangle tests, ray tracer precision might be increased to extremely high levels (64 bits or higher) with little overhead. This would allow rendering of larger, more complex scenes and reduce the incidence of image artifacts caused by imprecision. The vast majority of ray-AABB and ray-triangle tests would not require full-precision, possibly only requiring 16 to 32 bits. Appendix A List of Abbreviations AABB: Axis-Aligned Bounding Box - a bounding box where the planes of the faces are perpendicular to the coordinate system axes. ALU: Arithmetic Logic Unit - circuitry that performs arithmetic and logical operations on numbers. BSP: Binary Space Partitioning (tree) - a hierarchical space subdivision scheme for improving ray tracing performance. BVH: Bounding Volume Hierarchy - used for improving the efficiency of ray tracing. CAD: Computer-Aided Design - usually refers to graphical drafting software, but can mean any type of software where the computer assists with design. CMOS: Complementary Metal Oxide Semiconductor - the most common digital circuit technology. CORDIC: COordinate Rotation DIgital Computer - a technique for computing trigonometric functions to any desired precision. DRAM: Dynamic Random Access Memory - inexpensive, high-density semiconductor memory that must be regularly refreshed or it loses its contents. DSP: Digital Signal Processor - a simplified microprocessor designed for performing numerical operations on signals. FPGA: Field-Programmable Gate Array - a chip that consists of a software reprogrammable array of logic cells. FPU: Floating-Point Unit - a hardware unit that performs floating-point arithmetic operations. GPU: Graphics Processing Unit - a chip that performs real-time 3D rendering, typically 149 150 using a scan-conversion algorithm (not ray tracing.) IEEE: Institute of Electrical and Electronics Engineers - an organization responsible for many standards and promotion of the Engineering profession. IRDVC: Inverse Ray Direction Vector Components - if the ray direction components are ri, rj, rk, the IRDVCs are 1 , 1, 1. ri rj rk LAN: Local Area Network - connects multiple computers via wired or wireless connections. NaN: Not a Number - A special floating-point value that represents an otherwise unrepresentable value. PCI: Peripheral Component Interconnect - The industry-standard expansion bus used in virtually all microcomputers. RISC: Reduced Instruction Set Computer - a microprocessor design philosophy that minimizes the complexity of the instruction set to improve performance. SDRAM: Synchronous Dynamic Random Access Memory - a faster type of DRAM, and the most common type of RAM used in microcomputers. SIMD: Single Instruction Multiple Data - CPU instructions that perform the same operation on multiple pieces of data simultaneously. VHDL: VHSIC Hardware Description Language - a language used to specify the design of integrated circuits. VLSI: Very Large Scale Integration - describes an integrated circuit with between 10,000 and 99,999 gates, or more. Appendix B BVH Performance vs. Maximum Triangles per Node This appendix examines the effect of varying the maximum number of triangles per BVH node on the rendering time, the BVH build time (excluding disk I/O and file parsing), and the total number of BVH nodes for each of the test scenes in Section 4.2. The tables reveal that the minimal rendering time is obtained for a maximum triangles per node value of approximately 6. Lower values do not reduce the rendering times and greatly increase the number of BVH nodes. Higher values gradually increase rendering times and reduce the BVH build time and the number of BVH nodes. 151 152 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) 31.5 21.3 21.0 21.4 21.0 21.4 21.8 21.9 21.9 22.6 22.8 23.3 23.5 24.1 24.2 24.4 26.2 27.4 31.2 46.0 79.3 138 268 Build Time (s) 0.04 0.04 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.01 BVH Nodes 49711 29667 22303 16493 13703 11339 9989 8655 7867 7069 6443 5761 5375 5047 4807 4537 3645 3005 2283 1175 591 291 147 Table B.1: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Spheres test scene (24,578 triangles.) 153 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) 79.0 23.0 24.0 24.4 23.8 22.4 22.9 22.9 23.0 23.1 23.2 23.7 23.6 23.8 23.9 24.1 24.8 25.4 26.8 33.3 59.4 156 217 Build Time (s) 0.14 0.12 0.11 0.10 0.10 0.10 0.09 0.09 0.09 0.09 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.07 0.07 0.06 0.06 0.05 BVH Nodes 156763 88391 63757 47725 39459 33377 29077 25043 22471 20457 18773 17009 15843 14875 13951 12961 10395 8711 6599 3347 1681 839 439 Table B.2: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Bunny test scene (69,463 triangles.) 154 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) 42.2 19.5 19.2 19.1 19.0 18.9 19.0 19.0 19.1 19.1 19.1 19.3 19.3 19.4 19.9 19.5 19.9 20.2 21.0 24.9 34.1 67.0 147 Build Time (s) 0.83 0.49 0.45 0.42 0.41 0.39 0.38 0.37 0.36 0.35 0.35 0.34 0.34 0.34 0.33 0.33 0.32 0.31 0.29 0.26 0.23 0.20 0.18 BVH Nodes 1587595 399785 290305 225385 183973 155257 135045 118623 107393 98721 88885 82867 75099 71321 67185 63743 53181 42959 33809 18035 8961 3851 1973 Table B.3: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Branch test scene (312,706 triangles.) 155 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) 29.0 12.2 12.1 12.1 12.2 12.1 12.3 12.6 12.4 12.5 12.6 12.7 12.8 12.9 12.9 13.0 13.4 13.8 14.6 17.9 24.5 38.1 70.6 Build Time (s) 0.85 0.57 0.53 0.50 0.48 0.46 0.45 0.43 0.42 0.41 0.41 0.40 0.40 0.39 0.39 0.39 0.37 0.36 0.35 0.31 0.28 0.25 0.23 BVH Nodes 1481591 488959 363783 267833 225005 187551 165025 141921 128675 116479 107327 97351 90841 84825 79829 74379 60095 50387 38297 19349 9913 5137 2545 Table B.4: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Poppy test scene (395,460 triangles.) 156 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) Build Time (s) BVH Nodes Out of memory Out of memory Out of memory 35.5 26.9 15577571 34.7 25.5 12241023 34.4 24.1 8930457 34.2 23.5 7652121 33.9 22.8 6374789 34.4 22.5 5728823 34.2 22.0 4839263 34.3 21.7 4459689 34.5 21.4 3924275 34.4 21.2 3683141 34.8 20.9 3310999 35.1 20.8 3128979 35.4 20.6 2883919 35.8 20.5 2740991 36.0 20.3 2510451 37.2 19.8 2035785 38.7 19.5 1699325 41.8 19.0 1282305 57.8 17.9 656809 92.5 17.0 337091 171 16.1 170679 343 15.4 87669 Table B.5: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Powerplant-Front test scene (12,748,512 triangles.) 157 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) Build Time (s) BVH Nodes Out of memory Out of memory Out of memory 155 26.9 15577571 155 25.5 12241023 155 24.1 8930457 155 23.5 7652121 153 22.8 6374789 155 22.5 5728823 155 22.0 4839263 156 21.7 4459689 158 21.4 3924275 158 21.2 3683141 160 20.9 3310999 160 20.8 3128979 162 20.6 2883919 163 20.5 2740991 169 20.3 2510451 175 19.8 2035785 189 19.5 1699325 203 19.0 1282305 284 17.9 656809 507 17.0 337091 977 16.1 170679 2496 15.4 87669 Table B.6: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Powerplant-North test scene (12,748,512 triangles.) 158 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) Build Time (s) BVH Nodes Out of memory Out of memory Out of memory 109 26.9 15577571 109 25.5 12241023 109 24.1 8930457 109 23.5 7652121 110 22.8 6374789 111 22.5 5728823 112 22.0 4839263 112 21.7 4459689 115 21.4 3924275 115 21.2 3683141 116 20.9 3310999 115 20.8 3128979 116 20.6 2883919 117 20.5 2740991 118 20.3 2510451 123 19.8 2035785 129 19.5 1699325 138 19.0 1282305 161 17.9 656809 250 17.0 337091 398 16.1 170679 935 15.4 87669 Table B.7: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Powerplant-Boiler test scene (12,748,512 triangles.) 159 Max Tris 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 20 24 32 64 128 256 512 Rendering Time (s) Build Time (s) BVH Nodes Out of memory Out of memory Out of memory 67.5 76.3 35340445 48.7 72.4 25770479 49.0 69.8 19736663 49.0 68.1 16156559 49.6 66.7 13548145 51.1 65.7 11702781 51.3 64.8 10239551 51.4 64.2 9146131 51.8 63.5 8239011 52.2 63.0 7523509 53.1 62.6 6898983 53.7 62.2 6386311 54.5 61.8 5933093 57.0 61.5 5550657 57.1 61.2 5205801 59.5 60.2 4178239 65.7 59.4 3494515 75.3 58.2 2634061 109 55.8 1329565 175 53.7 670257 228 51.8 336975 302 50.0 169527 Table B.8: Rendering time, BVH build time, and number of BVH nodes versus maximum trianges per node for the Lucy test scene (28,057,792 triangles.) Appendix C BVH Construction C++ Code /* Sample code for BVH construction. The BVH construction code is inspired by the technique presented in "Realistic Ray Tracing", 2nd ed. by Peter Shirley and R. Keith Morley. My C++ implementation has been completely restructured, thus my C++ code does not resemble the code presented in the book. This code is released into the public domain, but please give me (J. Mahovsky) credit somewhere if you use it. There is no warranty of any kind on this code. */ struct Vertex { float x, y, z; // position float ni, nj, nk; // normal // a texture coordinate might go here as well }; struct Tri { Vertex *v0, *v1, *v2; Material *material; }; Tri *tris; // array of all the triangles in the scene, gets // shuffled as the BVH is constructed, this avoids the // need to store lists of pointers to triangles in the // BVH nodes, saving considerable memory and memory // management int numTris; // total number of triangles in the scene struct BBoxNode { float xmin, ymin, zmin; float xmax, ymax, zmax; 160 161 // < 0 if branch node, -1 = x splitting plane, -2 = y, -3 = z int numTris; union { // pointer to first triangle in node (in global *tris, see above) Tri *tris; // pointer to first child, second child is stored after the first BBoxNode *children; }; }; // 32 bytes (on a 32-bit machine) BBoxNode *rootNode; int boxMaxTris = 6; // maximum number of triangles per child node float epsilon = 5e-7; // box padding float minSplitSize = 1e-6; // 2 * epsilon // --- BVH construction -------------------------------------void swapTris(Tri *t0, Tri *t1) { // shuffles triangles around during BVH construction Vertex *v0 = t0->v0; Vertex *v1 = t0->v1; Vertex *v2 = t0->v2; Material *material = t0->material; t0->v0 = t1->v0; t0->v1 = t1->v1; t0->v2 = t1->v2; t0->material = t1->material; t1->v0 = v0; t1->v1 = v1; t1->v2 = v2; t1->material = material; } void Build() { // build the BVH starting with the root node rootNode = new (std::nothrow) BBoxNode; if(rootNode == NULL) { MessageBox(NULL, "Build()", "Memory Allocation Error", MB_OK | MB_ICONERROR); exit(1); 162 } BuildTree(rootNode, 1, 0, numTris - 1, 0); } void BuildTree(BBoxNode *node, int depth, int firstTri, int lastTri, int axis) { // this function would be much shorter if I had used 3-element // arrays instead of x, y, z and v0, v1, v2 // determine the bounding box for the tris in the node Tri *tri = &tris[firstTri]; float float float float float float xmin ymin zmin xmax ymax zmax = = = = = = tri->v0->x; tri->v0->y; tri->v0->z; tri->v0->x; tri->v0->y; tri->v0->z; for(int i = firstTri; i <= lastTri; i++) { tri = &tris[i]; if(xmin if(ymin if(zmin if(xmax if(ymax if(zmax > > > < < < tri->v0->x) tri->v0->y) tri->v0->z) tri->v0->x) tri->v0->y) tri->v0->z) xmin ymin zmin xmax ymax zmax = = = = = = tri->v0->x; tri->v0->y; tri->v0->z; tri->v0->x; tri->v0->y; tri->v0->z; if(xmin if(ymin if(zmin if(xmax if(ymax if(zmax > > > < < < tri->v1->x) tri->v1->y) tri->v1->z) tri->v1->x) tri->v1->y) tri->v1->z) xmin ymin zmin xmax ymax zmax = = = = = = tri->v1->x; tri->v1->y; tri->v1->z; tri->v1->x; tri->v1->y; tri->v1->z; if(xmin if(ymin if(zmin if(xmax if(ymax > > > < < tri->v2->x) tri->v2->y) tri->v2->z) tri->v2->x) tri->v2->y) xmin ymin zmin xmax ymax = = = = = tri->v2->x; tri->v2->y; tri->v2->z; tri->v2->x; tri->v2->y; 163 if(zmax < tri->v2->z) zmax = tri->v2->z; } if(((xmax - xmin < minSplitSize) && (ymax - ymin < minSplitSize) && (zmax - zmin < minSplitSize)) || ((lastTri - firstTri) < boxMaxTris) || (depth == maxDepth)) { // if the node is too small along all 3 axes, or the number of // triangles is less than boxMaxTris, or the BVH depth is too // large, make this node a child node node->numTris = lastTri - firstTri + 1; node->tris = &tris[firstTri]; } else { int midpoint; // index of the first triangle in the second child for(int i = 0; i < 3; i++) // try up to 3 axes { if(axis == 0) // x-axis { if(xmax - xmin >= minSplitSize) { // if the box is large enough along the x-axis, partition // the triangles with respect to the box midpoint float pivot = xmin + (xmax - xmin) * 0.5f; midpoint = firstTri; for(int i = firstTri; i <= lastTri; i++) { // determine min/max x-coordinate for the triangle tri = &tris[i]; float txmin = tri->v0->x; if(tri->v1->x < txmin) txmin = tri->v1->x; if(tri->v2->x < txmin) txmin = tri->v2->x; float txmax = tri->v0->x; if(tri->v1->x > txmax) txmax = tri->v1->x; if(tri->v2->x > txmax) txmax = tri->v2->x; // if the center point of the triangle is on the negative // side of the pivot coordinate, put it in the first // child, else do nothing, which puts it in the second // child 164 float centroid = (txmin + txmax) * 0.5f; if(centroid < pivot) swapTris(tri, &tris[midpoint++]); } // if neither child is empty, break from the for() loop // because a satisfactory split has been achieved if(midpoint != firstTri && midpoint != lastTri + 1) break; } axis = 1; // else try the y-axis } else if(axis == 1) // y-axis { if(ymax - ymin >= minSplitSize) { float pivot = ymin + (ymax - ymin) * 0.5f; midpoint = firstTri; for(int i = firstTri; i <= lastTri; i++) { tri = &tris[i]; float tymin = tri->v0->y; if(tri->v1->y < tymin) tymin = tri->v1->y; if(tri->v2->y < tymin) tymin = tri->v2->y; float tymax = tri->v0->y; if(tri->v1->y > tymax) tymax = tri->v1->y; if(tri->v2->y > tymax) tymax = tri->v2->y; float centroid = (tymin + tymax) * 0.5f; if(centroid < pivot) swapTris(tri, &tris[midpoint++]); } if(midpoint != firstTri && midpoint != lastTri + 1) break; } axis = 2; } else // z-axis { if(zmax - zmin >= minSplitSize) { float pivot = zmin + (zmax - zmin) * 0.5f; midpoint = firstTri; 165 for(int i = firstTri; i <= lastTri; i++) { tri = &tris[i]; float tzmin = tri->v0->z; if(tri->v1->z < tzmin) tzmin = tri->v1->z; if(tri->v2->z < tzmin) tzmin = tri->v2->z; float tzmax = tri->v0->z; if(tri->v1->z > tzmax) tzmax = tri->v1->z; if(tri->v2->z > tzmax) tzmax = tri->v2->z; float centroid = (tzmin + tzmax) * 0.5f; if(centroid < pivot) swapTris(tri, &tris[midpoint++]); } if(midpoint != firstTri && midpoint != lastTri + 1) break; } axis = 0; } } // partition arbitrarily if still an empty child after trying all // 3 axes: just split the set of triangles in half regardless of // their actual positions if(midpoint == firstTri || midpoint == lastTri + 1) midpoint = (firstTri + lastTri) / 2; // denotes a branch node and the split axis node->numTris = -1 - axis; // allocate both children at once, this cuts the memory management // in half and makes the two children contiguous in memory node->children = new (std::nothrow) BBoxNode[2]; if(node->children == NULL) { MessageBox(NULL, "Memory Allocation Error", "BuildTree()", MB_OK | MB_ICONERROR); exit(1); } BuildTree(&node->children[0], depth + 1, firstTri, midpoint - 1, (axis + 1) % 3); BuildTree(&node->children[1], depth + 1, midpoint, lastTri, 166 (axis + 1) % 3); } // pad the box by adding/subtracting epsilon, which expands it // slightly to ensure robustness node->xmin = xmin - epsilon; node->ymin = ymin - epsilon; node->zmin = zmin - epsilon; node->xmax = xmax + epsilon; node->ymax = ymax + epsilon; node->zmax = zmax + epsilon; } Appendix D BVH-Plu-DSA Traversal C++ Code /* Sample code for ’BVH-Plu-DSA’ ray-BVH traversal. The and and for BVH traversal is split into two types: ClosestHit (for primary secondary rays) and ShadowTest (for shadow rays). ClosestHit() ShadowTest() call the appropriate recursive traversal function the given ray direction. This code is released into the public domain, but please give me (J. Mahovsky) credit somewhere if you use it. There is no warranty of any kind on this code. */ struct Ray { float Ox, Oy, Oz; // origin float Di, Dj, Dk; // direction components float rlen; // ray length, -1 if infinite length float Ex, Ey, Ez; // ray endpoint if not infinitely long Tri *triHit; // triangle hit by the ray, NULL if no triangle hit // rlen and Ex, Ey, Ez are modified to contain the hit distance and // point of the currently closest hit triangle, permitting // automatic early termination of the traversal }; // having eight specialized cases makes for a large amount of code // perhaps templates can be used to make the code more concise void ClosestHit(Ray *ray) { // technically, we should check the sign bit here because < 0 // does not properly handle -0 // doing it this way seems to be OK though, the BVH-Plu-DSA // algorithm does not fail if -0 is improperly handled // ... but it may traverse the children in a different order // for rays with -0 components instead of +0, but these rays // are extremely rare 167 168 if(ray->Di < 0) { if(ray->Dj < 0) { if(ray->Dk < 0) Traverse_ClosestHitMMM(rootNode, else Traverse_ClosestHitMMP(rootNode, } else { if(ray->Dk < 0) Traverse_ClosestHitMPM(rootNode, else Traverse_ClosestHitMPP(rootNode, } } else { if(ray->Dj < 0) { if(ray->Dk < 0) Traverse_ClosestHitPMM(rootNode, else Traverse_ClosestHitPMP(rootNode, } else { if(ray->Dk < 0) Traverse_ClosestHitPPM(rootNode, else Traverse_ClosestHitPPP(rootNode, } } ray); ray); ray); ray); ray); ray); ray); ray); } void Traverse_ClosestHitMMM(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy < node->ymin) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey > node->ymax) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; if((ray->Di * ya - ray->Dj * xb < 0) || 169 (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * xa xa za za ya - ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * yb zb xb yb zb < < < < < 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { // branch node Traverse_ClosestHitMMM(&node->children[1], ray); Traverse_ClosestHitMMM(&node->children[0], ray); } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitMMP(BBoxNode *node, Ray *ray) { if ((ray->Ox < node->xmin) || (ray->Oy < node->ymin) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey > node->ymax) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xa xb za za yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xb yb zb xa ya zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -3) { // DSA Traverse_ClosestHitMMP(&node->children[0], ray); Traverse_ClosestHitMMP(&node->children[1], ray); } else { Traverse_ClosestHitMMP(&node->children[1], ray); 170 Traverse_ClosestHitMMP(&node->children[0], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitMPM(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy > node->ymax) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey < node->ymin) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xb xa za zb ya - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa yb zb xb yb za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -2) { Traverse_ClosestHitMPM(&node->children[0], ray); Traverse_ClosestHitMPM(&node->children[1], ray); } else { Traverse_ClosestHitMPM(&node->children[1], ray); Traverse_ClosestHitMPM(&node->children[0], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitMPP(BBoxNode *node, Ray *ray) 171 { if((ray->Ox < node->xmin) || (ray->Oy > node->ymax) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey < node->ymin) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xb xb za zb yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa yb zb xa ya za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -1) { Traverse_ClosestHitMPP(&node->children[1], ray); Traverse_ClosestHitMPP(&node->children[0], ray); } else { Traverse_ClosestHitMPP(&node->children[0], ray); Traverse_ClosestHitMPP(&node->children[1], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitPMM(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy < node->ymin) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey > node->ymax) || (ray->Ez > node->zmax)))) return; float xa = node->xmin - ray->Ox; 172 float float float float float ya za xb yb zb = = = = = node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xa xa zb za ya - - ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xb ya za xb yb zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -1) { Traverse_ClosestHitPMM(&node->children[0], ray); Traverse_ClosestHitPMM(&node->children[1], ray); } else { Traverse_ClosestHitPMM(&node->children[1], ray); Traverse_ClosestHitPMM(&node->children[0], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitPMP(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy < node->ymin) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey > node->ymax) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; if((ray->Di * yb - ray->Dj * xb < 0) || (ray->Dj * xa - ray->Di * ya < 0) || 173 (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * xb zb za yb - ray->Di ray->Dk ray->Dk ray->Dj * * * * za xa ya zb < < < < 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -2) { Traverse_ClosestHitPMP(&node->children[1], ray); Traverse_ClosestHitPMP(&node->children[0], ray); } else { Traverse_ClosestHitPMP(&node->children[0], ray); Traverse_ClosestHitPMP(&node->children[1], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitPPM(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy > node->ymax) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey < node->ymin) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xb xa zb zb ya - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * if(node->numTris < 0) { if(node->numTris == -3) { xa ya za xb yb za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) 174 Traverse_ClosestHitPPM(&node->children[1], ray); Traverse_ClosestHitPPM(&node->children[0], ray); } else { Traverse_ClosestHitPPM(&node->children[0], ray); Traverse_ClosestHitPPM(&node->children[1], ray); } } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void Traverse_ClosestHitPPP(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy > node->ymax) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey < node->ymin) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xb xb zb zb yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa ya za xa ya za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { Traverse_ClosestHitPPP(&node->children[0], ray); Traverse_ClosestHitPPP(&node->children[1], ray); } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } void ShadowTest(Ray *ray) { 175 if(ray->Di < 0) { if(ray->Dj < 0) { if(ray->Dk < 0) Traverse_ShadowTestMMM(rootNode, else Traverse_ShadowTestMMP(rootNode, } else { if(ray->Dk < 0) Traverse_ShadowTestMPM(rootNode, else Traverse_ShadowTestMPP(rootNode, } } else { if(ray->Dj < 0) { if(ray->Dk < 0) Traverse_ShadowTestPMM(rootNode, else Traverse_ShadowTestPMP(rootNode, } else { if(ray->Dk < 0) Traverse_ShadowTestPPM(rootNode, else Traverse_ShadowTestPPP(rootNode, } } ray); ray); ray); ray); ray); ray); ray); ray); } void Traverse_ShadowTestMMM(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy < node->ymin) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey > node->ymax) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; if((ray->Di * ya - ray->Dj * xb < 0) || (ray->Dj * xa - ray->Di * yb < 0) || 176 (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * xa za za ya - ray->Di ray->Dk ray->Dk ray->Dj * * * * zb xb yb zb < < < < 0) || 0) || 0) || 0)) if(node->numTris < 0) { Traverse_ShadowTestMMM(&node->children[1], ray); // can terminate the traversal early if something is hit, // because for shadow tests we don’t care which object // is hit if(ray->triHit == NULL) Traverse_ShadowTestMMM(&node->children[0], ray); // else traversal is finished } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestMMP(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy < node->ymin) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey > node->ymax) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xa xb za za yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * if(node->numTris < 0) { if(node->numTris == -3) { xb yb zb xa ya zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) 177 Traverse_ShadowTestMMP(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestMMP(&node->children[1], ray); } else { Traverse_ShadowTestMMP(&node->children[1], ray); if(ray->triHit == NULL) Traverse_ShadowTestMMP(&node->children[0], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestMPM(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy > node->ymax) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey < node->ymin) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xb xa za zb ya - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa yb zb xb yb za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -2) { Traverse_ShadowTestMPM(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestMPM(&node->children[1], ray); } else { Traverse_ShadowTestMPM(&node->children[1], ray); if(ray->triHit == NULL) 178 Traverse_ShadowTestMPM(&node->children[0], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestMPP(BBoxNode *node, Ray *ray) { if((ray->Ox < node->xmin) || (ray->Oy > node->ymax) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex > node->xmax) || (ray->Ey < node->ymin) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * ya xb xb za zb yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa yb zb xa ya za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -1) { Traverse_ShadowTestMPP(&node->children[1], ray); if(ray->triHit == NULL) Traverse_ShadowTestMPP(&node->children[0], ray); } else { Traverse_ShadowTestMPP(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestMPP(&node->children[1], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } 179 void Traverse_ShadowTestPMM(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy < node->ymin) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey > node->ymax) || (ray->Ez > node->zmax)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xa xa zb za ya - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xb ya za xb yb zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -1) { Traverse_ShadowTestPMM(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestPMM(&node->children[1], ray); } else { Traverse_ShadowTestPMM(&node->children[1], ray); if(ray->triHit == NULL) Traverse_ShadowTestPMM(&node->children[0], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestPMP(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy < node->ymin) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || 180 (ray->Ey > node->ymax) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xa xb zb za yb - - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xb ya za xa ya zb < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -2) { Traverse_ShadowTestPMP(&node->children[1], ray); if(ray->triHit == NULL) Traverse_ShadowTestPMP(&node->children[0], ray); } else { Traverse_ShadowTestPMP(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestPMP(&node->children[1], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestPPM(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy > node->ymax) || (ray->Oz < node->zmin) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey < node->ymin) || (ray->Ez > node->zmax)))) return; float xa = node->xmin - ray->Ox; float ya = node->ymin - ray->Oy; float za = node->zmin - ray->Oz; 181 float xb = node->xmax - ray->Ox; float yb = node->ymax - ray->Oy; float zb = node->zmax - ray->Oz; if((ray->Di (ray->Dj (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * * * yb xb xa zb zb ya - ray->Dj ray->Di ray->Di ray->Dk ray->Dk ray->Dj * * * * * * xa ya za xb yb za < < < < < < 0) || 0) || 0) || 0) || 0) || 0)) if(node->numTris < 0) { if(node->numTris == -3) { Traverse_ShadowTestPPM(&node->children[1], ray); if(ray->triHit == NULL) Traverse_ShadowTestPPM(&node->children[0], ray); } else { Traverse_ShadowTestPPM(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestPPM(&node->children[1], ray); } } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } void Traverse_ShadowTestPPP(BBoxNode *node, Ray *ray) { if((ray->Ox > node->xmax) || (ray->Oy > node->ymax) || (ray->Oz > node->zmax) || ((ray->rlen != -1) && ((ray->Ex < node->xmin) || (ray->Ey < node->ymin) || (ray->Ez < node->zmin)))) return; float float float float float float xa ya za xb yb zb = = = = = = node->xmin node->ymin node->zmin node->xmax node->ymax node->zmax - ray->Ox; ray->Oy; ray->Oz; ray->Ox; ray->Oy; ray->Oz; if((ray->Di * yb - ray->Dj * xa < 0) || (ray->Dj * xb - ray->Di * ya < 0) || 182 (ray->Dk (ray->Di (ray->Dj (ray->Dk return; * * * * xb zb zb yb - ray->Di ray->Dk ray->Dk ray->Dj * * * * za xa ya za < < < < 0) || 0) || 0) || 0)) if(node->numTris < 0) { Traverse_ShadowTestPPP(&node->children[0], ray); if(ray->triHit == NULL) Traverse_ShadowTestPPP(&node->children[1], ray); } else { TriIntersect_ShadowTest(node->numTris, node->tris, ray); } } Appendix E BVH-Slabs-DSA Traversal C++ Code /* Sample code for ’BVH-Slabs-DSA’ ray-BVH traversal. The and and for BVH traversal is split into two types: ClosestHit (for primary secondary rays) and ShadowTest (for shadow rays). ClosestHit() ShadowTest() call the appropriate recursive traversal function the given ray direction. For brevity, only the ClosestHit() ’MMM’ cases are shown here. The DSA algorithm works the same as for the BVH-Plu-DSA code. This code is released into the public domain, but please give me (J. Mahovsky) credit somewhere if you use it. There is no warranty of any kind on this code. */ // having eight specialized cases makes for a large amount of code // perhaps templates can be used to make the code more concise float ii, ij, ik; void { ii ij ik // inverse ray direction vector components // (IRDVCs) // these should be put within the ray structure ClosestHit(Ray *ray) = 1.0f / ray->Di; = 1.0f / ray->Dj; = 1.0f / ray->Dk; // need to check the sign bit to determine the case because // doing something like if(ray->Di < 0)... does not properly // handle -0, which causes the algorithm to fail for -0 if(signbit(ray->Di)) { if(signbit(ray->Dj)) { if(signbit(ray->Dk)) Traverse_ClosestHitMMM(rootNode, ray); 183 184 else Traverse_ClosestHitMMP(rootNode, ray); } else { if(signbit(ray->Dk)) Traverse_ClosestHitMPM(rootNode, ray); else Traverse_ClosestHitMPP(rootNode, ray); } } else { if(signbit(ray->Dj)) { if(signbit(ray->Dk)) Traverse_ClosestHitPMM(rootNode, ray); else Traverse_ClosestHitPMP(rootNode, ray); } else { if(signbit(ray->Dk)) Traverse_ClosestHitPPM(rootNode, ray); else Traverse_ClosestHitPPP(rootNode, ray); } } } void Traverse_ClosestHitMMM(BBoxNode *node, Ray *ray) { // [tnear, tfar] are initialized to the valid range // within the length of the ray: // [0, ray->rlen] for finite-length and [0, 4] for infinite // The value 4 is larger than sqrt(2*2 + 2*2 + 2*2) // which is the scene size assuming the scene is normalized // to the [-1, -1, -1] to [1, 1, 1] cube. float tnear = 0; float tfar; if(ray->rlen == -1) tfar = 4; else tfar = ray->rlen; // ’infinite’ ray // Multiply by the IRDVC instead of dividing // The order which xmax and xmin appear here depends on the ray // direction sign. The i direction sign bit is negative, // so the order is xmax, xmin. The order is xmin, xmax for // a positive i direction sign bit. 185 { float t1 = (node->xmax - ray->Ox) * ii; // ’near’ intersection float t2 = (node->xmin - ray->Ox) * ii; // ’far’ intersection if(t1 > tnear) // t1 == NaN has no effect, tnear = t1; // t1 == +/- Inf works properly if(t2 < tfar) // t2 == NaN has no effect, tfar = t2; // t2 == +/- Inf works properly if(tnear > tfar || tfar < 0.0) // miss return; } { float t1 = (node->ymax - ray->Oy) * ij; float t2 = (node->ymin - ray->Oy) * ij; if(t1 > tnear) tnear = t1; if(t2 < tfar) tfar = t2; if(tnear > tfar || tfar < 0.0) return; } { float t1 = (node->zmax - ray->Oz) * ik; float t2 = (node->zmin - ray->Oz) * ik; if(t1 > tnear) tnear = t1; if(t2 < tfar) tfar = t2; if(tnear > tfar || tfar < 0.0) return; } // DSA if(node->numTris < 0) { Traverse_ClosestHitMMM(&node->children[1], ray); Traverse_ClosestHitMMM(&node->children[0], ray); } else { TriIntersect_ClosestHit(node->numTris, node->tris, ray); } } Bibliography [ABT+ 02] M Anido, N. Bagherzadeh, N. Tabrizi, H. Du, and M. Sanchez-Elez. Interactive ray tracing using a SIMD reconfigurable architecture. In 14th Symposium on Computer Architecture and High Performance Computing (SCAB-PAD’02), pages 20–28, 2002. [AC97] John Amanatides and Kia Choi. Ray tracing triangular meshes. Proceedings of the Eighth Western Computer Graphics Symposium, pages 43–52, 1997. [AK87] James Arvo and David Kirk. Fast ray tracing by ray classification. In Maureen C. Stone, editor, Computer Graphics (SIGGRAPH ’87 Proceedings), volume 21, pages 55–64, July 1987. [AK89] James Arvo and David Kirk. A survey of ray tracing acceleration techniques. In Andrew Glassner, editor, An Introduction to Ray Tracing, pages 201–262. 1989. [AML+ 89] A. Atamenia, M. Meriaux, E. Lepretre, S. Degrande, and B. Vidal. A cellular architecture for ray tracing. In D. Grimsdale and A. Kaufman, editors, Fifth Eurographics Workshop on Graphics Hardware, 1989. [App] Apple Computer, Inc. Apple cinema displays. http://www.apple.com/displays/specs.html. [App68] Arthur Appel. Some techniques for shading machine renderings of solids. In AFIPS 1968 Spring Joint Computer Conf., volume 32, pages 37–45, 1968. [App05] Apple Computer, Inc. Velocity engine, 2005. http://developer.apple.com/hardware/ve/. 186 187 [BD89] Jichun Bu and Ed F. Deprettere. A VLSI system architecture for high-speed radiative transfer 3D image synthesis. The Visual Computer, 5(3):121–133, June 1989. [BR94] Jules Bloomenthal and Jon Rokne. Homogeneous coordinates. The Visual Computer, 11:15–26, 1994. [BSC89] K. Bouatouch, Y. Saouter, and J. C. Candela. A VLSI chip for ray tracing bicubic patches. In W. Hansmann, F. R. A. Hopgood, and W. Strasser, editors, Eurographics ’89, pages 107–124. North-Holland, September 1989. [CLR80] E. Cohen, T. Lyche, and R. Riesenfeld. Discrete b-splines and subdivision techniques in computer-aided geometric design and computer graphics. Computer Graphics and Image Processing, 14:87–111, 1980. [CS90] Mark J. Charney and Isaac D. Scherson. Efficient traversal of well-behaved hierarchical trees of extents for ray-tracing complex scenes. The Visual Computer, 6(3):167–178, June 1990. [CW88] John G. Cleary and Geoff Wyvill. Analysis of an algorithm for fast ray tracing using uniform space subdivision. The Visual Computer, 4(2):65–83, July 1988. [CWBV86] J. Cleary, B. Wyvill, G. Birtwistle, and R. Vatti. Multiprocessor ray tracing. Computer Graphics Forum, 5(1):77–80, 1986. [CWVB83] John G. Cleary, Brian Wyvill, Reddy Vatti, and Graham M. Birtwistle. Design and analysis of a parallel ray tracing computer. In W. A. Davis, editor, Proceedings of Graphics Interface ’83, pages 33–38, May 1983. [DTB+ 01] H. Du, N. Tabrizi, N. Bagherzadeh, M. Sanchez-Elez, M. Fernandez, and M. Anido. Interactive ray tracing on reconfigurable SIMD MorphoSys. In DATE 2003, pages 144–149, 2001. 188 [Fen02] Joshua Fender. The design and implementation of a hardware accelerated ray tracer using the TM3a FPGA prototyping system. Bachelor’s thesis, University of Toronto, 2002. [Fle05] Tom Fletcher. Intel Corporation, personal communication, 2005. [FR03] Joshua Fender and Jonathan Rose. A high-speed ray tracing engine built on a field-programmable system. In IEEE International Conference On FieldProgrammable Technology, pages 188–195, December 2003. [FS88] Donald Fussell and K. R. Subramanian. Fast ray tracing using K-D trees. Technical Report TR-88-07, Dept. of Computer Sciences, Univ. of Texas at Austin, March 1988. [FTI86] Akira Fujimoto, Takayuki Tanaka, and Kansei Iwata. ARTS: Accelerated ray tracing system. IEEE Computer Graphics and Applications, 6(4):16–26, 1986. [FvDFH95] James D. Foley, Andries van Dam, Steven K. Feiner, and John F. Hughes. Computer Graphics: Principles and Practice in C. Addison-Wesley Professional, second edition, 1995. [Gla84] Andrew S. Glassner. Space subdivision for fast ray tracing. IEEE Computer Graphics and Applications, 4(10):15–22, October 1984. [Gla88] Andrew S. Glassner. Spacetime ray tracing for animation. IEEE Computer Graphics and Applications, 8(2):60–70, 1988. [Gla89] Andrew S. Glassner, editor. An Introduction to Ray Tracing. Academic Press, 1989. [GLGT99] Arthur Gregory, Ming C. Lin, Stefan Gottschalk, and Russell Taylor. A framework for fast and accurate collision detection for haptic interaction. In Pro- 189 ceedings of Virtual Reality Conference 1999, pages 38–45, 1999. [Gol91] David Goldberg. What every computer scientist should know about floatingpoint arithmetic. ACM Computing Surveys, 23(1):5–48, 1991. [GS87] Jeffrey Goldsmith and John Salmon. Automatic creation of object hierarchies for ray tracing. IEEE Computer Graphics and Applications, 7(5):14–20, May 1987. [GTGB84] Cindy M. Goral, Kenneth E. Torrance, Donald P. Greenberg, and Bennett Battaile. Modeling the interaction of light between diffuse surfaces. In SIGGRAPH ’84: Proceedings of the 11th annual conference on Computer graphics and interactive techniques, pages 213–222, New York, NY, USA, 1984. ACM Press. [HA96] Greg Humphreys and C. Scott Ananian. TigerSHARK: A hardware accelerated ray-tracing engine. Senior independent work, Princeton University, May 14 1996. [Hai91] Eric Haines. Efficiency improvements for hierarchy traversal in ray tracing. In James Arvo, editor, Graphics Gems II, pages 267–273. Academic Press, San Diego, 1991. [Hal99] Daniel Hall. The AR250: A new architecture for ray traced rendering. In SIGGRAPH/Eurographics 1999 Hardware Workshop in Computer Graphics, 1999. [Hal01] Daniel Hall. The AR350: Today’s ray trace rendering processor. In In Proceedings of the Eurographics/SIGGRAPH workshop on Graphics hardware Hot 3D Session 1, 2001. 190 [Hav01] Vlastimil Havran. Heuristic Ray Shooting Algorithms. PhD thesis, Czech Technical University, Praha, Czech Republic, April 2001. [Int] Intel Corporation. Intel VTune performance analyzers. http://www.intel.com/software/products/vtune/. [Int02] Intel Corporation. IA-32 Intel architecture software developer’s manual, instruction set reference (volume 2). 2002. [Jen01] Henrik Wann Jensen. Realistic Image Synthesis Using Photon Mapping. A. K. Peters, Natick, MA, 2001. [JW89] David Jevans and Brian Wyvill. Adaptive Voxel Subdivision for Ray Tracing. Proceedings of Graphics Interface 1989, pages 164–172, 1989. [Kaj86] J. T. Kajiya. The rendering equation. In David C. Evans and Rusell J. Athay, editors, Computer Graphics (SIGGRAPH ’86 Proceedings), volume 20, pages 143–150, August 1986. [KC93] Hao-Ren Ke and Ruei-Chuan Chang. An efficient hierarchical traversal algorithm for ray tracing. The Visual Computer, 10(2):79–87, 1993. [KG79] Douglas Scott Kay and Donald Greenberg. Transparency for computer synthesized images. In SIGGRAPH ’79: Proceedings of the 6th annual conference on Computer graphics and interactive techniques, pages 158–164, New York, NY, USA, 1979. ACM Press. [KK86] Timothy L. Kay and James T. Kajiya. Ray tracing complex scenes. Siggraph 86, 20:269–278, 1986. [Kor02] Israel Koren. Computer Arithmetic Algorithms. A. K. Peters, second edition, 2002. 191 [KSS+ 01] H. Kobayashi, K. Suzuki, K. Sano, Y. Kaeriyama, Y. Saida, N. Oba, and T. Nakamura. 3DCGiRAM: an intelligent memory architecture for photorealistic image synthesis. In IEEE International Conference on Computer Design: VLSI in Computers & Processors (ICCD ’01), pages 462–467, Washington - Brussels - Tokyo, September 2001. IEEE. [Lat03] Olin Lathrop. U.S. patent 6,597,359 – Hierarchical space subdivision hardware for ray tracing, 2003. [Ler87] Pascal Leray. Towards a z-buffer and ray-tracing multimode system based on parallel architecture and VLSI chips. In Wolfgang Strasser, editor, Advances in Computer Graphics Hardware I, first Eurographics workshop on Graphics Hardware, pages 141–145, 1987. [LSL+ 99] Guangming Lu, Hartej Singh, Ming-Hau Lee, Nader Bagherzadeh, Fadi J. Kurdahi, and Eliseu M. Chaves Filho. The MorphoSys parallel reconfigurable system. In Euro-Par ’99: Proceedings of the 5th International Euro-Par Conference on Parallel Processing, pages 727–734, London, UK, 1999. SpringerVerlag. [LWH02] Robert R. Lewis, Renwei Wang, and Donald Hung. Design of a pipelined architecture for ray/B´ezier patch intersection computation. In Thirteenth Western Computer Graphics Symposium, 2002. [Mac04] Peter MacMurchy. The use of subdivision surfaces in the modeling of plants. Master’s thesis, University of Calgary, 2004. [Man02] Brandon Mansfield. Ordered BVH hack, 2002. http://datamoon.net/projects/raytracer/index.php. 192 [MB89] J. David MacDonald and Kellogg S. Booth. Heuristics for ray tracing using space subdivision. In Proceedings of Graphics Interface ’89, pages 152–63, Toronto, Ontario, June 1989. Canadian Information Processing Society. [MD02] Stephen Mann and Leo Dorst. Geometric algebra: A computational framework for geometrical applications (part 2). IEEE Computer Graphics and Applications, 22(4):58–67, 2002. [MPJ+ 00] Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. Smart memories: a modular reconfigurable architecture. In ISCA ’00: Proceedings of the 27th annual international symposium on Computer architecture, pages 161–171, New York, NY, USA, 2000. ACM Press. [MT97] Tomas M¨oller and Ben Trumbore. Fast, minimum storage ray-triangle intersection. Journal of Graphics Tools: JGT, 2(1):21–28, 1997. [NYTN87] Tadashi Naruse, Masaharu Yoshida, Tokiichiro Takahashi, and Seiichiro Naito. SIGHT – a dedicated computer graphics machine. Computer Graphics Forum, 6(4):327–334, December 1987. [PH89] Michael Potmesil and Eric M. Hoffert. The pixel machine: A parallel image computer. In Computer Graphics (SIGGRAPH ’89 Proceedings), volume 23, pages 69–78, July 1989. [PK87a] Ron Pulleyblank and John Kapenga. The feasibility of a VLSI chip for ray tracing bicubic patches. IEEE Computer Graphics and Applications, 7(3):33– 44, 1987. [PK87b] Ron Pulleyblank and John Kapenga. A VLSI chip for ray tracing bicubic patches. In W. Strasser, editor, Advances in Computer Graphics Hardware I, pages 125–140. 1987. 193 [PKMS03] Hanspeter Pfister, Kevin A. Kreeger, Joseph W. Marks, and Chia Shen. U.S. patent 6,556,200 – Temporal and spatial coherent ray tracing for rendering scenes with sampled and geometry data, 2003. [POV] POV-Ray - The persistence of vision raytracer. http://www.povray.org/. [Pur01] Timothy Purcell. SHARP ray tracing architecture. In SIGGRAPH 2001 RealTime Ray Tracing Course Notes, August 2001. [Pur04] Timothy Purcell. Ray Tracing on a Stream Processor. PhD thesis, Stanford University, 2004. [Rit90] Jack Ritter. An efficient bounding sphere. In Graphics Gems, pages 301–303. Academic Press Professional, Inc., San Diego, CA, USA, 1990. [RL00] Szymon Rusinkiewicz and Marc Levoy. QSplat: A multiresolution point rendering system for large meshes. In Proceedings of ACM SIGGRAPH, pages 343–352, 2000. [RW80] Steven M. Rubin and Turner Whitted. A 3-dimensional representation for fast rendering of complex scenes. Computer Graphics, 14(3):110–116, July 1980. [Sch88] Bengt-Olaf Schneider. Ray tracing rational B-spline patches in VLSI. In A. A. M. Kuijk and W. Strasser, editors, Advances in Computer Graphics Hardware II, pages 45–63. Springer-Verlag, New York, 1988. [Sha98] Ashok K. Sharma. Programmable Logic Handbook: PLDs, CPLDs and FPGAs. Mcgraw-Hill, 1998. [Sho98] Ken Shoemake. Pl¨ ucker coordinate tutorial. Ray Tracing News, 11(1), July 1998. 194 [SK95] Kelvin Sung and William J. Kubitz. Tracing rays with the Area Sampling Machine. The Visual Computer, 11(9):477–496, November 1995. ISSN 01782789. [SLS03] J¨org Schmittler, Alexander Leidinger, and Philipp Slusallek. A virtual memory architecture for real-time ray tracing hardware. Computers and Graphics, 27(5):693–699, October 2003. [SM03] Peter Shirley and R. Keith Morley. Realistic Ray Tracing. AK Peters Limited, second edition, 2003. [Smi98] Brian Smits. Efficiency issues for ray tracing. Journal of Graphics Tools, 3(2):1–14, 1998. [SN01] Prasanna Sundararajan and Mohammed Niamat. FPGA implementation of the ray tracing algorithm used in the XPATCH software. In 44th IEEE Midwest Symposium on Circuits and Systems, 2001. [Sof] id Software. Quake 3 Arena. http://www.quake3arena.com/. [ST94] Wolfgang St¨ urzlinger and R. Tobler. Two optimization methods for ray tracing. In Spring Conference of Computer Graphics ’94, pages 104–107, June 1994. [Sta05] Stanford University. The Stanford 3D Scanning Repository, 2005. http://graphics.stanford.edu/data/3Dscanrep/. [St¨ u96] Wolfgang St¨ urzlinger. Bounding volume construction using point clouds. In Spring Conference on Computer Graphics ’96, pages 239–246, June 1996. [Sun92] Kelvin Sung. The Area Sampling Machine. In Third Eurographics Workshop on Rendering, pages 147–160, Bristol, UK, May 1992. 195 [Sun99] Prasanna Sundararajan. Design and implementation of the XPATCH ray tracer on a reconfigurable processor. Master’s thesis, University of Toledo, 1999. [SWS02] J¨org Schmittler, Ingo Wald, and Philipp Slusallek. SaarCOR – A hardware architecture for ray tracing. In Proceedings of the conference on Graphics Hardware 2002, pages 27–36. Saarland University, Eurographics Association, 2002. available at http://www.openrt.de. [SWW+ 04] J¨org Schmittler, Sven Woop, Daniel Wagner, Wolfgang J. Paul, and Philipp Slusallek. Realtime ray tracing of dynamic scenes on an FPGA chip. In Graphics Hardware 2004, 2004. [TH99] Seth Teller and Michael Hohmeyer. Determining the lines through four lines. Journal of Graphics Tools, 4(3):11–22, 1999. [TH05] Julia Taylor-Hell. Biomechanics in botanical trees. Master’s thesis, University of Calgary, 2005. [TL01] Tim Todman and Wayne Luk. Reconfigurable designs for ray tracing. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. IEEE Computer Society Press, 2001. [TM3] The Transmogrifier 3a rapid prototyping system. http://www.eecg.utoronto.ca/~tm3. [Tru92] Ben Trumbore. Rectangular bounding volumes for popular primitives. In David Kirk, editor, Graphics Gems III, pages 295–300. Academic Press, 1992. [Uni05] University of North Carolina at Chapel Hill. The Walkthru Project - power plant model release, 2005. http://www.cs.unc.edu/~geom/Powerplant/. 196 [Vat84] Bala Rajareddy Vatti. Multiprocessor ray tracing. Master’s thesis, University of Calgary, 1984. [Vol59] J. E. Volder. The CORDIC trigonometric computing technique. IRE Transactions on Electronic Computers, EC-8(3):330–334, 1959. [WBMS05] Amy Williams, Steve Barrus, R. Keith Morley, and Peter Shirley. An efficient and robust ray-box intersection algorithm. Journal of Graphics Tools, 10(1):55–60, 2005. [WH04] N. Weste and D. Harris. CMOS VLSI Design : A Circuits and Systems Perspective. Addison-Wesley, third edition, 2004. [WHG84] Hank Weghorst, Gary Hooper, and Donald P. Greenberg. Improved computational methods for ray tracing. ACM Transactions on Graphics, 3(1):52–69, January 1984. [Whi79] Turner Whitted. An improved illumination model for shaded display. In Computer Graphics (Special SIGGRAPH ’79 Issue), volume 13, pages 1–14, August 1979. [WLH03] Renwei Wang, Robert R. Lewis, and Donald Hung. A pipelined architecture for ray/B´ezier patch intersection computation. Canadian Journal of Electrical and Computer Engineering, 28(1), 2003. [Woo90] Andrew Woo. Fast ray-box intersection. In Andrew Glassner, editor, Graphics Gems, pages 395–396. Academic Press, 1990. [WPO96] Andrew Woo, Andrew Pearce, and Marc Ouellette. It’s really not a rendering bug, you see ... IEEE Computer Graphics and Applications, 16(5):21–25, 1996. 197 [Wri99] Adrian Wrigley. U.S. patent 5,933,146 – Method and apparatus for constructing an image of a notional scene by a process of ray tracing, 1999. [WSBW01] Ingo Wald, Philipp Slusallek, Carsten Benthin, and Markus Wagner. Interactive rendering with coherent ray tracing. In A. Chalmers and T.-M. Rhyne, editors, EG 2001 Proceedings, volume 20(3) of Computer Graphics Forum, pages 153–164. Blackwell Publishing, 2001. [WSS05] Sven Woop, J¨org Schmittler, and Phillip Slusallek. RPU: A programmable ray processing unit for realtime ray tracing. In SIGGRAPH 2005, 2005. [Wu92] Xiaolin Wu. A linear-time simple bounding volume algorithm. In Graphics Gems III, pages 301–306. Academic Press Professional, Inc., San Diego, CA, USA, 1992. [YHDD89] A. C. Yilmaz, S. Hagestein, E. Deprettere, and P. DeWilde. A hardware algorithm for fast realistic image synthesis. In R. L. Grimsdale and W. Strasser, editors, Advances in Computer Graphics Hardware IV, pages 37–60. SpringerVerlag, Berlin, Germany, 1989. [YN97] Fujio Yamaguchi and Masatoshi Niizeki. Some basic geometric test conditions in terms of Pl¨ ucker coordinates and Pl¨ ucker coefficients. The Visual Computer, 13(1):29–41, 1997. ISSN 0178-2789. [YNT91] M. Yoshida, T. Naruse, and T. Takahasi. A dedicated graphics processor SIGHT-2. In R. L. Grimsdale and W. Strasser, editors, Advances in Computer Graphics Hardware IV, pages 151–169. Springer-Verlag, 1991.