Transcript
ABSTRACT DRAELOS, MARK THEODORE. The Kinect Up Close: Modifications for Short-Range Depth Imaging. (Under the direction of Edward Grant.) Microsoft’s Kinect contains a diverse set of sensors, most notably a depth camera based on PrimeSense’s infrared structured light technology. With a proper calibration of its color and depth cameras, the Kinect can capture detailed color point clouds at up to 30 frames per second. This capability uniquely positions the Kinect for use in fields such as robotics, natural user interfaces, and three-dimensional mapping. Thus, techniques for efficiently calibrating the Kinect depth camera and altering its optical system to improve suitability for imaging smaller scenes at low-cost are presented. An application of these techniques to enhance close-range obstacle avoidance in Kinect-based robot navigation is also demonstrated. To perform depth calibration, a calibration rig and software were developed to automatically map to raw depth values to object depths. The calibration rig consisted of a traditional chessboard calibration target with easily locatable features in depth at its exterior corners. These depth features facilitated software extraction of corresponding object depths and raw depth values from paired color and depth images of the rig and thereby enabled automated capture of many thousand data points. Depth calibration produced the fits f B2 (r ) = 1 m/(−0.0027681r + 3.0163) and f M2 (r ) = (0.15257 m) tan(r /2431.3 + 1.1200) based upon Burrus’s and Magnenat’s functional forms, respectively. Fit f B2 (r ) had residuals with µ = −0.22958 mm and σ = 5.7170 mm, and fit f M2 (r ) had residuals with µ = 0.065544 µm and σ = 5.5458 mm.
To modify the Kinect’s optics for improved short-range imaging, Nyko’s Zoom adapter was used due to its simplicity and low-cost. Although effective at reducing the Kinect’s minimum range, these optics introduced pronounced distortion in depth. A method based on capturing depth images of planar objects at various depths produced an empirical depth distortion model for correcting such distortion in software. After compensation for depth distortion, depth calibration yielded the fit f B3 (r ) = 1 m/(−0.0041952r + 4.5380), which had residuals with µ = 0.19288 mm and σ = 11.845 mm. Together, the modified optics and the empirical depth undistortion procedure demonstrated the ability to improve the Kinect’s resolution and decrease its minimum range by approximately 30%. The ability of modified optics to improve obstacle detection was investigated by examining Kinect-based robot navigation through an environment with an obstacle placed around
a blind turn. In separate tests using either the unmodified or the modified Kinect optics, a robot’s navigation system was given a waypoint around the blind turn after first demonstrating navigability of the empty environment under both optical configurations. The robot failed to detect and subsequently collided with the obstacle when navigating using the original Kinect optics but successfully detected the obstacle and suspended path plan execution when navigating using the modified Kinect optics. These tests indicated that use of modified optics (with appropriate depth undistortion) does not prevent the Kinect from serving as a navigation sensor and that the modified Kinect’s shorter minimum range compared to the unmodified Kinect improves obstacle avoidance. The success of modifying the Kinect’s optics suggests further applications of the Kinect as an extensible yet low-cost depth camera. For example, multi-scale mapping, wherein robot is equipped multiple depth imagers optimized for performance at different scales, is likely achievable using several Kinects with varied optics. More forward-looking applications include adaptation of the Kinect’s imager to perform laparoscopic depth imaging for use in surgical robotics, medical diagnostics, and education.
© Copyright 2012 by Mark Theodore Draelos All Rights Reserved
The Kinect Up Close: Modifications for Short-Range Depth Imaging
by Mark Theodore Draelos
A thesis submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Master of Science
Electrical Engineering
Raleigh, North Carolina 2012
APPROVED BY:
John Muth
Wesley Snyder
Edward Grant Chair of Advisory Committee
DEDICATION To those who have championed my success.
ii
BIOGRAPHY Mark Draelos was born one afternoon in High Point, NC, and proceeded to disclose almost immediately thereafter an incredible predilection for wires and electrical machines. This fascination matured into a desire to study electrical engineering, stimulated in part by high school participation in FIRST Robotics. Funded by a Park Scholarship, he attended North Carolina State University for an undergraduate education in electrical and computer engineering and physics as a preface to this Master’s degree in electrical engineering. Mark aspires to complete a MD/PhD program in which he will study and develop medical robots to enhance the quality of healthcare. Those who know him well are not surprised.
iii
ACKNOWLEDGEMENTS It has been said that no man is an island, and I am no exception. My success has hinged upon the help of others, many of whom I credit below. I would like to thank: • Dr. Grant for taking me on first as an undergraduate and later as a graduate student in the Center for Robotics and Intelligent Machines, for sponsoring my summer internships, for the many letters of recommendation he has written on my behalf, and for his support of my further study; • Matthew Craver for bringing the Nyko Zoom adapter to my attention, which is featured prominently in Chapter 4, and for providing me with projects as an undergraduate researcher; • Nikhil Deshpande for our rich collaboration that produced Chapter 5 and for helping me work through all manner of research-related conceptual and software problems; • Zach Nienstedt for listening to my research ideas and for volunteering insights of his own; • The Park Scholarships program, particularly Eva Holcomb, for putting my fourth scholarship year towards this Master’s degree; • Dr. Devetsikiotis for his support as an instructor during my expedited undergraduate career and for his support as Director of Graduate Programs during my participation in the Accelerated Bachelor’s/Master’s program; and • Joshua Hykes for developing the LATEX template that I have used in preparing this thesis.
iv
TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Chapter 1 Depth Imaging . . . . . 1.1 Common Methods . . . . . . 1.1.1 Time-of-Flight . . . 1.1.2 Structured Light . . 1.2 Imaging Geometry . . . . . . 1.2.1 Laser Scanners . . . 1.2.2 Depth Cameras . . . 1.3 Applications . . . . . . . . . . 1.3.1 Measurement . . . . 1.3.2 Robotics . . . . . . . . 1.3.3 Medicine . . . . . . . 1.4 Limitations . . . . . . . . . . . 1.4.1 Range . . . . . . . . . . 1.4.2 Surface Properties . 1.4.3 Shadowing . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
1 1 2 2 4 4 5 7 7 7 7 8 8 8 9
Chapter 2 Microsoft’s Kinect . . . . . . . . . . 2.1 Hardware Features . . . . . . . . . . . . . 2.1.1 Camera System . . . . . . . . . . 2.1.2 Other Sensors and Actuators 2.1.3 Interface . . . . . . . . . . . . . . 2.2 Software Interface . . . . . . . . . . . . . 2.2.1 OpenKinect . . . . . . . . . . . . 2.2.2 OpenNI . . . . . . . . . . . . . . . 2.2.3 Kinect for Windows SDK . . . 2.3 Suitability . . . . . . . . . . . . . . . . . . . 2.3.1 Choice of Driver . . . . . . . . . 2.3.2 Range . . . . . . . . . . . . . . . . . 2.3.3 Distortion . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
10 10 10 14 14 14 14 15 15 16 16 16 17
Chapter 3 Kinect Calibration . . 3.1 Objectives . . . . . . . . . . . . 3.2 Procedure . . . . . . . . . . . . 3.2.1 Calibration Target . 3.2.2 Target Recognition
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
19 19 20 21 21
. . . . .
. . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . .
. . . . .
. . . . . . . . . . . . . . .
. . . . .
. . . . .
v
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
23 24 26 26 27
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
31 31 32 32 36 36 37 38 40 40 41 41 46
Chapter 5 Navigation Experiments . . . . . . . 5.1 Objectives . . . . . . . . . . . . . . . . . . . . . 5.2 Experimental Platform . . . . . . . . . . . . 5.2.1 The CRIMbot . . . . . . . . . . . . . 5.2.2 Robot Operating System (ROS) 5.2.3 Software Modifications . . . . . . 5.3 Test Environment . . . . . . . . . . . . . . . 5.3.1 Maze Design . . . . . . . . . . . . . . 5.3.2 Obstacle Placement . . . . . . . . 5.4 Procedure . . . . . . . . . . . . . . . . . . . . . 5.5 Outcomes . . . . . . . . . . . . . . . . . . . . . 5.5.1 Results . . . . . . . . . . . . . . . . . . 5.5.2 Discussion . . . . . . . . . . . . . . . 5.5.3 Improvements . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
48 48 49 49 50 51 52 52 53 55 57 58 58 62
Chapter 6 Future Work . . . . . 6.1 Improving Techniques . 6.2 Multi-scale Mapping . . 6.3 Laparoscopic Imaging . 6.3.1 Current Systems 6.3.2 The Possibilities
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
64 64 65 66 66 66
3.3
3.2.3 Establishing Correspondences 3.2.4 Curve Fitting . . . . . . . . . . . . . Outcomes . . . . . . . . . . . . . . . . . . . . . 3.3.1 Results . . . . . . . . . . . . . . . . . . 3.3.2 Discussion . . . . . . . . . . . . . . .
Chapter 4 Modifying the Kinect Optics 4.1 Objectives . . . . . . . . . . . . . . . . . 4.2 Nyko’s Lens Adapter . . . . . . . . . . 4.2.1 Suitability . . . . . . . . . . . . 4.3 Empirical Depth Undistortion . . 4.3.1 Theory . . . . . . . . . . . . . . 4.3.2 Capturing Snapshots . . . 4.3.3 Estimating True Depth . . 4.3.4 Snapshot Processing . . . . 4.4 Outcomes . . . . . . . . . . . . . . . . . 4.4.1 Results . . . . . . . . . . . . . . 4.4.2 Discussion . . . . . . . . . . . 4.4.3 Improvements . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . .
. . . . . .
. . . . . . . . . . . . .
. . . . . .
. . . . . . . . . . . . .
. . . . . .
vi
. . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A Camera Parameters . . . . . . . . . . . . . A.1 Original Optics . . . . . . . . . . . . . . . . . . . A.2 Modified Optics . . . . . . . . . . . . . . . . . . . Appendix B ROS Parameter Tuning . . . . . . . . . . . B.1 Mapping . . . . . . . . . . . . . . . . . . . . . . . . B.2 Navigation . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Path Planning . . . . . . . . . . . . . . . B.2.2 Obstacle Avoidance . . . . . . . . . . B.2.3 Localization . . . . . . . . . . . . . . . . Appendix C OpenNI Kinect Calibration . . . . . . . . C.1 Outcomes . . . . . . . . . . . . . . . . . . . . . . . C.1.1 Results . . . . . . . . . . . . . . . . . . . . C.1.2 Discussion . . . . . . . . . . . . . . . . . Appendix D Software Tools . . . . . . . . . . . . . . . . . D.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . D.2 Architecture . . . . . . . . . . . . . . . . . . . . . . D.3 Techniques . . . . . . . . . . . . . . . . . . . . . . D.3.1 High-Fidelity Image Undistortion D.3.2 Performance Optimizations . . . . D.4 User Interfaces . . . . . . . . . . . . . . . . . . .
vii
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
71 72 72 73 75 75 75 76 76 77 79 79 79 80 83 83 84 84 85 86 87
LIST OF TABLES Table 5.1
Average navigation time into and out of the empty maze from start of CRIMbot motion to achievement of the target pose for both Kinect configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Table A.1
Color and infrared camera lens distortion coefficients estimated for the Kinect’s original optics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Color and infrared camera lens distortion coefficients estimated for the Nyko Zoom optics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Table A.2 Table B.1 Table B.2 Table B.3 Table B.4 Table B.5 Table B.6
GMapping parameter differences between the Turtlebot and the CRIMbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Path planning and execution parameter differences between the Turtlebot and the CRIMbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common costmap parameter differences between the Turtlebot and the CRIMbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global costmap parameter differences between the Turtlebot and the CRIMbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Local costmap parameter differences between the Turtlebot and the CRIMbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AMCL parameter differences between the Turtlebot and the CRIMbot. .
viii
76 76 77 77 78 78
LIST OF FIGURES Figure 1.1 Figure 1.2 Figure 1.3 Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6
Figure 4.1 Figure 4.2 Figure 4.3
Figure 4.4 Figure 4.5 Figure 4.6
The geometrical setup of a structured light depth camera located at the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laser scanner imaging geometry based upon spherical coordinates. . . Depth camera imaging geometry based upon the pinhole camera model. Picture of the Kinect for Xbox 360 as shipped. . . . . . . . . . . . . . . . . . . Picture of the Kinect’s internal hardware. . . . . . . . . . . . . . . . . . . . . . Structured light pattern emitted by the Kinect when imaging a planar scene, specifically the one in Fig. 2.5. . . . . . . . . . . . . . . . . . . . . . . . . A sample scene in depth and its raw depth value histogram as viewed with the Kinect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A planar object approximately aligned to the Kinect’s image plane in depth and its raw depth value histogram as viewed with the Kinect. .
. 11 . 12 . 13 . 17 . 18
An 11 × 8 chessboard calibration pattern. . . . . . . . . . . . . . . . . . . . . . . Color and depth images of a 24.5 cm × 18.4 cm depthboard rig with a 11 × 8 chessboard pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A 11×8 chessboard’s 70 interior corners (black) projected onto a depthboard depth image following pose estimation. . . . . . . . . . . . . . . . . . . Plot of measured depth in meters versus raw depth value for 134 depthboard poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (top) Plot of residuals for the best fit based upon Burrus’s reciprocal calibration form (Eq. 3.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Plot of Burrus’s (Eq. 3.2) and Magnenat’s (Eq. 3.3) depth calibrations and fitted depth calibrations using the original Kinect optics (Eq. 3.4 and Eq. 3.5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Picture of Nyko’s Zoom adapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A sample scene similar to Fig. 2.4 in depth and its raw depth value histogram as viewed by the Kinect using the Nyko Zoom optics. . . . . . A planar object approximately aligned to the Kinect’s image plane in depth and its raw depth value histogram as viewed by the Kinect using the Nyko Zoom optics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structured light pattern emitted and captured by the Kinect when using modified optics to image the scene in Fig. 4.3. . . . . . . . . . . . . . . Adaptation of a swivel chair as a mount for the Kinect with modified optics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of three methods for estimating the true raw depth value in distorted depth images of planar objects for one particular dataset. .
ix
3 5 6
21 22 25 27 28
30 33 34
35 35 37 39
Figure 4.7
Surface plot of cropped and downsampled depth images of large planar objects viewed with the modified Kinect. . . . . . . . . . . . . . . . . . . . Figure 4.8 Undistorted depth image from applying Eq. 4.1 as estimated in Fig. 4.7 to the depth image in Fig. 4.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.9 (top) Plot of measured depth in meters versus raw depth value for 46 depthboard poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.10 Plot of residuals for the best fit based upon Burrus’s reciprocal calibration form (Eq. 3.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.11 Plot of fitted depth calibrations using the original Kinect optics (Eq. 3.4 and Eq. 3.5) and the modified Kinect optics (Eq. 4.3). . . . . . . . . . . . . . Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Figure 5.5 Figure 5.6 Figure 5.7 Figure 5.8
Picture of the CRIMbot with the modified optics attached. . . . . . . . . . Design plan for the maze test environment. . . . . . . . . . . . . . . . . . . . . Overhead picture of the maze’s corridor and blind turn. . . . . . . . . . . . Maze test environment as mapped using SLAM. . . . . . . . . . . . . . . . . . Intended navigation trajectories into (left) and out of (right) the maze. Placement of the discoverable obstacle around the blind turn. . . . . . . Object placed as an obstacle in the maze. . . . . . . . . . . . . . . . . . . . . . . Visualizations of the CRIMbot’s three navigation positions created using ROS’s RViz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 5.9 Visualizations of representative outcomes for the four different test scenarios when navigating into the maze. . . . . . . . . . . . . . . . . . . . . . Figure 5.10 Overhead pictures of representative outcomes for the four different test scenarios when navigating into the maze. . . . . . . . . . . . . . . . . . . Figure 5.11 Visualizations of representative outcomes for the two test scenarios that involved navigating out of the maze. . . . . . . . . . . . . . . . . . . . . . . Figure 5.12 Magnified visualization of the CRIMbot finishing the blind turn into the maze and failing to observe the obstacle with the unmodified Kinect.
42 43 44 45 46 50 53 54 54 55 56 56 57 59 60 61 63
Figure C.3
Surface plot of cropped and downsampled depth images of large planar objects viewed with the modified optics using the OpenNI Kinect driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 (top) Plot of measured depth in meters versus raw depth value for 41 depthboard poses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Plot of residuals for best fit from (Eq. C.1). . . . . . . . . . . . . . . . . . . . . . 82
Figure D.1 Figure D.2 Figure D.3
Diagram of the Kinect interface’s modular architecture. . . . . . . . . . . . 85 Graphical user interface for point cloud measurements. . . . . . . . . . . . 87 Graphical user interface for point cloud and image mass capture. . . . . 88
Figure C.1
Figure C.2
x
CHAPTER 1 DEPTH IMAGING
Depth imaging is the process of forming an image in which the quantity of interest is distance. Whereas color image pixels capture object color or density image pixels measure material density, a depth image pixel value represents the distance of the first object encountered along a ray through that pixel. Such depth images are acquired using a number of methods and have many applications, most commonly, the measurement of object position in three dimensions. This chapter gives a broad overview of several depth imaging technologies and applications to provide background for later chapters.
1.1
Common Methods
The contactless formation of depth images exploits one of three techniques: triangulation, interferometry, and time-of-flight [1]. For the purposes of this discussion, only time-of-flight and the structured light subcategory of triangulation are considered. Devices such as the Hokuyo UXM-30LN laser scanner [2] and the Mesa Imaging SwissRanger SR400 depth camera [3] rely upon the former technique whereas the PrimeSense Reference Design depth camera [4] relies upon the latter. The remainder of this section presents the theory of operation behind time-of-flight and structured light depth imaging.
1
1.1.1
Time-of-Flight
The underlying principle behind time-of-flight depth imagers is the timing of the interval required for light to traverse an unknown distance. Such devices operate by emitting a light pulse (visible or otherwise) at known time t 1 and measuring the return time t 2 of the first reflection. Since the speed of light c is known, the radial distance r to the object that generated the reflection, which is the nearest object along the pulse’s ray, is given by r=
t 2 − t 1 ∆t = . 2c 2c
(1.1)
Combining this distance measurement with the light pulse’s orientation provides sufficient information to compute the three-dimensional position with respect to the imager of where the light pulse encountered an object (Section 1.2.1).
1.1.2
Structured Light
Structured light depth imagers operate using a variation of the stereo method to measure depth. Rather than using two cameras as with stereopsis, a well-defined pattern is projected onto the scene at a known pose and then imaged from a separate known pose fixed in relation to the first. Known features from the pattern are then identified in the captured image to triangulate each feature’s depth as depicted in Fig. 1.1. For a feature F projected at angle φ and imaged at angle θ with the camera and projector separated a distance d , the associated depth in the x -z plane is z=
d , cot θ + cot φ
(1.2)
which is derived by noting that the two right triangles with angles φ and θ share a height of z and have bases that sum to d . This solution depends explicitly upon the feature’s two angles. Alternatively, if feature F is known to have image coordinate x i 0 at a single known depth z 0 , its depth in any scene is expressible purely in terms of its image coordinate x i provided that the camera’s scaled focal length αx and principal point component c x are known. For any depth, the image coordinate x i of feature F in Fig. 1.1 is x i = −αx
x + c x = −αx cot θ + c x , z
2
(1.3)
F
zˆ xˆ
z xˆi
θ
camera
φ projector
x d
Figure 1.1: The geometrical setup of a structured light depth camera located at the origin. The projector creates feature F in the scene a known angle φ, and the camera measures angle θ to triangulate the feature’s position in the x -z plane [5, p. 49]. The camera and projector are chosen to lie along the x -axis and face in the zˆ direction without loss of generality, which renders the y -axis irrelevant. xˆi is oriented oppositely of xˆ to maintain consistency with Fig. 1.3.
as given by the pinhole camera model geometry (Fig. 1.3) [6, 7].1 Since θ varies with the position of feature F , Eq. 1.2 is used to re-express x i as x i = −αx
d − cot φ + c x , z
(1.4)
which captures the feature’s identity in φ since φ is fixed for a given feature regardless of that feature’s position in world coordinates.2 Thus, if x i 0 and z 0 satisfy Eq. 1.4 for feature F , then subtracting x i 0 from x i for feature F yields δx i ≡ x i − x i 0 = −αx
d 1 1 d − cot φ + c x + αx − cot φ − c x = αx d − , z z0 z0 z
(1.5)
which relies upon the constancy of φ for a given feature. This result relates the change in feature depth from known z 0 to unknown z with the associated change in image coordinate 1
Section 1.2.2 further discusses the pinhole camera model as drawn in Fig. 1.3. Of significance here is the difference in orientation of xˆ and xˆi which produces the minus sign. 2 The projection of features onto the scene at known angles produces this result. Furthermore, if the structured light pattern is designed carefully, each feature exists in a one-to-one mapping with φ, meaning that specification of φ uniquely identifies the corresponding feature.
3
from x i 0 to x i . Solving Eq. 1.5 for z in terms of δx i yields z=
δx i 1 − z 0 αx d
−1
=
1 , a δx i + b
(1.6)
where a = −1/αx d and b = 1/z 0 . Thus, a feature’s depth in the x -z plane is expressible in
terms of its image coordinate shift from a single known calibration point and the camera’s intrinsic parameters. Both Eq. 1.2 and Eq. 1.6 provide sufficient information to compute the three-dimensional position with respect to the camera of all features projected onto the scene and hence the scene’s object surfaces as well (Section 1.2.2). The concise expression of Eq. 1.6 and its independence of φ, which relaxes the requirement to know φ for every projected feature, render it preferred over Eq. 1.2. Moreover, Eq. 1.2 relies upon transcendental functions that require more computation than the arithmetic of Eq. 1.6.
1.2
Imaging Geometry
Depth imaging devices capture images using geometries suited to their specific technology. Depth cameras produce traditional rectangular images with pixels uniquely specified by their Cartesian position (x i , y i ) in the camera’s focal plane. Laser scanners, however, produce “angular” images with pixels uniquely specified by the inclination and azimuthal angles (φ, θ ) of their capture. For the purposes of reconstructing object surfaces, the x i -y i and φ-θ pixel arrays are converted into world coordinates in the camera’s frame of reference. This representation is called a “point cloud” since the image, which has regularly spaced samples, is transformed into a set of points in three-dimensional space.3 The mathematics of this transformation depends upon the capture device’s geometry.
1.2.1
Laser Scanners
The laser scanner geometrical setup is directly analogous to the definition of spherical coordinates (Fig. 1.2). Laser scanners emit light pulses along a radial line with known azimuthal 3
The point cloud representation may or may not retain the original image’s structure in memory. So called “dense” point clouds retain neighbor relationships present between pixels in the source image, which can simplify point cloud operations such as surface normal estimation. Other “unordered” or “sparse” point clouds require use of search algorithms to identify neighboring points.
4
angle φ and inclination angle θ .4 The object distance measurement r at that φ and θ defines a point in three-dimensional space with displacement
r sin θ cos φ ~r = r rˆ = r sin θ sin φ r cos θ
(1.7)
from the laser scanner’s imaging origin. The set of points computed in this manner from a laser scan reconstructs the object surfaces presented to the scanner.
zˆ
~r θ yˆ
φ xˆ
Figure 1.2: Laser scanner imaging geometry based upon spherical coordinates. The vector ~r is the laser beam’s path from the imaging origin to the nearest object for given φ and θ .
1.2.2
Depth Cameras
The depth camera geometrical setup is the traditional pinhole camera model (Fig. 1.3). In general, the projection of a point in three-dimensional space onto a pinhole camera’s image plane discards the point’s depth. Thus, inverting the projection of a point yields only the ray from the camera’s focal point through that point rather than the point itself. The image from a depth camera, however, contains points’ depths, and using the depth camera’s intrinsic parameters, the camera projection process is easily inverted for each pixel to compute 4
Most laser scanners will require an articulating mount to achieve both degrees of freedom in φ and θ . In a common arrangement, the laser scanner sweeps its beam across its field of view in φ, after which the articulating mount adjusts the scanner to a new θ .
5
the corresponding point in space.5 For a camera with scaled focal lengths αx and αy and principal point (c x , c y ), a pixel with image coordinates (x i , y i ) and object depth z defines the displacement
−(x i − c x )z /αx ~r = x xˆ + y yˆ + z zˆ = −(y i − c y )z /αy z
(1.8)
from the camera’s focal point [6, 7].6 Each pixel of a depth image is unprojected in this manner to produce a three-dimensional surface reconstruction of the camera’s scene.
xˆi yˆi yˆ
yi
xˆ
xi
~r y
zˆ z
x
Figure 1.3: Depth camera imaging geometry based upon the pinhole camera model. The vector ~r defines the ray from the camera’s focal point to the nearest object that maps to image coordinates (x i , y i ). Knowledge of z from the depth image pixel value enables finding the x and y that originally produced x i and y i , respectively, up to the precision allowed by the image’s discretization.
5
Strictly speaking, inverting the projection process produces the volume containing the point, not the actual point, corresponding to each pixel due to discretization, and this volume increases with depth. Unprojected points at greater depth thus have lower precision than those at lesser depth. 6 The scaled focal lengths for a camera with true focal length f are defined as α = m f and similarly for x x αy , where m x and m y are pixels/distance scale factors. The negative signs for the x - and y -components of the unprojected ~r account for the different orientations of the x -y plane and the x i -y i plane as shown in Fig. 1.3.
6
1.3
Applications
Applications for depth imaging abound due to its extensibility and the usefulness of determining object position and surface structure. The following subsections briefly describe current and potential uses of depth imaging in diverse fields.
1.3.1
Measurement
Measurement, and closely related verification, directly use depth imaging to capture detailed three-dimensional models that accurately reflect objects’ exterior surfaces, such as in [8, pp. 618–619]. If of sufficient resolution, these surface models are suitable for inspecting tolerances of machined parts, detecting manufacturing defects in an assembled device, or even reproducing the original objects’ shape using three-dimensional printers. Depending upon the level of detail required and the scale of the object of interest, model generation for measurement and verification can use both time-of-flight and structured light techniques.
1.3.2
Robotics
Robotic applications often employ depth imagers as sensors to gather detailed representations of the environment. Robot control algorithms use these representations to avoid obstacles during navigation, produce maps during exploration, and gain feedback when interacting with the environment. For example, Willow Garage’s PR-2 robot uses an articulated Hokuyo UTM-30LX laser scanner to obtain a point cloud of both its immediate and long-range environment, enabling localization using an existing map and navigation around dynamically introduced obstacles [9]. Time-of-flight cameras like Mesa Imaging’s SwissRangers provide depth imaging for similar purposes yet at shorter maximum ranges than possible with laser scanners.
1.3.3
Medicine
It is believed that promising medical applications for depth imaging technologies exist as well. Macroscopic depth imaging of tissue surfaces, for example, could provide useful diagnostic information in registering existing X-ray, magnetic resonance, and computed tomography images. Of particular interest is the integration of depth imaging with laparoscopic minimally invasive surgery systems. Current laparoscopes provide only a monocular view of the surgical
7
field or at best a stereoscopic view, in the case of the da Vinci Surgical System by Intuitive Surgical [10]. Performing depth imaging with laparoscopes would provide unprecedented three-dimensional visualization of the tissues and instruments within a surgical space and provide detailed telemetry for true robotic surgical systems. Educational and research uses of laparoscopic surgeries recorded with depth imaging may exist as well (Section 6.3).
1.4
Limitations
Based upon their methods of acquiring images, the time-of-flight and structured light depth imaging technologies have inherent functional and practical operating limitations. The following subsections identify and discuss those limitations most immediately affecting common depth imaging applications.
1.4.1
Range
Time-of-flight and structured light depth imagers have both a minimum and a maximum range. For time-of-flight devices, the minimum range is determined by the minimum interval measurable with the on-board timing circuitry. The minimum range of structured light devices, however, is determined by the range at which the reflected structured light pattern exceeds the camera’s field of view or is sufficiently bright to saturate the image sensor. The maximum range for both device types depends upon their ability to resolve light reflected from distant objects and, for time-of-flight imagers, the maximum timed waited for that reflection as well.
1.4.2
Surface Properties
Since time-of-flight and structured light depth imaging methods rely upon capturing reflections of light projected into the environment, object surfaces that poorly reflect light back to the imager resist accurate depth imaging. Highly reflective surfaces, such as mirrors in the most extreme case, deflect laser beams and structured light patterns in addition to producing specular highlights that complicate identification of structured light features. Reflection in particular can result in overestimated depths corresponding to those object surfaces viewed through reflections rather than the surface of the reflective object itself. Highly absorbing surfaces, such as laser safety curtains, prevent the return of laser beams to the scanner for
8
timing and yield no structured light features, which produces gaps or “holes” in the resulting depth image.
1.4.3
Shadowing
For structured light imagers, the spacing between the camera and projector causes the casting of shadows by objects near the projector onto those more distant as viewed from the camera’s perspective. Since the imaging device cannot estimate the depth of objects for which the structured light pattern is occluded, scenes with a range of actual object depths may have gaps in the depth image along the edges of near objects. Shortening the cameraprojector spacing reduces this effect but leads to decreased depth sensitivity as seen from the dependence upon d in Eq. 1.2 and Eq. 1.6.
9
CHAPTER 2 MICROSOFT’S KINECT
Originally produced as the next-generation input device for Microsoft’s Xbox 360 gaming console, the Kinect contains a diverse set of sensors, most notably a depth camera based on PrimeSense’s infrared structured light technology (Fig. 2.1) [11]. With a proper calibration of its color and depth cameras, the Kinect can capture detailed color point clouds at up to 30 frames per second. This capability uniquely positions the Kinect for use in fields other than entertainment, such as robotics, natural user interfaces, and three-dimensional mapping.
2.1
Hardware Features
Microsoft properly describes the Kinect as a “sensor array” since it contains more than a depth camera. A teardown of the Kinect reveals that its full hardware complement includes a structured light depth camera, a color camera, a microphone array, a three-axis accelerometer, and a tilt motor as shown in Fig. 2.2. The following subsections discuss these sensors, placing most emphasis on the camera system.
2.1.1
Camera System
The Kinect’s camera system consists of paired CMOS infrared and color cameras and an infrared structured light projector. Both cameras sit in the Kinect’s middle on either side of its standoff whereas the structured light projector is displaced towards the Kinect’s left side (as
10
Figure 2.1: Picture of the Kinect for Xbox 360 as shipped. The Kinect’s three apertures correspond to a structured light projector, color camera, and infrared camera in order from left to right. Fig. 2.2 reveals the hardware that lies beneath the Kinect’s polished plastic exterior. (Retrieved from [11] on 14 Feb 2012 under the open source Creative Commons BY-NC-SA license.)
viewed from the front). Using stereo calibration, the cameras are measured as roughly 2.5 cm apart, and a similar method measures the separation between the structured light projector and infrared camera as roughly 7.5 cm [12, 13]. Microsoft’s official Kinect programming guide specifies the system’s field of view as 43° vertical by 57° horizontal, the color and depth image sizes as VGA and QVGA,1 respectively, the “playable range” as 1.2 m to 3.5 m, and the frame rate as 30 Hz [14, p. 11]. Based upon the PrimeSensor Reference Design specifications, the Kinect’s expected depth resolution at 2 m is 1 cm [4]. The Kinect’s infrared structured light projector emits the pattern shown in Fig. 2.3, which resembles the illumination scheme described by [15], one of PrimeSense’s depth imaging patents. According to the patent, the pattern consists of pseudo-random spots such that any particular local collection of spots (a “speckle feature”) is uncorrelated with the remainder of the pattern. Thus, a known speckle feature is uniquely identifiable and locatable in an image of the structured light pattern. The patent’s explanation continues by noting the utility of pixel shifts for measuring depth, which suggests that the Kinect relies upon Eq. 1.6 to compute depths using a calibrated reference image as described in Section 1.1.2. 1 The OpenKinect and OpenNI drivers (Section 2.2) can obtain color and depth images from the Kinect in SXGA and VGA sizes. These larger sizes are more consistent with the PrimeSensor Reference Design specifications in [4].
11
Figure 2.2: Picture of the Kinect’s internal hardware. The two cameras are easily distinguished from the structured light projector due to their more numerous data wires, and close inspection of the circuit boards reveals the presence of PrimeSense’s image processor. The cameras and projector mount rigidly to the metal support to maintain fixed relative poses. A fan is also included to cool the laser projector. (Retrieved from [11] on 14 Feb 2012 under the open source Creative Commons BY-NC-SA license.)
12
A speculative yet detailed discussion of the Kinect’s structured light pattern processing algorithm is given on the ROS.org wiki page for Kinect calibration [13]. Kurt Konolige and Patrick Mihelich postulate that the Kinect uses pixel offsets from a calibrated reference image to transform an infrared image of the structured light pattern into depth. They estimate that the Kinect computes pixel offsets to 1/8 subpixel accuracy using a 9 × 9 pixel correlation
window on a 2×2 downsampled image from the infrared camera. The associated relationship between depth and pixel offset is given on the wiki page as z=
8bf , disparity offset − Kinect [pixel] disparity
(2.1)
where b is the camera-projector baseline separation and f is the infrared camera’s focal length. This strongly resembles the functional form of Eq. 1.6 for a general structured light imager and is equivalent after accounting for the scale factor of 8.
Figure 2.3: Structured light pattern emitted by the Kinect when imaging a planar scene, specifically the one in Fig. 2.5. The pattern exhibits structure at large scale but random noise at small scale. This image was captured using the Kinect’s infrared camera and is thus an example of the structured light processor’s input.
13
2.1.2
Other Sensors and Actuators
In addition to the color and depth camera system, the Kinect includes an array of four microphones, a three-axis accelerometer, and a tilt motor. The microphone array is capable of beamforming, source localization, and speech recognition with the assistance of a host computer, although the Kinect can perform on-board echo cancellation and noise suppression [14]. The tilt motor can angle the Kinect ±31° from horizontal and works in conjunction with the accelerometer to level the Kinect on an angled surface [16].2
2.1.3
Interface
The Kinect’s hardware interface consists of a proprietary type-A USB connector which provides a +12 V supply in addition to USB power. Fortunately, the Kinect is packaged with a converter to the standard type-A USB connector that draws extra power from a wall adapter, and use with a computer is consequently possible without hardware modification. The Kinect thus effectively requires separate connections for data and power.
2.2
Software Interface
Several profoundly different software drivers for the Kinect have emerged since its original release. Variations exist in supported imaging modes, datatype of retrieved images, and operating system support. Each driver differs additionally in its level of device abstraction, which directly impacts access to low-level hardware features.
2.2.1
OpenKinect
The OpenKinect driver (http://openkinect.org/), known by its library name libfreenect, was the first Kinect driver available for general use and is open-source, cross-platform, and derived purely from reverse-engineering efforts. libfreenect implements low-level access to the hardware by directly communicating with the Kinect’s USB endpoints. Support exists for reading raw camera frames, changing the status LED state, reading the tilt motor joint state, and writing the tilt motor setpoint [16]. The Kinect’s audio subsystem, however, requires the upload of a firmware image at runtime, which limits its accessibility and usefulness with 2
The Microsoft programming guide states a tilt range of ±28°, however [14, p. 11].
14
libfreenect. Obtainable camera frames include Bayer-encoded color images, 11-bit uncalibrated depth images, 10-bit infrared images, and various other encodings or compressions thereof. libfreenect does not provide skeletal tracking support since that functionality is not performed on board the Kinect.
2.2.2
OpenNI
OpenNI (http://www.openni.org) is a collaboration between several companies, including PrimeSense, to develop “natural interaction” software and devices [17]. The OpenNI is software architecture is open-source, cross-platform, and abstracts high-level functionality, such as skeletal tracking and gesture recognition, from low-level device data capture. Specifically, OpenNI implements a framework into which SensorKinect, the OpenNI Kinect interface module based upon PrimeSense’s software, integrates. As a result of this design, the OpenNI framework does not allow low-level access to the Kinect and, consequently, provides access to only color, calibrated depth, and infrared images.3 Nonetheless, the usefulness of calibrated depth images and the lesser importance of the Kinect’s auxiliary sensors and actuators result in preference for OpenNI over libfreenect in Willow Garage’s ROS software suite [18]. OpenNI additionally integrates with PrimeSense’s NITE software (http://www.primesense.com/nite), which provides device-independent skeletal tracking and gesture recognition on the host system.
2.2.3
Kinect for Windows SDK
The Kinect for Windows SDK (http://www.kinectforwindows.org) is Microsoft’s response to the OpenKinect and OpenNI Kinect drivers. The SDK provides high-level access to color and calibrated depth images, the tilt motor, advanced audio capabilities, and skeletal tracking but requires Windows 7 (or newer) and the .NET Framework 4.0, which very effectively restricts use to Windows only [19].4 Speech recognition is supported in conjunction with Microsoft’s Speech Platform SDK v11 on the host system. 3
The term “calibrated” is used to indicate that depth image pixel values are actual object depths. It is unclear if the Kinect for Windows SDK provides access to infrared images or accelerometer data since access to the documentation requires installing the SDK on a Windows 7 system. 4
15
2.3
Suitability
The use of PrimeSense’s structured light depth camera and the availability of software drivers render the Kinect an attractive depth imaging system. With depth and color imaging at 30 Hz and roughly 1 cm depth resolution, the Kinect captures a rich three-dimensional representation of its immediate environment. Moreover, as a consumer product priced at a few hundred dollars, the Kinect costs far less than other common depth camera systems. It was thus decided to explore possible non-gaming applications for the Kinect, particularly those that involved the imaging of smaller scenes.
2.3.1
Choice of Driver
The OpenKinect driver was chosen for investigating the Kinect based upon its simple software interface, low-level hardware access, cross-platform support, and support for multiple programming languages.5 Of particular significance is libfreenect’s ability to retrieve uncalibrated depth images from the Kinect for the purposes of examining depth calibration, which is the topic of Chapter 3. Furthermore, skeletal tracking was not considered important in exploring small scene depth imaging, which eliminated the need for that functionality as provided by the OpenNI and Kinect for Windows SDK drivers.
2.3.2
Range
As is consistent with Section 1.4, the ability of the PrimeSense image processor to resolve the structured light pattern enforces the Kinect’s range. Near objects with high infrared reflectivity or external infrared illumination (e.g., incandescent light bulbs or sunlight) obscure the structured light pattern and consequently require a larger minimum range for detection. Far objects with high infrared absorption poorly reflect the structured light pattern and consequently require a smaller maximum range for detection. Specifically, “raw depth values” from the Kinect range from approximately 300 for near objects to 1100 for far objects, as observed by examining the depth histogram for a scene containing near and far objects (Fig. 2.4). Crude measurements suggest that this range corresponds to 0.5 m through 10 m,6 5
Section D.1 specifically discusses language features and data formats that favored use of libfreenect. Also, since Linux was the preferred development platform, the Kinect for Windows SDK was disfavored, especially given its dependence upon Windows 7 and the .NET Framework. 6 This greatly exceeds the “playable range” reported by Microsoft’s Kinect programming guide (Section 2.1.1), which most likely reflects the optimal range for skeletal tracking rather than the Kinect’s operable range. For the
16
and the Kinect reports out-of-range pixels for objects that exceed these limits. To improve imaging of scenes with smaller dimensions, however, Chapter 4 explores techniques to decrease the Kinect’s range requirements.
1200 1100
100
1000
150
900
200
800
250
700 600
300
500
350
400
400
300
450
200
50 100 150 200 250 300 350 400 450 500 550 600
relative frequency
raw depth value
50
0.4 0.3 0.2 0.1 0
0
100
200
300
400
500 600 raw depth value
700
800
900
1000
1100
Figure 2.4: A sample scene in depth and its raw depth value histogram as viewed with the Kinect. White represents pixels for which the Kinect reported no depth. The foreground object extends within the Kinect’s minimum depth range and the background exceeds the Kinect’s maximum depth range. The Kinect’s depth range as estimated with this scene has an approximate lower limit of 300 and an approximate upper limit of 1100. The foreground object has an infrared-absorbing cloth covering which may lead to overestimation the Kinect’s lower limit.
2.3.3
Distortion
Additional experimentation indicates that the mapping between raw depth values and object depth is largely independent of (x i , y i ) image coordinates since pixels of planar objects parallel to the Kinect’s image plane receive nearly identical raw depth values (Fig. 2.5). This purposes of depth imaging, the entirety of the Kinect’s range is usable.
17
indicates that the Kinect either images without significant distortion or compensates for it when interpreting a captured structured light pattern. Distortion is encountered, however, in pursuit of smaller scene imaging and is subsequently corrected as discussed in Chapter 4.
1200 1100
100
1000
150
900
200
800
250
700 600
300
500
350
400
400
300
450
200
50 100 150 200 250 300 350 400 450 500 550 600
relative frequency
raw depth value
50
0.4 0.3 0.2 0.1 0
0
100
200
300
400
500 600 raw depth value
700
800
900
1000
1100
Figure 2.5: A planar object approximately aligned to the Kinect’s image plane in depth and its raw depth value histogram as viewed with the Kinect. The Kinect assigns the object a narrow range of raw depth values between 570 and 630, indicating that depth values do not depend significantly upon pixel coordinates.
18
CHAPTER 3 KINECT CALIBRATION
Since the Kinect contains a color camera and a depth camera (Section 2.1), calibrating the Kinect requires estimating a camera matrix and a set of distortion coefficients for each camera. Additionally, as noted in Section 2.2, depth images captured using the OpenKinect driver have unitless pixel values related to depth through an unknown mapping. Thus, to accurately image scenes in three dimensions, two camera calibrations and one depth calibration from raw depth values to depth in meters are necessary.1 This chapter presents the development of a procedure to efficiently produce these calibrations.
3.1
Objectives
Calibration procedure development began with the three design goals below in mind: 1. For ease of use, the capture of calibration data should not require the user to perform many manual measurements, particularly for calibrating depth; 2. For extensibility and portability, the procedure should not require modification to the Kinect hardware; and 1 Although the OpenNI driver provides calibrated depth images, the Kinect analysis was started with the OpenKinect driver to examine depth calibration as explained in Section 2.3.1. Moreover, modifying the Kinect’s optics as presented in Chapter 4 requires conducting a calibration with even the OpenNI and other calibrated drivers.
19
3. For validity, the procedure should reproduce existing camera and depth calibrations for the unmodified Kinect. By satisfying these three objectives, the procedure will enable rapid calibration of a single Kinect. Furthermore, such a procedure’s portability from one Kinect to another and ease of use would render calibration of multiple Kinects or analyzing modifications to a Kinect a tractable problem.
3.2
Procedure
The OpenCV library [7] contains built-in support for locating the interior corners of a chessboard calibration rig through its findChessboardCorners function (Fig. 3.1). OpenCV’s
calibrateCamera function can use image and object coordinates from multiple poses of the chessboard interior corners to estimate the camera intrinsic matrix, the lens distortion coefficients, and the interior corners’ world coordinates. The estimation of world coordinates is effectively a measurement of interior corners’ object depth in meters for each chessboard pose. By annotating the traditional chessboard rig with easily locatable features in depth, a relationship between raw depth values and object depths can be established during pose estimation. The set of pairings between raw depth values and object depths obtained in this manner is suitable for calibrating raw depth values to meters. An outline of the complete calibration procedure based upon this technique proceeds as follows: 1. Capture many paired (i.e., stereo) color and depth images of a chessboard with annotated depth features; 2. Establish correspondences between chessboard interior corners in paired color and depth images; and 3. Relate interior corner object depth from pose estimation and raw depth value to calibrate raw depth values. This procedure satisfies the first two stated objectives by performing all measurements with computer vision techniques and by operating independently of the Kinect’s exact configuration. The following subsections discuss this procedure’s implementation details.
20
Figure 3.1: An 11 × 8 chessboard calibration pattern. Since an m by n chessboard has (m − 1)(n − 1) interior corners, this chessboard has 70 interior corners and 4 exterior corners.
3.2.1
Calibration Target
One particularly simple yet effective depth annotation of a chessboard rig involves physically offsetting the chessboard from its background (Fig. 3.2). The offset creates sharp edges in depth along the perimeter of the chessboard mounting which remain fixed with respect to the chessboard’s interior corners. With careful sizing of the chessboard mounting, the chessboard’s exterior corners coincide with the sharp edges in depth. The coincidence eliminates the need to measure the positioning of the chessboard with respect to the mounting; only the dimension of a single chessboard square is needed. Thus, the chessboard’s exterior corners are annotated in depth and have known coordinates in the chessboard’s object frame. This calibration rig is termed as a “depthboard” rig.
3.2.2
Target Recognition
Identifying the depthboard target in paired color and depth images requires locating the depthboard’s interior and exterior corners in the color and depth images, respectively. Since OpenCV supports the chessboard calibration rig, the findChessboardCorners function is first used to extract the depthboard’s interior corner image coordinates from the color image. Estimation of the color camera’s intrinsic matrix and distortion coefficients and of world coordinates for each depthboard pose with respect to the color camera is then performed using the calibrateCamera function.2 This step constitutes the first phase of target recognition. Recognizing the depthboard’s exterior corners in the depth image requires greater ef2
If camera calibration is not required, the solvePnP function, which performs pose estimation given the camera matrix and lens distortion coefficients, may be substituted.
21
50 100 150 200 250 300 350 400 450 200
300
400
500
600
50 100 150 200 250 300 350 400 450 100
200
300
400
500
600
1200 1100 1000 900 800 700 600 500 400 300 200
raw depth value
100
Figure 3.2: Color and depth images of a 24.5 cm × 18.4 cm depthboard rig with a 11 × 8 chessboard pattern. Depthboard exterior corners in the depth image correspond to exterior chessboard corners in the color image. The chessboard pattern is suspended from the background on a mount so that its outline is readily extractable from the depth image. This rig supports capture of up to 70 raw depth value and object depth pairs per image.
22
fort, however, since its depth (or analogously, “gray level” in the depth image) is unknown. Nonetheless, the depthboard’s similar positioning in both the color and depth images (Section 2.1.1) and its approximately quadrilateral shape enable reliable detection. First, the depthboard’s position in the depth image is estimated as the average position of its interior corners previously identified in the color image. Position-based thresholding to preserve only the rectangular region centered on the depthboard’s estimated depth image position partially eliminates the background. Second, the depthboard raw depth value is estimated by sampling the depth image at the depthboard’s estimated position. Depth-based thresholding to preserve only pixels with similar depth values further eliminates the background immediately behind the depthboard (such as the mounting apparatus). These two thresholding operations select a “three-dimensional” volume in which to search for the depthboard’s quadrilateral exterior edge. Finally, quadrilateral detection is accomplished using OpenCV’s contour finding and approximation capabilities (the findContours and approxPolyDP functions, respectively). The quadrilateral’s vertices estimate the depth image coordinates of the depthboard’s exterior corners. Depth camera calibration and estimation of world coordinates for each depthboard pose with respect to the depth camera are then performed analogously as for the color camera. Since the depthboard has far fewer exterior corners than interior corners (4 compared to 70 for an 11 × 8 depthboard), pose estimation from the depth image is considered of lesser quality than that from the color image.3 This step concludes the second and final phase of target recognition.
3.2.3
Establishing Correspondences
For the purposes of identifying depth image pixels that correspond to the depthboard’s interior corners, target recognition provides two key pieces of information for each depthboard pose: 1. The transformation of world coordinates with the respect to the depth camera into depth image coordinates; and 3 Moreover, the availability of only 4 points per target pose in depth images may result in poor depth camera calibration. To improve depth calibration robustness in later steps, it may be desirable to separately calibrate the depth camera using infrared images of the full chessboard and then use the solvePnP function for pose estimation in depth images.
23
2. The world coordinates with respect to the color camera of the depthboard’s interior corners. The first item enables the identification of the depthboard’s interior corners in the depth image while second item constitutes a set of object depth measurements collected by the color camera. Although the depth image-based pose estimates provide object depth measurements, these measurements use 4 points per pose (the depthboard’s exterior corners) for computations as previously noted. The color image-based pose estimates, however, use 70 points per pose (the depthboard’s interior corners) for computations, which necessarily yields higher accuracy in pose reconstruction. Utilizing these higher quality measurements requires identifying depth image pixels that correspond to the depthboard’s interior corners. Since the depthboard rig contains depth annotations for only its exterior corners, direct identification of its interior corners in the depth image is not possible. Nonetheless, because the depthboard’s exterior and interior corners share the same coordinate system (Section 3.2.1) and the depthboard pose has been estimated based upon exterior corner position, it is possible to project the object coordinates of the depthboard’s interior corners onto the depth image (Fig. 3.3).4 OpenCV’s projectPoints function provides exactly this functionality. Projecting the depthboard’s interior corners onto the depth image identifies pixels in the depth image that correspond to the same corners in the color image. Since the color camera has measured the world coordinates these corners, the object depths associated with each identified depth pixel are known. By pairing the pixel raw depth values at the corners with measured object depths, a relationship between raw depth values and object depth in meters can be established.
3.2.4
Curve Fitting
After measuring many object depths and collecting the associated raw depth values, the final part of calibration requires analyzing the relationship between raw depth value, object depth, and depth image coordinates. The characterization of the Kinect’s original optics in Section 2.3.3 fortunately suggests that raw depth value is independent of image coordinates, and thus, only raw depth value and object depth require consideration. This analysis is 4
Projection of the depthboard’s interior corners onto the depth image suffers from the inaccuracies in poorlyconstrained depth image pose estimates. The smoothly varying depth pixel values across the depthboard, however, mitigate the errors that result from these inaccuracies when identifying interior corners in the depth image.
24
50
1100
100
1000
150
900
200
800 700
250
600
300
raw depth value
1200
500
350
400 400 300 450 50
100
150
200
250
300
350
400
450
500
550
600
200
Figure 3.3: A 11 × 8 chessboard’s 70 interior corners (black) projected onto a depthboard depth image following pose estimation. The correspondence between raw depth values and object depths is established by pairing raw depth values at these image coordinates with depths in world coordinates from pose estimation. The depthboard edges appear blurred as a consequence of averaging several depth images to smooth the depthboard border in preparation for feature extraction.
therefore accomplished by fitting a curve to paired raw depth values and object depths, which fills gaps in the captured calibration dataset, allows extrapolation beyond the range of measured depths, and enables a concise software representation of the calibration. Several individuals have developed depth calibrations, d = f (r ),
(3.1)
for the Kinect’s original optical system that transform raw depth values r into depths d in meters without dependence upon image coordinates. Nicholas Burrus has proposed the two-parameter “reciprocal” calibration function f B (r ) =
1 1m = a r + b −0.0030711016r + 3.3309495161
(3.2)
on his wiki [12], which is identical in form to Eq. 1.6.5 A fit of this form is easily obtained through least-squares regression of 1/d = a r +b using the paired raw depth values and object depths. MATLAB’s [20] polyfit function provides this functionality. Stéphane Magnenat 5
This suggests that the Kinect directly reports a measure of pixel offset as is consistent with [13] and Eq. 2.1.
25
has proposed the more complex “tangent” calibration function r + 1.1863 f M (r ) = c tan(a r + b ) = (0.1236 m) tan 2842.5
(3.3)
on the OpenKinect mailing list [21] that he suggests performs “slightly better” than Eq. 3.2. Determining a fit of this form requires non-linear optimization of an error metric rather than linear regression due to the tangent’s presence. One suitable approach is optimizing c to yield a least-squares fit of arctan(c /d ) = a r + b with minimum mean squared error. MATLAB’s fminsearch and polyfit functions provide suitable non-linear optimization and least-squares regression functionality, respectively. Additionally, both proposed calibration functions increase asymptotically as r increases, suggesting that the Kinect’s depth resolution decreases as object depth increases. Specifically, Burrus’s and Magnenat’s calibration functions exhibit asymptotes near r = 1084 and r = 1092, respectively, which limits the useful raw depth values to roughly r < 1080.
3.3
Outcomes
After performing a separate camera calibration to obtain accurate intrinsic parameters and distortion coefficients for the Kinect’s depth camera, 9369 pairings between raw depth values and object depth in meters were collected from 134 depthboard poses (Fig. 3.4). The following subsections present and discuss the results of applying the calibration procedure.
3.3.1
Results
Camera calibration estimated the color and depth cameras’ intrinsic parameters given in Eq. A.1 and Eq. A.2, respectively. Burrus’s reciprocal calibration form (Eq. 3.2) produced a best fit of f B2 (r ) =
1m −0.0027681r + 3.0163
(3.4)
with residual mean µ = −0.22958 mm and standard deviation σ = 5.7170 mm (Fig. 3.5). Magnenat’s tangent calibration form (Eq. 3.3) produced a best fit of r f M2 (r ) = (0.15257 m) tan + 1.1200 2431.3
with residual mean µ = 0.065544 µm and standard deviation σ = 5.5458 mm (Fig. 3.5).
26
(3.5)
2
measured depth (m)
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 450
500
550
600 650 700 750 raw depth value
800
850
900
Figure 3.4: Plot of measured depth in meters versus raw depth value for 134 depthboard poses. Measured depths do not increase linearly with raw depth value as expected given Burrus’s and Magnenat’s calibration functions (Eq. 3.2 and Eq. 3.3).
3.3.2
Discussion
This depth value calibration procedure enabled rapid collection of many corresponding raw depth values and object depths. With an appropriately sized chessboard, collection of several thousand data points is possible in fewer than 15 minutes under favorable lighting conditions. This procedure is consequently suitable for rapidly determining the depth characteristics of a collection of Kinects. Moreover, estimation of camera intrinsic parameters and distortion coefficients is integrated into the data collection process, further reducing the calibration time for an individual Kinect. In this particular case, however, a separate depth camera calibration using infrared images of the chessboard rig became necessary to achieve reliable results. Since this method relied upon chessboard identification, however, the ability to resolve the chessboard in the color image constrained the maximum object depth suitable for calibration. When viewed at larger distances, smaller depthboards like the one in Fig. 3.2 posed challenges first in chessboard recognition and second in accurately extracting exterior corner image coordinates. This limit is seen in Fig. 3.4 as roughly 1.6 m, which is far below the Kinect’s maximum depth range (Section 2.1.1). Use of larger depthboards is expected to
27
3
·10−2
f B2 residual (m)
2 1 0 −1 −2 −3 450 3
500
550
600 650 700 750 raw depth value
800
850
900
550
600 650 700 750 raw depth value
800
850
900
·10−2
f M2 residual (m)
2 1 0 −1 −2 −3 450
500
Figure 3.5: (top) Plot of residuals for the best fit based upon Burrus’s reciprocal calibration form (Eq. 3.4). The residuals have mean µ = −0.22958 mm and standard deviation σ = 5.7170 mm. The range ±5 mm contains 85% of all residuals. (bottom) Plot of residuals for the best fit based upon Magnenat’s reciprocal calibration form (Eq. 3.5). The residuals have mean µ = 0.065544 µm and standard deviation σ = 5.5458 mm. The range ±5 mm contains 85% of all residuals.
28
increase the maximum calibrated object depth beyond 2 m with the ultimate limit imposed by the Kinect’s maximum ranging depth and calibration function’s previously identified asymptotes (Section 3.2.4). The two fits from Eq. 3.4 and Eq. 3.5 as well as Burrus’s and Magnenat’s calibration function are plotted together in Fig. 3.6. The fitted reciprocal and tangent calibration functions for the Kinect nearly coincide with but differ from the other two known calibration functions (Eq. 3.2 and Eq. 3.3). Specifically, Burrus’s and Magnenat’s calibration functions appear to differ from the computed fits by constant offsets.6 These constant offsets are expected to arise from variations in the choice of origin. The calibration procedure from Section 3.2 uses the focal point of the Kinect’s optical system as the origin since that origin is implicitly defined during object pose estimation. Although this origin is not easily referenced to an external marker on the Kinect, this choice of origin will produce correct projection of image coordinates into world coordinates without the need for a depth offset.7 Thus, this procedure satisfied its third design objective in Section 3.1 by reproducing the character of previously proposed calibration functions. The tangent calibration functional form (Eq. 3.5) produced a slightly better fit than the reciprocal calibration functional form (Eq. 3.4) as determined using residual mean and standard deviation. Since both functions produced statistically and visually similar residuals (Fig. 3.5) and 85% of residuals differed from zero by less than 5 mm, these calibration functions are considered roughly equivalent. The tangent calibration function’s slightly better performance is attributed to its inclusion of a third parameter beyond the two of the reciprocal calibration function. From a computational complexity standpoint, however, the reciprocal calibration function may yield a better execution time than the tangent calibration function.
6
Strictly speaking, the calibration functional forms in Eq. 3.2 and Eq. 3.3 do not admit constant offsets. Nonetheless, the OpenKinect site proposes using a −37 mm offset to “center” Magnenat’s calibration with Burrus’s calibration at http://openkinect.org/wiki/Imaging_Information#Depth_Camera. 7 Other reasonable choices of origin include the front the Kinect’s optical housing or base, but these are subject to changes in the Kinect’s shape, such as those induced by the mounting additional optics in Chapter 4.
29
2
object depth (m)
1.8 1.6 1.4
fB fM f B2 f M2
1.2 1 0.8 0.6 0.4 450
500
550
600 650 700 750 raw depth value
800
850
900
Figure 3.6: Plot of Burrus’s (Eq. 3.2) and Magnenat’s (Eq. 3.3) depth calibrations and fitted depth calibrations using the original Kinect optics (Eq. 3.4 and Eq. 3.5). Both the reciprocal and tangent calibration functional forms yield very similar fits as expected given the highly constrained curve in Fig. 3.4. Qualitatively, both fitted functions appear to differ from Burrus’s and Magnenat’s functions by a constant offset.
30
CHAPTER 4 MODIFYING THE KINECT OPTICS
Out of the box, the Kinect provides depth imaging appropriate for room-sized scenes as is consistent with its intended use. Other applications, however, may require depth imaging of smaller scenes. Potential medical applications, in particular, would require the capability to image small tissues or the ability to image through a laparoscope (Section 1.3.3). Given its comparatively very low cost and extensibility (Section 2.3), the Kinect presents itself as a promising platform to investigate low-cost small-scene depth imaging. Consequently, Kinect modifications to improve suitability for small-scene imaging must remain low in cost and flexible in nature; otherwise, such modifications would counteract the Kinect’s attractiveness as depth imaging platform. This chapter discusses efforts to adapt the Kinect for depth imaging of smaller scenes while retaining its cost and flexibility advantages.
4.1
Objectives
Depth imaging of smaller scenes with appreciable quality using the Kinect requires reducing the Kinect’s range without sacrificing depth resolution. The Kinect’s large operable and minimum ranges (Section 2.3.2) and low cost led to the formulation of these objectives for modifying the Kinect: 1. Reduce the Kinect’s minimum range to enable imaging of nearer, and hence smaller, objects;
31
2. Compress the Kinect’s operable range to improve resolution over that range; 3. Achieve these effects with little or no modification to the Kinect’s out-of-the-box hardware; and 4. Achieve these effects at a cost less than that of the Kinect. Satisfying these objectives would improve the Kinect’s suitability for small-scene depth imaging without great expense or loss of flexibility. Changing the Kinect’s lenses is perhaps the simplest method of achieving these objectives. Mounting lenses to the Kinect’s housing requires neither internal hardware changes nor great expense yet can dramatically affect the Kinect’s imaging capabilities. This idea is not new, however, as Nyko, a manufacturer of console game accessories, has already commercially marketed a simple and inexpensive lens adapter for the Kinect.
4.2
Nyko’s Lens Adapter
The Nyko Zoom [22] is one commercially-available modified optical system for the Kinect. After snapping into place, the adapter fits snugly around the Kinect’s external housing and provides a lens for each of the Kinect’s three apertures (Fig. 4.1). Although intended for multiplayer gaming, the Zoom’s wide-angle lenses have the potential to enable higher-resolution depth imaging of small objects that would otherwise require entering the minimum range of an unmodified Kinect to adequately capture. Moreover, the adapter’s low cost and apparent ease of modification render it an attractive option for mounting modified optics onto the Kinect. The Zoom adapter thus satisfies objectives three and four for modifying the Kinect. Depth calibration of the Kinect as modified with the Zoom adapter was investigated due to its promising characteristics and since calibration techniques for the Zoom optics will apply to other lens systems as well. For the remainder of this chapter, the Kinect with the Zoom adapter is referred to as the “modified Kinect”.
4.2.1
Suitability
A cursory characterization of the modified Kinect indicates that it has a raw depth value range of approximately 400 for near objects to 1100 for far objects (Fig. 4.2).1 Although 1
For the purposes of characterizing the modified Kinect optics, work continued using the OpenKinect driver that was used in Chapter 3.
32
Figure 4.1: Picture of Nyko’s Zoom adapter. The adapter snaps onto the Kinect’s plastic housing and covers each aperture with a lens. The outer two lenses have a yellow tint when viewed from the front whereas the middle lens is clear.
similar to the original Kinect, the modified Kinect’s raw depth value range corresponds roughly to 0.3 m through 4 m, which is a significant reduction in minimum and maximum range (Section 2.3.2). This suggests that the modified Kinect can image nearer objects with resolution comparable to the original Kinect, satisfying the first two objectives for modifying the Kinect. Additionally, the modified Kinect exhibits a larger field of view than the original Kinect. The modified optics produce two unusual effects, however, for which compensation is required. First, the lens causes loss of depth information along the edges of the depth camera’s field of view, especially in the corners. This loss is most likely a direct result of lens distortion, which increases with radius from the camera’s optical axis, since distortion interferes with the PrimeSense image processor’s ability to resolve the structured light pattern (Fig. 4.4). Although recovering the lost depth information is not possible without internally modifying the Kinect, rectification of the depth image will mitigate the effect by “trimming” the image’s edges.2 Second, the lenses induce a dependency on (x i , y i ) image coordinates in the mapping between raw depth values and object depth as demonstrated by “barrel distortion” in depth when imaging planar objects. Specifically, planar objects presented parallel to the Kinect’s image plane appear as convex surfaces when imaged by the depth camera (Fig. 4.3). These convex surfaces appear to “billow” in depth such that raw depth value increases with pixel distance from the image center rather than remaining uniform across the object. This distortion is least pronounced near the image center, as is expected since a lens’s distortion is 2
Such depth image rectification is demonstrated in Section 4.4.1 and in Fig. 4.8.
33
minimized along its optical axis, and decreases with object depth. Whereas existing computer vision software can correct such distortion in (x i , y i ) image coordinates, the modified lens has distorted the pixel values themselves, which requires additional compensation techniques. Neither a depth distortion compensation technique nor a depth calibration function exist for the Kinect modified by the Zoom lens, and development of a calibration method is therefore necessary to use these optics.
1200 1100
100
1000
150
900
200
800
250
700 600
300
500
350
400
400
300
450
200
50 100 150 200 250 300 350 400 450 500 550 600
relative frequency
raw depth value
50
0.1
0.05
0
0
100
200
300
400
500 600 raw depth value
700
800
900
1000
1100
Figure 4.2: A sample scene similar to Fig. 2.4 in depth and its raw depth value histogram as viewed by the Kinect using the Nyko Zoom optics. The foreground object extends within the modified Kinect’s minimum depth range and the background exceeds the modified Kinect’s maximum depth range. With the modified optics, the Kinect’s depth range has an approximate lower limit of 430 and an approximate upper limit of 1040.
34
1200 1100
100
1000
150
900
200
800
250
700 600
300
500
350
400
400
300
450
200
50 100 150 200 250 300 350 400 450 500 550 600
relative frequency
raw depth value
50
0.2 0.15 0.1 0.05 0
0
100
200
300
400
500 600 raw depth value
700
800
900
1000
1100
Figure 4.3: A planar object approximately aligned to the Kinect’s image plane in depth and its raw depth value histogram as viewed by the Kinect using the Nyko Zoom optics. The planar object exhibits significant “barrel distortion” in depth, which is more pronounced horizontally than vertically. Consistent with the distortion, the Kinect assigns the object a broad range of raw depth values from 640 to 870. This suggests that assigned raw depth values depend upon image coordinates.
Figure 4.4: Structured light pattern emitted and captured by the Kinect when using modified optics to image the scene in Fig. 4.3. Comparison with Fig. 2.3 indicates pronounced radial distortion of the pattern, which is likely the source of depth distortion.
35
4.3
Empirical Depth Undistortion
Correction for depth distortion is ideally implemented by recalibrating the PrimeSense image processor within the Kinect to view Fig. 4.4 rather than Fig. 2.3 as planar.3 The required internal modifications to the Kinect, however, violate the third objective in Section 4.1. Alternatively, detailed reverse-engineering of the depth camera image processing algorithm and characterization of the added lenses could produce a viable correction, yet the low cost and associated low quality of the Zoom lenses does not warrant such a detailed analysis. The most straightforward option is simply to image many scenes of planar objects in depth to develop an empirical model for the distortion process. This approach was pursued based upon its simplicity and applicability to other lens systems.
4.3.1
Theory
Depth distortion is corrected by developing a function r 0 = g (r, x i0 , y i0 ),
(4.1)
where r is the reported raw depth value, r 0 is the corrected raw depth value, and (x i0 , y i0 ) are the undistorted image coordinates as determined from camera calibration.4 Rather than using a distortion model to analytically derive g (r, x i0 , y i0 ), it is approximated empirically by capturing many depth images of large planar objects, such as a wall, presented parallel to the depth camera’s image plane at many different depths. Each image’s pixels form a set of points (r, x i0 , y i0 ) that satisfy Eq. 4.1 for some undistorted raw depth value r 0 since planar objects imaged in this manner ideally have uniform raw depth values. These sets of points and their corresponding r 0 taken across all images are samples of g (r, x i0 , y i0 ) from which an estimate gˆ (r, x i0 , y i0 ) is constructed through interpolation. Raw depth value undistortion is then accomplished by transforming raw depth image values using gˆ (r, x i0 , y i0 ). The following subsections describe the process of sampling g (r, x i0 , y i0 ). 3
This approach is discussed without conclusion at http://groups.google.com/group/openni-dev/browse_ thread/thread/1a95a18a3a90292 yet appears viable if direct access is gained to the PrimeSense image processor within the Kinect. 4 Since Eq. 4.1 relies upon undistorted image coordinates, camera calibration is a pre-requisite for empirical depth undistortion.
36
4.3.2
Capturing Snapshots
Properly sampling the distortion process requires capturing many depth images of planar objects positioned throughout the modified Kinect’s operable range. It was decided to capture these images of a corridor wall and to adjust object depth by moving the Kinect relative to the fixed wall (Fig. 4.5). This choice allows imaging of large planar surfaces in depth so as to satisfy Eq. 4.1 but presents a small difficulty: ensuring the planar object aligns with the Kinect’s depth camera image plane. The simplest and most easily implemented alignment method relies upon the user’s visual estimation of alignment. Before the capture of each depth image, the user either sights the Kinect housing to ensure it squares with the object or orients the Kinect on the basis of some feature viewed though its color camera. Extending this method with a laser sight offers improvement as long as the laser’s beam is properly oriented to fully constrain the Kinect’s pose with respect to the object.
Figure 4.5: Adaptation of a swivel chair as a mount for the Kinect with modified optics. The bubble level is used to ensure the Kinect remains properly upright since the chair is sufficiently adjustable to allow tilting. If attached, a laser square projects a reference line on the floor that is aligned with the Kinect’s optical axis by visual estimation. Before each image capture, the reference line is aligned to a floor tile boundary, which is approximately perpendicular to the wall, to ensure that the Kinect is properly squared.
37
4.3.3
Estimating True Depth
Following capture, the correct raw depth value r 0 is computed for each depth image, which completes the solution to Eq. 4.1. Three different techniques for determining r 0 were explored as discussed below. Fig. 4.6 compares the performance of these three methods for one particular dataset. Minimum of Depth Image This method computes the corrected raw depth value as the image’s minimum raw depth value. Since depth distortion produces incorrectly large raw depth values, the minimum raw depth value seen throughout the entire image reasonably estimates the image’s corrected raw depth value. These estimates become increasingly inaccurate as r 0 increases, however, since the image minimum is increasing influenced by noise at the image edges as convex distortion decreases for larger object depths. Fitting Hyperboloids Alternatively, the corrected raw depth value is estimated by fitting an elliptic hyperboloid of two sheets to the planar object’s raw depth values since the convex depth image resembles a hyperboloid. The elliptic hyperboloid surface is defined by r 2 = (r 0 )2 + b x (x i0 − x i0 c )2 + b y (y i0 − y i0c )2 ,
(4.2)
where (x i0 c , y i0c ) is the hyperboloid’s center in undistorted image coordinates and b x and b y define the hyperboloid’s curvature in x i0 and y i0 , respectively. Fitting Eq. 4.2 to the depth image’s convex surface is readily accomplished through linear regression and enables solving for the corrected raw depth value r 0 . Fits to less convex images at large object depths, however, become prone to error as the hyperboloid’s curvature is poorly constrained for flat images, which can produce very inaccurate estimates of r 0 . Minimum of Depth Image Center An intermediate approach applies the image minimum method to only the center neighborhood of the depth image. This restriction eliminates the effects of noisy image edges, particularly for large object depths. Moreover, the minimum technique can accommodate low-curvature images that otherwise produce poorly constrained fits. This method was ultimately used based upon its low susceptibility to noise.
38
estimated true raw depth value
1040 1020 1000 980 960 940
hyperboloid fit image minimum center minimum
920 900 10
20
30 40 50 snapshot number
60
70
Figure 4.6: Comparison of three methods for estimating the true raw depth value in distorted depth images of planar objects for one particular dataset. Since snapshots were taken at progressively larger object depths in this dataset, estimated true raw depth value should increase with snapshot number. The hyperboloid fit and image minimum methods produce noisy estimates, particularly for larger estimated true raw depth values. The image center minimum technique produces the most stable results.
39
4.3.4
Snapshot Processing
Each depth image is then processed to improve its suitability for use in interpolation. First, each image is cropped to include only the planar object and edited to remove artifacts, such as aberrant clusters of pixels corresponding to objects within the Kinect’s minimum range. These operations ensure that the depth images contain only the planar object’s surface and are performed manually. Since the undistortion image sets used a corridor wall as the planar object, visibility of the floor and ceiling necessitated cropping of images taken at larger object depths. Second, each image is downsampled on a rectangular grid to reduce the interpolator’s memory use and input into an instance of MATLAB’s TriScatteredInterp scattered linear interpolator using undistorted image coordinates for each pixel. The use of undistorted image coordinates corrects for lens distortion as characterized during camera calibration and requires scattered interpolation rather than gridded interpolation. To improve performance, the scattered interpolator is evaluated on a rectangular lattice to produce a faster gridded interpolator.5 This gridded interpolator contains a description of the depth distortion process for many object depths and thus empirically implements Eq. 4.1. This interpolator is the final product of developing an empirical depth undistortion calibration.
4.4
Outcomes
After performing camera calibrations for the modified Kinect’s color and depth cameras, 70 depth images of a corridor wall presented parallel to the Kinect’s image plane were captured using the modified optics and the visual estimation alignment method. Then, 3079 pairings between raw depth values and object depth in meters were collected from 46 depthboard poses using the modified optics (Fig. 4.9). The following subsections present and discuss the results of approximating the depth undistortion function and applying it in the calibration procedure from Section 3.2. 5
Section D.3.1 discusses high-fidelity image undistortion as implemented in the interpolator, and Section D.3.2 comments on interpolator performance considerations.
40
4.4.1
Results
Camera calibration estimated the color and depth cameras’ intrinsic parameters given in Eq. A.3 and Eq. A.4, respectively. Processing the depth images of planar objects yielded the undistortion sample points in Fig. 4.7 from which the undistortion interpolator was constructed to approximate Eq. 4.1. As a simple test, application of the undistortion interpolator to Fig. 4.3’s depth image of a planar object yielded the corrected depth image in Fig. 4.8. The original image had standard deviation σ = 43.669 whereas the corrected image had standard deviation σ0 = 13.051. Undistorting the paired raw depth values and object depths yielded the corrected depth calibration sample points in Fig. 4.9. Burrus’s reciprocal calibration form (Eq. 3.2) produced a best fit of f B3 (r ) =
1m −0.0041952r + 4.5380
(4.3)
with residual mean µ = 0.19288 mm and standard deviation σ = 11.845 mm (Fig. 4.10). This fit was obtained through linear regression of the form 1/d = a r + b . Magnenat’s tangent calibration form (Eq. 3.3) produced a best fit not significantly different from Eq. 4.3.6
4.4.2
Discussion
Application of the estimated depth undistortion function from Fig. 4.7 to raw depth values captured using the modified optics yielded significant correction. When applied to the depth image of a planar object from Fig. 4.3, the correction function produced a “flattened” image (Fig. 4.8) with a reduction in standard deviation of 70%. This flattening was also manifested as the compression of the histogram from Fig. 4.3 to Fig. 4.8. The undistortion function did, however, appear to underestimate corrected depth values, particularly in the lower left quadrant of Fig. 4.8, which appears as the histogram’s left edge. It is anticipated that this results from slight misalignments of the depth camera image plane and the planar objects during image capture for the undistortion function. Applying the depth undistortion function to the depth calibration correspondences in Fig. 4.9 significantly reduced spread along the curve while discarding no input points as outside its region of support. Qualitatively, depth 6
Non-linear optimization to fit Magnenat’s calibration form yielded f M3 (r ) = (0.19313 mm) tan(r /1234300 + 1.5699), which is very sensitive to numerical precision as the tangent is evaluated near its π/2 asymptote. (In fact, conveying the correctly-behaving function requires more than 5 digits of precision.) The first-order Taylor series expansion of f M3 is effectively Eq. 4.3 as is clearly seen after converting the tangent into a cotangent. This equivalence renders Burrus’s form the preferred calibration due to its lesser computational complexity.
41
1100 1100 1000 1000 900
raw depth value
800 800 700 700 600 600 500
corrected raw depth value
900
500 400
400
300 −300 −200 −100
300 0
100
200
300
0 −200 −100
100
200 200
yi′
xi′
Figure 4.7: Surface plot of cropped and downsampled depth images of large planar objects viewed with the modified Kinect. The vertical axis corresponds to raw depth value, and color indicates corrected raw depth value. Each convex surface corresponds to an imaged planar object and has a single color associated with its corrected raw depth value. As expected, the pronounced curvature in both x i0 and y i0 decreases as object depth increases. This dataset is used to build the scattered interpolator, which is then sampled on a rectangular lattice to build a faster gridded interpolator.
42
1200 1100
100
1000
150
900
200
800
250
700 600
300
500
350
400
400
300
450
200
50 100 150 200 250 300 350 400 450 500 550 600
relative frequency
raw depth value
50
0.4 0.3 0.2 0.1 0
0
100
200
300
400
500 600 raw depth value
700
800
900
1000
1100
Figure 4.8: Undistorted depth image from applying Eq. 4.1 as estimated in Fig. 4.7 to the depth image in Fig. 4.3. The undistorted image exhibits a much narrower histogram with raw depth values ranging from 600 to 670 since the planar object’s previously convex shape is correctly flattened. Raw depth values near the original depth image’s borders are lost since they exceed the region of support in the undistortion function’s rectified coordinate system.
43
measured depth (m)
1 0.9 0.8 0.7 0.6 0.5 550
600
650
700 750 800 raw depth value
850
900
600
650 700 750 800 corrected raw depth value
850
900
measured depth (m)
1 0.9 0.8 0.7 0.6 0.5 550
Figure 4.9: (top) Plot of measured depth in meters versus raw depth value for 46 depthboard poses. This plot exhibits the same general curvature as seen in Fig. 3.4 but with exaggerated raw depth values as expected from Fig. 4.3. (bottom) Plot of measured depth in meters versus corrected raw depth value using the undistortion function from Fig. 4.7. Application of the undistortion function reduced the spreading of raw depth values along the expected curve but may have underestimated some correct depth values. Undistortion retained all original calibration points since none of the input points exceed the undistortion function’s region of support.
44
6
·10−2
f B3 residual (m)
4 2 0 −2 −4 −6 550
600
650
700 750 800 raw depth value
850
900
Figure 4.10: Plot of residuals for the best fit based upon Burrus’s reciprocal calibration form (Eq. 3.4). The residuals have mean µ = 0.19288 mm and standard deviation σ = 11.845 mm. The range ±1 cm contains 83% of all residuals. The greater positive spread of residuals supports the hypothesized underestimation of correct raw depth values in Fig. 4.9.
undistortion decreased the range of raw depth values observed for a given measured depth and therefore improved the accessibility of Fig. 4.9 to curve fitting. Additionally, the depth undistortion operation significantly reduced the region of support in Fig. 4.8. This results from the depth undistortion interpolator’s use of undistorted image coordinates (x i0 , y i0 ), which effectively implements pincushioning to correct the wide angle lens’s barrel distortion. The pincushioning operation necessarily discards the image’s highly distorted periphery since the undistorted image coordinate system “moves” pixels outwards (by changing their image coordinates) from the image center to compensate for barreling. This produces the cropping of image edges as seen in Fig. 4.8. Section D.3.1 provides a more detailed discussion of this process. Fitted depth calibrations for the Kinect’s original optics (Eq. 3.4 and Eq. 3.5) and modified optics (Eq. 4.3) are plotted together in Fig. 4.11. Use of the modified optics achieved a compression of the Kinect’s object depth range of roughly 30% as compared with the original optics. The raw depth values, however, do not exhibit such compression with modified optics since the estimated raw depth value ranges from Fig. 2.4 and Fig. 4.2 are comparable. Thus,
45
the modified optics improve the Kinect’s depth resolution for close object depths by roughly 30%, which increases suitability for imaging smaller objects rather than rooms. Interpreted alternatively, the modified optics depth calibration (Eq. 4.3) has a smaller slope than either of the original optics depth calibrations (Eq. 3.4 and Eq. 3.5), indicating a smaller change in object depth for a given change in raw depth value and therefore higher depth resolution. This result confirms the satisfaction of Section 4.1’s first and second objectives for modifying the Kinect as predicted in Section 4.2.1.
2 f B2 f M2 f B3
1.8 object depth (m)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 450
500
550
600 650 700 750 raw depth value
800
850
900
Figure 4.11: Plot of fitted depth calibrations using the original Kinect optics (Eq. 3.4 and Eq. 3.5) and the modified Kinect optics (Eq. 4.3). The modified Kinect’s calibration function demonstrates a reduction of roughly 30% in measured depth for all raw depth values. Furthermore, Eq. 4.3 has a smaller slope that either Eq. 3.4 or Eq. 3.5 which indicates an improvement in resolution using the modified optics.
4.4.3
Improvements
Although the Zoom adapter fits snugly on the Kinect’s external housing and mates with the Kinect’s recessed apertures, close investigation reveals that the adapter can move “slightly” (approximately less than 1 mm) along the Kinect’s front face. This capability likely resulted in
46
variations in lens positioning when attaching the adapter, which rendered reuse of camera calibrations between lens changes somewhat inaccurate. Future applications of this technique should consider reliability in lens positioning and the ability to assess or compensate for possible lens misalignment. Also, manual cleaning and cropping of depth images as required in Section 4.3.4 is very time-consuming for the user, particularly in the case of large datasets. Annotation of planar calibration objects with markers in depth may possibly enable automatic detection of the surface and relieve the user’s burden. Unfortunately, automatic detection cannot rely upon annotations of the corresponding color image, since the depth image is distorted in a way the color image is not, or detection of planes in depth, since depth distortion prevents proper imaging of planes. Annotating the color image, however, may enable automatic estimation of object depth, which can bypass the techniques in Section 4.3.3 and integrate depth calibration with depth undistortion.
47
CHAPTER 5 NAVIGATION EXPERIMENTS
Chapter 3 investigated the Kinect’s ability to capture a three-dimensional representation of a room-sized scene, and Chapter 4 demonstrated the use of lenses to improve closerange imaging capabilities. This chapter now considers Kinect-based robot navigation and application of lenses to improve short-range obstacle avoidance, especially avoidance of those obstacles discovered while executing a path plan.
5.1
Objectives
The Kinect’s large maximum range of roughly 10 m allows it to detect obstacles at great range. When a robot navigating using a Kinect approaches an obstacle, occupancy grid mapping methods enable persistence of that obstacle, even if the robot approaches it within the Kinect’s minimum range. When the robot turns a blind corner or operates within a highly dynamic environment, however, it cannot rely upon the Kinect to detect “discovered” or dynamic obstacles within approximately 50 cm of the sensor (Section 2.3.2). Such obstacles appear as segments of out-of-range pixels (or “gaps”) in the corresponding depth image, which provides the robot with no basis upon which to perform occupancy estimation.1 Thus, 1
One might propose that the robot simply back up to gain a better view of whatever obstacle my lie in front of it. Without distance information, however, the robot cannot distinguish between a mirror 5 m away, which warrants no immediate attention, or a basketball rolled directly in front of it, which should trigger avoidance maneuvering. This is a fundamental ambiguity of pinhole cameras.
48
the capability for close-range depth imaging is desirable. The modified optics examined in Chapter 4 promise to cheaply enable imaging of these immediate obstacles, and it was consequently decided to consider applications of these optics for improving avoidance of discovered obstacles during robot navigation. To explore such applications, a set of experiments was devised according to the following objectives. Experiments should: 1. Represent plausible navigation scenarios (with regards to environment setup and navigation parameters) that a small robot might encounter; 2. Depend upon obstacles’ static placement rather than dynamic introduction to improve repeatability; and 3. Demonstrate the utility of the modified optics in detecting obstacles along trajectories where the unmodified Kinect reports no obstruction. Since these experiments specifically consider discovered obstacles, the environment map was assumed available for path planning and localization purposes.
5.2
Experimental Platform
To compare performance of the Kinect with and without modified optics, a small robotic experimental platform called the “CRIMbot” was developed. The following subsections describe the platform’s hardware and software components.
5.2.1
The CRIMbot
The experimental platform’s hardware component, the CRIMbot, consists of an Asus Eee PC netbook and Kinect mounted on an iRobot Create base (http://www.irobot.com/create/) as shown in Fig. 5.1. This configuration is closely modeled after Willow Garage’s Turtlebot robot (http://www.willowgarage.com/turtlebot) which includes these same major pieces. The CRIMbot’s utilized sensor complement includes a Kinect to simulate a fixed laser scanner, a single axis gyroscope and wheel encoders to constitute a planar inertial measurement unit (IMU), and cliff and bump sensors for robot safety.2 The full system is approximately 2
A fixed laser scanner is simulated by transforming the middle row of the Kinect’s Cartesian point cloud into polar coordinates according to Fig. 1.2.
49
Figure 5.1: Picture of the CRIMbot with the modified optics attached. The Kinect is mounted 19 cm off the ground (as measured to its apertures’ midpoints) at the Create’s front, which differs from the Turtlebot’s rearward and more elevated mounting.
cylindrical with a diameter of 40 cm and a height of 25 cm.
5.2.2
Robot Operating System (ROS)
Willow Garage’s Robot Operating System (http://www.ros.org/) was selected as the software package to control the CRIMbot. ROS is a collection of open-source software modules (or “stacks”) for distributed robotics based upon a networkable message passing architecture and includes support for Willow Garage’s Turtlebot platform [23]. Given the hardware similarity of the Turtlebot and CRIMbot, the ROS interface stack for the Turtlebot works equally well with the CRIMbot after minor configuration modifications as discussed in Section 5.2.3. Moreover, ROS’s robot-independent design allows its navigation stack to operate with the CRIMbot. ROS provides the following modules of specific interest for these experiments: turtlebot Provides a software interface for communicating with the Create base. The stack submodules execute high-level motor commands and maintain the robot’s IMU status. openni_kinect Implements calibrated image capture and point cloud generation for the Kinect using the OpenNI drivers. slam_gmapping Uses the open-source GMapping simultaneous localization and mapping (SLAM) algorithm to produce an occupancy-grid map from laser scan data and robot pose estimates.
50
navigation Implements the adaptive Monte-Carlo localization (AMCL) algorithm to localize a robot in a known map using laser scan data and provides global and local costmapbased path planning capabilities. visualization Provides three-dimensional visualization of ROS messages, including point clouds, camera frames, occupancy-grid maps, robot poses, and navigation trajectories. This set of stacks provides a complete mapping and navigation system compatible with the CRIMbot.3 Of those listed above, only the turtlebot stack required modification to fully support the CRIMbot due to ROS’s modular design.
5.2.3
Software Modifications
Adapting the turtlebot stack to accommodate the CRIMbot hardware resulted in the creation of the crimbot stack. These adaptations included minor alterations to the Turtlebot’s robot model to reflect the CRIMbot positioning of the Kinect and gyroscope as well as more involved changes to support the ROS’s OpenNI Kinect interface and improve the CRIMbot’s performance. Supporting OpenNI Drivers Prior work with the Kinect described in Chapter 3 and Chapter 4 had relied upon the OpenKinect driver for the reasons explained in Section 2.3.1. Willow Garage, however, has deprecated use of libfreenect in favor of the OpenNI driver. Thus, integrating Chapter 4’s depth undistortion techniques with the CRIMbot’s ROS framework involved writing an OpenNI-based Kinect adapter.4 Furthermore, as noted in Section 2.2.2, the OpenNI driver reports object depth rather than raw depth values, which prevented direct reuse of the depth undistortion interpolator in Fig. 4.7 when using modified optics. This necessitated repeating the procedure from Section 4.3 to develop a depth undistortion interpolator that operated up object depth as reported by the OpenNI driver. Appendix C presents the details of this calibration. Performance Issues
Preliminary testing with ROS running on board the CRIMbot indicated
that Kinect point cloud capture, laser scan matching, navigation, and uploading visualization 3
A full list of ROS stacks and documentation for those discussed are available at http://www.ros.org/browse/ stack_list.php. 4 Section D.2 presents the implementation details of the Kinect adapter and explains the abstraction layer developed for reusing libfreenect-based software in ROS with little modification.
51
data to the supervising workstation computationally overloaded the netbook. To reduce the netbook’s workload, all tasks other than controlling the Create base and Kinect image processing were offloaded to the workstation. This offloading more evenly distributed the workload across available computational power but suffered from latency, especially when the netbook and workstation communicated wirelessly, which subsequently impaired ROS’s ability to transform sensor data into the appropriate reference frame. Replacing the wireless link with a point-to-point wired connection between the workstation and the netbook eliminated latency provided that Kinect point cloud processing occurred on board the CRIMbot. Additionally, when using the modified Kinect optics, the cropping of raw depth images to their central 400 × 100 pixels before depth undistortion greatly increased
the simulated laser scan frame rate and reduced the influence of poorly undistorted image edges.
5.3
Test Environment
A self-contained test environment was built to provide a well-controlled area in which to observe the CRIMbot’s navigation behaviors. Design of the environment focused on the three objectives set forth in Section 5.1 and yielded an environment referred to as the “maze”.
5.3.1
Maze Design
As depicted by its design plan in Fig. 5.2, the maze has three main features: a single narrow corridor, a blind turn, and an antechamber. When navigating into and out of the maze (from the antechamber to the corridor’s terminus and back), the narrow corridor serves to constrain the CRIMbot’s planned path to a single trajectory as well as test its ability to turn around in physically limited situations. This enables performing repeated obstacle discovery tests while keeping the navigation trajectory roughly constant, which is a requirement of the Section 5.1’s third objective. The blind turn facilitates creation of discoverable obstacles as discussed further in Section 5.3.2 and simulates turning through an open doorway, which is a practical navigation scenario and thus satisfies the first experimental objective. The antechamber provides a location within the controlled area for the CRIMbot to localize its self before commencing path planning and navigation. Once built (Fig. 5.3), the maze environment was mapped with the unmodified Kinect
52
using ROS’s GMapping SLAM implementation for subsequent use in navigation.5 Fig. 5.4 shows the map as captured at 1 cm/pixel resolution; however, a cleaned and 4 cm/pixel resolution map was derived from it for use in navigation to reduce the computational overhead of AMCL. Trajectories with preset originating positions and navigation goals into and out of the maze were then designed (Fig. 5.5). The trajectory into the maze started in the antechamber and ended at the corridors terminus whereas the trajectory out of the maze did the reverse.
Figure 5.2: Design plan for the maze test environment. Each square drawn within the maze represents a 1 ft × 1 ft floor tile. The design features a narrow corridor (green), a blind turn (red), and an antechamber (blue).
5.3.2
Obstacle Placement
Based upon the defined trajectory into the maze, a position for a static (non-moving) obstacle was chosen around the blind turn as depicted in Fig. 5.6. Since it is not included on the environment map and is not visible until the blind turn is mostly completed, this obstacle is avoidable only if the CRIMbot can discover it before colliding. Avoidance of this obstacle, which is simply suspending path execution given the constrained environment, thus depends upon the Kinect’s field of view and minimum range. This configuration therefore satisfies Section 5.1’s second and third objectives given the fixed nature of the obstacle and the dependence of obstacle detection upon the Kinect’s imaging parameters, respectively. The chosen obstacle for this test was constructed from lightweight cardboard to protect the CRIMbot from damage during collisions and had a paper surface covering readily seen 5 Section
B.1 presents the specific parameters used to tune GMapping’s performance. The unmodified Kinect was used for mapping since its longer range compared to the modified Kinect better captured large-scale maze features.
53
Figure 5.3: Overhead picture of the maze’s corridor and blind turn. The maze is constructed primarily from posterboard, foamboard, and cardboard and is separable into foldable segments. This fabrication method yielded a very compliant maze, which protected the CRIMbot in the event it collided with a wall.
Figure 5.4: Maze test environment as mapped using SLAM. White, gray, and black pixels correspond to empty, unknown, and occupied cells, respectively, in the occupancy grid. The left image is the original map as captured at 1 cm/pixel resolution whereas the right image is the original map cleaned and downsampled to 4 cm/pixel resolution to reduce localization overhead during navigation. An area is cut out of the antechamber’s left side to accommodate the tethered link to the workstation. Additionally, the width of the blind turn’s left half is captured at approximately 80% of its actual dimension.
54
Figure 5.5: Intended navigation trajectories into (left) and out of (right) the maze. The dark green arrow and dot indicate the trajectory’s goal pose and originating position, respectively. Both trajectories run the length of the corridor and include the blind turn.
both by the unmodified and modified Kinect at appropriate range (Fig. 5.7). With an approximately 22 cm × 21 cm base and a height exceeding that of the CRIMbot, this obstacle has dimensions consistent with a small trash can or toy such as that possibly encountered when
operating in a household environment. This choice of obstacle thus satisfies Section 5.1’s first objective regarding the plausibility of obstacle-avoidance scenarios.
5.4
Procedure
Four separate test scenarios were considered to establish the CRIMbot’s ability to navigate within the maze and to examine discoverable obstacle avoidance with unmodified and modified Kinect optics. The first pair of scenarios used an empty maze to demonstrate navigability with and without the modified Kinect optics. At the start of these scenarios, the CRIMbot was localized in the antechamber either by rotating it at its starting point or providing it with an accurate initial pose estimate. The CRIMbot was then issued the chosen navigation goal at the corridor’s terminus, and upon reaching the target pose, was issued the chosen navigation goal in the antechamber’s lower right corner according to the designed trajectories in Section 5.3.1. Fig. 5.8 shows visualizations of the CRIMbot’s navigation state for the antechamber start position, corridor terminus, and antechamber end position. The second pair of scenarios included the chosen obstacle placed as indicated in Fig. 5.6. These scenarios were identical to the first pair except that navigation out of the maze was not considered since the CRIMbot either suspended path plan execution after discovering the
55
Figure 5.6: Placement of the discoverable obstacle around the blind turn. The red square indicates the obstacle’s approximate position and dimensions. The obstacle is visible only after nearly completing the turn along the trajectory into the maze and hence exercises the Kinect’s ability to detect close-range objects.
Figure 5.7: Object placed as an obstacle in the maze. The obstacle has dimensions 22 cm × 21 cm × 37 cm and tapers to a line from bottom to top, which results in a laser scan target smaller than the base at the height of the CRIMbot’s Kinect.
56
obstacle or displaced the obstacle after colliding with it. For all tests using the Kinect’s original optics, ROS’s standard Kinect modules were used, and for all tests using the modified optics, the techniques from Chapter 4 were used to compensate for depth distortion. All tests were repeated three times to verify reproducibility and were recorded using ROS’s data logging capabilities for later analysis. Path planning and trajectory execution were implemented with ROS’s Turtlebot navigation stack. Section B.2 gives the specific parameters used for path planning and AMCL, such as obstacle inflation radius, goal biases, and sensor uncertainties.
Figure 5.8: Visualizations of the CRIMbot’s three navigation positions created using ROS’s RViz. The images correspond to the CRIMbot’s initial start pose (left), waypoint at the corridor’s terminus (middle), and end pose in the antechamber’s corner (right). The blue and orange blocks correspond to inflated and actual obstacles, respectively, in the local occupancy grid costmap whereas green blocks indicate laser scan readings from the Kinect. The map colors are darker than those in Fig. 5.4 (since RViz by default renders the map with partial transparency on a black background) but denote the same occupancy grid cell type.
5.5
Outcomes
Three navigation tests were conducted for each of the four scenarios to yield a total of 12 tests. Navigation into the maze for each test scenario was repeated a fourth time for overhead video recording. The following subsections present and interpret the results for each of the four navigation scenarios.
57
5.5.1
Results
In all tests without an obstacle, the CRIMbot successfully navigated into and out of the maze largely along the intended trajectories from Fig. 5.5. Table 5.1 presents the average trajectory execution time for these tests with the maze empty. Fig. 5.9 and Fig. 5.10 show representative visualizations and overhead pictures of the CRIMbot’s outcome when navigating into the maze under each test scenario. Fig. 5.11 shows representative visualizations of the CRIMbot’s outcome when navigating out of the maze in the empty maze scenarios. For all tests with an obstacle using the unmodified Kinect, the CRIMbot failed to identify the obstacle and collided with it while executing its planned trajectory. For all tests with an obstacle using the modified Kinect optics from Chapter 4, however, the CRIMbot detected the obstacle and suspended navigation along its planned trajectory. It was additionally discovered that the obstacle occluded the unmodified Kinect’s view of the opposite corridor wall when completing the blind turn while navigating into the maze (Fig. 5.12).
Table 5.1: Average navigation time into and out of the empty maze from start of CRIMbot motion to achievement of the target pose for both Kinect configurations. Timings do not include localization spins before the CRIMbot commenced motion along its trajectory and are rounded to the nearest second. (Localization spins once plan execution had started are included, however.) On average, use of modified optics increased navigation time when compared to original optics, especially along the trajectory out of the maze. The CRIMbot performed localization spins during path execution out of the maze for each test with the modified optics, which partially accounts for the corresponding 90 s average navigation time. Kinect Optics
5.5.2
Direction
Unmodified
Modified
Into Out of
24 s 35 s
30 s 90 s
Discussion
The CRIMbot successfully navigated into and out of the maze with both the unmodified and modified Kinect (Fig. 5.9 and Fig. 5.11). This indicates that use of modified optics with appropriate depth undistortion does not prevent the Kinect from serving as a navigation
58
modified optics
obstacle
empty
original optics
Figure 5.9: Visualizations of representative outcomes for the four different test scenarios when navigating into the maze. Fig. 5.10 shows the corresponding overhead pictures of the maze. (top left) The CRIMbot navigates into the empty maze without difficulty using the unmodified Kinect. (top right) The CRIMbot navigates into the empty maze again without difficulty but using the modified Kinect optics. (lower left) Using the unmodified Kinect, the CRIMbot incorrectly navigates through the obstacle after the blind turn instead of correctly engaging obstacle avoidance behavior. The highlighted gap in the local occupancy grid costmap along the wall to the CRIMbot’s left results from occlusion by the obstacle (Fig. 5.12). (lower right) Using the modified Kinect optics, the CRIMbot correctly suspends navigation after detecting the obstacle as highlighted.
59
modified optics
obstacle
empty
original optics
Figure 5.10: Overhead pictures of representative outcomes for the four different test scenarios when navigating into the maze. These pictures correspond to the CRIMbot’s navigation state visualizations in Fig. 5.9. The lower left picture shows the physical displacement of the obstacle (red) after the CRIMbot collided with it whereas the lower right picture shows suspension of the CRIMbot’s forward trajectory after detecting the obstacle blocking its path.
60
modified optics
empty
original optics
Figure 5.11: Visualizations of representative outcomes for the two test scenarios that involved navigating out of the maze. The CRIMbot successfully navigated out of the maze with both the unmodified and unmodified Kinect, demonstrating the ability to turn around in physically constrained situations. Additionally, poor laser scan reconstruction of the antechamber’s lower right corner with modified optics suggests incomplete compensation for depth distortion.
61
sensor. Navigation performance, however, is decreased with modified optics as evidenced by the elevated path execution times in Table 5.1 compared to the Kinect’s original optics. It is hypothesized that such performance degradation results from modified Kinect’s inability to image distant features due to its shorter range and from persistent depth distortion (Fig. 5.11) for the reasons noted in Section 4.4.3. Nonetheless, since the CRIMbot reliably failed to detect the obstacle with the original Kinect but reproducibly succeeded in obstacle avoidance with the modified Kinect, use of the modified optics from Chapter 4 demonstrably improved the CRIMbot’s ability to detect and avoid obstacles. This improvement results primarily from the modified Kinect’s shorter minimum range compared to the unmodified Kinect rather than its wider field of view since the modified Kinect’s depth image was cropped as noted in Section 5.2.3. The influence of cropping is seen to give comparable horizontal fields of view for the unmodified and modified Kinect by examining the final laser scan readings for the empty maze scenarios in Fig. 5.9. The Kinect’s minimum range with modified optics, however, is reduced by approximately 20 cm from 50 cm (Section 2.3.2) to 30 cm (Section 4.2.1). Thus, when the CRIMbot turns the maze’s blind corner and encounters the obstacle at close range, the modified Kinect with its shorter minimum range properly images the obstacle rather than producing out-of-range pixels as does the unmodified Kinect (Fig. 5.12). This is the expected reason why the modified optics enhanced obstacle detection at close range, particularly for the considered blind turn scenario.
5.5.3
Improvements
As alluded to by elevated navigation times in Table 5.1, the CRIMbot adhered less well to the intended trajectories in Fig. 5.5 when using the modified optics in comparison to the original Kinect. It is believed that navigation trajectories around the maze’s blind turn differed slightly between the unmodified and modified Kinect due to differences in localization. Thus, the demonstrated ability of the modified optics to enhance close-range obstacle detection may have partially relied upon trajectory variations, particularly the possibility of overshooting the blind turn when navigating into the maze. Such an overshoot would give the CRIMbot an advantage in obstacle detection since the obstacle is subsequently visible at a greater distance than the maze design had intended. Future experiments of this nature should consider additional methods to ensure constancy of trajectories.
62
Figure 5.12: Magnified visualization of the CRIMbot finishing the blind turn into the maze and failing to observe the obstacle with the unmodified Kinect. The annotations show an approximation of the Kinect’s centerline and 57° horizontal field of view (dark green) from Section 2.1.1 and the approximate obstacle position (red) from Fig. 5.6. The undetected obstacle is seen to partially occlude the Kinect’s view of the corridor wall opposite the CRIMbot since the Kinect produces out-of-range pixels for objects within its minimum range. The obstacle’s tapered top likely accounts for the visibility of the wall behind the obstacle’s right edge as viewed by the Kinect.
63
CHAPTER 6 FUTURE WORK
The work presented thus far is a foundation for further exploration into innovative uses of depth imaging. This chapter concludes by describing potential improvements to and extensions of this foundation.
6.1
Improving Techniques
The methods of Chapter 4 exhibit the greatest opportunity for improvement, such as better optics and depth undistortion than presented in Section 4.2 and Section 4.3, respectively. Specifically, Chapter 4 is accessible to the following improvements: Analytical Depth Distortion Model Development of an analytical model for the depth distortion process may enable better performance than achieved with the empirical one described in Section 4.3. One possible analytical approach involves attempting to relate Eq. 1.5 for two different optical setups while simultaneously accounting for change in feature positions. This may prove difficult, however, especially if feature projection angles from Fig. 1.1 are unknown or are treated as such. Using Other Optics Compared to the lenses from Chapter 4, higher quality optics characterized by lower radial lens distortion may improve the Kinect’s close-range imaging capability without introducing as pronounced depth distortion. Such optics may fur-
64
ther correct the poor quality of depth image corners seen in Section 4.2.1. Nonetheless, other optics must remain low in cost to satisfy Section 4.1’s fourth objective. Kinect for Windows SDK The latest version of Microsoft’s Kinect SDK (released 1 February 2012) is accompanied by a Kinect specifically designed for applications other than gaming [24]. The new sensor has “near mode” capability for reliable imaging as close as 50 cm, which Microsoft implemented using modified firmware (and possibly hardware changes) [25].1 Although compatible with Windows only at present, use of near mode in conjunction with modified optics may shorten the Kinect’s minimum range by more than the 30% demonstrated in Chapter 4.2
6.2
Multi-scale Mapping
As discussed in Section 5.5.2, enhancements to the Kinect’s close-range imaging capability using the techniques of Chapter 4 degrade large-scale navigation performance. Without optics, however, the Kinect’s larger minimum range prevents capture of fine detail for mapping purposes. Thus, the concept of multi-scale mapping is proposed, wherein a robot is equipped multiple depth imagers optimized for performance at different scales. For example, a robot equipped with an unmodified (“long-range”) and a modified (“short-range”) Kinect would use the long-range sensor for localization and coarse mapping and the short-range sensor for adding fine detail to the otherwise coarse map. In this particular scenario, the Kinect’s low price enables cost-effective use of multiple sensors.3 An alternative multi-scale mapping implementation would involve mounting to the Kinect an actuated adapter capable of switching between lenses. A Kinect-enabled robot navigating with such a system could toggle between long- and short-range imaging modes since Section 4.3’s method for correcting depth undistortion relies purely on software techniques. Switching between lenses could thus enable simultaneous long- and short-range mapping capability with a single Kinect, provided that the switching mechanism is sufficiently rapid. 1
It is believed that this minimum range reported by Microsoft takes into consideration reliable operation of the SDK’s skeletal tracking system. 2 Availability of near mode does not invalidate the successes of Chapter 4 for this reason. 3 Proper care is necessary to ensure that the two Kinects do not interfere with each other’s structured light pattern.
65
6.3
Laparoscopic Imaging
As introduced in Section 1.3.3, it is believed that depth imaging in laparoscopic settings could dramatically enhance visualization of surgical spaces. If appropriately miniaturized and adapted to image at even shorter ranges than those considered in Chapter 4, the technology behind the Kinect may provide a low-cost solution for laparoscopic depth imaging.
6.3.1
Current Systems
The traditional laparoscope is a narrow, inflexible cylinder containing optics for viewing the surgical field in a minimally-invasive procedure [26]. A camera system is typically mounted at the laparoscope’s extracorporeal end to capture imagery for local or remote display or for recording and archival purposes. Video laparoscopes, however, use tip-mounted image sensors and electronically relay images to an external video system rather than channeling light from the surgical field through the scope [27]. Furthermore, since such laparoscopes do not contain fixed optics, they may have a flexible shaft that allows turning or bending, which enables greater surgical field visualization without additional articulation at the scope insertion site. More recently, video laparoscopes with dual viewpoints have emerged that provide depth perception through stereopsis [26, 27]. Notable stereo laparoscopes include the vision component of Intuitive Surgical’s da Vinci Surgical System [10] and Viking Systems’s 3DHD Vision System [28]. As expected, both systems rely upon specialized presentation technologies to display stereo imagery to the surgeon. Although advertised as displaying “3-D images”, stereo laparoscopes provide no more a three-dimensional view than does the human vision system and consequently do not constitute true three-dimensional imagers. Nonetheless, stereo imagery does improve upon a monocular laparoscope by enabling surgeons to perceive depth using the stereopsis binocular depth cue.
6.3.2
The Possibilities
It is believed that laparoscopic depth imaging would enable several promising scenarios. Underlying all of these scenarios are the detailed point clouds of tissue surfaces that laparoscopic depth imaging would provide.
66
Surgical Robotics Detailed point clouds of the surgical environment could endow surgical robots with an information-rich sense of situational awareness. Rather than merely replicating the surgeon’s hand motions at smaller scale as do teleoperative systems, surgical robots could use depth imagery to track their instruments and surgical space boundaries in three-dimensions. One straightforward application of this feedback entails constraining instrument motions to avoid “danger zones” during delicate procedures, which becomes particularly helpful when training new surgeons. More complex uses include establishing the robot’s pose with respect to the patient through identification of body or tissue landmarks. This estimated pose would enable transformation pre-operative imagery (CT, MRI, X-ray, etc.) into the robot’s reference frame for augmented reality surgeon displays. As noted in [29], such applications are not too far-fetched. Medical Diagnostics
The availability of tissue surface point clouds would facilitate collec-
tion of diagnostic information as well. Static point cloud analysis would yield basic tissue structural properties, such as volume or shape, in addition to texture if sufficient precision is available. Dynamic analysis of a tissue’s deformation when probed with instruments could reveal tissue mechanical properties, such as stiffness, that pertain to identification of certain structures and malignancies. Furthermore, recording of depth imagery during a surgical procedure would enable a three-dimensional post-operative review, giving surgeons the ability to examine their techniques and outcomes from multiple viewpoints. Education
Laparoscopic procedures captured as point clouds would also enhance the
training of medical professionals. Students could view appropriately recorded laparoscopic techniques from any perspective to fully understanding the positioning and orienting of instruments with respect to target tissues. This would accelerate classroom learning and improve preparation for actual operating room experience. Medical schools could ultimately compile databases of tissue models derived from point clouds to produce anatomical “textbooks” viewable using augmented reality systems. Moreover, integration of pre-operative imagery with such three-dimensional anatomical models would provide new ways to teach how medical images relate to actual physiology.
67
REFERENCES [1] R. Schwarte, G. Häusler, and R. W. Malz, “7 - Three-Dimensional Imaging Techniques,” in Computer Vision and Applications, B. Jähne and H. Haußecker, Eds. San Diego: Academic Press, 2000, pp. 177 – 208. [Online]. Available: http://www.sciencedirect.com/science/article/pii/B978012379777350008X [2] Hokuyo UXM-30LN Scanning Laser Rangefinder Datasheet, Hokuyo, 2009. [Online]. Available: http://www.hokuyo-aut.jp/02sensor/07scanner/uxm_30ln_p.html [Accessed: 12 Feb 2012] [3] SR400 Datasheet Rev. 5.1, Mesa Imaging, 2011. [Online]. Available: http://www. mesa-imaging.ch/dlm.php?fname=pdf/SR4000_Data_Sheet.pdf [Accessed: 11 Feb 2012] [4] The PrimeSensor Reference Design 1.08, PrimeSense, 2010. [Online]. Available: http://primesense.360.co.il/files/FMF_2.PDF [Accessed: 11 Feb 2012] [5] W. Snyder and H. Qi, Machine Vision.
New York: Cambridge University Press, 2004.
[6] R. Duraiswami. (2000) Camera calibration. Lecture slides for Fundamentals of Computer Vision at the University of Maryland. [Online]. Available: http: //www.umiacs.umd.edu/~ramani/cmsc828d/lecture9.pdf [Accessed: 26 Sept 2011] [7] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000. [8] “A - Application Gallery,” in Computer Vision and Applications, B. Jähne and H. Haußecker, Eds. San Diego: Academic Press, 2000, pp. 609 – 665. [Online]. Available: http://www.sciencedirect.com/science/article/pii/B9780123797773500194 [9] “Hardware specs,” Willow Garage, 2011, hardware specification for the PR2 robot. [Online]. Available: http://www.willowgarage.com/pages/pr2/specs [Accessed: 12 Feb 2012] [10] “The da Vinci Surgical System,” Intuitive Surgical, 2012, product summary of the da Vinci Surgical System. [Online]. Available: http://www.intuitivesurgical.com/products/ davinci_surgical_system/ [Accessed: 12 Feb 2012] [11] L. Soules, K. Wiens et al. (2010) Microsoft Kinect teardown. iFixit.com. [Online]. Available: http://www.ifixit.com/Teardown/Microsoft-Kinect-Teardown/4066 [Accessed: 26 Sept 2011] [12] N. Burrus. (2010) Kinect calibration. [Online]. Available: http://nicolas.burrus.name/ index.php/Research/KinectCalibration [Accessed: 26 Sept 2011]
68
[13] K. Konolige and P. Mihelich. (2010) kinect_calibration/technical. ROS.org. Discussion of the Kinect’s software workings. [Online]. Available: http://www.ros.org/wiki/kinect_ calibration/technical [Accessed: 26 Sept 2011] [14] Programming Guide, Microsoft Research, 2011, Kinect for Windows SDK, Beta 1 Draft Version 1.1. [Online]. Available: http://research.microsoft.com/en-us/um/redmond/ projects/kinectsdk/docs/ProgrammingGuide_KinectSDK.pdf [Accessed: 26 Sept 2011] [15] “Depth mapping using projected patterns,” U.S. Patent 12/522,171, 2010. [Online]. Available: http://www.google.com/patents/US20100118123 [16] “Protocol documentation,” OpenKinect.org, 2011, reverse-engineered USB protocol for the Kinect. [Online]. Available: http://openkinect.org/wiki/Protocol_Documentation [Accessed: 18 Feb 2012] [17] “Introducing OpenNI,” OpenNI, 2011, website of OpenNI organization. [Online]. Available: http://www.openni.org/ [Accessed: 18 Feb 2012] [18] S. Gedikli, P. Mihelich, and R. B. Rusu. (2011) openni_camera. ROS.org. Documentation for ROS’s OpenNI-based Kinect implementation. [Online]. Available: http://www.ros. org/wiki/openni_camera [Accessed: 18 Feb 2012] [19] “Kinect develop overview,” Microsoft, 2012, system requirements for Kinect for Windows SDK. [Online]. Available: http://www.microsoft.com/en-us/kinectforwindows/ develop/overview.aspx [Accessed: 18 Feb 2012] [20] MATLAB, version 7.10.0 (R2010a).
Natick, Massachusetts: The MathWorks Inc., 2010.
[21] S. Magnenat. (2010) Questions regarding code & algorithms. OpenKinect mailing list. [Online]. Available: https://groups.google.com/group/openkinect/browse_thread/ thread/31351846fd33c78/e98a94ac605b9f21?q=stephane [Accessed: 26 Sept 2011] [22] “Play range reduction lens for Kinect,” Nyko, 2011. [Online]. Available: http: //www.nyko.com/products/product-detail/?name=Zoom [Accessed: 26 Sept 2011] [23] M. Quigley, K. Conley, B. P. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “ROS: an open-source robot operating system,” in ICRA Workshop on Open Source Software, 2009. [24] “Kinect new features,” Microsoft, 2012, Kinect for Windows’s new features. [Online]. Available: http://www.microsoft.com/en-us/kinectforwindows/develop/new.aspx [Accessed: 18 Feb 2012] [25] “Near mode: What it is (and isn’t),” Microsoft, 2012, discussion of Kinect for Window near mode. [Online]. Available: http://blogs.msdn.com/b/kinectforwindows/archive/ 2012/01/20/near-mode-what-it-is-and-isn-t.aspx [Accessed: 18 Feb 2012]
69
[26] P. S. Lowry, “Laparoscopic instrumentation,” in Essential Urologic Laparoscopy, ser. Current Clinical Urology, S. Y. Nakada and S. P. Hedican, Eds. Humana Press, 2009, pp. 9–24, 10.1007/978-1-60327-820-1_2. [Online]. Available: http: //dx.doi.org/10.1007/978-1-60327-820-1_2 [27] T. T. Higuchi and M. T. Gettman, “Essential instruments in laparoscopic and robotic surgery,” in Retroperitoneal Robotic and Laparoscopic Surgery, J. V. Joseph and H. R. Patel, Eds. Springer London, 2011, pp. 9–22, 10.1007/978-0-85729-485-2_2. [Online]. Available: http://dx.doi.org/10.1007/978-0-85729-485-2_2 [28] “3DHD Vision System,” Viking Systems, 2012, product summary of the 3DHD Vision System. [Online]. Available: http://www.vikingsystems.com/medical/products/ 3DHD-Vision-System/default.html [Accessed: 28 Feb 2012] [29] J. V. Joseph and H. R. H. Patel, “On the horizon,” in Retroperitoneal Robotic and Laparoscopic Surgery, J. V. Joseph and H. R. Patel, Eds. Springer London, 2011, pp. 165–169, 10.1007/978-0-85729-485-2_15. [Online]. Available: http://dx.doi.org/10.1007/ 978-0-85729-485-2_15 [30] D. Ascher et al., “Numerical Python,” Lawrence Livermore National Laboratory, Tech. Rep. UCRL-MA-128569, 2001. [Online]. Available: http://numpy.scipy.org/ [31] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open source scientific tools for Python,” 2001–. [Online]. Available: http://www.scipy.org/
70
APPENDICES
71
APPENDIX A CAMERA PARAMETERS
This appendix specifies the camera intrinsics and lens distortion coefficients estimated using OpenCV’s chessboard calibration for the Kinect’s original optics and the Nyko Zoom optics at VGA resolution. The depth and infrared camera intrinsics are identical since the Kinect uses infrared camera imagery to produce depth images, and consequently, only infrared camera intrinsics are reported.
A.1
Original Optics
The color and infrared camera intrinsic parameters for the Kinect’s original optics are below in units of pixels. Table A.1 lists the associated lens distortion coefficients for both cameras.
520.38 0 319.25 Acolor,o = 0 520.19 261.87 0 0 1 588.92 0 315.10 Ainfrared,o = 0 587.89 252.12 0 0 1
72
(A.1)
(A.2)
Both cameras have similar scaled focal lengths and have principal points very close to the image center. There is a nominal difference between the x and y scaled focal lengths.
Table A.1: Color and infrared camera lens distortion coefficients estimated for the Kinect’s original optics. These parameters are suitable for direct use with OpenCV. Value
A.2
Coefficient
Color
Infrared
k1 k2 p1 p2 k3
0.230 26 −0.739 52 0.000 851 05 −0.001 531 1 0.817 42
−0.089 552 0.479 46 0.004 047 6 0.004 530 8 −0.804 28
Modified Optics
The color and infrared camera intrinsic parameters for the Nyko Zoom optics from Chapter 4 are below in units of pixels. Table A.2 lists the associated lens distortion coefficients for both cameras.
0 315.41 330.21 Acolor,m = 0 329.69 261.00 0 0 1 373.35 0 324.78 Ainfrared,m = 0 372.83 237.21 0 0 1
(A.3)
(A.4)
Both cameras again have similar scaled focal lengths with a nominal difference between the x and y versions. Compared to the original Kinect, these optics produce smaller focal lengths indicative of wide-angle lenses.
73
Table A.2: Color and infrared camera lens distortion coefficients estimated for the Nyko Zoom optics. These parameters are suitable for direct use with OpenCV. Value Coefficient
Color
Infrared
k1 k2 p1 p2 k3
−0.122 65 −0.032 653 0.000 250 16 −0.003 902 0 0.017 380
−0.240 12 0.106 22 −0.005 816 6 0.001 505 2 −0.033 573
74
APPENDIX B ROS PARAMETER TUNING
This appendix lists the ROS parameter differences between the turtlebot and crimbot navigation stacks as used in Chapter 5. Parameter tables are taken directly from the relevant launch and YAML files and include only changed or added lines. Deleted parameters are specifically noted.
B.1
Mapping
Table B.1 gives the parameters used to tune GMapping’s SLAM implementation. Parameter tuning primarily increased the map resolution to 1 cm/pixel, increased odometry uncertainty, and promoted faster map updates. This choice of parameters improved map detail and generation rate.
B.2
Navigation
Tuned navigation parameters included those affecting path planning, obstacle avoidance, and localization.
75
Table B.1: GMapping parameter differences between the Turtlebot and the CRIMbot. Original parameters are found in the Turtlebot’s gmapping_turtlebot launch file. Parameter
Value
sigma linearUpdate delta llsamplerange llsamplestep lasamplerange lasamplestep
2 0.05 0.01 0.0001 0.0001 0.000 05 0.000 05
Table B.2: Path planning and execution parameter differences between the Turtlebot and the CRIMbot. Original parameters are found in the Turtlebot’s base_local_planner_params YAML file. Parameter
Value
max_vel_x max_rotational_vel yaw_goal_tolerance xy_goal_tolerance goal_distance_bias path_distance_bias
B.2.1
0.30 1.0 0.1 0.10 0.6 1.0
Path Planning
Table B.2 gives the parameters used to tune ROS’s TrajectoryPlannerROS. These parameters decreased the tolerance for path deviations and goal pose achievement and slowed the CRIMbot’s velocity along planned trajectories.
B.2.2
Obstacle Avoidance
Table B.3 gives the tuned parameters shared by the global and local costmaps. Parameter changes included replacement of the default robot footprint with the less complex robot radius and elimination of the shared obstacle inflation radius. Table B.4 and Table B.5 give the tuned parameters for global and local costmaps, respectively. The local obstacle inflation radius was decreased to accurately reflect the robot radius whereas the global
76
Table B.3: Common costmap parameter differences between the Turtlebot and the CRIMbot. The added robot_radius parameter replaced the footprint and footprint_padding parameters, and the inflation_radius parameter was moved to the global and local costmap configuration files (Table B.4 and Table B.5, respectively). Original parameters are found in the Turtlebot’s costmap_common_params YAML file. Parameter
Value
robot_radius
0.25
Table B.4: Global costmap parameter differences between the Turtlebot and the CRIMbot. The added inflation_radius parameter allows specification of an obstacle inflation radius distinct from the local costmap. Original parameters are found in the Turtlebot’s global_costmap_params YAML file. Parameter
Value
transform_tolerance inflation_radius
1.0 0.40
obstacle inflation radius was decreased slightly to improve path planning behavior for the environment in Section 5.3.1. The global costmap transform tolerance was increased to combat latency errors, but the local costmap transform tolerance was reduced to enforce better maintenance of local obstacle positions.
B.2.3
Localization
Table B.6 gives the tuned parameters for AMCL. The parameter changes favor laser scans over odometry for localization, increase the localization update rate, and decrease the transform tolerance for laser scans.
77
Table B.5: Local costmap parameter differences between the Turtlebot and the CRIMbot. The added inflation_radius parameter allows specification of an obstacle inflation radius distinct from the global costmap. Original parameters are found in the Turtlebot’s local_costmap_params YAML file. Parameter
Value
publish_frequency resolution transform_tolerance inflation_radius
1.0 0.04 0.25 0.25
Table B.6: AMCL parameter differences between the Turtlebot and the CRIMbot. Original parameters are found in the Turtlebot’s amcl_turtlebot launch file. Parameter
Value
odom_alpha1 odom_alpha2 odom_alpha3 odom_alpha4 laser_z_hit laser_z_rand laser_sigma_hit update_min_d transform_tolerance recovery_alpha_slow recovery_alpha_fast
0.8 0.8 0.8 0.8 0.95 0.05 0.002 0.15 0.5 0.001 0.1
78
APPENDIX C OPENNI KINECT CALIBRATION
This appendix presents the outcomes of producing an OpenNI-compatible depth undistortion interpolator to enable use of Chapter 4’s modified optics in Chapter 5’s navigation experiments.
C.1
Outcomes
Following the procedure of Section 4.3, 121 depth images of a corridor wall presented parallel to the Kinect’s image plane were captured using the modified optics and the OpenNI Kinect driver. Using the camera intrinsics and distortion coefficients from Section 4.4.1, 2860 pairings between OpenNI’s depth values in millimeters and object depth in meters were then collected from 41 depthboard poses using the modified optics (Fig. C.2).1
C.1.1
Results
Processing the depth images of planar objects yielded the undistortion sample points in Fig. C.1 to approximate Eq. 4.1 from which the OpenNI-compatible undistortion interpolator was constructed. Undistorting the paired raw depth values and object depths yielded the corrected depth calibration sample points in Fig. C.2. Linear regression of the form d = a r +b 1
OpenNI’s calibration of raw depth values to millimeters is accurate only for the Kinect’s original optics, however.
79
7000 8000 6000 7000
5000
5000 4000 4000 3000 3000
corrected raw depth value
raw depth value
6000
2000
2000
1000 1000 −300 −200 −100
0
100
200
300
0 −200 −100
100
200
yi′
xi′
Figure C.1: Surface plot of cropped and downsampled depth images of large planar objects viewed with the modified optics using the OpenNI Kinect driver. The vertical axis corresponds to OpenNI depth value, and color indicates undistorted depth value. This plot is analogous to Fig. 4.7 but uses OpenNI’s calibrated depth values rather than raw depth values.
produced a best fit of f O3 (r ) = (0.61793 mm)r + (0.13065 m)
(C.1)
with residual mean µ = −6.3236 µm and standard deviation σ = 27.858 mm (Fig. C.3). This
linear form was chosen since the OpenNI driver reports calibrated object depth rather than raw depth values.
C.1.2
Discussion
Despite the larger number of planar object depth images captured, the Fig. C.1 dataset produced poorer undistortion results in Fig. C.2 than the Fig. 4.7 dataset did in Fig. 4.9. It is suspected that the possibility for slight changes in lens adjustment mentioned in Section 4.4.3 may account for greater depth undistortion error seen with the OpenNI driver than with libfreenect in Section 4.4.1. Even small lens offsets can produce large effects since the
80
1.1
measured depth (m)
1 0.9 0.8 0.7 0.6 0.5 800
900
1000 1100 1200 1300 1400 1500 1600 raw depth value
900
1000 1100 1200 1300 1400 1500 1600 corrected raw depth value
1.1
measured depth (m)
1 0.9 0.8 0.7 0.6 0.5 800
Figure C.2: (top) Plot of measured depth in meters versus raw depth value for 41 depthboard poses. (bottom) Plot of measured depth in meters versus corrected raw depth value using the undistortion function from Fig. C.1. Application of the undistortion function corrected raw depth values for large measured depths but yielded severe underestimation for smaller measured depths. Undistortion retained all but one of the original calibration points since few of the input points exceed the undistortion function’s region of support.
81
12
·10−2
f O3 residual (m)
9 6 3 0 −3 −6 −9
−12 800
900
1000 1100 1200 1300 1400 1500 1600 raw depth value
Figure C.3: Plot of residuals for best fit from (Eq. C.1). The residuals have mean µ = −6.3236 µm and standard deviation σ = 27.858 mm. The range ±4 cm contains 84% of all residuals. The residuals’ large spread indicates overestimation and underestimation of corrected raw depth values.
depth undistortion interpolator uses rectified pixel coordinates relative to the depth camera’s principal point, which changes with lens positioning. Nevertheless, the depth calibration in Eq. C.1 clearly demonstrates improvement in resolution with modified optics. Compared to the original optics, OpenNI’s reported depth values under modified optics represent roughly one-third less actual object depth (units of 0.6 mm versus OpenNI’s default 1 mm). The magnitude of this resolution improvement is consistent with that shown in Fig. 4.11.
82
APPENDIX D SOFTWARE TOOLS
This appendix describes the software tools used and developed to accomplish the work presented in the preceding chapters and appendices. Unless otherwise indicated, all software implementation used the open-source Python 2.7.1 programing language (http://www. python.org/) running under the 32-bit Ubuntu 11.04 Linux operating system (http://www. ubuntu.com/). Python’s cross-platform nature should allow portability of these software tools to other platforms without substantial modification.
D.1
Libraries
Development of software tools relied upon the following Python libraries for implementation of image processing algorithms, communication with the Kinect, and the production of graphical user interfaces: numpy (1.6.1) Performs efficient n -dimensional array manipulation and mathematical operations within Python [30]. Most array processing is handled internally by optimized C and Fortran libraries for performance. scipy (0.10.0b2) Enables access to a wide range of mathematical, scientific, and optimization tools within Python [31]. Backend processing is handled by numpy as well as specialized C and Fortran libraries.
83
cv2 (2.3.1) Provides Python bindings for the OpenCV computer vision and image processing toolkit [7]. Internal use of numpy data structures enables smooth integration with scipy. freenect (0.1.1) Provides Python bindings for libfreenect using numpy data structures (Section 2.2.1). rospy (electric) Provides Python bindings for interacting with Willow Garage’s ROS suite [23]. PyQt4 (4.8.3) Provides Python bindings for Nokia’s Qt graphical user interface toolkit (http: //qt.nokia.com/).1
D.2
Architecture
Software tools were designed using an object-oriented approach to enable Kinect driver independence. This independence is achieved by encapsulating all Kinect functionality into a single object, which creates an abstraction layer between Kinect implementation details and higher-level software (Fig. D.1). When imported at runtime, the kinect module automatically chooses between the libfreenect and OpenNI-based ROS Kinect drivers by detecting the presence of a running Python ROS node. If a ROS node has been constructed, the kinect module loads into itself the contents of the ros_kinect module, which then establishes the appropriate subscriptions to the OpenNI ROS node to receive color and depth images from the Kinect. Otherwise, the kinect module loads into itself the contents of the freenect_kinect module, which instantiates the libfreenect driver and configures the Kinect directly. In either case, the calibration module integrates into the Kinect interface object to transparently provide image rectification, depth undistortion, depth calibration, and point cloud generation capabilities.
D.3
Techniques
Several notable software techniques were developed to efficiently process depth images from the Kinect as described below. 1
PyQt4 is developed by Riverbank Computing (http://www.riverbankcomputing.co.uk/software/pyqt/).
84
calibration
kinect
Python module
Python module
freenect_kinect
ros_kinect
Python module
Python module
libfreenect C library
/camera/... ROS subscriptions
openni_camera C ROS node Figure D.1: Diagram of the Kinect interface’s modular architecture. This design abstracts Kinect driver implementation details from higher-level software by consolidating all Kinect functions into a single driver-independent interface.
D.3.1
High-Fidelity Image Undistortion
When correcting distortion in preparation for point cloud generation, OpenCV’s undistort function resamples images on the rectified coordinate system, which requires interpolation. For depth images, the interpolation required for resampling produces undesirable results, particularly when using bilinear or bicubic interpolation that blur edges in depth. Upon conversion to a point cloud, such blurred edges appear “halos” around objects of points suspended in space without a corresponding physical representation. Rather than resample the image from the distorted coordinate system on the rectangular lattice of the rectified image, high-fidelity image undistortion uses the distorted pixel coordinates directly when generating the point cloud. This is achieved by inverting the output of OpenCV’s initUndistortRectifyMap using scipy’s CloughTocher2DInterpolator to determine the distorted coordinates of each pixel and then using these distorted image coordinates to unproject the depth image. The resulting output is an undistorted the point cloud without the loss of fidelity incurred by resampling the image. High-fidelity depth undistortion is possible because the point cloud data structure explicitly specifies the x - and
85
y -coordinates of points instead of implicitly defining them using array indices.
D.3.2
Performance Optimizations
With careful design of array manipulations, Python can perform “nearly” as fast as C when empowered with numpy and scipy. Three particular optimizations are discussed below.2 Choice of Interpolator
Gridded interpolators, such as scipy’s map_coordinates, per-
form much faster than scattered interpolators, such as scipy’s LinearNDInterpolator and CloughTocher2DInterpolator. Especially when processing depth images at as close to 30 fps as possible, map_coordinates can capitalize upon sample points’ regular structure to access an input’s neighbors with array indexing whereas scattered interpolators must rely on more computationally expensive methods. Caching
Pre-processing and caching unchanging data can yield considerable speed im-
provements. For example, when generating point clouds from depth images using Eq. 1.8, pre-computing and caching −(x i − c x )/αx and −(y i − c y )/αy reduces the number of required arithmetic operations per dimension from three to one.
Low Overhead Operations Carefully designing array manipulations can significantly improve numpy’s performance. In particular, since numpy cannot optimize composed arithmetic operations at the Python level into a single C loop, each intermediate operation requires the allocation of a temporary array. Avoidance of operations that require temporary arrays can consequently yield tremendous speed-up. For example, the replacement operation
a[a==0]=1 requires allocating a temporary array to store a==0 whereas a single C loop could accomplish the task without such memory overhead.3 Inline C functionality is fortunately provided by scipy, wherein fragments of C code are compiled at runtime into libraries with Python bindings, and optimizing Python operations is thus possible much like the use of inline assembly in C. 2
Further discussion and comparison of several Python optimization paradigms is available at http://www. scipy.org/PerformancePython. 3 In fact, the lookup table operation array([1,0])[a] is much faster than creating the temporary array but requires that a have finitely many integer values. Also, scipy has the capability to automatically generate C loops for some (but not all) simple numpy expressions upon explicit request.
86
D.4
User Interfaces
Fig. D.2 and Fig. D.3 present two graphical user interfaces developed for quickly measuring point clouds and capturing images in various formats from the Kinect, respectively.
Figure D.2: Graphical user interface for point cloud measurements. The interface allows the user to sample two points in the point cloud and displays the distance and displacement between those two points.
87
Figure D.3: Graphical user interface for point cloud and image mass capture. The interface allows the user to capture point clouds in the CSV or PCD ASCII formats; depth images as 8-bit or 16-bit PNG images or in the CSV format; and color images as PNG images. The output base name is automatically appended with the frame number upon each save.
88