Preview only show first 10 pages with watermark. For full document please download

Thesis_1535684_luzhang.

   EMBED


Share

Transcript

Computer Engineering CE-MS-2010-11 Mekelweg 4, 2628 CD Delft The Netherlands http://ce.et.tudelft.nl/ M.Sc. Thesis Design and Implementation of Real-Time High-Definition Stereo Matching SoC on FPGA Lu Zhang B.Sc. Abstract Stereo matching has been widely used in many fields, such as viewpoint interpolation, feature detection system and free-view TV. However, the long processing time of stereo matching algorithms has been the major bottleneck that limits their real-time applications. During the past decades, many implementation platforms and corresponding algorithm adaptations are proposed to solve the processing time problem. Although notable real-time performances have been achieved, these works rarely satisfy both real-time processing and high stereo matching quality requirements. In this thesis, we propose an improved stereo matching algorithm suitable for hardware implementation based on the VariableCross and the MiniCensus algorithm. Furthermore, we provide parallel computing hardware design and implementation of the proposed algorithm. The developed stereo matching hardware modules are instantiated in an SoC environment and implemented on a single EP3SL150 FPGA chip. The experimental results suggest that our work has achieved high speed real-time processing with programmable video resolutions, while preserving high stereo matching accuracy. The online benchmarks also prove that this work delivers leading matching accuracy among declared real-time implementations, with only 8.2% averaged benchmark error rate. We have also achieved 60 frames per second for 1024 × 768 high-definition stereo matching, which is the fastest high-definition stereo matching to the best of our knowledge. Faculty of Electrical Engineering, Mathematics and Computer Science Design and Implementation of Real-Time High-Definition Stereo Matching SoC on FPGA Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Embedded Systems by Lu Zhang B.Sc. born in Shandong, China This work was performed in: Computer Engineering Group Department of Microelectronics & Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology Delft University of Technology c 2010 Computer Engineering Group Copyright ⃝ All rights reserved. Delft University of Technology Department of Microelectronics & Computer Engineering The undersigned hereby certify that they have read and recommend to the Faculty of Electrical Engineering, Mathematics and Computer Science for acceptance a thesis entitled “Design and Implementation of Real-Time High-Definition Stereo Matching SoC on FPGA” by Lu Zhang B.Sc. in partial fulfillment of the requirements for the degree of Master of Science. Dated: 2010-September-03 Chairman: Prof.dr.ir. Koen Bertels Advisors: Dr.ir. Gauthier Lafruit Dr.ir. Georgi Kuzmanov Committee Members: Dr.ir. Rene van Leuken iv Abstract Stereo matching has been widely used in many fields, such as viewpoint interpolation, feature detection system and free-view TV. However, the long processing time of stereo matching algorithms has been the major bottleneck that limits their real-time applications. During the past decades, many implementation platforms and corresponding algorithm adaptations are proposed to solve the processing time problem. Although notable real-time performances have been achieved, these works rarely satisfy both real-time processing and high stereo matching quality requirements. In this thesis, we propose an improved stereo matching algorithm suitable for hardware implementation based on the VariableCross and the MiniCensus algorithm. Furthermore, we provide parallel computing hardware design and implementation of the proposed algorithm. The developed stereo matching hardware modules are instantiated in an SoC environment and implemented on a single EP3SL150 FPGA chip. The experimental results suggest that our work has achieved high speed real-time processing with programmable video resolutions, while preserving high stereo matching accuracy. The online benchmarks also prove that this work delivers leading matching accuracy among declared real-time implementations, with only 8.2% averaged benchmark error rate. We have also achieved 60 frames per second for 1024 × 768 high-definition stereo matching, which is the fastest high-definition stereo matching to the best of our knowledge. v vi Acknowledgments This thesis work has been performed at IMEC (Leuven, Belgium) with support from the Computer Engineering group of TU Delft (Delft, the Netherlands). I want to take this opportunity to thank everyone who has helped me during this thesis work period. First I would like to thank Dr. Gauthier Lafruit, Prof. Diederik Verkest and Prof. Kees Goossens for their agreement to offer me the opportunity to do my master’s thesis project at IMEC. Their rich experience and scientific insights have also put the whole research and development in the correct direction - dedicated hardware implementation of high performance stereo matching. During the thesis work period, I thank Ke Zhang for his ingeniously developed VariableCross stereo matching algorithm and guiding me into the stereo vision world. Ke also inspires me a lot during the hardware architecture and computing flow design. I thank Bart Vanhoof for his advices of selecting and using the suitable FPGA and development tools. Bart also contributes a lot to the camera interface and image rectification implementations on FPGA. I thank Luc Rynders for his collaboration on viewpoint interpolation analysis and corresponding software developments. I would like to thank Prof. Georgi Kuzmanov for his invaluable suggestions for improvements and all the hours he put in proofreading my thesis. Along the way Qiong Yang and Bart Masschelein also spent time on teaching me background knowledge and proposing insightful advices. I also would thank Gauthier Lafruit, Francesco Pessolano and Johan De Geyter for their excellent management. Special thanks give to Prof. Tian Sheuan Chang and Gauthier Lafruit. Without their support and contributions this thesis work cannot make our desired achievements. Prof. Chang has proposed the Mini-Census transform, which is applied in our implementation to improve the robustness of stereo matching and reduce the memory consumption. I am also grateful to have his inspiration and encouragement. Gauthier’s enthusiasm and ambition is the propulsion of the whole research and development. I owe cordial appreciation for his constant guidance, trust and support. The recent two years’ study and research in Delft and Leuven has been a wonderful experience. TU Delft and IMEC give me the opportunity to interact with outstanding and talented people from all over the world. I want to express my gratefulness to all my friends for their love, support and sharing all the happy time with me in the two fantastic European towns. I also deeply thank my family for their remote and emotional support and encouraging. Finally I thank all my thesis defense committee members including Prof. Koen Bertels, Prof. Georgi Kuzmanov, Dr. Gauthier Lafruit and Prof. Rene van Leuken. Lu Zhang B.Sc. Delft, The Netherlands 2010-September-03 vii viii Contents Abstract v Acknowledgments vii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Solutions and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Overview of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background and Related Work 7 2.1 Epipolar Geometry and Image Rectification . . . . . . . . . . . . . . . 7 2.2 Stereo Matching Computation Flow . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Matching Cost Computation . . . . . . . . . . . . . . . . . . . . 8 2.2.2 Area-Based Matching Cost Aggregation . . . . . . . . . . . . . . 9 2.2.3 Disparity Estimation . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Disparity Refinement . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Design Considerations and Targets . . . . . . . . . . . . . . . . . . . . 14 3 Stereo Matching Algorithm: Software-Hardware Co-Design 17 3.1 Overview of the Stereo Matching Pipeline . . . . . . . . . . . . . . . . 17 3.2 Design of the Pre-Processor Pipeline . . . . . . . . . . . . . . . . . . . 21 3.2.1 Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2 Census Transform and Adaptive Support Region Construction . 24 3.2.3 Output Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Design of the Stereo-Matcher Pipeline . . . . . . . . . . . . . . . . . . . 29 3.3.1 Raw Cost Scatter and Correlation Region Builder . . . . . . . . 31 3.3.2 Parallelized Cost Aggregations . . . . . . . . . . . . . . . . . . . 32 3.3.3 Winner-Takes-All and L-R Disparity Maps . . . . . . . . . . . . 37 3.3.4 Output Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Design of the Post-Processor Pipeline . . . . . . . . . . . . . . . . . . . 41 3.3 3.4 ix 3.4.1 L-R Consistency Check . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Disparity Voting . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.3 Median Filter and Output . . . . . . . . . . . . . . . . . . . . . 46 4 Stereo Matching SoC Implementations on FPGA 4.1 47 Overview of the Proposed System . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 The PC Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.2 The FPGA Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Introduction to the SoC Hardware . . . . . . . . . . . . . . . . . . . . . 50 4.3 Introduction to the SoC Software . . . . . . . . . . . . . . . . . . . . . 51 4.4 System on a Chip Interconnections . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Avalon Memory Mapped Interconnect . . . . . . . . . . . . . . 53 4.4.2 Avalon Streaming Interconnect . . . . . . . . . . . . . . . . . . 54 4.5 Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Bandwidth Utilization Estimations . . . . . . . . . . . . . . . . . . . . 56 5 Design Evaluation and Experimental Results 59 5.1 Evaluation of the Proposed Algorithm . . . . . . . . . . . . . . . . . . 59 5.2 Evaluation of the FPGA Implementation . . . . . . . . . . . . . . . . . 63 5.2.1 Introduction to the FPGA Hardware Resource . . . . . . . . . . 63 5.2.2 FPGA Implementation Report . . . . . . . . . . . . . . . . . . . 65 Evaluation of the Real-Time Performance . . . . . . . . . . . . . . . . . 67 5.3.1 Performance Evaluation with Simulation . . . . . . . . . . . . . 67 5.3.2 Performance Evaluation with FPGA . . . . . . . . . . . . . . . 68 Comparison to Related Work . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 5.4 6 Conclusions and Future Work 73 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Summary of Chapters and Contributions . . . . . . . . . . . . . . . . . 73 6.3 Future Work and Application Developments . . . . . . . . . . . . . . . 75 Bibliography 77 x List of Figures 1.1 A disparity map example . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The eye gazing angle problem . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The Eye Contact Silhouette system . . . . . . . . . . . . . . . . . . . . 2 1.4 IMEC proposed eye gazing solution . . . . . . . . . . . . . . . . . . . . 3 1.5 Proposed view interpolation setup with FPGA . . . . . . . . . . . . . . 5 2.1 Correspondences and the half occlusion problem . . . . . . . . . . . . . 8 2.2 Rectified images and epipolar lines . . . . . . . . . . . . . . . . . . . . 8 2.3 Basic pixel-to-pixel matching method . . . . . . . . . . . . . . . . . . . 9 2.4 Cost aggregation over a fixed window . . . . . . . . . . . . . . . . . . . 10 2.5 Cost aggregation over an shaped window . . . . . . . . . . . . . . . . . 11 2.6 Stereo matching pseudo code . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Sequential stereo matching processing flow . . . . . . . . . . . . . . . . 18 3.2 Parallelized stereo matching processing flow . . . . . . . . . . . . . . . 18 3.3 Stereo matching pipeline and functions . . . . . . . . . . . . . . . . . . 19 3.4 Pre-Processor internal processing pipeline . . . . . . . . . . . . . . . . . 21 3.5 Median filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Median filter data stream . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.7 Line buffers and registers in a median filter . . . . . . . . . . . . . . . . 23 3.8 Slide window for a median filter . . . . . . . . . . . . . . . . . . . . . . 24 3.9 Median filter sorting array . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.10 Mini-Census transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.11 Local adaptive cross and support regions on an image . . . . . . . . . . 26 3.12 Adaptive support region representation . . . . . . . . . . . . . . . . . . 26 3.13 Census transform and adaptive support region construction window . . 28 3.14 Stereo-Matcher processor internal processing pipeline . . . . . . . . . . 29 3.15 1-dimensional local search area . . . . . . . . . . . . . . . . . . . . . . 30 3.16 Scattering raw matching costs using shift register arrays . . . . . . . . 31 3.17 Two orthogonal 1D aggregations . . . . . . . . . . . . . . . . . . . . . . 32 3.18 Integral costs and horizontal cost aggregation . . . . . . . . . . . . . . 33 3.19 Horizontal cost aggregation logic 34 . . . . . . . . . . . . . . . . . . . . . xi 3.20 Aggregation data reuse in vertical aggregation . . . . . . . . . . . . . . 35 3.21 Vertical cost aggregation without data reuse . . . . . . . . . . . . . . . 36 3.22 Vertical cost aggregation with data reuse . . . . . . . . . . . . . . . . . 36 3.23 Pipelined data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.24 Winner-Takes-All select tree . . . . . . . . . . . . . . . . . . . . . . . . 37 3.25 Compute the disparity map of the right image . . . . . . . . . . . . . . 38 3.26 Lineup buffers and read patterns . . . . . . . . . . . . . . . . . . . . . 39 3.27 L-R disparity maps generated by the Stereo-Matcher . . . . . . . . . . 40 3.28 Post-Processor internal processing pipeline . . . . . . . . . . . . . . . . 41 3.29 Synchronized L-R disparity data word . . . . . . . . . . . . . . . . . . 42 3.30 L-R consistency check logic . . . . . . . . . . . . . . . . . . . . . . . . 42 3.31 Left disparity map after consistency check . . . . . . . . . . . . . . . . 43 3.32 Disparity histogram and voted disparity . . . . . . . . . . . . . . . . . 43 3.33 2 × 1D orthogonal voting method . . . . . . . . . . . . . . . . . . . . . 44 3.34 Horizontal voting unit for disparity 0 . . . . . . . . . . . . . . . . . . . 44 3.35 Vertical voting logic and line buffers . . . . . . . . . . . . . . . . . . . . 45 3.36 Voted disparity maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.37 Final disparity map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 Proposed SoC with both source and result frame buffer . . . . . . . . . 47 4.2 Proposed SoC with source frame buffer and display . . . . . . . . . . . 48 4.3 Flow chart of the main loop tasks . . . . . . . . . . . . . . . . . . . . . 52 4.4 The minimum Avalon-ST interconnect . . . . . . . . . . . . . . . . . . 54 4.5 Avalon-ST interconnect in our system . . . . . . . . . . . . . . . . . . . 54 4.6 Packed stereo frame data storage . . . . . . . . . . . . . . . . . . . . . 55 5.1 Lmax-Vspan and error rates with Tsukuba image set . . . . . . . . . . 60 5.2 Lmax-Vspan and error rates with Teddy image set . . . . . . . . . . . . 60 5.3 Support region threshold and error rates with Tsukuba image set . . . 61 5.4 Support region threshold and error rates with Teddy image set . . . . . 61 5.5 Truth disparity maps and our results . . . . . . . . . . . . . . . . . . . 62 5.6 Luminance biased images and the resulted disparity maps . . . . . . . 62 5.7 High-level block diagram of Stratix-III ALM . . . . . . . . . . . . . . . 63 5.8 Basic Two-Multiplier Adder DSP unit . . . . . . . . . . . . . . . . . . 64 xii List of Tables 2.1 Matching cost computations and aggregations . . . . . . . . . . . . . . 11 5.1 EP3SL150 on-chip memory features . . . . . . . . . . . . . . . . . . . . 64 5.2 EP3SL150 DSP block configurations . . . . . . . . . . . . . . . . . . . 65 5.3 EP3SL150 hardware resource summary . . . . . . . . . . . . . . . . . . 65 5.4 EP3SL150 hardware resource utilization summary . . . . . . . . . . . . 66 5.5 Design and implementation scalability . . . . . . . . . . . . . . . . . . 67 5.6 Stereo matching pipeline latency summary . . . . . . . . . . . . . . . . 67 5.7 Stereo matching processing speed summary . . . . . . . . . . . . . . . . 68 5.8 Theoretical frame rates (FPS) with different pixel clocks . . . . . . . . 68 5.9 FPGA frame rates (FPS) with SoC1 and 100MHz Clock . . . . . . . . 69 5.10 FPGA frame rates (FPS) with SoC2 and standard pixel clocks . . . . . 69 5.11 Stereo matching algorithm error rate benchmark . . . . . . . . . . . . . 70 5.12 Stereo matching algorithm frame rates . . . . . . . . . . . . . . . . . . 71 xiii xiv 1 Introduction Stereo matching has been, and continues to be one of the most active research topics in computer vision. The task of stereo matching algorithm is to analyze the images taken from a stereo camera pair, and to estimate the displacement of corresponding points existing in both images in order to extract depth information (inversely proportional to the pixel displacement) of objects in the scene. The displacement is measured in number of pixels and also called Disparity; disparity values normally lie within a certain range, the Disparity Range, and disparities of all the image pixels form the disparity map, which is the output of a stereo matching processing. An example with the Teddy benchmark image set is shown in Figure 1.1. In the figure the disparities are visualized as grayscale intensities, and the brighter the grayscale, the closer (to the stereo cameras) the object. Therefore the disparity map encodes the depth information (a) (b) (c) Figure 1.1: A disparity map example (a): Image taken by the left camera. (b): Image taken by the right camera. (c): The ground truth disparity map associated with the left image. of each pixel, and once we infer the depth information by means of stereo matching, we are able to obtain the 3D information and reconstruct the 3D scene using triangulation. Since stereo matching provides depth information, it has great potential uses in 3D reconstruction, stereoscopic TV, navigation systems, virtual reality and so on. 1.1 Motivation The motivation of this thesis research and design work is to facilitate eye-gazing video conferencing with the support of high quality and real-time stereo matching. As many of the audience may have experienced, a common desktop video communication through 1 Skype and a webcam often lacks natural eye contact. The reason is obvious: providing natural eye contact requires the communication participants looking directly into the camera, but unfortunately, traditional video conferencing devices with a camera positioned above the screen (as shown in Figure 1.2) fail to satisfy this requirement. The angular displacement (noted as α in Figure 1.2) between the single camera and Figure 1.2: The eye gazing angle problem monitor center causes one participant cannot look at the camera simultaneously while looking at the remote participant shown on the display. The lack of natural eye contact leads to awkward and uncomfortable feelings, and sometimes causes distrust between meeting participants. There are several approaches to correct the eye gazing problem, one remarkable one is the Eye Contact Silhouette system provided by Digital Video Enterprises (DVE). As shown in Figure 1.3, the camera is physically positioned behind Figure 1.3: The Eye Contact Silhouette system the display, which means that when the local participants look at the display showing remote participants, they are also looking directly into the camera and providing true eye contact (with zero gazing angle) to the far end sites. However, this arrangement is bulky and expensive, and more importantly it does not fit well to our commodity computer peripherals. A more elegant solution is to provide a computer-synthesized video which looks as if it were captured from a camera directly behind the monitor screen. 2 Our proposed solution is shown in Figure 1.4. It requires two cameras positioned on the top and bottom (or left and right) of the display, which provide the computer with stereo vision. The depth information in the computer’s vision (meeting participants) Figure 1.4: IMEC proposed eye gazing solution is therefore obtained by means of stereo matching, and the in-between view from the virtually centered camera position is provided by a view interpolation algorithm. In this thesis we consider the two cameras are horizontally aligned, i.e., left and right. When the cameras are vertically aligned, they can also be considered as left and right with image rotation processing. The full proposed eye-gazing video pipeline contains image acquisition, camera calibration, image rectification, stereo matching, view interpolation and video display. This thesis work focuses on the stereo matching part, which introduces the major processing bottleneck as discussed in the following section. 1.2 Problem Definition In the processing pipeline, stereo matching is normally the most time consuming part. Due to high computational complexity, solutions that obtain satisfactory synthesizing quality under the constraint of real-time execution are quite rare, e.g. a recent threecamera viewpoint interpolation prototype system [32] proposed by Sharp Corporation only reaches a couple of frames per second for low-resolution images. In order to provide real-time and high-quality interpolation results, the stereo matching bottleneck must be solved first, especially for high frame rate and high definition (e.g. 50 frames per second with 1024 × 768 video size) applications such as business eye-gazing video conference systems. Recent stereo matching design and implementations utilize various processing platforms such as high-performance multi-core CPU, GPU, DSP, FPGA and ASIC. Meanwhile, many stereo matching algorithms and adaptations are also proposed for efficient implementations on these platforms. Nevertheless, these work rarely satisfies both real-time and high matching accuracy requirements. For example, a recent implementation on a 3 GeForce8800 GTX GPU proposed by Zhang et al. [50] preserves high matching quality with an averaged benchmark error rate 7.65%, but only reaches 12 frames per second with 450 × 375 image resolution. In contrast, the FPGA implementation proposed by Jin et al. [23] achieves 230 frames per second stereo matching with VGA (640 × 480) resolution, but the matching accuracy degrades a lot with averaged benchmark error rate 17.24%. Furthermore, frames to match in stereo matching come from two different cameras and hence might exhibit very different luminance or radiometric variations, incurring a severe matching cost bias to which some stereo matching cost functions are very sensitive. Some of the recently proposed algorithms are not robust in this condition. 1.3 Solutions and Contributions To satisfy both real-time and high-accuracy stereo matching requirements, especially for high-definition view interpolation applications, we propose a hardware friendly high-accuracy stereo matching algorithm and a dedicated hardware implementation approach. The prototyping and evaluation design are implemented on a single EP3SL150 FPGA, and the implemented algorithm is based on Adaptive Cross-Based Local Stereo Matching (VariableCross) proposed by Zhang et al. [49] with our hardware friendly adaptations. Compared with the other two kinds of implementations (on CPU and GeForce8800 GTX GPU) of the same algorithm, this design has achieved significant speedup regarding processing time (speedup factor ranges from 41 to 162 compared with the CPU implementation), while preserving high matching accuracy. Our main achievements are twofold: on one side, we improved the state-of-the-art VariableCross [49] algorithm regarding its accuracy and robustness; on the other side, we provided a high-performance and high-quality hardware design of the proposed algorithm. Summarized contributions of this thesis work are listed below: We proposed an improved algorithm that: • Preserves high matching accuracy with 8.2% averaged benchmark error rate. • Maintains robust to radiometric distortions and bias differences introduced by stereo cameras. The robustness is obtained with the support of Mini-Census Transform proposed by Chang et al. [4]. We implemented the algorithm in hardware that: • Achieves real-time stereo matching with various and run-time programmable resolutions e.g., 450 × 375 @ 193FPS, 640 × 480 @ 105FPS and 1024 × 768 @ 60FPS, independent on the disparity range. • Proposes fully pipelined and scalable hardware structure. The hardware scales with the maximum disparity range under consideration, and the scalability offers hardware resource savings with small disparity range applications. On the FPGA 4 EP3SL150 we have evaluated implementations with maximum disparity range of 16, 32 and 64. • Implements industry standard on-chip interconnections (Avalon System Interconnect Fabric). This enables our design to work properly with other Avaloncompatible processing modules, for example the Altera’s Video and Image Processing (VIP) Suite. • Provides fully functional SoC reference design with run-time configurable parameters and disparity map visualizations on a standard DVI monitor. The whole design is implemented on the DE3 evaluation board, which contains the EP3SL150 FPGA and rich peripherals e.g., DDR2 SDRAM, USB and camera interfaces. A DVI extension board is also employed for the disparity map display. To verify the hardware implementation on FPGA, the stereo matching hardware processing starts with rectified stereo images from an external frame buffer (DDR2 SDRAM) and finally output disparity maps to a result frame buffer. The disparity maps are uploaded to a PC for verification and benchmarking. To visualize disparity maps, the result frame buffer is removed and the disparity maps are output to a monitor. To enable follow-up work and view interpolation prototyping, the complete setup is also proposed, as shown in Figure 1.5. JTAG UART T Figure 1.5: Proposed view interpolation setup with FPGA 1.4 Overview of Chapters The content of the thesis text is organized as follows: Chapter 2 introduces the background knowledge of stereo matching computing flow, related work and our design 5 considerations. In Chapter 3, we present the proposed stereo matching algorithm and corresponding hardware module design in a top-down approach. Chapter 4 provides two reference SoC designs with the stereo matching modules on our selected FPGA and corresponding hardware/software tasks. In chapter 4, we also estimate the external memory storage and bandwidth utilizations. Chapter 5 evaluates the whole design and implementation according to our design considerations, and provides comparison to related work. Finally, Chapter 6 summarizes the thesis and suggests valuable optimization techniques and interesting application developments as future work based on our stereo matching results. 6 Background and Related Work 2 In the past two decades, various stereo matching algorithms have been proposed and they were summarized and evaluated by Scharstein and Szeliski [35]. In this notable work, these proposed stereo matching algorithms are categorized into two major types: local area based methods and global optimization based methods. In local methods, the disparity evaluation at a given pixel is based on similarity measurement performed in a finite window. The similarity metric is defined by a Matching Cost and all costs in the local window are often aggregated to provide a more reliable and robust result. On the other hand, global methods define global cost functions and solve an optimization problem. Global algorithms typically do not perform an aggregation step, but rather seek a disparity assignment that minimizes a global cost function. In this thesis we are particularly interested in local stereo matching methods, which generally have low computation complexity and less storage requirement; and therefore they are suitable for real-time and embedded implementations. In Section 2.1 we introduce the image rectification required by efficient local stereo matching; In Section 2.2 fundamental basis of the stereo matching computation flow and our chosen algorithm are discussed. We present our design considerations and corresponding algorithm adaptations in Section 2.4 and conclude this chapter with related work and comparisons. 2.1 Epipolar Geometry and Image Rectification For a given pixel in the left image, the stereo matching algorithm is to seek the corresponding one in the right image, for example in Figure 2.1 the object point p appears in both left and right views as pixel p(x, y) and p′ (x′ , y ′ ), respectively. The two pixels are defined as a correspondence pair, and the displacement of their locations in the stereo view is the disparity; in this case the disparity is equal to (x − x′ ). In Figure 2.1, the distance between the stereo cameras’ focal axis is defined as the Baseline. To enable efficient stereo matching, it is necessary that the two input images are Rectified, fulfilling Epipolar Geometry. In epipolar geometry each pixel in the left image finds its correspondence in the right image (if exists) on a specific line, called the Epipolar Line. If the two images are rectified, the epipolar lines coincide with the image rows and run parallel to the baseline; so that the correspondence search only needs to be performed along a 1-dimension scanline. The concept is shown in Figure 2.2. Image rectification according to epipolar geometry is an important processing step before the stereo matching algorithm, enabling less complex implementation of the correspondence searching. Detailed rectification algorithm and implementations are not further discussed in this thesis, and we therefore assume in the sequel input stereo images are rectified. 7 Figure 2.1: Correspondences and the half occlusion problem Epipolar Lines Left View Right View Figure 2.2: Rectified images and epipolar lines 2.2 Stereo Matching Computation Flow In general, stereo matching algorithms perform the following steps: 1. Matching cost computation. 2. Cost (support) aggregation. 3. Disparity computation / optimization 4. Disparity refinement. The four steps are introduced in the following subsections. In local algorithms, emphasis is on the matching cost computation and cost aggregation steps. 2.2.1 Matching Cost Computation Thanks to image rectification, searching for correspondences is reduced to 1dimensional. In addition, the disparity values lie within a certain disparity range; 8 so the basic matching method is performed pixel-to-pixel in a horizontal line segment, as shown in Figure 2.3. In this example the disparity range is [0 ... dmax ], and dmax = 7 in this figure. The disparity range is determined by the distance between the scene x x - dmax Left View x Right View Figure 2.3: Basic pixel-to-pixel matching method objects and the camera baseline, and the length of the baseline itself. To determine the matched correspondences in a stereo pair, the matching costs are defined to evaluate the probability of a correct match; the smaller the costs, the higher the probability. For pixel-to-pixel matching, the matching cost is commonly defined as Absolute Difference, Squared Difference, Hamming Distance etc., on both gray and color images. The Equation 2.1 defines the matching cost measured by absolute intensity difference; where I(·) returns the grayscale intensity of a pixel in the left image and I ′ (·) returns the intensity of a right image pixel. The two pixels under consideration are related by a disparity hypothesis d. Cost(x, y, d) = |I(x, y) − I ′ (x − d, y)| (2.1) In this simple case, the correspondence is determined by the minimum cost value, and the associated d is the disparity of the pixel p(x, y) in the left image. Hamming distance cost is normally computed based on transformed images. The transform is a pre-processing (often a filter) step before the matching cost computation. In our implementation we adopted the Mini-Census Transform proposed by Chang et al.[4]. Details of this transform algorithm are introduced in Chapter 3. To avoid ambiguity, in the following context we refer the matching cost between two single pixels as the Raw Matching Cost. Raw matching costs in a restricted area are often aggregated to improve the matching accuracy. 2.2.2 Area-Based Matching Cost Aggregation Typically, to increase the reliability of cost evaluation and the robustness to image noise, local stereo matching methods choose to aggregate (accumulate) raw matching costs over a Support Region. The most basic support region is a square window, as shown in Figure 2.4. In this example, a fixed 3 × 3 support region is built for a given pixel in the left image and all raw costs in this region are aggregated. The same computation applies to each pixel in the disparity range in the right image. 9 x - dmax x x d=0 d=1 d = dmax Figure 2.4: Cost aggregation over a fixed window The well known challenge for area-base local stereo matching is that a local support region should be large enough to contain sufficient intensity variation for reliable matching, while it should be small enough to avoid disparity discontinuity inside the region. Therefore state-of-the-art stereo matching algorithms often shape the support window in order to include only pixels with the same (unknown) disparity. Veksler [42] found a useful range of window sizes and shapes while evaluating the aggregated cost. Zhang et al. [49] adaptively shaped a support window with variable crosses in order to bound the pixels near arbitrarily shaped disparity discontinuities. Yoon and Kweon [48] also assigned a support weight to the pixel in a support window based on color similarity and geometric proximity. The Figure 2.5 illustrates a simple shaped support region example. The support regions are built around the anchor pixel in the left image, and the pixels in the disparity range in the right image. Normally for evaluating each hypothetical pair, only the overlapped area of the two support windows is considered to take both images into account. Some local stereo matching algorithms simply sum up all the raw matching costs in a support region, for example the Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD) and Sum of Hamming Distances. These algorithms are clearly broken down into step 1 and 2 as mentioned above. On the other hand, some of the algorithms merge step 1 and 2 and define a new matching cost based on the support region, such as Normalized Cross-Correlation (NCC). The Table 2.1 summarizes involved ∑ computations for these commonly used matching cost aggregation. In the table, the x,y (·) denotes aggregation over the (overlapped) support region; I(·) returns the intensity of a pixel and BitV ec(·) gives the bit vector associated with a pixel after 10 x x - dmax x d=0 d=1 d = dmax (a) d=1 (b) Figure 2.5: Cost aggregation over an shaped window (a): Shaped window for pixels in the left and right image. (b): Overlapped area of the windows associated with a hypothetical pair. Matching Cost Computations | I(x, y) ! I" (x ! d, y) | Sum of Absolute Differences #,$ ( I(x, y) ! I" (x ! d, y) )# Sum of Squared Differences Sum of Hamming Distances $,% Hamming( BitVec(x, y), BitVec ! (x " d, y) ) #,$ Normalized Cross-Correlation & $,%( I(x, y) ! I" (x # d, y) ) ' $,%( I(x, y) ! I" (x # d, y)' ) Table 2.1: Matching cost computations and aggregations a certain transform, e.g. Census. More extensive discussions and evaluations about state-of-the-art matching cost computations/aggregations are found in [19]. 11 2.2.3 Disparity Estimation Disparity estimation is to choose at each pixel in the left image the disparity d associated with the minimum cost value. Generally, this involves a local ”Winner-Takes-All” (WTA) optimization at each anchor pixel p(x, y) in the left image and the pixels in the disparity range in the right image. Therefore the basic concept of stereo matching is explained by the pseudo code listed in Figure 2.6. Though straightforward, the FOR (y IN 0 TO ImageHeight - 1) FOR (x IN 0 TO ImageWidth - 1) FOR (d IN 0 TO dmax) IF (Cost(x, y, d) < MinCost(x, y, d)) MinCost(x, y, d) = Cost(x, y, d) dp(x, y) = d END IF END FOR END FOR END FOR Figure 2.6: Stereo matching pseudo code embedded computing loops cause the major bottleneck for real-time implementations, and one of the contributions of this thesis work is to solve this problem with parallel and pipelined computing. A limitation of this disparity estimation is that uniqueness of matches is only guaranteed for one image (the left image), while pixels in the right image might get matched to multiple points. To compensate for this problem, a similar matching process is also performed based on the right image, and the disparity map of the right image is also generated. The uniqueness and consistency tests are performed in the disparity refinement step, combining both disparity maps. 2.2.4 Disparity Refinement The correspondence may not exits for each pixel in one image due to Half-Occlusion, as shown in Figure 2.1. Consistency Check (comparing left-to-right and right-to-left disparity maps) is often employed to detect occluded areas and perform uniqueness tests, and detected mismatches can be fixed by surface fitting or by distributing neighboring valid disparity estimates [35]. If disparity results pass consistency check, some additional techniques are used to further the disparity reliability and accuracy, for example Sub-Pixel Estimations [37] or local high-confidence Disparity Voting [28] in a support region. After this step, a final median filter is also helpful for removing speckle noises and smoothing the disparity map. 12 Disparity refinement normally contributes considerable improvements to the WTAestimated disparity map, so in our final implementation we applied consistency check, local disparity voting and median filtering. 2.3 Related Work To enable efficient stereo matching computation, especially for real-time applications, various approaches have been developed during the past two decades. In recent years, notable achievements have been made in either matching accuracy and processing speed. We categorize them according to the implementation platforms. General purpose CPUs are often used as the first algorithm prototyping platform, and many efficient aggregation methods are proposed to accelerate the CPU processing. Tombari et al. [39] proposed an efficient segmentation-based cost aggregation strategy, which achieves a frame rate of 5 FPS for 384 × 288 images with 16 disparity range on an Intel Core Duo clocked at 2.14 GHz. But it drops to only 1.67 FPS with 450 × 375 images and 60 disparity range. Later, Salmen et al. [34] optimized a dynamic programming algorithm on a 1.8 GHz PC platform and achieved a frame rate of 5 fps for 384 × 288 and 16 disparity range stereo matching. Kosov et al. [26] combined a multi-level adaptive technique with a multi-grid approach that allows the variational method to reach 3.5 FPS with 450 × 375 images and 60 disparity range. Zhang et al. [49] proposed the VariableCross algorithm and orthogonal integral image technique to accelerate the aggregation over irregularly shaped regions. This work achieves 7.14 FPS with 384 × 288 and 16 disparity range, but only 1.21 FPS with 450 × 375 and 60 disparity range. Though real-time performances are generally poor, CPU implementations often achieve very good accuracy with less than 10% averaged error rates. Error rates of the above mentioned four implementations are 8.24%, 8.83%, 9.05% and 7.60%, respectively. With GPU implementation, a global optimizing real-time algorithm was developed by Yang et al. [46] on a GeForce 7900 GTX GPU. The authors used hierarchical belief propagation and reached 16 FPS for 320 × 240 images with 16 disparities. This work presented high matching accuracy with 7.69% averaged error rate, but clearly its frame rates and image resolution are limited. Another GPU implementation was introduced by Wang et al. [43]. It is based on an adaptive aggregation method and dynamic programming and reaches a frame rate of 43 FPS for 320 × 240 images and 16 disparity levels on an ATI Radeon XL1800 graphics card. Compared with the work by Yang et al. [46], it achieves higher frame rates, but its accuracy degrades to 9.82%. Later, the authors of VariableCross algorithm [49] also implemented the same algorithm on GeForce GTX 8800 [50], which achieves 100.9 FPS with 384 × 288 images and 16 disparity range. However, the frame rate drops to 12 FPS with 450 × 375 resolution and 64 disparities. Most recently, Humenberger et al. [20] optimizes the Census Transform on GeForce GTX 280 GPU, and reaches 105.4 FPS with 450 × 375 resolution and 60 disparity range. This work also maintains high accuracy with averaged error rate of 9.05%. Although notable performances have been achieved with GPU implementations, GPUs are not efficient in power and memory consumption and 13 hence not suitable for highly integrated embedded systems. Furthermore, these works rarely report the performances with high-definition stereo matching. Chang et al. [3] investigated the performance of using DSP as the implementation platform. In this work, the authors proposed a 4 × 5 jigsaw matching template and the dual-block parallel processing technique to enhance the stereo matching performance on a VLIW DSP. The DSP implementation achieved 50 FPS with disparity range of 16 for 384 × 288 stereo matching. Nevertheless, its accuracy degrades a lot, with above 20% averaged error rate and its frame rate drops to 9.1 with 450 × 375 images and 60 disparity range. Implementations with CPUs, GPUs and DSPs are all software based approaches. The major problem with software is that the computing logic and data paths are fixed and not configurable. As a result, they are not optimized for available algorithms. Furthermore, their performances also require very high device clock frequency and memory bandwidth. To overcome the software limit, recent works also investigated the performance of dedicated hardware design. Chang et al. [4] proposed a high performance stereo matching algorithm with Mini-Census and Adaptive Support Weight (MCADSW) and also provided its corresponding real-time VLSI architecture. The design is implemented with UMC 90nm ASIC, and reaches 42 FPS with 352 × 288 images and 64 disparity range. It also preserves high matching accuracy (unknown average) with benchmark image sets. Jin et al. [23] proposed a fully pipelined hardware design using Census Transform and Sum of Hamming Distances. This work has achieved 230 FPS with 640 × 480 resolution and 64 disparities. However, its averaged error rate is 17.24%. Though the two hardware implementations achieve remarkable performance, the image resolutions are limited by the on-chip memories, due to relatively high storage requirements or less efficient hardware design. In Chapter 5, we provide a comparison between our implementation and all of the above mentioned references. 2.4 Design Considerations and Targets In order to enable real-time processing speed while preserving high stereo matching accuracy, we considered the following aspects to choose an algorithm and make adaptations for our implementation. 1. Matching accuracy Our target implementation should achieve less than 10% averaged error rate. 2. Robustness Frames to match in stereo matching come from two different cameras and hence might exhibit very different luminance or radiometric variations, incurring a severe matching cost bias to which some stereo matching cost functions are very sensitive. Our implementation should also be robust, i.e., none or slight matching accuracy variation in this situation. 14 3. Real-time Implementations Our design targets for at least 50 frames per second with programmable resolutions, and high-definition stereo matching is desired. 4. Implementation Complexity Our design should avoid complex computations such as divisions and square root calculations. 5. Implementation Scalability The proposed hardware structure should scale well with the maximum allowed resolution and disparity range. Our desired target is HDTV resolution. Compared with software approaches, the major benefit of using FPGA or ASIC is the great flexibility in logic and data path design, which offers extreme parallelism in data flow and pipelined processing. Therefore they enable highly efficient and optimized designs and implementations. Based on the related work and our investigation, we believe customized hardware design is the most suitable approach for our proposed algorithm. Since FPGAs provide great resources for parallel computing and fast prototyping, we choose FPGA to implement our algorithm. To achieve a fast prototyping and enable future developments, we select Terasic DE3 development board as the development platform. The board contains Altera EP3SL150 FPGA and rich peripherals like JTAG UART port, DDR2-SDRAM, camera interfaces and DVI extension card and so on. 15 16 Stereo Matching Algorithm: Software-Hardware Co-Design 3 In this chapter, we introduce the proposed stereo matching algorithm which is modified from Adaptive Cross-Based Local Stereo Matching proposed by Zhang et al. [49]. We modify this algorithm in order to make it more robust and more hardware friendly. Key to the best processing throughput is to fully pipeline all processing steps, i.e. with source image pixels come progressively in scanline order, a new income pixel gets its disparity at the end of the pipeline after a certain pipeline latency, and valid disparities also come successively in scanline order, synchronized with the input pixel rate. To enable fully pipelined implementation, parallelization is also important to provide all required data at the entrance of a pipelined processing module. Our proposed algorithm and corresponding hardware pipeline design are introduced together in a top-down approach. Section 3.1 discusses the overview of the whole stereo matching processing and proposed data-level parallelism. The full pipeline is divided into three major processing blocks, i.e., pre-processing, stereo matching and postprocessing. In hardware their functions are performed by corresponding Pre-Processor, Stereo-Matcher and Post-Processor, which are introduced in three subsections of this chapter. 3.1 Overview of the Stereo Matching Pipeline Overview of the stereo matching processing flow is presented in Figure 3.1, corresponding to the aforementioned pseudo algorithm code (see Figure 2.6). Clearly the embedded computing loops cause the major problem for efficient implementations. If design follows this diagram, e.g., using state machines, raw stereo image data has to be read repeatedly into the processing module, which not only prolongs processing time but also exhausts external memory bandwidth. According to our research, the disparity evaluation loop can be unrolled and all following processing modules are also parallelized accordingly. The parallelized data flow is illustrated in Figure 3.2. In the parallelized computing, matching costs with different hypothetical disparities are computed and aggregated mutually independently. As shown in the figure, multiple concurrent computing paths are employed to enable the parallelization. In principle any number of parallel paths are supported by the algorithm, but in practice this number is usually limited by the available hardware resources. Our proposed hardware design follows the parallelized data flow, with pipelined processing at each algorithmic module. Although the cameras and image rectifications are shown in the figure, they are not implemented with this thesis work. The purpose 17 Loop Camera L Camera R Disparity: 0 ~ dmax Rectification L Cali. Map L Rectified Img L Median Filter L Rectified Img R Mini-Census L Mini-Census R Hamming Dist. Correlation Region Support Region L Median Filter R Rectification R Cali. Map R Support Region R Cost Aggregation Disparity Map L WTA (Min Cost) Disparity Map R Consistency Check One-pass Data Flow Disparity Voting Repeated Data Flow Median Filter Final Disparity Map Figure 3.1: Sequential stereo matching processing flow Modified functions by this thesis work are shaded Camera L Camera R Disparity: 0 ~ dmax Rectification L Cali. Map L Rectified Img L Median Filter L Rectified Img R Mini-Census L Support Region L Mini-Census R Hamming Dist. Correlation Region Median Filter R Rectification R Cali. Map R Support Region R Cost Aggregation Disparity Map L WTA (Min Cost) Disparity Map R Consistency Check One-pass Data Flow Disparity Voting Median Filter Final Disparity Map Figure 3.2: Parallelized stereo matching processing flow Modified functions by this thesis work are shaded of their presence is to show that the left and right images are processed independently before the actual stereo matching starts; and their pixel coordinates must be synchronized to ensure a valid stereo matching processing. Processing modules shown in Figure 3.2 are categorized into three higher level modules, i.e., the Pre-Processor, Stereo-Matcher and Post-Processor. Their functions and inputs/outputs are shown in Figure 3.3. DMAs and Avalon interfaces are implemented for SoC integrations, which 18 Scatter-Gather DMA-0 DDR2 SDRAM to Avalon Stream 16-bit R L Avalon Slave Avalon Sink [Control Registers] Frame Width Frame Height Frame Pixels SR Threshold 40-bit Median Filter Census Transform Support Region Builder Raw Cost Scatter Correlation Region Cost Aggregation Winner-Takes-All L-R Disparity Maps Output Control 40-bit Pre-Processor Avalon Slave Avalon Sink [Control Registers] Frame Width Frame Height Frame Pixels 8-bit Avalon Source [Control Registers] Frame Width Frame Height Frame Pixels Disparity (0 ... 63) 40-bit 16-bit Avalon Source D Avalon Slave Avalon Sink 32-bit 32-bit Avalon Source 32-bit A B L-R Consistency Check Disparity Voting Median Filter C 8-bit Post-Processor Stereo-Matcher Figure 3.3: Stereo matching pipeline and functions will be introduced in Chapter 4. The hardware pipeline processing sequence is explained below. 1. The DMA transfers a pair of stereo frames to the Pre-Processor, with only luminance intensities. Synchronized luminance pixels from the left and right image are packed into one 16-bit data word. 2. The Pre-Processor performs median filter, census transform and support region construction on left and right frame independently and sends its transformed images to the Stereo-Matcher. 3. The Stereo-Matcher first computes the raw matching cost and overlapped support region at each pixel according to different hypothesis disparities. Then all the costs in an overlapped support region are aggregated; costs with different hypothesis disparities are processed independently in parallel. 4. Aggregated costs of all disparity hypothesis are presented to the WTA module by the cost aggregation module, and the best disparity is selected by a tree-based minimum value searching structure. 5. In the Stereo-Matcher, two disparity maps are generated for the left and right image respectively, denoted as image A and B in Figure 3.3. 19 6. The Post-Processor receives two disparity maps and support region data from the Stereo-Matcher, performs L-R consistency check, local disparity voting and median filter to detect occluded regions and improve the disparity map quality. The Post-Processor outputs the final disparity maps. Each processor also contains several control registers for run-time parameter configuration, these parameters include image resolution, disparity range and some threshold values. In the following sections, we present the proposed algorithm functions together with the corresponding hardware design. We also provide the latency calculations based on 100MHz operation clock with the selected EP3SL150 FPGA. If some functions become critical paths with another device technology, it is required to implement more pipeline stages. 20 3.2 Design of the Pre-Processor Pipeline Scatter-Gather DMA-0 DDR2 SDRAM to Avalon Stream DataA 16-bit Avalon Slave [Control Registers] Frame Width Frame Height Frame Pixels SR Threshold Avalon Sink DataA 16-bit DataBL 8-bit DataBR 8-bit Median Filter 3x3 Median Filter 3x3 Avalon Source Stereo-Matcher DataCL 8-bit DataCR 8-bit Census Transform + Adaptive Support Region Construction Census Transform + Adaptive Support Region Construction DataDL 22-bit DataDR 22-bit DataE 40-bit Output Logic Pre-Processor Figure 3.4: Pre-Processor internal processing pipeline The internal processing pipeline of Pre-Processor is shown in Figure 3.4. Each data word in the pipeline is comprised of several fields. Since most processing modules require data from left and right video frames simultaneously for parallel computing, the corresponding data word also contains both. In the Pre-Processor, left and right frames are processed independently, and corresponding data words are marked with ’L’ or ’R’ subscripts. 3.2.1 Median Filter We have applied a 3 × 3 median filter to each raw frame pixel to remove impulsive noises. The function of a median filter is shown in Figure 3.5. Input (DataA) to the median filter module is a systolic stream carrying a combined data word with both left and right luminance pixels in progressive scanline order (see Figure 3.6). Computations on the left and right frames (DataB) are performed independently in 21 17 3 5 10 5 6 3 4 21 3 3 4 5 5 6 10 17 21 Median Window Buffer Figure 3.5: Median filter parallel, and the output from the median filter module is the filtered data stream (DataC), in synchronization with the input pixel rate. Time 8bits 8bits 8bits 8bits ĂĂ ĂĂ ĂĂ ĂĂ YL(2, 0) YR(2, 0) Y'L(2, 0) Y'R(2, 0) YL(1, 0) YR(1, 0) Y'L(1, 0) Y'R(1, 0) YL(0, 0) YR(0, 0) Y'L(0, 0) Y'R(0, 0) Time Median Filter Median Filter Figure 3.6: Median filter data stream To get the median value from a 3 × 3 array, we first prepare all the 9 pixels in a window buffer and afterwards they are captured simultaneously by a follow-up sorting module, which delivers the median value in a few pipeline cycles. The pixel buffer window slides over the whole frame and provides a new 3 × 3 pixel array in every valid pipeline cycle, therefore after the initial latency for preparing the first window content and sorting the 9 values, we get continuous filtered pixels synchronized with the input pixel rate. Because the pixel stream comes in scanline order, and a 3 × 3 filter window is required for each pixel, 2 horizontal lines and 3 shift register arrays of length 3 are used to form a valid filter window. The structure and interconnections of line buffers and shift registers are illustrated in Figure 3.7. In the implementation, each line buffer stores one scanline of 8-bit luminance pixels, and each WinReg contains one 8-bit luminance pixel. The Row Index Counter cyclically counts from 0 to (frame width - 1) and provides read/write addresses for the line buffer memory blocks. Because the line buffer memory has a 2-cycle read latency, its write address and write data are also delayed by 2 cycles to match the processing sequence. This module prepares all the required pixels for computing the median for the center pixel p(x, y), buffered in WinReg4. 22 Line Buffer SRAM Data_Wr Data_Rd Addr_Wr Addr_Rd WinReg0 WinReg1 WinReg2 (x + 1, y - 1) (x, y - 1) (x - 1, y - 1) WinReg3 WinReg4 WinReg5 (x + 1, y) (x, y) (x - 1, y) WinReg6 WinReg7 WinReg8 (x + 1, y + 1) (x, y + 1) (x - 1, y + 1) Line Buffer SRAM DataB Reg Reg Data_Wr Data_Rd Addr_Wr Addr_Rd Reg Row Index Counter Reg Figure 3.7: Line buffers and registers in a median filter The depth of a line buffer is determined by the maximum allowed frame width. A 1024 × 8 SRAM block supports any frame width no larger than 1024. For an input pixel p(x, y) from DataB stream, it has to wait for 2 (read latency) + f rame width + 2 (W inReg3 and W inReg4) cycles to arrive in WinReg4. As pixels come in continuously, the 3 × 3 pixel window also slides over the whole frame in scanline order (see Figure 3.8; the four boundary pixel rows/columns are not filtered, by-passing the sorting module) and provides all neighboring pixels of each center pixel simultaneously for the following sorting module. The 9 luminance values in a filter window are captured by a sorting module to get the median of them. We have adopted the median sorting structure proposed by VegaRodrguez et al. [41] and implemented suitable pipeline stages. The implemented median sorting module is shown in Figure 3.9. Each basic building block in this systolic array is a combination of a comparator and two multiplexers, which performs a basic sorting function on two unsigned values. The full array is divided in 3 pipeline stages so the desired median is available after 3 valid pipeline cycles. Median filtering is performed on the left and right frames independently in parallel, and the output streams of this module are filtered stereo frames (DataC in Figure 3.4). The pipeline latency of the median filter module is given by Equation 3.1. Latencymf = f rame width + Cmf (3.1) In the equation Cmf is a small constant value depending on the above mentioned read latency, shift register cycles and actual implemented pipeline stages. Quantitative reports of the latencies are presented in Chapter 5. 23 Pixel Order, y++ Pixel Order, x++ 0 9 2 5 1 6 0 3 8 2 2 6 ... 0 1 3 4 9 3 2 4 8 8 5 4 ... 3 4 8 8 0 8 9 0 3 8 5 9 ... 6 0 3 1 6 0 8 2 2 6 0 6 ... 3 2 4 9 3 2 8 8 5 3 2 3 ... 8 9 0 0 8 9 3 8 5 8 9 8 ... ... ... ... ... ... ... ... ... ... ... ... ... ... Figure 3.8: Slide window for a median filter WinReg0 WinReg1 A Larger B Smaller WinReg2 WinReg3 WinReg4 Median WinReg5 WinReg6 WinReg7 WinReg8 Stage 1 Stage 2 Stage 3 Figure 3.9: Median filter sorting array 3.2.2 3.2.2.1 Census Transform and Adaptive Support Region Construction Census Transform Census transform is a process extracting relative information for the center pixel from its neighbors, and its output is a bit vector encoding the relative information. Census transform has a few variants depending on different patterns of selected neighbor pixels; the one we have adopted is the Mini-Census algorithm proposed by Chang et al. [4], 24 and the transform is illustrated in Figure 3.10. Left Luminance Pixel: 34 Right Luminance Pixel: 49 9 9 7 30 34 Left Census Vector Right Census Vector 111 000 111 011 14 40 27 49 30 30 67 48 51 Mini-Census Transform 1 1 1 1 1 0 1 0 0 1 0 1 Bit Vector: 111000 Bit Vector: 111011 Hamming Distance = 2 Figure 3.10: Mini-Census transform In the transform, each square in the census window represents a (median filtered) 8-bit luminance pixel, but only the pixel locations marked with values are considered for the two example center pixels. If the luminance of a neighbor pixel under consideration is larger than that of the center pixel, the corresponding bit is set to ’0’, otherwise ’1’. All of the comparison results form a 6-bit vector as the output of a mini-census transform. In the later Raw Cost Scatter module, the cost function is to compute the hamming distance between two hypothesis pixel’s mini-census vectors. Using census transform and hamming distance as cost function has been proved to be very robust to radiometric distortions [19], and much more hardware-friendly than other methods e.g., Rank filter and Normalized cross-correlation (NCC). Another important advantage of census transform is that it saves two valuable bits by converting an 8bit luminance pixel to a 6-bit census vector. In SAD based cost computation, the maximum raw cost of two pixels is 255 (8-bit storage), while in hamming distance calculations the maximum raw cost is only 6 (3-bit storage). If also consider the cost aggregation over a support region, the mini-census transform significantly reduces the memory requirements. 3.2.2.2 Adaptive Support Region Construction Aggregating matching cost in an adaptive support region is the key idea of the VariableCross algorithm. The approach of adaptive support region construction is to decide an upright cross for each pixel p(xp , yp ) in the input frame. As shown in Figure 3.11, this pixel-wise adaptive cross consists of two orthogonal line segments, intersecting at the + − + anchor pixel p. The cross of each pixel is defined by a quadruple{h− p , hp , vp , vp } that denotes the left, right, up, and bottom arm length, respectively (see Figure 3.12). Since the cross-based support region is a general concept, there are a variety of approaches to 25 p p q (a) U(p) H(q) (c) (b) Figure 3.11: Local adaptive cross and support regions on an image (a) A pixel-wise adaptive cross defines a local support skeleton for the anchor pixel, e.g., p. (b) A shape adaptive full support region U (p) is dynamically constructed for the pixel p, v −p integrating multiple horizontal line segments H(q) of neighboring crosses. (c) Sample shape − hq+ q adaptive local support regions, approximating local image hstructures appropriately. H (q ) V ( p) h−p v −p hq− hq+ H (q ) h+p H ( p) q U ( p) v+p V ( p) h−p H ( p) h+p p U ( p) v +p Figure 3.12: Adaptive support region representation It shows the configuration of a local upright cross H(p) ∪ V (p) for the anchor pixel p, and the constructed full support region U (p). Pixel q is on the vertical segment V (p). decide the four adaptive arm lengths. Here we implement an efficient approach based on luminance similarity: for each anchor pixel p(xp , yp ) in the frame, luminance difference evaluations are performed in the four directions on its consecutive neighboring pixels in order to find out the largest span r∗ in each direction. The computation of r∗ is formulated in Equation 3.2. ( ∏ ) r∗ = M ax r δ(p, pi ) r∈[1,L] (3.2) i∈[1,r] In the equation pi = (xp − i, yp ) and L is a preset maximum allowed directional arm length; δ(p1 , p2 ) is an indicator function evaluating the luminance difference between 26 the pixel p1 and p2 based on a given threshold value τ . { 1, |Y (p1 ) − Y (p2 )| ≤ τ δ(p1 , p2 ) = 0, otherwise (3.3) The value of τ is a run-time configurable parameter that controls the confidence level of luminance similarity. It is set empirically by the end user according to different applications. Its effect on the benchmark images are discussed in Chapter 5. + − + Based on the four arms {h− p , hp , vp , vp }, the adaptive cross of pixel p are given by Equation 3.4. { + H(p) = {(x, y)|x ∈ [xp − h− p , xp + hp ], y = yp } (3.4) − V (p) = {(x, y)|x = xp , y ∈ [yp − vp , yp + vp+ } The shape adaptive full support region U (p) of each frame pixel is therefore built by integrating all the horizontal segments H(p) of the pixels residing on the vertical segment V (p). The output of the support region construction module is a data word containing the + − + four arm lengths {h− p , hp , vp , vp }, which are used in the Cost Aggregation and Disparity Voting modules. 3.2.2.3 Combined Hardware Processing Because both of the mini-census transform and the adaptive support region construction are window-based processing, and they all operate on the same data stream (filtered stereo frames), they are designed to share the same line and window buffers. The left and right frames are again processed independently in parallel. In order to compute all the four arm lengths simultaneously, 2×Lmax lines of the filtered frame are stored in line buffers and a shift register matrix of (2·Lmax +1)×(2·Lmax +1) is used for providing all the neighboring pixels for the centered anchor pixel. The configurations and interconnections of line buffer memory blocks are similar with these in the median filter module (see Figure 3.7), with different number of blocks. One important hardware parameter in this module is the maximum allowed arm length Lmax , which is set to 15 in our implementation. Its effect on the matching accuracy is duscussed in Chapter 5. To make clear illustrations in Figure 3.13 Lmax is set to 7. In Figure 3.13, the centered pixel’s mini-census transform and support region are ready to be computed. Computing the census transform is straightforward: the luminance values of shaded pixels are compared with the center pixel’s luminance concurrently and the census vector is obtained by packing the result bits in a certain order. The four arm lengths in the center pixel’s adaptive cross are also computed in parallel; take the h+ p for example, first all 7 pixels in the anchor pixel’s right direction are evaluated concurrently and the corresponding δ(p, pi ) values are obtained. Then the packed δ(p, pi ) values are sent to a priority encoder that performs the function of Equation 3.2. Because the input data stream to this module carries consecutive video frames, pixels in this window are possibly not form the same frame. To solve the problem, pixel row 27 59 62 72 49 37 54 49 60 50 15 19 21 48 52 50 52 55 65 15 33 50 23 7 50 52 55 65 15 35 1 1 1 7 Absolute difference, compare and set bits. Threshold = 17 64 2 50 0 9 1 1 0 Priority Encoder 15 hp+ = 3 19 12 Figure 3.13: Census transform and adaptive support region construction window and column counters are also implemented in this module. The outlier pixels in this window are masked before computing the census vector and support regions. The pipeline latency of the census transform and adaptive support region construction module is given by Equation 3.5. Latencyctsr = 15 × f rame width + Cctsr 3.2.3 (3.5) Output Logic The output logic packs all the data words produced by the Pre-Processor and prepare additional data fields required by the following Stereo-Matcher processor. Content of the 40-bit output data word is listed below, from most significant bit to least. A data field followed by ’L’ indicates a value from the left frame, and ’R’ indicates a value from the right frame, with the same pixel coordinates. • 4-bit all ’0’ • 6-bit mini-census vector L + • 8-bit horizontal arms (h− p and hp ) L • 6-bit mini-census vector R + • 8-bit horizontal arms (h− p′ and hp′ ) R • 8-bit vertical support region segment (vp− and vp+ ) L Because we set the maximum allowed arm length to 15, a 4-bit word is sufficient to + − + represent any value in the quadruple{h− p , hp , vp , vp }. The 4-bit all ’0’ bits are simple used to pad the data word width to a multiple of 8, as required by the Avalon bus. 28 3.3 Design of the Stereo-Matcher Pipeline Pre-Processor Avalon Slave [Control Registers] Frame Width Frame Height Frame Pixels Max Disparity (0 - 63) Avalon Sink DataE 40-bit DataF 16-bit DataG 28-bit Delay Buffer DataG 28-bit DataH 28-bit Raw Cost Scatter Correlation Region Builder [Front] Raw Cost Scatter Correlation Region Builder [Delay] DataI 11-bit DataJ 11-bit Horizontal Aggregation 2d ..... [Front] d = 0 1 Bypass FIFO DataK 13-bit 63 Horizontal Aggregation 2d ..... [Delay] d = 0 1 DataL 13-bit Vertical Vertical Vertical Aggregation Aggregation Aggregation 16-bit DataM 18-bit ..... 63 63 Right Lineup 2d ..... d=0 1 63 DataNR 18-bit DataNL 18-bit Left Winner-Takes-All DataPL 8-bit DataO 16-bit Avalon Source 2 d=0 1 DataM 18-bit Left Lineup 2d ..... d=0 1 Post-Processor 63 Right Winner-Takes-All DataPR 8-bit Left-Right Disparity Output Logic DataQ 32-bit Stereo-Matcher Figure 3.14: Stereo-Matcher processor internal processing pipeline Overview of the internal processing pipeline of the Stereo-Matcher processor is shown in Figure 3.14. The Stereo-Matcher processor operates on the data stream coming from Pre-Processor (DataE). The data from the Pre-Processor is always associated with a new stereo frame pair and passes through each pipeline stage in the Stereo-Matcher. Once the final results (L-R disparity maps) are available, the Stereo-Matcher sends 29 them to the Post-Processor. In the figure, before the vertical aggregation step there are two major data paths, denoted by Front and Delay respectively. They work for the data reuse technique implemented in the vertical aggregation. After the vertical aggregation step, there are still two major data paths, but now they are associated with the left and right disparity map respectively. As a local stereo matching algorithm, the key emphasis of its processing is to find out the best correspondence for a pixel p(x, y) in the left frame, from a local area in the right frame. Because the stereo frames are already rectified, this local area shrinks to a 1-dimensional segment of pixels located on the same scanline (see Figure 3.15). Given Search Direction p(x, y) p'(x - dmax, y) p'(x - 1, y) p'(x, y) Figure 3.15: 1-dimensional local search area The number at each pixel location indicates the raw matching cost a pair of hypothetical correspondences, i.e., p(x, y) in the left frame and p′ (x′ , y ′ ) in the right frame, we define the following variables before inferring the best correspondence. Here the coordinates of p and p′ are correlated with a hypothetical disparity d ranging from 0 to dmax (the maximum disparity under consideration), i.e., x′ = x − d and y ′ = y. • Raw Matching Cost: between their Mini-Census vectors ( the Hamming′ distance ) ′ ′ i.e., HammingDist cvp (x, y), cvp (x , y ) • Correlation Region: the overlapped area of the two pixels’ support regions • Aggregated Matching Cost: the accumulated raw matching costs in their correlation region i.e., AggCost(x, y, d) • Aggregated Pixel Count: the total number of pixels in their correlation region i.e., P ixCount(x, y, d) • Averaged Matching Cost: the aggregated cost divided by the aggregated pixel count in their correlation region i.e., AvgCost(x, y, d) = AggCost(x, y, d) P ixCount(x, y, d) (3.6) Therefore the best correspondence is defined by the hypothetical pair that gives the minimum averaged matching cost. The hypothetical disparity d that leads to the minimum averaged cost is recorded as the result disparity dp (x, y) for pixel p(x, y). All the computed disparities form the disparity map for the left frame. For a given pixel p(x, y) in the left frame, cost aggregations in correlation regions with different hypothetical disparities are independent on each other; so it is possible to compute them all in parallel, and all the aggregated costs and pixel counts are available 30 simultaneously as the output of all parallel computing threads. Therefore the minimum averaged cost is picked out using the Winner-Takes-All method. To make this parallel computing successful, two key problems have to be solved: 1. Raw matching costs are provided for all computing threads simultaneously. 2. All computing threads are fully pipelined to match the input pixel data rate. Solutions to the two problems are explained in Section 3.3.1 and Section 3.3.2 respectively. The Stereo-Matcher processor provides disparity maps for both left and right stereo frames, which are sent to the Post-Processor for L-R Consistency Check that detects mismatched correspondences caused by occlusion. The modules for generating both disparity maps are introduced in Section 3.3.3. We also present a method with aggregation data reuse technique in Section 3.3.2, which significantly reduces the required line buffer memories. 3.3.1 Raw Cost Scatter and Correlation Region Builder The task of this module is to provide raw matching costs for all hypothetical disparities simultaneously. Input data stream to this module carries the census vectors cvp (x, y) and cvp′ (x′ , y ′ ) for pixel p(x, y) in the left frame and pixel p′ (x′ , y ′ ) in the right frame, respectively. Because the x′ coordinate of pixel p′ ranges from x to x − dmax for all hypothetical disparities, an array of shift registers are used to buffer the full disparity range for pixel p. Assuming dmax = 3, as shown in Figure 3.16, in a certain pipeline RegL Left frame cvp(x, y) HammingDist Right frame cvp'(x, y) cvp'(x - 1, y) RegR0 Raw Cost d0 RegR1 Raw Cost d1 cvp'(x - 2, y) RegR2 Raw Cost d2 cvp'(x - 3, y) RegR3 Raw Cost d3 Figure 3.16: Scattering raw matching costs using shift register arrays cycle t, the census vector cvp (x, y) arrives in the RegL while in the same cycle the census vector cvp′ (x, y) also arrives in RegR0 ; meanwhile all the census vectors stored in the RegR array shift to Therefore the desired Hamming ( the right for one position. ) distances HammingDist cvp (x, y), cvp′ (x − d, y) , with d ranging from 0 to 3, are available simultaneously by comparing all registers’ content in RegR array with the content in RegL. − + ′ + Similarly, the horizontal arms (h− p , hp , hp′ , hp′ ) for both p and p are buffered using the same technique. Since the correlation support region is defined as the overlapped area 31 of two pixel’s support regions, the horizontal arms in each correlation support region + (h− pc , hpc ) are obtained by selecting the shorter arm. { − { + − + hp , h − hp , h + p ≤ hp′ p ≤ hp ′ − + hpc = hpc = − + hp′ , otherwise hp′ , otherwise As a result, this module provides 4 parallel streams of data, carrying the raw matching costs and horizontal arms in scanline order for 4 hypothetical disparities. The 4 streams are scattered into 4 parallel data paths, which lead to the follow-up cost aggregation modules. In the next cost aggregation step we use fixed vertical arms for correlation − + region, i.e., vpc = vpc . The adaptive vertical arms vp− and vp+ are used in the later disparity voting step; they are put in the FIFO to bypass the cost aggregation modules. The latency of this module is a small constant, depending on the implemented pipeline stages. Latencyrcs = Crcs (3.7) 3.3.2 Parallelized Cost Aggregations After the raw cost scatter module, cost aggregations are able to be performed on all raw cost streams in parallel, and mutually independently by (dmax + 1) parallel computing threads. For each thread, aggregating all the raw costs in a correlation region is achievable through the line buffer plus register matrix configuration used in the Pre-Processor. But the arbitrarily shaped 2D support region makes this solution not hardware friendly. We solve this problem by decomposing the 2D aggregation into two orthogonal 1D aggregations, i.e., a horizontal step followed by a vertical one (see Figure 3.17). Sum in the horizontal segments Sum in the vertical segment q q p p p (a) (b) (c) Figure 3.17: Two orthogonal 1D aggregations Figure 3.17 (a) shows the correlation region of pixel p (shaded region), and q indicates a pixel lying in the vertical segment of p. First we aggregate all the raw costs in the horizontal segment of pixel q. As pixel q slides along the vertical segment of pixel p, we obtain the aggregated cost in the horizontal segment of each q. Next we aggregate all the horizontally aggregated costs along the vertical segment of pixel p, as shown in 32 Figure 3.17 (b - c). The aggregated cost by this two orthogonal 1D process is exactly identical with the aggregation result in the 2D region, and meanwhile this technique also simplifies the hardware implementation. Accordingly we divide one cost aggregation module into two sub-modules, namely Horizontal Cost Aggregation and Vertical Cost Aggregation; they are introduced in the following two sub-sections. 3.3.2.1 Horizontal Cost Aggregations The input data stream to a horizontal aggregation module carries the raw matching cost + and horizontal arms (h− pc , hpc ); and the output of this module is the partially aggregated cost in the horizontal segment of each left frame pixel. The horizontal partial aggregation for each pixel is performed using integral costs. As (a) 0 c a p b (b) 0 c a p b (c) 0 p b a (a, b) = (0, b) - (0, c) Figure 3.18: Integral costs and horizontal cost aggregation + shown in Figure 3.18, assuming pixel p has horizontal arms h− pc = 3, hpc = 5, we first calculate the integral for each pixel in a scanline, and then the horizontally aggregated + cost of pixel p equals to the difference of the two integrals indicated by h− pc and hpc . Since the input raw costs come in scanline order, the integral computing unit is implemented with an accumulator, which sends its output to an array of shift registers for + concurrently accessing the two integrals randomly chosen by h− pc and hpc . The hardware logic diagram is shown in Figure 3.19, assuming the maximum arm length Lmax is 15. A shift register array of length (2 × Lmax + 1) is needed for buffering all possible integrals to determine the horizontally aggregated cost. Once the raw cost of pixel p arrives in + Reg14, its horizontal arms h− pc and hpc also come out from the delay line; therefore the two boundary integrals indicated by the two arms are selected by the two (L + 1)-to-1 multiplexers. The difference of the two multiplexer outputs is the desired horizontally aggregated cost, and the pixel count in the horizontal segment is obtained by h− pc + . h+ pc The latency of this module is mainly determined by the maximum arm length Lmax , because an incoming raw cost has to arrive in Reg14 before its associated aggregation being able to be computed. Latencyhagg = Lmax + Chagg 33 (3.8) Raw cost integral value registers (2 x L + 1) Raw Cost Accumulator Reg0 ... Reg1 Reg14 Reg15 ... Horizontal Arms Delay Line Reg29 Reg30 ... MuxR Arm Len Addition ... MuxL Subtraction 5-bit horizontal pixel count 8-bit horizontally aggregated cost Figure 3.19: Horizontal cost aggregation logic 3.3.2.2 Vertical Cost Aggregations The input data stream to a vertical aggregation module carries the horizontally aggregated costs and pixel counts associated with a certain hypothetical disparity; and the output of this module is the fully aggregated cost and pixel count in the correlation region of each left frame pixel. Because the input data still comes in scanline order, and the vertical aggregation requires a vertical data vector for each pixel, line buffers are employed for converting the input data stream into a vertical vector stream. In contrast with the horizontal aggregation, vertical aggregation modules use fixed aggregation span for each pixel instead of the adaptive vertical arms. This approach has significantly reduced the line buffers needed for each vertical aggregation module, with negligible influence on the disparity map quality. Based on our experiments, a vertical aggregation span (Vspan ) of 5 (see Figure 3.17 (b)) gives the best trade-off between matching accuracy and hardware complexity. The vertical aggregation span covers the anchor pixel’s scanline and the − + − + adjacent lines defined by vpc and vpc . For fixed aggregation span vpc = vpc = vpc and we define Vspan with Equation 3.9. + − + 1 = 2 × vpc + vpc Vspan = vpc (3.9) The Vspan is a hardware parameter that determines the depth of the Delay Buffer and pipeline latency. The effect of Vspan on matching accuracy is introduced in Chapter 5. Moreover, with fixed vertical aggregation span a more efficient data reuse technique is applicable. The concept of this technique is illustrated in Figure 3.20. In the figure the pixels p(x, y), q(x, y + 1) and r(x, y + 2) are located in the same vertical column. And their correlation regions are shaded in the illustration. Obvioudly, the aggregated 34 p q r + + (a) (b) (c) Figure 3.20: Aggregation data reuse in vertical aggregation (a): pixel p(x, y) and its correlation region. (b): pixel q(x, y + 1) and its correlation region. (c): pixel r(x, y + 2) and its correlation region costs of pixel q(x, y + 1) and r(x, y + 2) fulfill the Equation 3.10. AggCost(x, y + 1, d) = AggCost(x, y, d) +AggCosth (x, y + 3, d) −AggCosth (x, y − 2, d) AggCost(x, y + 2, d) = AggCost(x, y + 1, d) +AggCosth (x, y + 4, d) −AggCosth (x, y − 1, d) (3.10) In Equation 3.10 the AggCosth (·) returns the horizontally aggregated cost in the bordered region. Therefore the vertically aggregated cost is reused for vertically adjacent pixels. Without the data reuse method, each vertical aggregation module only requires 4 line buffers to provide a vertical vector, and the fully aggregated cost in a correlation region is given by summing up the values in the vertical vector. Figure 3.21 shows the line buffers and aggregation logic configuration without data reuse. With the data reuse method, the line buffer memories are required to store only one line of data. The horizontal aggregations are provided by two data paths accordingly. As shown in Figure 3.14, a Delay Buffer is applied to provide the delayed data flow. In case the vertical aggregation span is 5, the delay buffer should be large enough to store 5 stereo image lines. In contrast with the distributed line buffers in vertical aggregation modules, the delay buffer is not duplicated; as a result the overall on-chip memory consumption is significantly reduced by applying the data reuse method. Figure 3.22 shows the line buffer and aggregation logic configuration with data reuse. The expense coming with the data reuse method is that it requires two blanking lines after each frame in order to generate the valid output of the two bottom lines in the frame. Nevertheless two lines is trivial for common image sizes, especially for highdefinition images and videos. The latency of this module is given by Equation 3.11. Latencyvagg = 2 × f rame width + Cvagg 35 (3.11) Line Buffer SRAM Data_Wr Data_Rd Addr_Wr Addr_Rd Line Buffer SRAM Data_Wr Data_Rd Addr_Wr Addr_Rd Line Buffer SRAM Data_Wr Data_Rd Addr_Wr Addr_Rd Pipelined Adders 10-bit aggregated cost 8-bit pixel count in correlation region Line Buffer SRAM Reg Reg Data_Wr Data_Rd Addr_Wr Addr_Rd Reg Reg Row Index Counter Figure 3.21: Vertical cost aggregation without data reuse Line Buffer SRAM Data_Wr Data_Rd Addr_Wr Addr_Rd Reg Reg Reg Reg Reg Reg Row Index Counter Front Path Add Subtract Unit 10-bit aggregated cost 8-bit pixel count in correlation region Delay Path Figure 3.22: Vertical cost aggregation with data reuse After the initial pipeline latency, in each follow-up pipeline cycle an aggregation module delivers aggregated cost and pixel count in a correlation region associated with a certain disparity. The aggregation modules are fully pipelined so that in each cycle the result is also associated with a new pixel, in scanline order. The data flow is illustrated in Figure 3.23, assuming dmax = 4 and 4 parallel computing threads are implemented. 36 d=0 ... ... ... ... cvp(0, y+1) cvp'(0, y+1) ... ... cvp(x + 1, y) ... ... cvp'(x + 1, y) cvp(x, y) cvp'(x, y) Raw Cost Scatter & Correlation Region Builder ... ... rc(0, y+1) d=1 ... ... rc(0, y+1) d=2 ... ... rc(0, y+1) d=3 ... ... rc(0, y+1) rc(x, y) d=0 d=0 ... ... aggc(0, y+1) ... ... aggc(x + 1, y) aggc(x, y) ... ... rc(x + 1, y) rc(x, y) d=1 d=1 ... ... aggc(0, y+1) ... ... aggc(x + 1, y) aggc(x, y) ... ... rc(x + 1, y) rc(x, y) d = 2 Aggregations d=2 ... ... aggc(0, y+1) ... ... aggc(x + 1, y) aggc(x, y) rc(x, y) d=3 d=3 ... ... aggc(0, y+1) ... ... aggc(x + 1, y) aggc(x, y) ... ... rc(x + 1, y) ... ... rc(x + 1, y) Parallelized Cost cv(x, y): census vector for a pixel (x, y) rc(x, y): raw matching cost for a left frame pixel (x, y), associated with a hypothetical disparity aggc(x, y): aggregated cost for a left frame pixel (x, y), associated with a hypothetical disparity The horizontal arms, horizontally aggregated pixel counts and total pixel counts in a correlation region are synchronized with the corresponding cost values but not shown in the flow. Figure 3.23: Pipelined data flow 3.3.3 Winner-Takes-All and L-R Disparity Maps The averaged cost AvgCost(x, y, d) associated with each hypothetical d is ready to be computed with the input streams to this module. A straightforward approach to get an averaged value is to implement a divider for each thread. However dividers are quite expensive for FPGA in terms of occupied area and hardware resources; moreover, a divider also causes result precision problems. So instead of using dividers in each parallel thread, we propose using multiply-subtract functions to determine the minimum averaged cost. The idea is based on the following equivalence. AggCost(x, y, d0 ) AggCost(x, y, d1 ) < P ixCount(x, y, d0 ) P ixCount(x, y, d1 ) ⇔ AggCost(x, y, d0 ) × P ixCount(x, y, d1 ) < AggCost(x, y, d1 ) × P ixCount(x, y, d0 ) So given (dmax + 1) pairs of AggCost(x, y, d) and P ixCount(x, y, d), a tree structure with multiply-subtract functions and multiplexors selects the minimum AvgCost(x, y, d), as shown in Figure 3.24, assuming dmax = 4. Here each select AggCost(x, y,0) × PixCount(x, y, 0) × Sign Bit AggCost(x, y,1) PixCount(x, y, 1) AggCost(x, y,2) PixCount(x, y, 2) Select Unit Mux AggCost AggCost(x, y, dwin0) dwin0 AggCost(x, y,3) PixCount(x, y, 3) Select Unit PixCount(x, y, dwin0) AggCost(x, y, dwin1) dwin1 PixCount(x, y, dwin1) Mux PixCount Select Unit Mux d AggCost(x, y, dwin) (a) Select Unit dwin PixCount(x, y, dwin) (b) Select Tree Figure 3.24: Winner-Takes-All select tree unit is comprised of a multiply-subtract unit and three multiplexers that track the W innerAggCost, P ixCount and associated d respectively. The multiply-subtract unit 37 maps perfectly to the dedicated DSP block in our FPGA, with a pipeline latency of only 1 cycle. In a certain pipeline cycle t, the computing threads give (dmax + 1) aggregated costs and pixel counts based on pixel p(x, y) in the left frame and the set of pixels in the right frame: p′ (x − dmax , y) ... p′ (x, y). In the next cycle t + 1, aggregated costs are given by pixels p(x + 1, y) and p′ (x − dmax + 1, y) ... p′ (x + 1, y). As a result, the aggregated costs based on pixel p′ (x, y) in the right frame and the set of pixels in the left frame: p(x, y) ... p(x + dmax , y) are available from cycle t to cycle t + dmax . This reversed x x + dmax x Left View Right View Figure 3.25: Compute the disparity map of the right image search (see Figure 3.25) provides the result disparity dp′ (x, y) for pixel p′ (x, y) in the right frame. Similarly, dp′ (x, y) is selected by the minimum averaged cost associated with pixel p′ (x, y) and the set of pixels p(x, y) ... p(x+dmax , y). The right disparity map is generated for a L-R Consistency Check, i.e., if pixel p(x, y) in the left frame finds its best correspondence p′ (x − dp (x, y), y) in the right frame, then p′ (x − ( ) dp (x, y), y) should also map back to pixel p(x, y) with its disparity dp′ x − dp (x, y), y . This relationship is formulated by Equation 3.12. ( ) dp′ x − dp (x, y), y = dp (x, y) (3.12) The consistency check is performed in the Post-Processor, and here it is important to generate the right disparity map. Selecting the disparity for pixel p′ (x, y) is similar to the process of selecting the disparity for pixel p(x, y), only with different set of aggregated costs and pixel counts. As discussed above, the required aggregated costs and pixel counts for pixel p′ (x, y) come consecutively in several pipeline cycles, so we place a Lineup Buffer before the right winner-takes-all module to store them and provide all the required ones simultaneously in a certain cycle. To keep the result disparities for p(x, y) and p′ (x, y) synchronized, we also use a similar lineup buffer for the left frame data. Configuration of the lineup buffers are shown in Figure 3.26, with dmax = 4 and 4 parallel threads. In the configuration, each column is a small dual-port memory block that buffers the aggregated costs and pixel counts generated in 4 consecutive cycles. The aggregation results are written into them with a circular addressing pattern. After 4-cycle latency, all the required data for pixel p′ (x, y) is available in the right lineup buffer, with an incremental address pattern; meanwhile, all data for pixel p(x, y) is also available in the left lineup buffer, but with the same read address. Read address patterns for the left and right lineup buffers are shaded in Figure 3.26. 38 Parallel cost aggregation threads: time d=0 d=1 d=2 d=3 t+3 (x + 3, y, 0) (x + 3, y, 1) (x + 3, y, 2) (x + 3, y, 3) (x + 3, y, 0) (x + 3, y, 1) (x + 3, y, 2) (x + 3, y, 3) t+2 (x + 2, y, 0) (x + 2, y, 1) (x + 2, y, 2) (x + 2, y, 3) (x + 2, y, 0) (x + 2, y, 1) (x + 2, y, 2) (x + 2, y, 3) t+1 (x + 1, y, 0) (x + 1, y, 1) (x + 1, y, 2) (x + 1, y, 3) (x + 1, y, 0) (x + 1, y, 1) (x + 1, y, 2) (x + 1, y, 3) t (x, y, 0) (x, y, 1) (x, y, 2) (x, y, 3) (x, y, 0) (x, y, 1) (x, y, 2) (x, y, 3) Right lineup buffer and read pattern Left lineup buffer and read pattern Figure 3.26: Lineup buffers and read patterns Readout data from the left and right lineup buffers are sent to left and right winnertakes-all modules respectively, and the computed winner left-right disparities are synchronized at the output ports of the two modules. The latency of this winner-takes-all and L-R disparity map generator is a small constant, depending on the implemented pipeline stages. Latencywta = Cwta 3.3.4 (3.13) Output Logic The final output of this Stereo-Matcher to the Post-Processor is a 32-bit data word, including the following fields. + • 8-bit horizontal arms (h− p and hp ) L • 8-bit vertical arms (vp− and vp+ ) L • 8-bit disparity map result L i.e., dp (x, y) • 8-bit disparity map result R i.e., dp′ (x, y) The right disparity map is used for checking mismatched correspondence in the L-R Consistency Check module, and the support regions for the left frame pixels are used in the Disparity Voting module to improve the disparity map quality. The L-R disparity maps generated so far is shown in Figure 3.27. 39 Left Error Rate = 10.8% Figure 3.27: L-R disparity maps generated by the Stereo-Matcher 40 3.4 Design of the Post-Processor Pipeline Avalon Slave [Control Registers] Frame Width Frame Height Frame Pixels Post-Processor Vertical Disparity Voting DataT 8-bit DataS 16-bit Median Filter 3x3 Horizontal Disparity Voting DataR 24-bit DataU 8-bit Avalon Source L-R Consistency Check DataQ 32-bit Avalon Sink DataQ 32-bit Stereo-Matcher Figure 3.28: Post-Processor internal processing pipeline The internal processing pipeline of the Post-Processor is shown in Figure 3.28. It takes the L-R disparity maps and support regions of the left frame pixels defined by + − + the quadruple{h− p , hp , vp , vp }. The right disparity map is used to check disparity consistency, and correct mismatched correspondences in the left disparity map. The double-checked left disparity map is sent to the following disparity voting modules to further refine its quality. Finally the refined left disparity map passes through a median filter to remove its speckles. The median filter gives the final disparity map associated with the left frame, which is used for higher level applications such as view interpolation or feature detection. 3.4.1 L-R Consistency Check The task of the L-R consistency check module is to testify whether the two disparity maps satisfy Equation 3.12 i.e., ( ) dp′ x − dp (x, y), y = dp (x, y) ( ) If dp (x, y) and the corresponding dp′ x − dp (x, y), y satisfies this equation, dp (x, y) is considered as a valid disparity; otherwise dp (x, y) is regarded as a mismatch caused by occlusions. In the latter case dp (x, y) is replaced by a closest valid disparity. The input data stream to this module contains synchronized disparities of the left and right frame respectively, as shown in Figure 3.29. To check Equation 3.12 for each 41 ... ... dp(0, y + 1) ... ... dp(x + 1, y) dp(x, y) ... ... dp'(0, y + 1) ... ... dp'(x + 1, y) dp'(x, y) ... ... t+1 t time Figure 3.29: Synchronized L-R disparity data word dp (x, y) in the left disparity map, we first buffer a series of both left-right disparities, as shown in Figure 3.30. The depth of each disparity buffer should be no less than (dmax + 1). When (dmax + 1) (disparities are) buffered, the Read Address Generator provides the read address for dp′ x−dp (x, y), y , according to dp (x, y). The Consistency Check block checks whether the two disparities are identical; if so, dp (x, y) is considered as valid, and registered as the latest valid disparity; otherwise, dp (x, y) is replaced by the last registered value. The Delay Line is used to compensate for the address generator and buffer read latency. Disparity Buffer Left dp(x, y) Circular Address Counter dp'(x, y) Data_Wr Data_Rd Addr_Wr Addr_Rd Delay Line Read Address Generator Consistency Check dp,cc(x, y) Disparity Buffer Right Data_Wr Data_Rd Addr_Wr Addr_Rd Figure 3.30: L-R consistency check logic The latency of this L-R consistency check module is determined by the maximum allowed disparity range and a few constant pipeline stages. Latencylrcheck = D + Clrcheck (3.14) Output of this L-R consistency check module is the corrected left disparity map dp,cc (x, y), and the right disparity map is discarded. The corrected left disparity map is shown in Figure 3.31. 3.4.2 Disparity Voting Disparity voting is to search for the most frequent disparity in a pixel’s support region as the statistically optimized disparity for the pixel. Precisely, this corresponds to build a histogram for all disparities dp,cc (x, y) in the support region of pixel p(x, y) (see Figure 3.32), and pick out the disparity associated with the peak histogram bin. The philosophy behind this computation is that pixels contained in a local support 42 Error Rate = 9.53% Figure 3.31: Left disparity map after consistency check 3 3 1 2 2 1 Voted Disparity in the 2D support region 3 3 2 2 1 2 3 1 2 3 3 3 2 3 1 3 1 2 3 3 3 3 2 3 3 1 1 2 3 2 2 3 1 3 1 Disparity Histogram 3 20 18 16 14 12 10 8 6 4 2 0 d= 0 d= 1 d= 2 d= 3 Figure 3.32: Disparity histogram and voted disparity region mostly originate from the same scene patch, and hence they very likely share similar disparities. Experiments show that this processing has considerably improved the disparity map quality. Like cost aggregations, the disparity voting also performs on an arbitrarily shaped 2D support region, which complicates the hardware design. To simplify this problem, here we again decompose the voting in a 2D region into 2 × 1D voting processes (see Figure 3.33); i.e., the Horizontal Disparity Voting module first searches for the most frequent disparity in each horizontal segment in the support region of pixel p(x, y), then the Vertical Disparity Voting picks out the most frequent one in the vertical segment. This decomposing method does not necessarily provides the same result as the voting in the 2D region, but it is able to give the same result in most cases. Experiments show that it only provides a slightly different disparity map compared with the 2D voting result. 3.4.2.1 Horizontal Disparity Voting The horizontal disparity voting module is comprised of (dmax + 1) horizontal voting units. Each of them is responsible for building the histogram bin for a certain dispar43 3 Voted disparity in the horizontal segments 3 1 2 2 1 3 3 2 2 1 1 2 3 1 2 3 3 2 3 1 3 1 2 3 3 3 3 3 2 3 3 1 3 1 2 3 2 2 3 1 3 1 3 3 3 2 1 Voted disparity in the vertical segment Disparity Histogram 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 3 d= 0 d= 1 d= 2 d= 3 Figure 3.33: 2 × 1D orthogonal voting method ity. The voting unit for disparity 0 is shown in Figure 3.34, assuming the maximum arm length Lmax = 15. The input to such a unit is a stream containing the consistency(2 x L + 1) shift register array ... dp, cc(x + 1, y) dp, cc(x, y) Reg0 ... Reg1 Reg14 Reg15 ... ... Reg30 ... all possible disparities in a horizontal segment ... disparity 0 population vector ... disparity 0 population vector restricted by the horizontal arms Compare each register value with 0 ... ... hp+ (x + 1, y) hp- (x + 1, y) hp+ (x, y) hp- (x, y) Support Region Mask Delay Line ... Population Counter histogram bin value of disparity 0 Figure 3.34: Horizontal voting unit for disparity 0 + checked disparity dp,cc (x, y) and the corresponding horizontal arms (h− p , hp ). The disparities are buffered in a shift register array that has the capacity to store all the + possible disparities in the horizontal segment specified by h− p and hp , which are accessible simultaneously when dp,cc (x, y) arrives in Reg15. In the voting unit for disparity 0, all shift register values are compared with 0, and the related result bit is set to ’1’ if equal, otherwise ’0’. All outputs of the comparators form a population vector that has the same number of ’1’ bits as the number of disparity 0. In the next Support Region + Mask stage, the ’1’ bits that do not belong to the h− p ↔ hp segment are inverted by a + mask vector given by h− p and hp . As a result, the mask unit outputs the population vector for disparity 0 in the horizontal segment of pixel p(x, y); and the actual number is obtained by the follow-up Population Counter that counts the number of ’1’ bits in 44 the population vector. Similarly, each of the (dmax + 1) horizontal voting units gives the histogram bin value of the corresponding disparity. So the disparity associated with the histogram peak is obtained using a comparator tree. The latency of this module is determined by the maximum arm length Lmax and a few constant pipeline stages. Latencyhvote = Lmax + Chvote (3.15) The output of the horizontal disparity voting module is the stream that carries the voted disparity in the horizontal segment of each pixel p(x, y). 3.4.2.2 Vertical Disparity Voting The vertical disparity voting module employs the same voting units as the ones used in the horizontal disparity voting unit, and each unit counts the number of a certain disparity in the vertical segment of pixel p(x, y). Because the incoming data come in scanline order, disparities in the vertical segment of pixel p(x, y) are presented using line buffers instead of shift register array. The line buffer configurations and voting logic are shown in Figure 3.35, assuming the maximum arm length Lmax = 15. The Line Buffer 29 Data_Wr Data_Rd Addr_Wr Addr_Rd Line Buffer 28 Data_Wr Data_Rd Addr_Wr Addr_Rd ... ... ... ... Line Buffer 1 Voting Units Voted disparity in vertical segments & Data_Wr Data_Rd Addr_Wr Addr_Rd Comparator Tree Line Buffer 0 Reg Reg Data_Wr Data_Rd Addr_Wr Addr_Rd Reg Reg Row Index Counter Figure 3.35: Vertical voting logic and line buffers latency of the vertical voting module is given by Equation 3.16. Latencyvvote = 15 × f rame width + Cvvote 45 (3.16) The vertical disparity voting module gives the result of the 2 × 1D voting method, and the disparity map after this stage is considerably improved. The voted disparity map is shown in Figure 3.36. To compare with the 2D voting method, the 2D voted result is also presented. 2 x 1D Voting Error Rate = 7.46% 2D Voting Error Rate= 7.54% Figure 3.36: Voted disparity maps 3.4.3 Median Filter and Output The final median filter, though not critical, is employed to to increase the reliability and accuracy by removing some spike points in the disparity map. The median filter concludes the whole pipeline of the Post-Processor and the final disparity map associated with the left frame is output (See Figure 3.37). Error Rate = 7.17% Figure 3.37: Final disparity map 46 4 Stereo Matching SoC Implementations on FPGA We propose two SoC implementations to verify and evaluate the stereo matching pipeline design. Their major difference is whether to use a result frame buffer to store the final disparity maps in the external memory or not. Figure 4.1 illustrates the implementation with result frame buffer and two Scatter-Gather DMAs (SG-DMA) to transfer data between the FPGA and the external memory. Another implementation PC Offline Processing Camera Stereo Image Upload Rectification Data Download Camera Calibration Disparity Map Upload Test Stereo Image Download Result Verification JTAG UART Serial Communication Port Serial Link FPGA Real-Time Processing Peripherals, External Memory Peripheral Clock Domain Timer JTAG UART Serial Communication IP Button Control Push Buttons LED Control Status LEDs Clock Crossing Bridge CPU Clock Domain 32-bit DDR2 SDRAM 64-bit 667 Data 32-bit Descriptor Memory 0 Nios-II Soft Processor CPU Memory Reserved Scatter-Gather DMA-0 Memory to Stream Stream 16-bit Stereo Image Source Buffer (Written by Nios-II) (Read by SG-DMA-0) 16-bit Instruction 32-bit 32-bit Pre-Processor Stream 32-bit Double Data Rate 400MHz 64-bit 32-bit Post-Processor Stream Descriptor Memory 1 DDR2 SDRAM Controller Stereo Matcher Stream 32-bit 40-bit 8-bit Scatter-Gather DMA-1 Stream to Memory 8-bit Stream Figure 4.1: Proposed SoC with both source and result frame buffer 47 PC Offline Processing Camera Stereo Image Upload Rectification Data Download Camera Calibration Disparity Map Upload Test Stereo Image Download Result Verification JTAG UART Serial Communication Port Serial Link FPGA Real-Time Processing Peripherals, External Memory Peripheral Clock Domain Timer JTAG UART Serial Communication IP Button Control Push Buttons LED Control Status LEDs Clock Crossing Bridge CPU Clock Domain 32-bit DDR2 SDRAM 64-bit 667 Data 32-bit Descriptor Memory 0 Nios-II Soft Processor CPU Memory Reserved Scatter-Gather DMA-0 Memory to Stream Stream 16-bit Stereo Image Source Buffer (Written by Nios-II) (Read by SG-DMA-0) 16-bit Instruction 32-bit 32-bit Pre-Processor Stream 32-bit Double Data Rate 400MHz 64-bit 32-bit Post-Processor Stream 32-bit DDR2 SDRAM Controller Stereo Matcher Stream 32-bit 40-bit 8-bit Video Adaptor Stream 24-bit Clocked Video Output Clocked Video Stream 24-bit DVI Monitor Figure 4.2: Proposed SoC with source frame buffer and display without the result frame buffer but with a DVI-display interface is shown in Figure 4.2. Their functional differences are introduced in Section 4.2 and Section 4.3. The proposed system is divided into two computing platforms, the PC and the FPGA board, which will be introduced in Section 4.1. We also discuss the system interconnections, memory and bandwidth usage estimations in this chapter. 4.1 Overview of the Proposed System The communication between the PC and the FPGA is based on a JTAG UART link. JTAG UART is a simplified version of the standard UART interface and implemented over a USB connection. As a simplified UART, the JTAG UART does not provide 48 adjustable baud rates and its data rate is limited. As a result, this link is not suitable for real-time image and video communications; but due to its simplicity, we choose it as a transceiver for non-real-time validation. 4.1.1 The PC Tasks The desktop PC works as the host development and evaluation platform. In the development phase, the EDA software e.g., Quartus-II and Nios-II IDE running on the PC configures and debugs the FPGA hardware and software respectively. In the evaluation phase, main tasks of the PC are shown in the top part of Figure 4.1 and Figure 4.2. The first task of the PC is to calibrate the stereo cameras. A PC program receives a sequence of stereo image pairs captured by the cameras through the JTAG UART link, analyzes the images and extracts the required rectification data. The essential output of the calibration program is the rectification matrices, which are used to rectify realtime stereo videos to make them satisfy epipolar geometry. The rectification matrices are downloaded to the DDR2-SDRAM on the FPGA board through the JTAG UART link again. Since the focus of this design work is stereo matching, image acquisition and rectification modules are not implemented. Another task of the PC is to verify and evaluate the FPGA processing results. A PC program downloads test image pairs to the FPGA board and retrieves the resulting disparity maps, which are compared with the results produced by PC software algorithm to verify the FPGA hardware development. 4.1.2 The FPGA Tasks The FPGA chip contains the core subject of this design and implementation, which is categorized into two parts: digital hardware and embedded software. The whole design on FPGA is built into an Altera SOPC1 environment, which is a Nios-II soft processor integrated with memories, peripherals and system interconnections. Altera has provided abundance of SOPC component cores e.g., block memory, JTAG UART, DMA and SDRAM controller together with its SOPC Builder tool. The SOPC components are connected with each other through the Avalon System Interconnect Fabric [7]. Our proposed stereo matching hardware is also developed as SOPC Builder compatible components connected with other SOPC peripherals as well as the Nios-II processor. Most hardware IPs, including our stereo matching cores, provide a memory-mapped interface that is accessible by software running on the Nios-II processor. Several internal registers are employed to control each core and monitor its working status. The FPGA is responsible for all the real-time processing including stereo matching and result display. 1 System on a Programmable Chip 49 4.2 Introduction to the SoC Hardware As shown in Figure 4.1 and Figure 4.2, the FPGA hardware is split in two clock domains: the CPU clock domain and the peripheral clock domain. In our implementation the CPU clock domain is working at 100MHz, which contains more time-critical components than those in the 50MHz peripheral clock domain. This split allows Quartus-II to optimize the hardware placement and routing. Components in the two clock domains communicate with each other through a clock crossing bridge. In the peripheral clock domain, the push-button and LED interface are implemented using Altera Parallel IO Core; they are responsible for receiving configuration signals from the push-buttons and showing the system status on the LEDs. The JTAG UART core is connected with the PC through a USB cable and exchanges non-real-time data with the PC-side program. A timer is also implemented in this domain to work as a performance evaluation tool. In our future developments, the camera interfaces and rectification module will also be implemented in this domain. The rectified stereo frames are written to the DDR2-SDRAM for the follow-up stereo matching processing. The CPU clock domain contains the Nios-II soft processor that regulates the whole system behavior. It is equipped with a dedicated on-chip memory block that contains complete embedded software instructions and run-time program data. The DDR2SDRAM interface is implemented using Altera DDR2-SDRAM High-Performance Controller Core, which is a memory-mapped slave component in the SOPC and provides accesses for multiple masters. In addition, a PLL is also implemented in this controller by SOPC Builder; the PLL works as the clock source for the CPU clock, peripheral clock and DDR2 memory interface clock. Our proposed stereo matching hardware pipeline lies in the CPU clock domain, comprised of the aforementioned three coprocessors: Pre-Processor, Stereo-Matcher and Post-Processor. Each of them has one memory-mapped slave port and several unidirectional streaming ports. The memory-mapped ports provide accesses for Nios-II CPU to configure computing parameters e.g., frame resolution, thresholding values at run-time. The streaming ports work as data highway for transporting image/video data and processing results around these blocks. The three coprocessors are not granted direct accesses to the DDR2-SDRAM controller, but provided by the SG-DMAs. Since the stereo matching cores mostly require continuous pixel data reads/writes, implementing a random-addressing master interface for them to access the external memory is not necessary, which complicates the stereo matching hardware design and makes it more error-prone. Compared with traditional DMA cores, which only queue one transfer command from the CPU at a time, the SGDMA is able to read a series of descriptors stored in the Descriptor Memory and execute data transactions accordingly, without additional intervention from the CPU. The descriptor series are organized as a linked list and created by the CPU at the beginning of an automatic transaction sequence; each descriptor specifies the source, destination and how much data to be transferred. Besides traditional memory-to-memory transfers, an SG-DMA also supports memory-to-data stream, data stream-to-memory transfers, which perfectly match the proposed continuous stereo matching data flow. 50 In Figure 4.1, two SG-DMAs are implemented to transfer the stereo images and disparity maps respectively. This implementation is used to measure the processing speed with both source and result frame buffers. Since the stereo matching coprocessors are all fully pipelined, they do not have internal processing bottlenecks; so their actual performance depends on the availability of the data source or sink, which is determined by the efficiency of SG-DMAs and the DDR2-SDRAM controller and possible external memory accessing competitions. Once the data source or the sink is not ready for providing/receiving new data, the full stereo matching pipeline also has to stall in order to wait for them. The system shown in Figure 4.2 aims at showing the disparity maps directly on a standard desktop monitor. Off-the-shelf DVI monitors support several standard refresh rates e.g., 640 × 480 @ 75Hz and 1024 × 768 @ 60Hz with certain pixel clocks. Without a disparity map frame buffer, the stereo matching processing must be synchronized with the corresponding pixel clock. In practice the stereo matching cores still work at 100MHz and the Clockd Video Output core is used for the synchronization with a clock-domain crossing FIFO. It back-pressures the stereo matching pipeline when the FIFO is full. This implementation removes the result frame buffer and as a result the external memory accessing competitions are reduced. Potentially this implementation is able to offer better processing throughput compared with the Figure 4.1 system, but if the display refresh rate is less than the pipeline frame rate, the pipeline has to stall. 4.3 Introduction to the SoC Software The embedded software runs on the Nios-II CPU, a general-purpose RISC processor designed by Altera for FPGA implementations. Software in our proposed system is responsible for the following tasks: • Communicate with the PC program. Communications with the PC program exchange non-real-time data between PC and FPGA through the JTAG UART interface. • Control and monitor the SOPC peripherals. This is achieved by accessing peripheral registers through memory-mapped interfaces. • Generate and update descriptor memories for corresponding SG-DMAs. An Scatter-Gather DMA performs multiple transactions as specified by the descriptors stored in its descriptor memory. After one descriptor is processed, or after a linked list of descriptors are processed, the SG-DMA core generates an interrupt request to Nios-II CPU, and the CPU calls the related Interrupt Service Routine (ISR) to update its descriptor memory if necessary. • Configure stereo matching parameters. 51 Stereo matching parameters e.g., frame resolution and thresholding values are configurable at run-time. Control signals are specified by the user through pushbuttons and handled by the push-button interrupt service routine. • Evaluate the system performance. System performance e.g., stereo matching frame rates are measured by the timer and analyzed by the embedded software. The software footprint is small enough to be completely implemented on-chip, and a dedicated memory block is employed for all the instruction and data storages. The embedded software is implemented with a round-robin with interrupts architecture [36]. In this architecture, interrupt service routines deal with the most urgent requests of the hardware and set flags; the main loop polls flags and does any follow-up processing indicated by the flags. In our system, interrupt service routines mainly serve for SG-DMAs and push-buttons. Since SG-DMAs have to deal with the most urgent stereo matching data, their interrupts are assigned with the highest priority. The 2 SG-DMAs all have the same priority and they are served based on first-come, first-served policy. Other tasks in the main loop all have the same priority which is lower than all interrupts; different tasks are handled in a round-robin fashion. The flow chart of task code in the main loop is shown in Figure 4.3. Start Compute performance Update status display Initialize Peripherals Stop SG-DMAs Disable intrrrupts Generate and initialize descriptor memories Enable global and peripheral interrupts Trigger all SG-DMAs Yes No Read end time Read start time All processing finished? Update descriptor memories Yes No Processing finished? Poll flags End Figure 4.3: Flow chart of the main loop tasks 52 4.4 System on a Chip Interconnections We have chosen the Avalon System Interconnect Fabric as the SoC interconnections. It is a collection of interconnect and logic resources defined by Altera, which connects components in an SOPC Builder system. The interconnect fabric is comprised of two standards [7]: Avalon memory-mapped (Avalon-MM) and Avalon streaming (AvalonST). We have implemented both for different data transfer requirements. Their properties are introduced in the following subsections. 4.4.1 Avalon Memory Mapped Interconnect The Avalon-MM interconnect is a partial crossbar switch connecting memory-mapped slaves and masters. Unlike a full crossbar switch, which allows any number of masters to connect any number of slaves, the partial crossbar only make connectivities between masters and a selected number of slaves. Non-active connections in a full crossbar are removed to reduce the usage of FPGA logic and routing resources. Also in contrast with a shared bus structure, a partial crossbar allows concurrent transactions issued by different masters, as long as they do not access the same slave. For masters accessing the same slave simultaneously, the SOPC Builder generated arbiter logic performs slave side arbitration. By default, the arbitration algorithm provides equal fairness for all masters accessing a particular slave, and accesses are granted in a round robin fashion. Prioritized arbitration is also supported by assigning more arbitration shares to a particular master in the SOPC Builder. Avalon-MM interconnect employs Dynamic Bus Sizing [7] to resolve data width differences between masters and slaves. In our system, for example, the CPU Memory is connected with Nios-II processor through a dedicated Avalon-MM link, which is always accessible for the processor regardless of other going-on transactions. So during the stereo matching processing, the CPU is still able to perform some tasks e.g., updating a descriptor memory, without interfering the stereo matching flow. In contrast, the DDR2-SDRAM Controller is accessed by many masters, sometimes concurrently; therefore the access has to be granted according to masters’ priorities. In our current implementation, equal fairness is applied for all masters. Avalon-MM interconnections in our system mainly take the following responsibilities: • Work as data and instruction buses between the Nios-II CPU and memories connected to it. • Allow Nios-II to access its peripherals and transfer control commands and status information between them. • Transfer non-real-time data, e.g., data from the JTAG UART link to DDR2SDRAM. 53 4.4.2 Avalon Streaming Interconnect An Avalon-ST link creates a point-to-point connection between a streaming source and a streaming sink port to accommodate the requirements of high-bandwidth, low latency, uni-directional traffic associated with networking, video and DSP applications. Since Avalon-ST link carries continuous data streams, it does not require an address signal on the sink port. Also because its data flow is always point-to-point and unidirectional, there is no need to distinguish read or write requests. Minimally, an AvalonST interconnect only requires two signals: data and valid, as shown in Figure 4.4; the sink port only samples data from the source port when both data and valid signals are asserted. Figure 4.4: The minimum Avalon-ST interconnect In our system, the ready signal is also implemented in Avalon-ST links to provide back pressure. The ready signal is asserted by a sink port when it is ready to accept data, otherwise ready is de-asserted to pause the data sending from its source port. Each Avalon-ST link in our system is comprised of data, valid and ready signals (see Figure 4.5). Because our stereo matching hardware is fully pipelined, it never stops processing internally as long as its source is valid and sink is ready. Figure 4.5: Avalon-ST interconnect in our system Avalon-ST interconnects are responsible for transporting stereo frames , intermediate stereo matching results and final disparity map frames between processing cores and the DDR2-SDRAM. All Avalon-ST streams are transferred in progressive scanline order. 4.5 Storage System Hierarchically, the storage system contains three levels: registers, on-chip SRAM and off-chip SDRAM. Registers guarantee 0-cycle access latency for all operations, so they 54 are responsible for holding the most time-critical data. In addition, since a group of registers are accessible concurrently, they are also used for data to be processed in parallel. The on-chip SRAMs are implemented as memory blocks (also called block RAMs) in our Stratix-III FPGA and used for the following purposes: • Cache, program and data memory for the Nios-II processor • Descriptor memories for the SG-DMAs • Scanline data buffers in the stereo matching coprocessors • FIFOs for peripherals if necessary e.g., in the Clocked Video Output The external DDR2-SDRAM is a 1GB DDR2-667 dual-rank SO-DIMM module, with each rank organized as 64M × 64bit. The SDRAM core is clocked at same frequency with the CPU clock (100MHz), therefore the data bus between FPGA and DDR2SDRAM is working at 200MHz. Because DDR2-SDRAM transfers data at both the rising and falling clock edge, the local data width provided by the DDR2-SDRAM controller is 256bits (2 × 2 × 64). However because there exists several masters in the system accessing the only SDRAM, and also because SDRAM has internal access latencies, the DDR2-SDRAM still works not as efficient as the on-chip block RAMs. Higher efficiency of using the SDRAM is achieved when long read bursts or long write bursts are issued to the same SDRAM row, which normally implies a continuous addressing space. To use the SDRAM more efficiently, we pack data words to be processed concurrently into one larger data word, and read/write them continuously as required by scanline processing order. For example, the 16-bit data word read by Scatter-Gather DMA-0 (see Figure 4.1) is comprised of 8-bit luminance pixel from the left video frame and 8-bit luminance pixel with the same coordinates from the right frame, illustrated in Figure 4.6. Start of storage YL(0, 0) YR(0, 0) 8bits 8bits YL(1, 0) YR(1, 0) YL(2, 0) YR(2, 0) ĂĂ Figure 4.6: Packed stereo frame data storage Utilization of the DDR2-SDRAM storage is listed below. • Stereo frame buffer Minimally this storage only requires space for 2 stereo frame pairs to provide alternating accesses for the stereo cameras and stereo matching cores. The required memory space is: Stereo f rame buf f er size = 2×f rame width×f rame height×(2×8bits) (4.1) 55 For a VGA resolution (640 × 480) video this amounts to 1.17M bytes. In case that the raw stereo frames are not required to be preserved, the stereo cameras directly feed packed stereo frame data to stereo matching blocks and this piece of storage becomes zero. In another case if a sequence of continuous frame pairs is required for evaluating the stereo matching performance, this storage also linearly scales accordingly: Stereo f rame buf f er size = N ×f rame width×f rame height×(2×8bits) (4.2) Where N indicates the desired number of frame pairs. • Disparity map buffer This storage, if exist, is similar to the stereo video source part, except that here it only requires 1 frame of 8-bit disparities for each processed stereo frame pair. The required memory space is: Disparity map buf f er size = N × f rame width × f rame height × 8bits (4.3) Where N indicates the number of disparity map frames, which also equals to the number of processed stereo frame pairs . 4.6 Bandwidth Utilization Estimations Bandwidth of a DDR2 SDRAM module is given by Equation 4.4: Bandwidth = 2 (double data rate) × data bus width × data bus f requency (4.4) In our implementation, the DDR2-SDRAM data bus width is 64-bit and the data bus frequency is 200MHz. Therefore the bandwidth of the DDR2-SDRAM is 2 × 64 × 200MHz = 25.6Gbits/s. To avoid ambiguity, in the context of the thesis Bandwidth refers to the maximum or theoretical bit rate of a communication channel, and Throughput (or Bandwidth Utilization) indicates the average bit rate of actual data being transferred. Throughput of the DDR2-SDRAM is calculated by Equation 4.5: T hroughput = memory bandwidth × ef f iciency (4.5) The efficiency is mainly determined by the DDR2-SDRAM controller and the accessing pattern of applications. The worst-case efficiency happens when the application is frequently switching between short reads and short writes and memory addressing is forcing a row to be opened and closed on every transaction [6]. In contrast, the best-case scenario arises with long read bursts and long write bursts, which implies a continuous address space to be accessed. The following techniques are applied to improve the memory accessing efficiency. • Data words to be processed in parallel are packed into a larger word (e.g., Figure 4.6) to form a continuous addressing space. 56 • All data words are stored and accessed in continuous scanline order to minimize possible memory row breaks. • The DDR2-SDRAM controller is implemented using Altera DDR2-SDRAM HighPerformance Controller Core and some parameters are configured to improve its efficiency e.g., the maximum burst length and address bus ordering (chip select, bank, row and column). As shown in Figure 4.1, only the Nios-II processor and SG-DMAs have access to the DDR2-SDRAM controller. Nios-II only uses the external memory for non-real-time data, e.g., prepare the stereo frame buffer content at the beginning and upload disparity maps to the PC finally; so its influence on the real-time throughput is negligible. In contrast stereo matching cores have to access the external memory for real-time processing, and the required throughputs by different coprocessors are listed below: • Pre-Processor SG-DMA-0 reads stereo frame pairs from the external memory. Its throughput requirement is given by Stereo throughput = f rame rate × f rame width × f rame height × 16bits (4.6) In the future developments, if stereo video directly comes from cameras, this requirement becomes zero for external memory. • Post-Processor SG-DMA-1 writes disparity maps to the external memory. Its throughput requirement is: Disparity throughput = f rame rate×f rame width×f rame height×8bits (4.7) If disparity maps are directly output to display devices e.g., Figure 4.2, this requirement also becomes zero for external memory. • Stereo-Matcher The Stereo-Matcher does not require the external memory as data storage. The overall bandwidth utilization is therefore estimated by Equation 4.8 and Equation 4.9 for the two SoC implementations respectively. Estimated throughput1 = f rame rate × f rame width × f rame height × 24bits (4.8) Estimated throughput2 = f rame rate × f rame width × f rame height × 16bits (4.9) 57 58 Design Evaluation and Experimental Results 5 Corresponding to the design considerations discussed in Section 2.4, in this chapter we evaluate our design in the following aspects. • The proposed stereo matching algorithm • FPGA implementation and scalability • Real-time performance For the proposed algorithm, we evaluate its accuracy with several hardware and software design parameters and its robustness. We also evaluate our FPGA implementation and the design scalability in this chapter. The real-time performance of our design is evaluated based on both ModelSim simulation and FPGA implementation. Finally the comparison between our implementation and related work is provided and discussed. 5.1 Evaluation of the Proposed Algorithm As introduced in Section 3.2.2.3 and Section 3.3.2.2, there are two important hardware parameters in our design, i.e., the maximum support region arm length Lmax and the vertical aggregation span Vspan . The two parameters have impact on the logic resource and memory usage, and should be determined at design time. One software parameter that is programmable at run time is the support region threshold τ , which is configured by the Nios-II software. Figure 5.1 and Figure 5.2 show their influence on the stereo matching accuracy. In the two figures the support region threshold is set to 17. The curve marked with Vspan = A indicates vertical aggregation with fully adaptive support region. The figures show that the stereo matching algorithm produces high accuracy with Lmax ranging from 15 to 20, and Vspan from 5 to fully adaptive; the error rate only varies a little. Because larger Lmax and Vspan require more logic and memory resource, we set the two hardware parameters to 15 and 5 respectively. With Lmax and Vspan fixed, the software parameter τ is still programmable at run time; its value is configured by the user with push-buttons on the FPGA board or control commands sent from a host PC. Its effect on the stereo matching accuracy is shown by Figure 5.3 and Figure 5.4. In the figures, we care most about the NonOcc error rates, which should be less than 10% in out target implementation. From the figures we notice that the matching accuracy varies with this software parameter and reaches the minimum error rates when 15 ≤ τ ≤ 20. By default we set τ to 17. The benchmarked 59          /PD[                (UURU                       30     25    20    15                 10                                   Nonocc Error Rate (%)          9VSDQ         5    0   0                  9VSDQ                /PD[                (UURU                9VSDQ                /PD[                (UURU       DS DS DS DS DS DS DS DS DS 9VSDQ DS  DS  DS    Lmax-Vspan andError  Rates with Tsukuba Image Set                                                                                                           DS DS DS DS DS DS DS DS DS DS DS DS DS DS DS DS DS              5          /PD[                             (UURU                9VSDQ                                            10 15   20    Lmax Threshold = 17  SupportRegion                                                     25                  30          Vspan=1 Vspan=3 Vspan=5 Vspan=7 Vspan=9         35         Vspan=11 Vspan=15 Vspan=21  Vspan=25  Vspan=A Figure 5.1: Lmax-Vspan and error rates with Tsukuba image set              Lmax-Vspan and Error Rates with Teddy Image Set 40 35 Nonocc Error Rate (%) 30 Vspan=1 25 Vspan=3 Vspan=5 20 Vspan=7 Vspan=9 15 Vspan=11 Vspan=15 Vspan=21 10 Vspan=25 Vspan=A 5 0 0 5 10 15 20 25 30 35 Lmax Support Region Threshold = 17 Figure 5.2: Lmax-Vspan and error rates with Teddy image set results given in Section 5.4 are computed with the parameter configuration given by Equation 5.1. Lmax = 15, Vspan = 5, τ = 17. (5.1) The left benchmark images, truth disparity maps and our resulted disparity maps are shown together in Figure 5.5. Our proposed algorithm is also very robust to luminance bias and radiometric differences. Figure 5.6(a - b) shows the stereo image set with a luminance bias of +50 on the 60                               1RQ2FF $OO 'LVF                                                                                           Support Region Threshold and Error Rates with Tsukuba Image Set 35 30 25 Error Rate (%) 1RQ2FF $OO 'LVF                                                                                           20 NonOcc 15 All Disc 10 5 0 0 50 100 150 200 250 300 Support Region Threshold Figure 5.3: Support region threshold and error rates with Tsukuba image set Support Region Threshold and Error Rates with Teddy Image Set 40 35 30 Error Rate (%)                               25 20 NonOcc All 15 Disc 10 5 0 0 50 100 150 200 250 300 Support Region Threshold Figure 5.4: Support region threshold and error rates with Teddy image set right camera and the resulted disparity map (Figure 5.6(c)) computed by the VariableCross algorithm [49]. Obviously the stereo matching result severely degrades caused by the bias. In contrast, Figure 5.6(d) is the disparity map computed by our proposed algorithm, which is only a bit affected by the luminance bias. The robustness is obtained by the adopted mini-census transform [4] and corresponding Hamming distance raw cost function. 61 Tsukuba Venus Teddy Cones Nonocc Error = 3.84% Nonocc Error = 1.20% Nonocc Error = 7.17% Nonocc Error = 5.41% Left Image Truth Disparity Our Results Figure 5.5: Truth disparity maps and our results (a) (b) (c) Error: 98.4% (d) Error: 7.80% Figure 5.6: Luminance biased images and the resulted disparity maps (a): Image taken by the left camera. (b): Image taken by the right camera with +50 luminance bias. (c): Resulted disparity map by the VariableCross algorithm. (d): Resulted disparity map by the proposed algorithm 62 5.2 Evaluation of the FPGA Implementation 5.2.1 Introduction to the FPGA Hardware Resource FPGA EP3SL150 is chosen as our implementation platform; this FPGA belongs to Altera 65-nm Stratix III L family that provides balanced logic, memory and multiplier ratios for mainstream applications [11]. Unlike custom ASICs, the hardware resources of an FPGA chip are determined at the time it leaves factory. Therefore, knowing the features and amount of its programmable hardware is essential to any design work. Its hardware resources that are essential to our design are introduced below, and their availability is summarized in Table 5.3. • Programmable logic array blocks (LABs) The basic programmable logic array block if EP3SL150 is known as adaptive logic module (ALM) that supports logic functions, arithmetic functions and registers. The high-level block diagram of an ALM is shown in Figure 5.7. Each ALM Figure 5.7: High-level block diagram of Stratix-III ALM is equipped with a fracturable 8-input look-up table (LUT), two dedicated 1-bit adders, two dedicated 1-bit registers and additional logic enhancements. One ALM is equivalent to 2.5 traditional 4-input LUT based Logic Elements (LEs). Utilization of ALMs is performed by Quartus-II software at compile time, and in most cases users do not have to manually adjust them. • Memory logic array blocks (MLABs) MLAB, which adds SRAM memory to the above mentioned LAB, is a superset of the LAB and includes all LAB features. Each ALM in an MLAB is also able to work as a 320-bit SRAM block, and its possible aspect ratio configurations are listed in Table 5.1 [8]. MLABs are distributed over the whole FPGA chip and very suitable for implementing small delay lines and shift registers. • Dedicated memory blocks (BRAMs) 63 EP3SL150 contains two types of dedicated memory blocks: M9K and M144K. As their names indicate, one M9K block contains 9K SRAM bits and one M144K contains 144K bits. Each memory block supports single port, simple dual port1 or true dual port2 mode and multiple aspect ratio configurations (see Table 5.1); different configurations also affect their achievable performances. BRAMs are suitable for general purpose memory applications, and in our system they are employed for processor memory, descriptor memory, FIFO, frame line buffer etc. Feature Maximum Performance Aspect Ratios MLABs M9K Blocks M144K Blocks 600 MHz 580 MHz 580 MHz 8K×1 16 K × 8 4K×2 16 K × 9 16 × 8 2K×4 8 K × 16 16 × 9 1K×8 8 K × 18 16 × 10 1K×9 4 K × 32 16 × 16 512 × 16 4 K × 36 16 × 18 512 × 18 2 K × 64 16 × 20 256 × 32 2 K × 72 256 × 36 Table 5.1: EP3SL150 on-chip memory features • Digital signal processing blocks (DSPs) EP3SL150 has dedicated high-performance digital signal processing (DSP) blocks optimized for DSP applications. The fundamental building block is a TwoMultiplier Adder comprised of a pair of 18-bit × 18-bit multipliers followed by a 37bit addition/subtraction unit, as shown in Figure 5.8. Each Stratix III DSP block contains four Two-Multiplier Adder units and other logic enhancements for different computing configurations. Stratix III DSP blocks support various computaA0[17..0] B0[17..0] +/− A1[17..0] D Q B1[17..0] D Q P[36..0] Figure 5.8: Basic Two-Multiplier Adder DSP unit tion modes including multiplication, multiply-add/subtract, multiply-accumulate 1 2 One read port and one write port Two read/write ports 64 and so on. The available resources under different configurations are summarized in Table 5.2. Among these configurations, we have widely implemented the Independent Input and Output Multiplication Operators Configuration DSP Blocks Availability 48 9×9 Multipliers 384 12 × 12 Multipliers 288 18 × 18 Multipliers 192 36 × 36 Multipliers 96 Two-Multiplier Adders 18 × 18 ± 18 × 18 192 Table 5.2: EP3SL150 DSP block configurations Two-Multiplier Adder mode to compute Equation 5.2, which is applied in the Winner-Takes-All step. Result[36..0] = A[17..0] × B[17..0] − C[17..0] × D[17..0] (5.2) The overall available resources that are important to this design are summarized in Table 5.3. To make our design fit the FPGA and scalable for higher-density devices, Resource Availability Logic Array Blocks ALMs LEs 57K 142.5K Memory Array Blocks Dedicated SRAMs Dedicated DSPs MLAB Blocks MLAB Kbits M9K Blocks M144K Blocks Dedicated RAM Kbits DSP Blocks Two-Multiplier Adders 2850 891 355 16 5499 48 192 Table 5.3: EP3SL150 hardware resource summary one important principle is to properly map our algorithm to different functional units and balance their consumption ratios. For example, general shift registers are achievable using either normal ALMs or MALBs, but the synthesis tool Quartus-II is not intelligent enough to make the optimal implementation all the time. In this case we have to clearly specify which kind of resource to be used. More available resources of the selected FPGA, e.g., phase-locked loops (PLLs), error correction coding (ECC) and high-speed I/O support, are not introduced here. They are either not used or automatically handled by Quartus-II. 5.2.2 FPGA Implementation Report In the stereo matching pipeline implementation, the Pre-Processor is not aware of any disparity range and the Post-Processor is only slightly affected by the maximum allowed disparity range. In contrast, utilized hardware resource by the Stereo-Matcher is mainly determined by the maximum allowed disparity range. As shown in Figure 3.14, the number of processing modules in the Stereo-Matcher scales with the maximum allowed disparity range. In practice, the disparity range is determined by the distance between the scene objects and the stereo camera baseline, and the length of the baseline itself. So it varies with the target application and corresponding camera setup. In our implementation, the Post-Processor is set to deal with a maximum disparity range of 64, and the Stereo-Matcher is tested with maximum disparity range of 16, 32 and 64 respectively. Besides disparity range, the image size also determines the hardware resource utilization, especially for the data line buffers. As introduced in Section 5.2.1, the data 65 line buffers are implemented with the M9K memory blocks, which are configured with different data widths according to the computing requirements. With the EP3SL150 FPGA we target for videos with frame width no larger than 1024, so the depths of the line buffer memories are also set to 1024 (1K). The line buffer depth only limits the frame width; the frame height is not limited and it only affects the processing time. Combinational ALUTs Total: 113,600 Util. 3,310 3% 12,211 11% 8,874 8% 17,136 15% 33,639 30% 60,296 53% 60,816 54% Pre-Processor Post-Processor Stereo-Matcher 16 Stereo-Matcher 32 Stereo-Matcher 64 Stereo-Matcher 64 SoC1 Stereo-Matcher 64 SoC2 Memory ALUTs Total: Util. 288 1% 536 1% 4,864 9% 9,728 17% 19,456 34% 20,288 36% 20,288 36% Registers Total: Util. 2,075 2% 10,263 9% 18,813 17% 37,245 33% 74,109 65% 94,891 84% 94,980 84% DSP Blocks SRAM Bits Total: 384 Util. Total: 5,630,976 Util. 0 0% 417,792 7% 0 0% 393,216 7% 60 16% 589,824 10% 124 32% 884,736 16% 252 66% 1,474,560 26% 257 67% 3,771,247 67% 257 67% 3,752,121 67% FPGA Resource Utilization Combinational ALUTs Memory ALUTs SRAM Bits Registers DSP Blocks 65% 33% 32% 66% 34% 30% 26% 17% 8% 9% 16% 15% 17% 16% 10% Stereo-Matcher 16 Stereo-Matcher 32 Stereo-Matcher 64 Table 5.4: EP3SL150 hardware resource utilization summary The FPGA hardware resources utilized by different processing modules are summarized in Figure 5.4. The values following Stereo-Matcher indicate the number of parallel computing threads implemented in the Stereo-Matcher processor, which equals the maximum allowed disparity of corresponding implementation. It clearly shows that the hardware utilization scales with the number of parallel computing threads. The design scalability is illustrated in Figure 5.5. The scalability figure shows that the implementation scales nearly linearly with the parallel computing threads, but there are different scaling factors associated with the corresponding hardware resource. In the current implementation the required registers (flip-flops) and dedicated DSP blocks are limiting resources for a larger scale implementation. Balancing techniques include replacing some dedicated hardware DSPs with LUTs, and reducing the usage of pipeline registers. 66 Design and Implementation Scalability 70% FPGA Resource Utilization 60% 50% Combinational ALUTs 40% Memory ALUTs SRAM Bits 30% Registers DSP Blocks 20% 10% 0% Stereo-Matcher 16 Stereo-Matcher 32 Stereo-Matcher 64 Table 5.5: Design and implementation scalability 5.3 Evaluation of the Real-Time Performance We evaluate the real-time performance with two approaches: ModelSim simulation and FPGA implementation. The simulation verifies the functional correctness of our design model and the implementation demonstrates its applicability and actual performance in practice. 5.3.1 Performance Evaluation with Simulation The ModelSim simulation provides the theoretical processing time and latencies. We have already discussed the hardware processing latencies in Chapter 3, and they are summarized in Table 5.6. The full pipeline requires a latency about 34 lines of valid Processing Module Median Filter Census Transform and Support Region Builder Output Logic Raw Cost Scatter and Correlation Region Builder Horizontal Aggregation Stereo-Matcher Vertical Aggregation Lineup and WTA Output Logic L-R Consistency Check Horizontal Voting Post-Processor Vertical Voting Median Filter Total Pre-Processor Pipeline Latency Frame Width + 6 Frame Resolution 384 × 288 450 × 375 640 × 480 800 × 600 1024 × 768 390 456 646 806 1030 Frame Width × 15 + 19 5779 6769 9619 12019 15379 0 0 0 0 0 0 4 4 4 4 4 4 17 Frame Width × 2 + 3 72 2 69 23 Frame Width × 15 + 9 Frame Width + 6 Frame Width × 34 + 230 17 771 72 2 69 23 5769 390 13286 17 903 72 2 69 23 6759 456 15530 17 1283 72 2 69 23 9609 646 21990 17 1603 72 2 69 23 12009 806 27430 17 2051 72 2 69 23 15369 1030 35046 Table 5.6: Stereo matching pipeline latency summary The latency is measured in cycles and targets for 100MHz with EP3SL150 FPGA pixel cycles. For video processing, after the initial pipeline latency, valid disparity 67 pixels continuously come out of the pipeline in progressive scanline order. Thus the throughput of the full pipeline is determined by the frame resolution and the vertical blanking lines required by the aforementioned vertical aggregation step. The pipeline cycles for processing a full video frame is summarized in Table 5.7. Pipeline Cycle Count Valid Area Vertical Blanking Total Frame Width × Frame Height Frame Width × 2 Frame Width × (Frame Height + 2) Frame Resolution 384 × 288 450 × 375 640 × 480 800 × 600 1024 × 768 110592 168750 307200 480000 786432 768 900 1280 1600 2048 111360 169650 308480 481600 788480 Table 5.7: Stereo matching processing speed summary The speed is measured in cycles and targets for 100MHz with EP3SL150 FPGA Our ModelSim simulation exactly verifies the latency and processing cycles. Therefore the theoretical processing frame rate is calculated by dividing the pixel clock with the total pipeline processing cycles, as shown in Table 5.8. Pixel Clock 25M 50M 75M 100M 150M Theoretical Frame Rates (FPS) 384 × 288 450 × 375 640 × 480 800 × 600 1024 × 768 111360 169650 308480 481600 788480 224 147 81 51 31 448 294 162 103 63 673 442 243 155 95 897 589 324 207 126 1346 884 486 311 190 Table 5.8: Theoretical frame rates (FPS) with different pixel clocks 5.3.2 Performance Evaluation with FPGA We evaluate the real-time performance of our FPGA implementation with the two proposed SoC architecture presented in Chapter 4. The first SoC (see Figure 4.1 SoC1) accesses the external DDR2-SDRAM as both source and result frame buffer and the whole stereo matching pipeline and corresponding DMAs work at 100MHz. To measure the processing time with this implementation, we first download a pair of rectified stereo frames to the DDR2-SDRAM through the JTAG UART link; then the Nios-II CPU triggers the stereo matching processing, which repeatedly computes the same frame pair for a certain number of times. The software control is illustrated in Figure 4.3. The computing start and end times are recorded by the on-chip timer and therefore the corresponding frame rates are obtained. The measured frame rates with SoC1 and 100MHz operating clock are summarized in Table 5.9. As discussed in Section 4.6, in our implementation the DDR2-SDRAM bandwidth is 25.6Gbits/s; corresponding throughput and external memory bandwidth utilization are computed by Equation 4.8 and Equation 4.5. 68 FPGA Frame Rates (FPS) with SoC1 and 100MHz Clock 384 × 288 450 × 375 640 × 480 800 × 600 1024 × 768 Frame Sizes 111360 169650 308480 481600 788480 296 193 116 80 47 Frame Rates 0.791 0.786 0.859 0.925 0.889 Throughput (Gbits/s) 3.09% 3.07% 3.35% 3.61% 3.47% Bandwidth Utilization Table 5.9: FPGA frame rates (FPS) with SoC1 and 100MHz Clock Comparing the data shown in Table 5.8 and Table 5.9, we notice that the measured frame rates are about one third of the theoretical maximum attainable performance. This is because the external memory accessing efficiency is not optimized, which is determined by a number of factors, such as randomness of addresses, refresh rate, turnaround times between reads and writes, and burst lengths. During the stereo matching processing, the SG-DMA-0 in SoC1 (see Figure 4.1) continuously reads stereo pixels from the DDR2-SDRAM and the SG-DMA-1 continuously writes pixel disparities back to it. Such back-to-back reads and writes are considered to be the most inefficient way of using the DDR2-SDRAM. There are at least two approaches to solve this problem, one is to increase the allowed burst transfer size of the DMAs, and the other is to add FIFOs before and after the processing pipeline to avoid the frequent switches between reading and writing the DDR2-SDRAM. Since the proposed stereo matching pipeline requires continuous pixel reads and writes, FIFOs can be used to pre-fetch data and buffer results for long burst transfers. We have put the first approach into practice and it contributes to the performance in Table 5.9. Other possible optimizations will also be implemented in our future developments. With SoC2 (see Figure 4.2) and a desktop DVI monitor, we have evaluated our design and implementation with several standard display specifications. In this case, the stereo matching pipeline and the SG-DMA-0 still work at 100MHz, but the output disparity map is synchronized with various pixel clocks. This implementation does not need a result frame buffer, therefore possible external memory accessing competitions are relieved. This implementation has the potential to provide higher frame rates than the SoC1. Our reported performance 1024 × 768 @ 60FPS is based on this implementation. The implementation passes tests with all frame sizes and frame rates listed in Table 5.10; corresponding throughput and external memory bandwidth utilization are computed by Equation 4.9 and Equation 4.5. FPGA Frame Rates (FPS) with SoC2 and Standard Pixel Clocks 640 × 480 640 × 480 800 × 600 800 × 600 1024 × 768 Frame Sizes 308480 308480 481600 481600 788480 25.17 31.5 40 49.5 65 Pixel Clock (MHz) 60 75 60 75 60 Frame Rates 0.296 0.370 0.462 0.578 0.757 Throughput (Gbits/s) 1.15% 1.45% 1.81% 2.26% 2.95% Bandwidth Utilization Table 5.10: FPGA frame rates (FPS) with SoC2 and standard pixel clocks The performance evaluations show that the proposed implementation is able to deliver real-time high-definition stereo matching with very low external memory bandwidth 69 utilization. It is also flexibility adaptive to various standard pixel clocks. On one hand, the external memory accessing efficiency can be improved to increase the achievable frame rates; on the other hand, the reserved bandwidth can also be used for other applications such as image rectification and viewpoint interpolation. To the best of our knowledge, we have achieved the fastest processing speed (60 frames per second) for high-definition (1024 × 768) stereo matching. 5.4 Comparison to Related Work A well known stereo matching algorithm benchmarking metric is presented on the Middlebury Stereo Evaluation website. A disparity map produced by a stereo matching algorithm is compared with the ground truth disparity map; if the disparity values are not identical in the two maps, it is regarded as a bad matched pixel. Besides comparing all pixels in the image, two more additional metrics are also given in the evaluation results i.e., nonocc and disc, which denotes non-occluded regions and near occluded regions respectively. The near occluded regions are formerly referred to as near discontinuities, and qualify as the most difficult areas for stereo matching algorithms to solve. The benchmarked performance of our implemented algorithm is shown and compared with other work in Table 5.11. The algorithm is based on VariableCross [49] shown in the table. Although the benchmark provides a fair comparison among error rates Stereo Matching Error Rates (Bad Matches%) Image Set Image Size Disparity Range Evaluation Method VariableCross [49] RealtimeBFV [50] RealtimeBP [46] Chang et al. 2010 [4] Proposed FastAggreg [39] OptimizedDP [34] RealtimeVar [26] RTCensus [20] RealTimeGPU [43] Jin et al. 2010 [23] Chang et al. 2007 [3] Tsukuba 384 x 288 16 nonocc all 1.99 1.71 1.49 N/A 3.84 1.16 1.97 3.33 5.08 2.05 9.79 21.5 2.65 2.22 3.40 2.80 4.34 2.11 3.78 5.48 6.25 4.22 11.56 21.7 disc 6.77 6.74 7.87 N/A 14.2 6.06 9.80 16.8 19.2 10.6 20.29 48.7 Venus 434 x 383 20 nonocc all 0.62 0.55 0.77 N/A 1.20 4.03 3.33 1.15 1.58 1.92 3.59 16.5 0.96 0.87 1.90 0.64 1.68 4.75 4.74 2.35 2.42 2.98 5.27 17.8 disc 3.20 2.88 9.00 N/A 5.62 6.43 13.0 12.8 14.2 20.3 36.82 29.9 Teddy 450 x 375 60 nonocc all 9.75 9.90 8.72 N/A 7.17 9.04 6.53 6.18 7.96 7.23 12.5 26.3 15.1 15.0 13.2 13.7 12.6 15.2 13.9 13.1 13.8 14.4 21.5 33.6 disc 18.2 19.5 17.2 N/A 17.4 20.2 16.6 17.3 20.3 17.6 30.57 35.1 Cones 450 x 375 60 nonocc all 6.28 6.66 4.61 N/A 5.41 5.37 5.17 4.66 4.10 6.41 7.34 24.2 12.7 12.3 11.6 10.1 11.0 12.6 13.7 11.7 9.54 13.7 17.58 32.4 Average Bad Pixel Rate disc 12.9 13.4 12.4 N/A 13.9 11.9 13.4 13.7 12.2 16.5 21.01 31.0 7.60 7.65 7.69 N/A 8.20 8.24 8.83 9.05 9.73 9.82 17.24 N/A Table 5.11: Stereo matching algorithm error rate benchmark The algorithms and corresponding implementations are ordered according to the averaged bad pixel rate of different algorithms, it still has some insufficiencies, e.g., it does not evaluate the algorithm’s robustness to luminance bias and radiometric differences; and more importantly, the processing speed is not evaluated in this benchmark. To improve the robustness of VariableCross algorithm and enable efficient hardware implementations, we make some trade-offs so the result is reasonably a bit worse than VariableCross. Nevertheless, the modified algorithm still demonstrates even better performance than VariableCross on the Teddy and Cones image sets, which feature higher resolution and 70 disparity range. On the other hand, VariableCross is implemented on CPU and achieves very limited frame rates; while the proposed algorithm with FPGA implementation has significant improvement regarding processing speed. Table 5.12 shows the processing speed comparisons between our implementation and other reported real-time implementations. The processing speed of different systems is Stereo Matching Frame Rates (Frames per second) Implementation Jin et al. 2010 [23] Proposed RTCensus [20] Chang et al. 2010 [4] RealtimeBFV [50] Chang et al. 2007 [3] RealTimeGPU [43] RealtimeVar [26] RealtimeBP [46] FastAggreg [39] OptimizedDP [34] VariableCross [49] 1 x FPGA Virtex-4 XC4VLX200-10 1 x FPGA EP3SL150 GPU GeForce GTX 280 ASIC UMC 90nm GPU GeForce GTX 8800 DSP TMS320C64x GPU Radeon XL1800 CPU Pentium 2.83GHz GPU GeForce GTX 7900 CPU Core Duo 2.14GHz PC 1.8GHz CPU Pentium IV 3.0GHz Disparity Range Frame Rate MDE/s 64 230 @ 640 x 480 4521 64 60 @ 1024 x 768 3019 60 64 64 60 16 60 16 60 60 60 105.4 @ 450 x 375 42 @ 352 x 288 12 @ 450 x 375 9.1 @ 450 x 375 43 @ 320 x 240 3.5 @ 450 x 375 16 @ 320 x 240 1.67 @ 450 x 375 1.25 @ 450 x 375 1.21 @ 450 x 375 1067 272 129 92 53 35 20 17 13 13 Table 5.12: Stereo matching algorithm frame rates The algorithms and corresponding implementations are ordered according to the MDE/s given in frames per second (FPS) and, more meaningfully, in million disparity evaluations per second (MDE/s). M DE/s = Image W idth × Image Height × Disparity Range × F P S (5.3) Clearly, our implementation demonstrates very high frame rates compared to other implementations. Our implementation and the one proposed by Jin et al. [23] are both fully pipelined design and the frame rates are not limited by the computing pipeline itself. The frame rate difference between the two implementations is caused by different measurement methods. In addition, our implementation achieves real-time highdefinition performance and higher matching accuracy compared with Jin’s work. For FPGA implementations, the achievable resolution is usually limited by the available on-chip memories, used as line buffers etc. Thanks to the mini-census transform, the required on-chip matching cost storage is reduced a lot. Moreover, we have also applied data-reuse technique in the cost aggregation step to further reduce the memory consumption. On the other hand, CPU and GPU based implementations do not have resolution limitations, however, they hardly achieve real-time performance with highdefinition images. To the best of our knowledge, our implementation has achieved so far the best processing speed on high-definition images. 71 72 Conclusions and Future Work 6 This chapter summarizes the contributions of this thesis and concludes the thesis by suggesting valuable optimizations and future research and development directions. 6.1 Conclusions This thesis has proposed an improved stereo matching algorithm suitable for hardware implementation based on the VariableCross and the MiniCensus algorithm. Furthermore, we provide parallel computing hardware design and implementation of the proposed algorithm. The experimental results have proved that our work has achieved high speed real-time processing with programmable video resolutions, while preserving high stereo matching accuracy. The online benchmarks also suggest that this work has achieved leading matching accuracy among declared real-time implementations. To the best of our knowledge, we have achieved the fastest processing speed (60 frames per second) for high-definition (1024 × 768) stereo matching. The implementations have satisfied our design targets discussed in Section 2.4. Regarding the algorithm, the experimental results suggest that the combination of Mini-Census transform and the VariableCross stereo matching algorithm not only delivers very high matching accuracy, but also remains robust to radiometric and luminance differences. The hardware design proves that the cost aggregations with different disparity hypothesis are suitable for parallelization and pipelined systolic array processing. We have also proposed various modifications that are more hardware efficient than the VariableCross algorithm, without degrading the matching accuracy. These modifications simplify the hardware implementation so that more parallel computing threads and high-definition processing are supported on a chosen FPGA chip. 6.2 Summary of Chapters and Contributions The proposed stereo matching algorithm and real-time implementation are motivated by a viewpoint interpolation application, which provides computer synthesized viewpoint for eye-gazing video conferences. The concept and proposed setup for the eyegazing viewpoint interpolation are presented in Chapter 1. The focus and major contribution of this thesis work is that we have solved the major computing bottleneck, stereo matching, in the viewpoint interpolation processing pipeline. Chapter 2 has introduced state-of-the-art stereo matching algorithms and various computing platforms being used for efficient implementations, which include high performance CPU, GPU, DSP, FPGA and ASIC. Nevertheless, these related work either 73 suffers from low stereo matching accuracy or only partially meets real-time requirements, especially for high definition image/video processing. To achieve both high speed and quality stereo matching, we have presented our design considerations and proposed solutions for a more advanced implementation. Finally the Altera EP3SL150 FPGA and the Terasic DE3 development board are selected as our implementation platform and we have successfully achieved the desired matching accuracy and real-time performance. Comparisons with previously related work are presented in Section 5.4. Chapter 3 has elaborated the algorithm-hardware co-design development scheme being used in this thesis work. We first propose the data-level (also known as loop-level) parallel computing architecture for concurrently evaluating hypothetical disparities in the disparity range. Then we present the proposed algorithm and corresponding hardware pipeline design in a top-down approach. The full stereo matching pipeline is divided into three major processing modules, i.e., pre-processing, stereo matching and postprocessing. In hardware their functions are performed by three co-processors i.e., PreProcessor, Stereo-Matcher and Post-Processor, respectively. Massive data-level parallelizations are mainly implemented in the Stereo-Matcher processor, corresponding to the maximum allowed disparity range. In the other two co-processors data and bit level parallelism are also applied, for example the left and right images are processed in parallel whenever possible. Besides parallel computing, all co-processors are also fully pipelined in each processing step, which is the key to enable the highest throughput. In Chapter 4 we have provided two reference SoC designs utilizing our stereo matching co-processors, soft processor (Nios-II) and various on-chip peripherals including DMAs, DDR2-SDRAM controller and block memories. The SoC1 design accesses the external DDR2-SDRAM as both source and result frame buffer, which is a practical use case with other higher level applications. The resulted disparity maps in DDR2-SDRAM provides by SoC1 are also uploaded to the PC to verify the hardware processing results. In contrast, the SoC2 design avoids the result frame buffer and directly outputs resulted disparity maps on a standard desktop DVI monitor. Because of accessing competitions, refresh time and turnaround between read and write problems, the external memory bandwidth is hardly used very efficiently. So we try to use the SoC2 implementation to achieve better real-time performance than SoC1. The SoC2 design and implementation also prove that our stereo matching IPs is perfectly compatible with the Avalon system interconnect standards and work properly with other Avalon-compatible components such as the IPs provided by Altrea’s Video and Image Processing (VIP) Suite. In this chapter we also estimated the memory and bandwidth utilizations with several derived formula, which provide reference for future optimizations and application developments. Chapter 5 evaluates the overall design and implementation in various aspects corresponding to the design considerations proposed in Chapter 2. The evaluation results show that this work has achieved our desired matching accuracy, robustness, real-time performance and design scalability. It also shows several potential problems with the current implementation, which will be solved or optimized in our future research and developments. 74 6.3 Future Work and Application Developments Regarding the current FPGA implementation, there are at least two optimization plans to further improve its real-time performance. 1. Add FIFOs between DMAs and DDR2-SDRAM controller to avoid frequent switches between reads and writes. As shown in Section 5.3.2, the DDR2-SDRAM bandwidth utilization is very low. This accessing efficiency can be improved to increase the achievable frame rates; alternatively, the reserved bandwidth can also be used for other applications such as image rectification and viewpoint interpolation. 2. Balance the hardware resource utilizations Although the proposed architecture is scalable, the FPGA implementation reports in Section 5.2 expose that the consumed FPGA resources have different scaling factors according the implemented parallel computing threads. Since the chosen FPGA has fixed hardware resources, balanced resource utilizations are achievable by replacing some dedicated hardware DSPs with LUTs, and reducing the usage of pipeline registers. FPGA vendors also provide products with various resource allocations, targeting different application types. So it is also possible to choose a more suitable FPGA chip for this implementation or use ASIC design to circumvent the hardware resource limitations. To enable a more complete function set in a single chip, the image acquisition and camera rectification modules are also suggested to be implemented together with the stereo matching pipeline. If camera and rectification modules are also Avalon-compatible, we believe the source frame buffer (in DDR2-SDRAM) can also be removed and the processing pipeline including image acquisition, rectification and stereo matching does not require any external memory access. With real-time stereo matching processing, various applications are feasible with the stereo matching results, the disparity maps. The eye-gazing viewpoint interpolation is an potential application, which also motivates this thesis research. Other possible applications include free-view TV, object tracking and gesture controlled devices and video games. 75 76 Bibliography [1] K. Ambrosch, M. Humenberger, W. Kubinger, and A. Steininger. SAD-based stereo matching using FPGAs. Embedded Computer Vision, pages 121–138, 2009. [2] M.Z. Brown, D. Burschka, and G.D. Hager. Advances in computational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 993–1008, 2003. [3] N. Chang, T.M. Lin, T.H. Tsai, Y.C. Tseng, and T.S. Chang. Real-time DSP implementation on local stereo matching. [4] N.Y.C. Chang, T.H. Tsai, B.H. Hsu, Y.C. Chen, and T.S. Chang. Algorithm and Architecture of Disparity Estimation With Mini-Census Adaptive Support Weight. Circuits and Systems for Video Technology, IEEE Transactions on, 20(6):792–805, 2010. [5] P.P. Chu. RTL hardware design using VHDL: coding for efficiency, portability, and scalability. Wiley-IEEE Press, 2006. [6] Altera Corporation. The Efficiency of the DDR & DDR2 SDRAM Controller Compiler, December 2004. http://www.altera.com/literature/wp/wp ddr sdram efficiency.pdf. [7] Altera Corporation. Avalon Interface Specifications, April 2009. http://www. altera.com/literature/manual/mnl avalon spec.pdf. [8] Altera Corporation. TriMatrix Embedded Memory Blocks in Stratix III Devices, May 2009. http://www.altera.com/literature/hb/stx3/stx3 siii51004. pdf. [9] Altera Corporation. Embedded Peripherals IP User Guide, July 2010. http: //www.altera.com/literature/ug/ug embedded ip.pdf. [10] Altera Corporation. Nios II Processor Reference Handbook, July 2010. http: //www.altera.com/literature/hb/nios2/n2cpu nii5v1.pdf. [11] Altera Corporation. Stratix III Device Handbook, July 2010. http://www.altera. com/literature/hb/stx3/stratix3 handbook.pdf. [12] Altera Corporation. Video and Image Processing Suite User Guide, July 2010. http://www.altera.com/literature/ug/ug vip.pdf. [13] A. Darabiha, J. Rose, and W.J. MacLean. Video-rate stereo depth measurement on programmable hardware. 2003. [14] A. Fusiello, V. Roberto, and E. Trucco. Efficient stereo with multiple windowing. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 858–863. Citeseer, 1997. 77 [15] M. Gong, R. Yang, L. Wang, and M. Gong. A performance study on different cost aggregation approaches used in real-time stereo matching. International Journal of Computer Vision, 75(2):283–296, 2007. [16] M. Hariyama, T. Takeuchi, and M. Kameyama. VLSI processor for reliable stereo matching based on adaptive window-size selection. In PROC IEEE INT CONF ROB AUTOM, volume 2, pages 1168–1173, 2001. [17] H. Hirschmueller, P.R. Innocent, and J. Garibaldi. Real-time correlation-based stereo vision with reduced border errors. International Journal of Computer Vision, 47(1):229–246, 2002. [18] H. Hirschmueller and D. Scharstein. Evaluation of cost functions for stereo matching. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07, pages 1–8, 2007. [19] H. Hirschmueller and D. Scharstein. Evaluation of stereo matching costs on images with radiometric differences. IEEE transactions on pattern analysis and machine intelligence, pages 1582–1599, 2008. [20] M. Humenberger, C. Zinner, M. Weber, W. Kubinger, and M. Vincze. A fast stereo matching algorithm suitable for embedded real-time systems. Computer Vision and Image Understanding, 2010. [21] H. Jeong and S.C. Park. Trellis-based systolic multi-layer stereo matching. In IEEE Workshop on Signal Processing Systems, 2003. SIPS 2003, pages 257–262, 2003. [22] Y. Jia, X. Zhang, M. Li, and L. An. A miniature stereo vision machine (MSVM-III) for dense disparity mapping. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, volume 1, 2004. [23] S. Jin, J. Cho, XD Pham, KM Lee, S.K. Park, M. Kim, and JW Jeon. FPGA Design and Implementation of a Real-time Stereo Vision System. IEEE Transactions on Circuits and Systems for Video Technology, 2010. [24] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. In cvpr, page 196. Published by the IEEE Computer Society, 1996. [25] P. Kauff, N. Brandenburg, M. Karl, and O. Schreer. Fast hybrid block-and pixelrecursive disparity analysis for real-time applications in immersive tele-conference scenarios. In Proc. of WSCG, volume 9. Citeseer, 2000. [26] S. Kosov, T. Thormahlen, and H.P. Seidel. Accurate real-time disparity estimation with variational methods. Advances in Visual Computing, pages 796–807, 2009. [27] M. Kuhn, S. Moser, O. Isler, F.K. Gurkaynak, A. Burg, N. Felber, H. Kaeslin, and W. Fichtner. Efficient ASIC implementation of a real-time depth mapping stereo vision system. In Proceedings of the of the 46th IEEE Midwest International Symposium on Circuits and Systems (MWSCAS03), pages 1478–1481. Citeseer. 78 [28] J. Lu, G. Lafruit, and F. Catthoor. Anisotropic local high-confidence voting for accurate stereo correspondence. In Proceedings of SPIE, volume 6812, page 68120J, 2008. [29] J. Lu, S. Rogmans, G. Lafruit, and F. Catthoor. High-speed stream-centric dense stereo and view synthesis on graphics hardware. In IEEE 9th Workshop on Multimedia Signal Processing, 2007. MMSP 2007, pages 243–246, 2007. [30] D. Masrani and W. MacLean. A real-time large disparity range stereo-system using FPGAs. Computer Vision–ACCV 2006, pages 42–51, 2006. [31] Y. Miyajima and T. Maruyama. A real-time stereo vision system with FPGA. In Field-Programmable Logic and Applications, pages 448–457. Springer, 2003. [32] S. Mukai, D. Murayama, K. Kimura, T. Hosaka, T. Hamamoto, N. Shibuhisa, S. Tanaka, S. Sato, and S. Saito. Arbitrary view generation for eye-contact communication using projective transformations. In Proceedings of the 8th International Conference on Virtual Reality Continuum and its Applications in Industry, pages 305–306. ACM, 2009. [33] Sungchan Park and Hong Jeong. Real-time Stereo Vision FPGA Chip with Low Error Rate. 2007. [34] J. Salmen, M. Schlipsing, J. Edelbrunner, S. Hegemann, and S. Luke. Real-Time Stereo Vision: Making More Out of Dynamic Programming. In Computer Analysis of Images and Patterns, pages 1096–1103. Springer, 2009. [35] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1):7–42, 2002. [36] D.E. Simon. An embedded software primer. Addison-Wesley, 1999. [37] Q. Tian and M.N. Huhns. Algorithms for subpixel registration. Computer Vision, Graphics, and Image Processing, 35(2):220–233, 1986. [38] F. Tombari, S. Mattoccia, L. Di Stefano, and E. Addimanda. Classification and evaluation of cost aggregation methods for stereo correspondence. [39] F. Tombari, S. Mattoccia, L. Di Stefano, and E. Addimanda. Near real-time stereo based on effective cost aggregation. In 19th International Conference on Pattern Recognition, 2008. ICPR 2008, pages 1–4, 2008. [40] Middlebury University. Middlebury Stereo Evaluation - Version 2, 2010. http: //vision.middlebury.edu/stereo/eval/. [41] M.A. Vega-Rodr´ıguez, J.M. S´anchez-P´erez, and J.A. G´omez-Pulido. An FPGAbased implementation for median filter meeting the real-time requirements of automated visual inspection systems. In Proceedings of the 10th Mediterranean Conference on Control and Automation. Citeseer, 2007. 79 [42] O. Veksler. Fast variable window for stereo correspondence using integral images. 2003. [43] L. Wang, M. Liao, M. Gong, R. Yang, and D. Nister. High-quality real-time stereo using adaptive cost aggregation and dynamic programming. 2006. [44] J. Woodfill and B. Von Herzen. Real-time stereo vision on the PARTS reconfigurable computer. In IEEE Symposium on FPGAs for Custom Computing Machines, volume 4. Citeseer, 1997. [45] J.I. Woodfill, G. Gordon, and R. Buck. Tyzx deepsea high speed stereo vision system. 2004. [46] Q. Yang, L. Wang, R. Yang, S. Wang, M. Liao, and D. Nister. Real-time global stereo matching using hierarchical belief propagation. In The British Machine Vision Conference, pages 989–998, 2006. [47] R. Yang, M. Pollefeys, and S. Li. Improved real-time stereo on commodity graphics hardware. In Conference on Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04, pages 36–36, 2004. [48] K.J. Yoon and I.S. Kweon. Adaptive support-weight approach for correspondence search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):650–656, 2006. [49] K. Zhang, J. Lu, and G. Lafruit. Cross-based local stereo matching using orthogonal integral images. IEEE Transactions on Circuits and Systems for Video Technology, 19(7):1073–1079, 2009. [50] K. Zhang, J. Lu, G. Lafruit, R. Lauwereins, and L. Van Gool. Real-time accurate stereo with bitwise fast voting on CUDA. 5th IEEE workshop on embedded computer vision, held in conjunction with ICCV 2009. 80