Preview only show first 10 pages with watermark. For full document please download

Fast Scale Invariant Feature Detection And Matching On

   EMBED


Share

Transcript

Fast Scale Invariant Feature Detection and Matching on Programmable Graphics Hardware Nico Cornelis K.U.Leuven Leuven, Belgium Luc Van Gool K.U.Leuven / ETH Zurich Leuven, Belgium / Zurich, Switzerland [email protected] [email protected] Abstract able to run at very high speeds. Algorithms developed using the GPU as a programming platform are commonly referred to as GPGPU (General Purpose Computations on GPU) algorithms [8]. Recent developments in graphics hardware also provide GPGPU programmers with a C-like programming environment for GPUs, enabling them to write even more efficient code, provided that a small number of guidelines are being followed. One such programming language is called CUDA which stands for Compute Unified Device Architecture [5]. This paper shows how scale invariant feature detection and matching can be performed using the GPU as development platform. More specifically, we show how carefully chosen data layouts can lead to a fast calculation of SURF feature locations and descriptors even when compared to existing GPU implementations [11, 3]. Additionally, fast feature matching is being performed by taking advantage of the CUBLAS library which provides fast GPU implementations of commonly used basic linear algebra routines on the latest generation of Geforce graphics cards. This approach has been developed for Geforce8 series graphics board and above. Ever since the introduction of freely programmable hardware components into modern graphics hardware, graphics processing units (GPUs) have become increasingly popular for general purpose computations. Especially when applied to computer vision algorithms where a Single set of Instructions has to be executed on Multiple Data (SIMD), GPUbased algorithms can provide a major increase in processing speed compared to their CPU counterparts. This paper presents methods that take full advantage of modern graphics card hardware for real-time scale invariant feature detection and matching. The focus lies on the extraction of feature locations and the generation of feature descriptors from natural images. The generation of these featurevectors is based on the Speeded Up Robust Features (SURF) method [1] due to its high stability against rotation, scale and changes in lighting condition of the processed images. With the presented methods feature detection and matching can be performed at framerates exceeding 100 frames per second for 640 × 480 images. The remaining time can then be spent on fast matching against large feature databases on the GPU while the CPU can be used for other tasks. 2. Related work Keywords GPU,SURF, feature extraction, feature matching One commonly addressed problem in computer vision is the extraction of stable feature points from natural images. These feature points can subsequently be used for correspondence matching in applications such as wide-baseline stereo and object recognition. Because there can be a significant change in viewpoint position and orientation between images taken from a common scene, the extracted feature points need to be scale and rotation invariant and preferably also invariant to changes in lighting conditions in order to be considered as stable features. The most popular algorithm developed towards this goal, called Scale Invariant Feature Transform (SIFT) as proposed by Lowe [10], has already been applied successfully in a variety of applications. Shortly after the introduction of SIFT, a different kind 1. Introduction Since the arrival of dedicated graphics hardware, its development has mainly been driven by the game industry. Throughout the years, however, these GPUs transformed from a piece of hardware with fixed function pipelines into a compact size programmable mini computer with multiple processor cores and a large amount of dedicated memory. With an instruction set comparable to that of a CPU, modern day graphics boards provide a new platform for developers, allowing them to write new algorithms and also port existing CPU algorithms to the GPU which are then 1 of feature extraction and descriptor calculation was proposed, called SURF. While the feature localization stage of SURF has been tuned towards performance and parallelism by using a fast approximation of the Hessian, its feature descriptor stage has been tuned towards high recognition rates and proved to be a valid alternative to the SIFT descriptor. Therefore, the method proposed in this paper will use the SURF descriptor as our preferred choice of feature vector. 3. SURF overview The SURF method consists of multiple stages to obtain relevant feature points. The single SURF stages are: 1. Construction of an Integral Image, introduced by Viola and Jones [12], for fast box filtering with very few memory accesses. 2. Search for candidate feature points by the creation of a Hessian based scale-space pyramid (SURF detector). Fast filtering is performed by approximating the Hessian as a combination of boxfilters. 3. Further filtering and reduction of the obtained candidate points by applying a non maximum suppression stage in order to extract stable points with high contrast. To each remaining point, its position and scale are assigned. 4. Assignment of an orientation to each feature by finding a characteristic direction. 5. Feature vector calculation (SURF descriptor) based on the characteristic direction to provide rotation invariance. 6. Feature vector normalization for invariance to changes in lighting conditions. For feature localization on a single scale, a variety a feature detectors can be used such as the Laplacian of Gaussians (LoG), Difference of Gaussians (DoG) as used in SIFT and the determinant of the Hessian used in SURF. In order to provide scale invariance, however, these filters have to be applied at different scales, corresponding to different values of sigma for the Gaussian filter kernels. As the filter sizes become large, there is little difference in output values between neighbouring pixels. Therefore the computational complexity can be reduced by the creation of a scale-space pyramid where each level of the pyramid, referred to as an octave, consists of N different scales. The resolution of each octave is half that of the previous octave in both the x and y directions. In order to localize interest points in the image and over scales, non-maximum suppression (NMS) in a 3 × 3 × 3 neighbourhood is applied to the scale-space pyramid. The extrema of the determinant of the Hessian matrix are then interpolated in scale and image space with the method proposed by Brown and Lowe [2]. The next step consists of fixing a reproducible orientation based on information from circular regions around the interest points. For that purpose, they first calculate the Haar-wavelet responses in x and y direction in a circular neighbourhood around the interest point. The size of the circular region and the sampling steps are chosen proportional to the scale at which the feature point was detected. Once the wavelet responses are calculated and weighted with a Gaussian centered at the interest point, the responses are represented as vectors in a space with the horizontal response strength along the abscissa and the vertical response strength along the ordinate. The dominant orientation is subsequently estimated by calculating the sum of all responses within a sliding orientation window covering an angle of Π3 . The horizontal and vertical responses within this window are summed. The two summed responses then yield a new vector. The longest such vector lends its orientation to the interest point. Finally, a descriptor is being generated for each feature. This step consists of constructing a square region centered around the interest point and oriented along the orientation selected in the previous step. This region is split up regularly into 4 × 4 square sub-regions. For each of these sub-regions, 4 characteristic values are computed at 5 × 5 regularly spaced sample points. The wavelet responses dx and dy computed at these sample points are first weighted with a Gaussian centered at the interest point and summed up over each sub-region to form a first set of entries to the feature vector. In order to bring in information about the polarity of the intensity changes, the sum of the absolute values of the responses, |dx| and |dy|, are also extracted. Hence, each sub-region has a four-dimensional descriptor vector v = (Σdx, Σdy, Σ|dx|, Σ|dy|) for its underlying intensity structure resulting in a descriptor vector for all 4 × 4 sub-regions of length 64. The wavelet responses are invariant to a bias in illumination (offset). Invariance to contrast (a scale factor) is achieved by turning the descriptor into a unit vector. 4. GPU-implementation Writing GPGPU applications is by no means a trivial task, even in the presence of a reference CPUimplementation. One has to make careful choices regarding the data layout in graphics card memory to ensure that the application can take full advantage of the graphics card capabilities, especially in terms of parallelism. The following sections show how the data layout proposed in this paper allows for parallelism at both image-level and scale-level at various stages of the algorithm. 4.1. Data Layout The methods proposed in this paper make extensive use of texture mipmaps. A texture mipmap differs from a regular texture in the sense that it also provides additional storage for subsampled versions of the full resolution (base) texture of size w × h. Apart from the base level, the higher mipmap levels are generated by applying a 2 × 2 boxfilter on the previous mipmap level. Texture mipmaps generated this way are used to reduce aliasing artifacts when displaying high resolution textures in low resolution viewports. Because of the ability of the GPU to write into textures rather than treating them as a read-only part of memory and also because the data in the mipmap levels is not constrained to be subsampled versions of the base texture, these texture mipmaps can provide us with the data storage we need. Similar to the storage requirements for the Hessian calculations, the mipmap size atlevel i corresponds to max 1, bw/2i c × max 1, bh/2i c . As such, a texture mipmap proves to be a suitable choice for the scale-space pyramid where each octave corresponds to a mipmap level. 4.2. Scale-Space Pyramid Creation Although the pyramid creation proposed for SURF in [1] allows for fast filtering and parallelism at scale-level, it is not suited for an efficient GPU-implementation. This is caused by the fact that, although only a small amount of memory accesses have to be performed, there is a large stride between memory locations of those accesses at large scales leading to poor performance of the texture caches. Furthermore, the calculation of this fast Hessian at different scales requires texture accesses at different positions within the integral image. Because texture caching works efficiently on localized data, we prefer a computation of the exact Hessian as a combination of separable filters to an approximation using Integral Images. Similar to the first stages of SIFT, a two-pass approach is used where Gaussian filtered versions of the base texture are created in the first pass. The second pass consists of a pixel-wise calculation of the Hessian matrix. 4.2.1 Gaussian Pyramid Creation As noted in [3], GPUs do not only have parallel processing ability on a per pixel basis, parallelized by the number of fragment processors on the GPU, there is also parallelism on the computational stages of the GPU by calculating the four color channels simultaneously as a vector. To take advantage of these vector abilities of GPUs they modified the gray-level input image in a preprocessing step where the gray image data was rearranged into a four channel RGBA image. Each pixel in the RGBA image represented a 2 × 2 pixel block of the gray-level image, thus reducing the image area by a factor of 4. Although this approach clearly allows Channel red green blue alpha g(t) Scale 0 1 2 3 Figure 1. Top Left: Gaussian blurred images, stored in multiple textures. Top Right: All octaves and scales mapped into a single texture mipmap. Bottom: Gaussian filter kernels for N = 4. for more efficient parallelization at image-space level this paper proposes an alternative data layout which also provides parallelism at scale-space level. For each octave, N Gaussian blurred versions G of the base texture I need to be created according to the following separable filter implementations: G0s (x, y) = Gs (x, y) = ns X t=−ns ns X I(x − t, y).gs (t) (1) G0s (x, y − t).gs (t) (2) t=−ns where gs represents the discretized Gaussian kernel of size 2ns + 1 at scale s ∈ [0, N − 1]. By defining: T (3) I(x − t, y) ∗ g(t) (4) G0 (x, y − t) ∗ g(t) (5) g(t) = (g0 (t), g1 (t), ..., gN −1 (t)) the above equations can be rewritten as: G0 (x, y) = G(x, y) = n X t=−n n X t=−n where ∗ represents an elementwise multiplication operator and n = max(n0 , n1 , ..., nN −1 ). Figure 1 illustrates the data layout used for N = 4 and n = 5. In practice, n = 9. The separable Gaussian filters are implemented as a twopass operation. The first pass performs Gaussian filtering in the x-direction by fetching scalar values from the graylevel base texture I and multiplying them with the fourcomponent Gaussian kernel values g. As the values of g are independent of the pixel position, they can be hard-coded into the fragment shader. The results of the first pass are written to a four-channel intermediate texture G’. Similarly, the second pass reads from G’ and writes the result of the 19 × 19 Gaussian kernel to the four-component texture G. As such, the four Gaussian filtered versions are being calculated simultaneously. Note that, compared to the method proposed in [3], superfluous computations are being performed at lower scales caused by the zero-valued entries in the Gaussian kernels. This disadvantage, however, is being overcompensated by a reduced number of context and fragment shader switches as well as the elimination of a preprocessing step. Apart from the first octave, where the first pass takes samples from the base texture I, the first pass for the remaining octaves samples the alpha-component of G at the previous octave in order to reduce aliasing effects. 4.2.2 Hessian Pyramid Creation Once the four Gaussian filtered images have been calculated for an octave, with one scale per color channel, they can be used for calculating the determinant of the Hessian. detH(x, y) = ∂ 2 G(x,y) ∂x2 ∂ 2 G(x,y) ∂x∂y ∂ 2 G(x,y) ∂x∂y ∂ 2 G(x,y) ∂y 2 (6) Because the discretized second order derivative kernels are independent of the scale, the Hessian values are being calculated for the four scales simultaneously once again. Note that the methods mentioned above are also applicable for a GPU implementation of the SIFT algorithm. 4.3. Keypoint Filtering Once the Hessian values have been computed, Non Maximum Suppression is applied in a 3 × 3 × 3 neighbourhood for all scales. First of all, in order to reduce the amount of data that has to be passed on to the next stage, we exploit the fact that no more than one extremum can be found in a 2 × 2 neighbourhood in image space. Therefore the results of the NMS stage can be written out to a texture mipmap of half the size of H in both x and y directions. Because the first and last scale lack one border scale to perform 3×3×3 NMS filtering, the two last scales of the previous octave are resampled so that there are once again 4 detectable scales per octave. Without going into further detail, we would like to point out that the data-layout, which encodes 4 scales in a single four-component pixel, once more proves to be a Figure 2. Top: Fast extraction of sparse feature locations from large 2D arrays on GPU. Bottom: Expanding the algorithm to extract feature data at multiple scales from a matrix of encoded 32-bit values. good choice in terms of efficiency. The extremum in a 3 × 3 neighbourhood in image-space can be computed for multiple scales simultaneously by performing the comparisons on vectors rather than scalars. For each pixel, the result is encoded in a 4-byte value as follows: X-offset 00001100 Y-offset 00000101 Extremum 00001101 Octave 00000011 where the last byte encodes the octave that is currently being processed (up to 256 possible octaves). The third byte encodes whether an extremum has been detected with the least significant bit corresponding to scale 0. As such, this encoding allows for up to 8 detectable scales. The first two bytes encode the offset in x and y direction, needed because 2 × 2 pixels are being processed simultaneously. Note that, although a 16-bit encoding would suffice for N = 4, we have chosen a 32-bit encoding for further efficient processing in CUDA. Once the pixels have been encoded, we need to extract the extrema locations from the encoded images. Usually, this is done by transferring the data from the GPU to system memory followed by scanning the downloaded data on the CPU for extrema locations. Downloading large amounts of data to system memory, however, is slow as it has to be transferred over the PCI bus. Therefore, we present a method that only downloads the detected extrema locations to system memory. In 3D application programming interfaces such as OpenGL and DirectX, a fragment produced by a single ex- ecution of a fragment program is constrained to the pixel location where it is to be drawn on screen, even when using multiple render targets. The Nvidia CUDA language, however, allows data to be scattered at arbitrary locations in memory. One only needs to ensure that different threads do not write to the same location in memory to avoid undefined results. Therefore, the extrema extraction stage is implemented as a two-pass algorithm where the first pass assigns a unique index to each extremum, followed by a second pass which scatters the extrema locations into a continuous array of extrema locations. For clarity reasons, we will show how the CUDA language can be used to extract pixel locations from a sparse single-component m×n matrix. This process is illustrated in the top half of Figure 2. In the first pass, n simultaneously running threads process a single column each and have an integer valued counter initialized at 0. While iterating over the m rows, the counter value is incremented by one if an extremum is detected. This results in a 1 × n array where each value corresponds to the number of detected extrema in its corresponding column. Subsequently, this array is converted into another array Ne of equal size where each value is the sum of all previous values with the first value initialized at 0. This conversion step can also be performed efficiently on the GPU [4]. In the second pass, the n threads once again process a single column each, but now their counter value is initialized at its corresponding value in Ne . This way, whenever an extremum is encountered for the second time, the counter value provides a unique index for the extremum and its location can be scattered into a continuous output array. Afterwards, this output array is downloaded to system memory to inform the CPU of the number of features and their respective locations. Note that this approach avoids the need to download entire images to system memory as well as scanning these downloaded images for extrema locations on the CPU while the large number of simultaneously running threads compensates for the fact that the matrix needs to be traversed a second time. Expanding this algorithm to extract feature data at multiple scales from a 2D matrix of encoded 32-bit values is straightforward. Instead of incrementing the counter value by one, it is incremented by the number of extrema Np in the encoded pixel p at position (x, y) where Np corresponds to the number of ones in the third byte. In the second pass, the feature data of the detected extrema is scattered to Np contiguous 4-tuples in the output array starting at the index stored in the counter as shown in the bottom half of Figure 2. Note that, besides the 2D feature location in pixel units of the original image, we also write out the scale and octave for each feature. This additional information facilitates selecting the necessary mipmap levels and color channels to perform feature location interpolation. Figure 3. Tiling of square neighbourhood regions at extracted feature locations. 4.4. Feature Location Interpolation For features detected at higher octaves, feature location interpolation is needed in order to get more accurate localizations. Possible interpolation schemes range from a simple parabolic interpolation to the interpolation scheme proposed by Brown and Lowe [2]. Because they require a very limited amount of numerical operations and only have to be performed on a small amount of data, the choice of interpolation scheme will not have a significant impact on global performance. Since all values are stored in a texture mipmap, where each mipmap level corresponds to an octave and each color channel corresponds to a scale, all information needed to perform interpolation is present within a single texture mipmap. This means that interpolation for all features can be performed in a single pass while only a single texture mipmap needs to be bound to the fragment program. The ability to process all features in a single pass eliminates many texture- and framebuffer-switches. 4.5. Feature Descriptors In order to avoid loss of speed by reallocating data containers, the data structures should be initialized once at initialization time so that they can be reused. As the number of extracted features per image can vary, the data containers used to store feature information are initialized at a fixed maximum size. This section illustrates how orientation assignment and feature descriptor calculation is performed for a maximum allowable number of 4096 features. 4.5.1 Orientation Assignment Gradient information in the neighbourhood of the feature locations is stored in 16 × 16 pixel-blocks which are tiled into a texture of size (64 × 16) × (64 × 16) so that each feature corresponds to one of the 4096 pixel-blocks, as shown in Figure 3. The neighbourhood area covered in the original image is chosen proportionally to the scale at which the feature has been detected. As the original image is be- ing loaded onto the graphics card in a texture mipmap, we can let OpenGL perform linear interpolation in both imagespace and mipmap-space automatically to avoid aliasing for features detected at large scales. The actual gradient computation for a feature neighbourhood stores a 64-bit value for each pixel in the 16 × 16 pixel-block, corresponding to two 32-bit floating point values representing the gradients in x- and y-direction. These gradient values are also being weighted by a Gaussian centered at the feature location. Once these Gaussian filtered gradients have been computed, the results are transferred into a pixel buffer which serves as the data link between OpenGL and the CUDA context in which further operations will take place. Within CUDA, each pixel-block is processed by a multiprocessor running 256 threads in parallel where each thread transfers a pixel from the 16 × 16 pixel-block into fast shared-memory. In order to ensure efficient memory transfers, one has to take the warp-size of the multi-processors into account. The warp-size is the number of threads that can be combined into a single SIMD instruction (32 for Geforce8). As the width of a pixel-block corresponds to half the warp-size, the requirement imposed by CUDA needed to coalesce the memory accesses into a single contiguous aligned memory access, is met. Once this transfer has taken place, the transferred gradient array is sorted according to its phase values by using a parallel bitonic sort algorithm and the dominant orientation is estimated by calculating the sum of all responses within a sliding orientation window covering an angle of Π3 . The horizontal and vertical responses within this window are summed to yield a new vector. The longest such vector lends its orientation to the interest point. 4.5.2 Figure 4. Feature vector generation and normalization. a vector of 64 floating point values, a multi-phase algorithm is needed. To this end, a texture mipmap with a base size of (64 × 4) × (64 × 4) is allocated and is used to draw the squared values of v 0 . Once again, hardware accelerated mipmapping is used to generate mipmap level 2. This mipmap level is of size 64 × 64 and the sum of the four values in each pixel now corresponds to the squared norm of its corresponding feature vector. In a final step, the feature vectors are normalized according to these values. Because hardware accelerated mipmapping can be seen as a repetitive 2 × 2 boxfiltering, this is a multi-phase approach where each hierarchical level calculates the sum of four lower level values, thereby reducing numerical inaccuracies. Aligned Feature Descriptors The feature descriptor generation stage is similar to the initial step in the orientation assignment. The feature neighbourhoods are tiled and multiplied with a Gaussian in the same way but are now aligned to their assigned orientation. As required by the SURF descriptor, the gradient values v = (dx, dy, |dx|, |dy|) are now stored in 128-bit pixels. In this step, the results are written to a texture mipmap with a base size of (64 × 16) × (64 × 16). By making use of hardware accelerated mipmapping, the values v 0 = (Σdx, Σdy, Σ|dx|, Σ|dy|) for mipmap level 2 are automatically generated as a 4 × 4 boxfiltering of the base level. As shown in Figure 4, this mipmap level is of size (64 × 4) × (64 × 4) and each feature now corresponds to a 4 × 4 pixel-block, resulting in a descriptor of size 64. Although the resulting descriptor incorporates spatial information about the feature, it still needs to be normalized to provide invariance against changing lighting conditions. In order to avoid numerical inaccuracies when normalizing 5. Matching Feature descriptor matching can be performed using different criteria, such as the Sum of Squared Differences (SSD) or the dot product (DOT). Finding the best match for a feature vector f in a database of feature vectors D is done by locating the feature d in D for which the SSD is minimal. Because the feature vectors have been normalized, this corresponds to finding the feature vector d for which the dot product with f is maximal: SSD = 63 X (fi − di )2 (7) i=0 = 63 X i=0 fi2 + 63 X i=0 = 2 − 2.DOT d2i − 2 63 X fi .di (8) i=0 (9) If the feature descriptors are to remain on the graphics card, the tiled feature descriptor array can be converted to a 2D matrix where each tile is mapped onto a row of the 2D matrix. Similar to the transformation in the orientation assignment pass, this step can be performed efficiently using CUDA. Next, a correlation matrix C is computed as C = D.F T , where each entry Ci,j contains the dot product of the feature vectors di and fj . Because the correlation matrix C is calculated as a basic matrix multiplication, it can be performed efficiently on the GPU by making use of the CUBLAS library which provides optimized versions of basic linear algebra operations on the GPU. Finally, in order to extract the best match for each of the features in F , a CUDA algorithm is used where each thread processes a column of C and returns the row index of the best match. When matching m features against a database of n features using descriptors of size d, calculating the correlation matrix C has a computational complexity of O(m × n × d) whereas the extraction of the best matches has a computational complexity of O(m × n). 6. Results The proposed method has been tested using the datasets and the evaluation tool provided by Mikolajczyk [9]. The matching results for two image sequences along with the repeatability scores for the four most challenging image sequences are shown in Figure 6. As expected, the repeatability scores of the GPU implementation and the original SURF implementation are very similar. Timing results of the proposed method and the ones developed by Sinha et al. [11] and Heymann et al. [3] are listed in Table 1. The tests were performed on images with a resolution of 640 × 480. For the matching test, a feature database of 3000 features is matched against another database of 3000 features, similar to [3]. More detailed timing results for the proposed algorithm are shown in Figure 5. For an image of size 640 × 480, the algorithm is able to extract 805 features locations and descriptors at 103 frames per second on a desktop pc and 35 frames per second on a mid-end workstation laptop. As previously developed GPU implementations already achieved near real-time framerates, one might fail to see the practical value of an even faster implementation. Not only does the method proposed in this paper allow for additional feature matching required by real-time applications such as object recognition, it can also be used to speed up offline applications that need to process vast amounts of data. For example, given an image of an object, one might want to retrieve alternate images containing this object by matching against a collection of images selected from large image databases [7, 6] based on image tag information. In both cases, the proposed methods can be used for fast feature extraction and matching against large feature databases. Ad- Figure 5. Detailed timing results for a desktop pc (GF8800GTX) and a mid-end workstation laptop (QFX1600M). GPU model Octaves Scales Extraction Time Matching Time Sinha 7900GTX 4 3 100ms N/A Heymann QFX3400 4 3 58ms 500ms This 8800GTX 5 4 9.7ms 32ms Table 1. Performance comparison. ditionally, in order to achieve higher recognition rates, the increase in processing speed can be used to extend the feature descriptors by using more and/or larger subregions. 7. Conclusion In this paper, we have shown how the SURF algorithm can be accelerated significantly using programmable graphics hardware, even when compared to existing GPU implementations. By making extensive use of texture mipmapping and the CUDA programming language, feature extraction and descriptor generation can be performed at very high rates. As a result, the SURF algorithm can be applied to image sequences with 640 × 480 pixels at framerates exceeding 100 frames per second. Furthermore, utilizing the CUBLAS library also resulted in fast feature matching against large databases. References [1] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. European Conference on Computer Vision, pages 404–417, 2006. [2] M. Brown and D. Lowe. Invariant features from interest point groups. British Machine Vision Conference, Cardiff, Wales, pages 656–665, 2002. [3] S. Heymann, K. M¨uller, A. Smolic, B. Fr¨ohlich, and T. Wiegand. SIFT Implementation and Optimization for GeneralPurpose GPU, 2007. [4] http://developer.download.nvidia.com/compute/ cuda/sdk/website/projects/scan/doc/scan.pdf. [5] http://developer.nvidia.com/object/cuda.html. [6] http://images.google.com/. [7] http://www.flickr.com. [8] http://www.gpgpu.org. [9] http://www.robots.ox.ac.uk/vvgg/research/affine/. [10] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [11] S. Sinha, J. Frahm, and M. Pollefeys. GPU-based Video Feature Tracking and Matching. EDGE 2006, workshop on Edge Computing Using New Commodity Architectures, 2006. [12] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. Proc. CVPR, 1:511–518, 2001. Figure 6. Results indicating invariance to rotation and scaling (Bark,Boat) and affine transformations (Graffiti,Wall). Top: Example of matching results. Middle: Number of extracted features for each data set. Bottom: Repeatability scores.