Preview only show first 10 pages with watermark. For full document please download

Proceedings Of Seminar And Project 2d Image Processing And

   EMBED


Share

Transcript

Proceedings of Seminar and Project 2D Image Processing and Augmented Reality Winter Semester 2016-17 Oliver Wasenm¨ uller and Prof. Didier Stricker Department Augmented Vision University of Kaiserslautern and DFKI GmbH Introduction The seminar and project 2D Image Processing and Augmented Reality (INF-73-72-S-7, INF-7382-L-7) are continuative courses based on and applying the knowledge taught in the lectures 3D Computer Vision (INF-73-51-V-7) and Computer Vision: Object and People Tracking (INF-73-52V-7). The goal of the project is to research, design, implement and evaluate algorithms and methods for tackling computer vision problems. The seminar is more theoretical. Its educational objective is to train the ability to become acquainted with a specific research topic, review scientific articles and give a comprehensive presentation supported by media. In the semester eight projects and seven seminars were performed. The results are documented in these proceedings. Organisers and supervisors The courses are organised by the Department Augmented Vision (http://ags.cs.uni-kl.de), more specifically by: Oliver Wasenm¨ uller Prof. Dr. Didier Stricker In this semester, the projects and seminars were supervised by the following department members: Ahmed Elhayek Tewodros A. Habtegebrial Stephan Krau Christiano Gava Yuriy Anisimov Jason Rambach Kiran Varanasi Pramod Murthy Oliver Wasenm¨ uller March 2017 Human Motion Capture using Embedded Deep Learning Chengbiao Deng1 and Pramod Murthy2 and Ahmed Elhayek3 1 [email protected] [email protected] 3 [email protected] 2 Abstract. Deep convolutional neural network is a very popular research field recently, it has been proven to be able to have human level performance in object recognition, and localization. Human motion recognition is one of the interesting topics,researchers have trained very successful CNN models to achieve high accuracy human pose estimation. It is very promising to see the model be applied in real time in embedded systems, which has vast application scenarios such as human computer interaction and surveillance.In this report, some state-of-art technologies are explored to achieve this holy grail. Keywords: human motion, deep learning, embedded 1 Introduction Over the past years, it has been demonstrated deep CNNs are able to outperform traditional algorithms in object detection and classification. Recent work on object detection and classification using YOLO has reached human level accuracy, and can be run in real time on desktops[9]. Human motion detection has been a research hot spot for a long time, T. Pfister et al.[8] trained a deep CNN with a large receptive field and optical flow to implicitly capture global spatial dependencies. V. Belagiannis et al.[1] added recurrent module to the feed forward module to replace the complexity of graphical model stage. Zhe Cao et al.[2] further achieve real-time multi-person pose estimation using part affinity fields. However, these CNN models often have millions of parameters, it takes days to weeks to train on high end hardware. It requires machines with high performance GPU to be able to apply the model in real-time. It is a challenging task to transfer the model to embedded platforms which have limited computing resources. On the software side, it is necessary to compress the code size. On the hardware side, new embedded architectures need to be designed to leverage the characteristics of compressed CNN models. Software There are mainly two level of compression. Firstly, at architecture level, Han et al. came up with the idea of SqueezeNet, which transform convolutional layers into fire modules, it takes idea from Google Net by using different size of convolutional filters. Secondly, at weights level, researchers focus on pruning out unnecessary connections, and quantize the weights with minimum sacrifice of accuracy. Han. et al.[4] managed to achieve 35X compress rate on AlexNet without loss of accuracy, which shows great potential in weights compression. Hardware Theoretically, there is great potential to accelerate DNN inference with pruning and weights sharing, however current hardware is not able to operate directly on compressed networks. Han et al. purpose an efficient inference engine[3], it exploits static weights sparsity and dynamic activation sparsity, and directly operates on compressed weights which are able to fit in SRAM. Compared with CPU (Intel i7-5930k), EIE achieve 189x acceleration, while saving 24,000x energy on the task of AlexNet. This work shows the direction to real time DNN inference on embedded board is by software and hardware co-designing. 2 My approach In this report, different methods to achieve efficient image inference on embedded platform are explored and tested quantitatively. Most of the conducted experiments are based on trained model by T.Pfister. Fig. 1: T. Pfister model, using spatial fusion layer to learn implicit spatial model. To achieve fast and accurate inference, methods are classified into three groups. 2.1 Different deep learning frameworks Different deep learning frameworks are characterized for their high efficiency in specific task domains, Caffe and TinyDNN are chosen to be tested with the same CNN model. Caffe is one of the most popular framework for its speed and expressive power. While TinyDNN is an header only, dependency-free framework built for embedded system. Heat-map CNN models are applied to both framework, speeds are compared and analyzed. The tested model is trained in Caffe, and can be converted to TinyDNN weights. But TinyDNN-Caffe converter supports only single input/single output network wtihout branch, so only the simplified Caffe model which omits fusion layers is converted and benchmarked. 2.2 Model level optimization Simplify the architecture, architecture level optimization. I have tried adapting the heat-map model with the idea of SquzeeNet[5], and fine tuning the trained weights with new data set MPII. The idea is drawn from the fact that lots of feature channel are not activated or have similar activation pattern. And it is beneficial to reduce number of weights by using different size of kernel. Additionally, batch normalization proves to be able to increase accuracy by a small margin. 2.3 Weights pruning and quantization Pruning The amount of connections are defined with the model before training, the purpose of training to assign correct weights to each connection, some connections grow to be more prominent, while others vanish to 0. Heuristically, it makes sense to set all those small weights to 0, it should not affect the accuracy. In this way it also make hardware developer easier to invent new design that leverages the sparse matrix property. Quantization Linear and k-means quantization methods are explored and compared. In linear method, first we find the minimum and maximum element of each layer parameters, then group each float input to the nearest bin. The drawback of this method is that lots of information is missed since the weights distribution is not even. K-means is better in terms of represent the nonlinearity property of the weights. The initial centroids are randomly selected in the range of the weights, then iterate to find the optimal centroids, and group all weights to its nearest centroid. Two methods are implemented and compared in terms of both accuracy and speed. The compressed model are stored in a binary file in order to compare its final size. 2.4 Benchmark To measure the accuracy of all kinds of model weights, a benchmark standard needs to be settled. In this report, we use the head length, which is the distance from head to neck, as a reference.All predictions that are within the tolerance range from the ground truth predictions are marked as correct predictions. def ault tolerance range = ||head − 3 3.1 lef ts houlder − rights houlder || 2 Experiments Visualization Deep CNN is very hard to understandthe activation of each layer is visualized to get an idea how the raw input image regresses to joints positions gradually. Fig. 2: Input data: Morris at DFKI. Figure 2 is the input data. Figure 3a reveals edge and color blobs information of the input image. Figure 3b and figure 3c find the outline of human shape, using various edge information from previous layers. Figure 3d shows that Conv5 layer seems able to separate human body from environment.The later layers tend to be more task specific, the activations of feature maps become very sparse, relevant information stands out prominently. Figure 3f shows the heat map outputs. Overall, the deeper the network, the more abstract spatial information it can leverage. By studying the activations of each layer, I found that lots of feature maps are either not activated or share the same activation patterns. The first few layers have lots of similar activation feature maps, while in the later layers a lot of feature maps are just black. This observation leads to the two important ideas to make DNN run faster. Firstly, we can reduce the number of feature maps that are fed to the next layer, this is partly the reason that I adopt the idea of SquzeeNet later on. Secondly, most part of each feature map is black, but different input images may activate different feature maps and different region of each feature map, so special hardware needs to be designed to leverage this kind of dynamic sparsity to save necessary MACs(multiplication and accumulation). It is hot research topic in the community, MIT is conducting the Eyeriss project to develop energy efficient reconfigurable accelerator for DNNs[7], but this is out of the scope of my project. (a) conv1 (b) conv2 (c) conv3 (d) conv5 (e) conv7 (f) conv8 Fig. 3: Activations at different layers. 3.2 Accuracy From the above the section, we get an idea how the DNN model processes the input image to human joints position. In order to further improve the accuracy, first we need measure the prediction accuracy quantitatively. By using the normalized head length benchmark, the accuracy of trained heat-map model is measured, Figure 4 shows that shoulders are relatively easier to recognize, but head has the highest accuracy when the tolerance range is more that 0.25 times reference distance.For the following experiment, 0.5 times reference distance is used as correct prediction range. Fig. 4: Prediction accuracy with different tolerance ranges. The first method that I tried to improve accuracy is to finetune the existing model weights on MPII data set. MPII dataset provides ground truth positions of 14 joints, since the original model only predict 7 joints, I use only part of the ground truth information. But the finetuning process is not successful, the heat map loss drops to 0 soon after training, which should indicate all predictions are 100% accurate if the Caffe framework is working correctly, however when I applied the model to testing images, all predicting joints fall on the top left corner of image, that is seriously wrong. The other idea to improve the accuracy is inspired while I was observing the training images. Figure 5 shows that the color and brightness of training images vary , so the statistics of the activations of each layer vary from batch to batch. Adding a batch normalization layer between the convolution layer and the consecutive RELU layer may faster the convergence and improve the accuracy as studied in the paper by Sergey et al [6]. However, at the time T.Pfister develop heat map model, batch normalization layer does not exist in Caffe framework. So if we want to use the new layers, we need to compile the new Caffe with heat map data layer and heat map loss layer. Since Caffe has evolved a lot over last two years, the work to to merge the new Caffe with customized heat map layers is still undergoing with help from my supervisor. 3.3 Deployment I have tried two DNN framework on two platform. Caffe can be compiled successfully on Odroid XU4, but since Odroid XU4 does not have a NVIDIA GPU, it only support CPU mode of Caffe. TinyDNN is a c++ implementation of deep learning, and does not depend on any external library, so it can be compiled on the board without any trouble. TinyDNN comes with a tool to convert sequential Caffe model to TinyDNN model, the original heat map model has a spatial fusion branch, my solution is to cut off the branch and use a new Caffe model as input, in this way I successfully converted Caffe model to TinyDNN model. The latency for a single image inference takes more than 200s, if compile the TinyDNN with -O3 flag, the latency can be reduced to 144s. Then I tried Fig. 5: Eight sample training images. deploying Caffe on the board, the memory consumption for the simplified heat map model (without spatial fusion) is around 1.9GB. I use a USB drive as swap space to run the complete heat map model. The latency test for different situation is summarized in Figure 6. Fig. 6: Speed measurement with different combinations. On the side of desktop, the speed of inference with the complete heat-map model can reach 16 fps when running in GPU mode. And it is 100x faster than running in CPU mode. In addition to that, the speed test shows TindyDNN is 10x slower than Caffe concerning the task of compute deep CNN model. On the side of embedded board, owing to relatively slower memory access speed, lower clock frequency, smaller DRAM and other limiting factors, Odroid is about 10x slower than the desktop in compute the same CNN in CPU mode. 3.4 Compression Now we need to analyze why the forward inference is so slow. The weights size is 70MB, and it is too large to fit in the memory cache. Each time the model infers one frame, the weights need to be Fig. 7: Comparison of two hardware platform. Fig. 8: Prediction accuracy with different pruning levels. Fig. 9: Number of pruned weights, the last bin is the comparison of the overall number of weights. transferred between DRAM and CPU, back and forth. Most of the time and energy is wasted on memory traffic. It is very important to compress the weights, to make it small enough to fit in the cache. Two width researched methods are pruning and quantization. As we know, the connections of each convolutional kernel have different weights, some weights are so small that they have almost no effect on the output value compared with other strong connections. It is safe to prune out those connections by setting the weights to 0. Then we can use sparse matrix multiplication library to accelerate the calculation. My work is to test how pruning will affect the prediction accuracy. Different pruning intensities are tested, the accuracies are revealed in Figure 8, it shows clearly that pruning 0.3 σ is safe to preserve the basic accuracy even without fine tuning. And Figure 9 reveals the number of pruned weights comparing with original weights. Overall, 31 weights can pruned out without affecting the accuracy too much. The other method is to quantize the weights, by representing the weights with less bits rather than in full precision. Figure 10 [10] shows that 8 bits is enough to represent the weights and activations for AlexNet. Other methods like binary quantization will harm the accuracy too much. So 8 bits quantization is adopted for my experiment. 8 bits can represent 256 numbers, but how to choose these 256 numbers to maintain the accuracy as much as possible is a problem. Here linear method and non-linear method (K means) are tested. In both cases, we store a look-up table and the group index of each floating point number. Thus we can save the weights size by around 4x. Fig. 10: Accuracy of different quantization methods. Figure 11 demonstrates the effect of different quantization methods by peeking into Conv5 layer. Linear quantization has the same value density everywhere, even though there is no weights distribution nearby. K means is better, it has more centroids at regions that are dense with weights, thus has minimum loss of information from original weights, and it guarantees to have 0 centroid, which can represent 0 value weights. (a) Linear quantization. (b) K means quantization. Fig. 11: Two quantization methods. A technical problem I met to implement the K means method, at first, I want to use a big matrix to hold the distance value from weights to K (K=256) cluster centroids, and find the minimum of each row and use its index as the label for its corresponding weights. Because for some layer, such as conv5 layer, there are 10 million weights, the number of cluster multiply by the number of weights will result in a huge number, the program will crash due to memory leakage. My solution is to divide the weights of each layer to blocks, each block contains around 4k floating numbers. In this way, the K means quantization converge after around 3 minutes with the maximum centroid drifts less than 0.001. Fig. 12: Demo of the huge distance matrix. Figure 13 shows the variation of accuracy with different compressing methods. 8 bits quantization of the weights maintain the prediction accuracy even without fine tuning. All compressed weights are also tested for their speed. As predicted before, the compressed weights have the same average inference speed as the original weights. Even though the size of the compressed weights in 8 bits can be 4x smaller, the forward inference is still computed in full precision for Caffe kernel. The large portion of 0s in the weights do not make the inference faster, because neither Caffe nor TinyDNN use sparse BLAS library in its kernel. Fig. 13: Accuracy variation with different compress methods. 4 Conclusion In this report, methods to achieve accurate real-time human motion capture on embedded platform are explored systematically. Accessing DRAM consumes two orders more energy and is 10x slower than SRAM, it makes pruning and weights sharing by quantization essential for embedded platform. The heat-map model is deployed on desktop and Odroid XU4, different combinations of pruning and quantization methods are tested and proved to have great potential in helping achieving real time human joints capture with customized hardware. References 1. Vasileios Belagiannis and Andrew Zisserman. Recurrent human pose estimation. In International Conference on Automatic Face and Gesture Recognition. IEEE, 2017. 2. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017. 3. Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. Eie: Efficient inference engine on compressed deep neural network. International Conference on Computer Architecture (ISCA), 2016. 4. Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016. 5. Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size. arXiv:1602.07360, 2016. 6. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. 7. MIT. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, 2016. [Online; accessed 4-May-2017]. 8. T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In IEEE International Conference on Computer Vision, 2015. 9. Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016. 10. Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient processing of deep neural networks: A tutorial and survey. CoRR, abs/1703.09039, 2017. Human Pose Estimation in Spatio-Temporal 3D Convolutional Neural Network Sidney Pontes-Filho1 , Ahmed Elhayek2 , and Pramod Murthy3 1 s [email protected] 2 [email protected] 3 [email protected] Abstract. Nowadays estimating human pose is necessary for several applications. Our goal is to develop a method that accurately infer 3D joint locations from monocular RGB videos. The chosen method is 3D Convolutional Neural Network which learns the desired output from the input without any manual feature extraction and it takes advantage from one additional dimension that is the time coded in sequence of frames through the videos. Human pose estimation is a challenging task due to depth ambiguity, difficult background, and occlusion. We use a common dataset used in works of 3D human pose estimation which is the Human3.6M. Our model of network’s architecture and method’s details are described. The results of our method is presented and analysed in qualitative and quantitative manner. Keywords: Human pose estimation, 3D Convolutional Neural Network, Motion Capture 1 Introduction In the recent years we have seen the rise of several applications connecting digital and real world, such as, Augmented Reality, Virtual Reality, pose-based games, Human-Computer Interaction, and so on. These applications can be improved by methods of Human Pose Estimation. There are many accurate methods for estimating the human pose, but they require additional hardware devices and costs. Our goal is to estimate 3D position of human body parts using a cheap and abundant device, a monocular RGB camera. For that we use Convolutional Neural Network (CNN), a special version of it that uses 3D convolution instead of the standard 2D one. In Computer Vision, CNNs are strongly researched and present excellent performance, even though it doesn’t use predefined feature extraction methods because it learns by itself the important features of the data. Human Pose Estimation faces difficult challenges when using only a monocular RGB camera that are separation of the person and the scene, missing depth information, and occlusion by objects, other people or even the own target person. For attenuating those problems, we use the information of time dimension taken from consecutive video frames. In this way, we add the motion information as well. Since we don’t only use a static 2D image, but a video. Hence that adds one more dimension (time) and that is the reason of using 3D CNN. This work is organized in the following manner. Section 2 has a review of some publications related to ours. Section 3 explains the dataset used. Section 4 consists of the explanation and details of our method. Section 5 shows the results obtained and their evaluation. Section 6 presents an analysis of this work and future ones. 2 Related Work There are several scientific publications on Human Pose Estimation using lots of different methods for this task. We summarize here part of this literature which the method is 2D or 3D Convolutional Neural Networks. 2 Li and Chan [7] make use of 2D CNN for 3D pose estimation. They train their network in two phases, firstly it is trained for human parts detection, then the fully connected layers are replaced by non-trained ones and the CNN is trained for 3D pose regression. The approach of Li et al. [8] use also 2D CNN for 3D pose estimation. Their technique has an additional goal of image-pose pair matching score which supports the estimation of 3D pose. The use of 3D CNN is not that recent. In 2010, Ji et al. [6] proposed a 3D CNN model for action recognition from monocular RGB videos. Action recognition task does not estimate directly human pose, but the knowledge of pose is needed to infer the action class. Therefore this approach estimates indirectly also the human pose through time for action recognition. The method that we are based on was done by Grinciunaite et al. [3]. They use a 3D CNN model for 3D pose estimation from monocular RGB videos. This method has better performance than 2D CNN methods of [7] and [8]. We can compare them because they use the same dataset which is Human3.6M [2, 5] and we use it also to evaluate our 3D CNN model. 3 Dataset After analysing the literature, Human3.6M [5, 2, 7, 8, 3] is a suitable dataset for training and evaluation of our proposed CNN. For what we know so far, Human3.6M dataset is the largest motion capture dataset that has a license granting use free of change for academic purpose. There are 3.6 million human poses data for each corresponding high-resolution video frame of 1000 x 1000 pixels recorded at 50Hz frame rate. 11 professional actors were recorded in an indoor environment using 4 calibrated cameras in different viewpoints performing 15 distinct activities, such as walking, taking photo, and so on. The human pose data consist of accurate 32 joint locations in 2D and 3D captured from high-speed motion capture system. Some sample images of Human3.6M dataset is shown in Figure 1. Fig. 1. Sample images from Human3.6M dataset, showing the variation of subjects, poses and viewpoints. Image taken from [5]. 3 Fig. 2. Sample of cropping procedure. 4 Method Our method is based on the work of Grinciunaite et al. [3]. In this section we describe how the input data and ground truth are preprocessed; the details of the 3D CNN architecture; the procedure for training, validation and testing; and the processing of the network’s output. 4.1 Preprocessing The original 1000 x 1000 pixels video frames of Human3.6M are cropped with the pelvic bone location as the centre of the image and the size of the cropping is calculated by Equation 1 and it calculates the distance from the image centre to the edge of the image. distanceF romCentre = max(widthBoundingBox, heightBoundingBox) (1) distanceF romCentre is the distance from the centre of the image to the border. The bounding box with the human subject is defined using all 2D joint locations, so a rectangular region is acquired. The widthBoundingBox and heightBoundingBox are, respectively, the width and height of that bounding box. The method for defining the cropping region seems an exaggeration because if a person is standing, the person’s height is added above and below pelvic bone joint location, then the double of person’s height is the height and width of the cropping region. However, that is necessary because of several variations of the subject’s pose that can remove the hands or feet when cropping. After cropping the video frame using the procedure explained above, the image’s size is modified to a resolution of 256 x 256 pixels. The results of cropping and resizing the video frames are presented in Figure 2. The coordinate system of 3D joint locations used as ground truth is translated to the pelvic bone location. That is, the origin of this coordinate system is now the pelvic bone location which is the first joint in the ground truth annotation. The 32 joints locations of the original 3D human pose are filtered performing a selection of 17 core joints. 4 Fig. 3. Network architecture. C means convolutional layer, P is the pooling layer and F is the flattened layer. ’full’ means the fully connected layer. 4.2 3D Convolutional Neural Network Our 3D Convolutional Neural Network consists of 5 3D convolutional layers for spatio-temporal feature extraction followed by 1 fully connected layer for mapping the extracted features to the desired output (3D human pose). The input is 5 consecutive video frames from just one camera which we defined to skip 2 frames for increasing the temporal window. The output of the network is 5 3D human poses corresponding to the input frames. The mathematical description of a 3D convolutional layer is represented by Equation 2 [3]. XXX (K ∗ X)i,j,k = Xi−m,j−n,k−l Km,n,l (2) m n l Where ∗ denotes discrete convolution, X for the three dimensional data of dimensions m × n × l, K for the three dimensional flipped kernel. The model of network’s architecture and hyper-parameters are described in the following. The network’s input is 5 consecutive video frames. There are 5 convolutional layers in this sequence of kernel sizes: 3 × 5 × 5, 2 × 5 × 5, 1 × 5 × 5, 1 × 3 × 3 and 1 × 3 × 3. After all convolutional layers, there is the activation function PReLU with p set to 0.01 [3, 4]. There are 4 max pooling layers applied in the image space after the first, second, third and fifth convolutional layers with kernel size of 2 × 2. Those previously described layers are responsible for feature extraction, their output is flattened to one dimensional vector of size 11,520 and follows to the fully connected layer which results in a vector which corresponds to the human poses for each video frame input. The size of the output vector, in our case, is 255 which means 5 frames × 17 joints × 3 dimensions. The 3D CNN architecture is illustrated in Figure 3. The configuration for training is batch of size 10 for stochastic gradient descent with learning rate of 10−7 and Nestrov momentum of 0.9. The number of batches per epoch is 20,000 for training (subjects S5, S6, S7, S8) and 2,000 for validation (subject S9). The batches are all randomly selected. After finishing all 20 epochs, the model with best validation result is tested, then the whole test dataset (subject S11) is used. The cost function used for training is the mean square error (MSE). We developed our 3D CNN using the scientific computing framework Torch for the programming language LuaJIT. Our code is based on GitHub repository of an ImageNet example in Torch [1]. 4.3 Postprocessing The output of the network is human poses corresponding to the input video frames. Hence the input is 5 consecutive video frames, hence the output is 5 human poses. For robustness all output human poses are averaged resulting in just one human pose and compared to the true human pose of the middle frame. 5 Fig. 4. Network output sample with MPJPE of 131 mm. Input is the first frame of S11 performing action ’Walking’ in the first recording of camera 2. (a) Input middle frame. (b) Ground truth of middle frame. (c) Averaged network output. 5 Results The averaged human pose is compared to the ground truth of the middle frame. The comparison is done by mean per joint position error (MPJPE) and used to track the training and to evaluate the network during test process and reported here in this section. The equation of MPJPE is represented by Equation 3. M P JP E = NJ 1 X Ji − Jˆi NJ i=1 2 (3) MPJPE is the mean euclidean distance between ground truth and estimated joints positions. NJ is the number of joints, Ji is the estimated i joint location and Jˆi is i joint location of ground truth. A sample of the frame input, ground truth and postprocessed human pose output is shown in Figure 4. It is a sample also of the selected 17 core joints that our network estimates. The MPJPE of that output is 131 milimeters. Another sample shows the visual quality of a good estimation of the human pose done by our method is illustrated in Figure 5. The MPJPE of this sample is 105 milimeters. However Figure 5 shows a good estimation of our CNN, there are some deviations especially in the ankle joints. Because of this issue, we verify also the MPJPE for each joint. We can visualize it in Figure 6. 6 Fig. 5. Network output sample with low MPJPE of 105 mm. Input is the frame 1292 of S11 performing action ’Walking’ in the first recording of camera 2. (a) Input middle frame. (b) Ground truth of middle frame. (c) Averaged network output. After analysing the Figure 6, we can say that more distant a joint is from the root (pelvic bone joint), more deviated it is from the ground truth. So that shows the challenge of 3D joint locations regression, especially for the head, wrists and ankles which are the final joints of our defined skeleton. For a quantitative investigation using the MPJPE results, we compare our method with the stateof-the-art approaches mentioned in the Section 2 which use Human3.6M for 3D pose estimation. It is important to mention that our results are not from the hidden official test set as the others. Our test set is the subject S11, which is part of train and validation set of the compared methods. The comparison is done with MPJPE in milimeters (mm) and separated by the 15 actions. This comparison is not fair because of the difference in the test set used, even though there is this issue the results are shown in Table 1. Action Directions Discussion Eating Greeting Phoning Posing Purchases Sitting SittingDown Smoking TakingPhoto Waiting Walking WalkingDog WalkingTogether Avg LinKDE(BS) [5] DconvMP-HML [7] StructNet-Avg(500) [8] 3DCNN [3] 124 91 117 148 92 89 93 104 76 94 138 127 98 102 111 92 105 145 106 99 136 140 112 139 135 151 203 260 239 118 98 109 197 189 170 151 146 105 106 115 77 99 101 166 146 138 141 153 109 106 140 131 122 119 3DCNN (ours) 125 117 152 162 161 123 146 191 227 166 211 137 132 238 142 162 Table 1. MPJPE results. Values are in milimeters (less is better). Note that our method was tested with a different test set (subject S11) while others used the hidden official test set of Human3.6M. 7 Fig. 6. Percentage of estimated joints below a MPJPE threshold. (a) Joints of the central part of human body. (b) Joints of arms and legs. The Table 1 shows that our approach perfomed poorly in average compared to the other state-ofthe-art methods, however our method performed similar to the others in some actions. The reason for the poor performance of our approach can be the smaller train set and different test set. 6 Conclusion Adding more input information to a Convolutional Neural Network helps in performance improvement. What we added in our method was the time information using consecutive video frames. That gives information about the movement. Hence an adequate input data need to have movement information in the temporal window. The application of a 3D CNN is done successfully by Grinciunaite et al. [3]. Unfortunately we didn’t manage to reach their results due to the smaller training set and we didn’t test our trained model in the official test set of Human3.6M. It is very likely that our model trained as the method of Grinciunaite et al. [3] will present a similar or better performance because our input frames have a better resolution, therefore there is more information for a more precise estimation of the 3D joint location. After developing this 3D CNN model, we have some future work ideas which are estimation of different 3D human pose representation, such as joint angles and exponential maps; adjustment of architecture and hyper-parameters; increase the number of input video frames; combination with recurrent neural network that handles also spatio-temporal data; and translation and rotation of the coordinate system to correspond to the camera position and orientation, hence making depth information be one of the three dimensions and avoiding, in this way, the centering procedure in the 3D joint locations. References 1. imagenet-multiGPU.torch. https://github.com/soumith/imagenet-multiGPU.torch. Accessed: 2017-0503. 2. Fuxin Li Cristian Sminchisescu Catalin Ionescu. Latent Structured Models for Human Pose Estimation. In International Conference on Computer Vision, 2011. 8 3. Agne Grinciunaite, Amogh Gudi, Emrah Tasli, and Marten den Uyl. Human Pose Estimation in Space and Time using 3D CNN, October 2016. 4. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. CoRR, abs/1502.01852, 2015. 5. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014. 6. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D Convolutional Neural Networks for Human Action Recognition. In Johannes F¨ urnkranz and Thorsten Joachims, editors, ICML, pages 495–502. Omnipress, 2010. 7. Sijin Li and Antoni B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Asian Conference on Computer Vision (ACCV), 2014. 8. Sijin Li, Weichen Zhang, and Antoni B. Chan. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation. CoRR, abs/1508.06708, 2015. Deep Learning Camera Pose Estimation Compuer Vision Project WS 16/17 Jan Holub1 and Jason Raphael Rambach2 1 2 j [email protected] Jason [email protected] Abstract. Camera Pose Estimation is crucial for Augmented Reality systems. A lot of work has been done in the past with approaches like the Extended Kalman Filter or other approaches. We tried to use a recurrent neural network, a LSTM together with IMU measurements to estimate the camera pose. We show that it can be done quite well with IMU data and the past pose and that it can be improved a lot if you estimate the camera’s displacement instead of its future position. We also show that using two separate networks for the orientation and the displacement showed to be beneficial and that using multiple LSTMs can also increase the performance. 1 Introduction Camera Pose Estimation is a very important task in the field of computer vision, especially Augmented Reality. In order to produce a very sturdy image of a 3D model in the augmented reality environment it is crucial to know the camera’s position. It is even better to know the future location of the camera, the precisely the better. Augmented Reality is still a minor area from a consumer’s perspective. It’s not exaggeration to say the the Google Glass failed and any hardware that is on the market is still very expensive. Microsoft recently released the HoloLens which is said to have an extremely well tracking of the camera. This pushes the world of AR further to a new level. In my project I looked at a Neural Network approach proposed by [3] where an LSTM [1] was used to estimate the camera position in the next step. By taking the past into account (due to its RNN nature) it tries to reliably estimate the future position and orientation. This approach provides various advantages over traditional approach using for example an Extended Kalman Filter, which proposes various challenges. 1.1 Motivation Traditional approaches using an Extended Kalman Filter have three big problems, namely Synchronisation, Hand-Eye Coordination and they are Difficult to Model. Synchronization Problem To get the acceleration and angular velocity of the camera usually IMUs are used. These have some kind of frequency how often they output the measurements. In a similar fashion a camera has a framerate, usually something like 30, 60 or 120 Hertz. You have to synchronize the IMU measurements with the images and this poses quite a challenge. By using neural networks this synchronization problem is solved automatically because the network learns that from the training data it receives. The network adapts to the situation. Hand-Eye Coordination The IMU unit will usually be located somewhere else on the camera system, than the lense. When the camera is moved ore turned the IMU records these accelerations and angular velocity, but of course from the location of the IMU unit. Especially with rotations the forces differ strongly from the location of the camera. This poses an unknown translation between the IMU and the camera lens which has to be modeled. Again the network adapts to this. By feeding the correct training data with the right ground truth information, the networks learns this translation and adapts to the situation. This is much easier than to try to model this translation by hand. Difficult to Model The Extended Kalman Filter always has a probibalistic model underneath which has to be chosen correctly according to the problem. It is often assumed to be gaussian but in reallity it often is not. The performance of the EKF depends highly on the probibalistic model chosen. Especially when what you want to model is nonlinear it becomes really difficult. Here once again the network just learns the things and adapts accordingly. Similar to how convolutional neural networks learn image features a lot better than it works with artificial handcrafted features, these problems are adressed by simply letting a network learn them. This shows how powerful neural networks can be to overcome challenges which are difficult to model or to abstract. 2 2.1 My Approach Network The network proposed includes an LSTM [1] to process information from the fast so estimate the future and 3 Fully Connected Layers for the regression task. The input consists of 10 ∗ 6 IMU measurement inputs and 7 inputs of the previous pose. That is 3 for the position (x, y, z) and 4 for the orientation in quaternion representation (q0 , q1 , q2 , q3 ). The output gives the estimation of the future position again in (x, y, z, q0 , q1 , q2 , q3 ) representation. That is the basic idea proposed in [3] which was the first version to be implemented. Going from Position to Displacement When we discussed how to improve the approach, one of the first ideas was to not estimate the absolute future position, but only the displacement of the camera. Since the current location is known anyway, the future position can easily be calculated by adding the displacement to the past location. The displacement is a relative value in contrast the absolute position. This leaves the network with much less possible values to predict. The input now consists 10 ∗ 6 IMU measurement inputs and 4 inputs representation the orientation, the output now looks like the following (∆x, ∆y, ∆z, q0 , q1 , q2 , q3 ) The output layer remains as a 7 neuron layer and but the input layer changes from 67 to 64. Splitting the network So far the network has always predicted 2 different things, the position / displacement and the orientation. The next idea was to split the network into 2 networks, one for the displacement and one for the orientation. By doing so the network is able to focus one a single purpose and hopefully do that one better (as later shown in the Results section, it did). The input remains the same as before, 10∗6 IMU measurement inputs and 4 for the orientation. The output is split as well. One network outputs the displacement, the other the orientation. 2.2 Implementation The implementation was done in Python using the TensorFlow Framework [4] from Google. The beginning was pretty hard. Compared to other frameworks, TensorFlow [4] is rather difficult to learn. Even single fully connected layers have to be written manually by hand. Including the multiplications of the weight matrices. For a beginner to both TensorFlow [4] and Neural Networks this poses quite a challenge. Furthermore the tutorials online are not as easy, as the authors think they are. It took quite some time to actually get the hang of it. Especially how to feed your own data was very unclear to me at first. Most tutorials use some library internal module which automatically downloads the MNIST [2] dataset and processes it in a way, TensorFlow [4] can handle it. That had the effect that the process of feeding data was very unclear to me because it was hidden away in those library functions. But I eventually understood it and wrote classes to handle the data for all the different networks explained in the network section. One problem I ran into was that I initialized the networks weights with 0, which is a very bad idea. That’s why the network didn’t train well at first and it took a long time to find that mistake. In the end I went with the following formula to initialize the weights, where n represtens the number of neurons of that layer rand(0,1) √ n Using this the results were starting to get really good. Data Handler I wrote data handler classes to process the input files containing both the IMU measurements and the ground truth data. The data was processed in a way that it was easily accessible from outside. When creating the class you pass the input files and a batch size. Once the data is processed you can call a nextBatch method to receive the next batch of the size provided to feed to the network. With a data available method it is possible to check, whether there is still data or not. I wrote 4 data handler classes for the 4 different network. One to give you the absolute position in the ground truth, one for displacement and orientation. The data handler splits the read data into 70% training set and 30% validation set and gives access to both for validation purposes. Networks The networks are fully implemented using the TensorFlow [4] framework. To realize the LSTM [1] I used the module provided by TensorFlow [4]. After some getting used to it was not too hard to handle, although the documentation was very sparse. Again all 4 networks are implemented. Interface The program is a command line interface with 2 required arguments: imu input and ground truth data. Furthermore various optional arguments can be provided Table 1: Optional Command Line Arguments Argument –batch-size –display-step –epochs –model-output –output Effect Default Sets the batch size 20 How often the intermediate results should be output 10 How many epochs to train 100 Where to trained model should be saved none Where to output the results none With this command line interface the networks should be easy to train and use. 3 Experiments The networks were always trained with a batch size of 100 and over 500 epochs. To the Mean Squared Error was used. It is defined as P calculate the error 2 (output−estimate) 2|trainingSet| where both output and estimate are vectors. As the data we used a set of 1921 measurements in total. 70% were used for training and 30% for validation. 3.1 Orginal Idea Best error Training Set 0.000266 Validation Set 0.000933 Fig. 1: Original Idea As you can see the result is already quite good giving a training error as low as 0.000266. 3.2 Displacement This already shows a slight improvement over the original idea but the validation error is a lot higher than before 3.3 Orientation alone This already shows improvement. The training error is only half as low as in the original idea. 3.4 Displacement Only This is an enormous improvement over the previous approach, With both the training and the validation error being as low as 0.000052 and 0.000093 respectively. The error of the displacement is down to 7mm. Best error Training Set 0.000253 Validation Set 0.002246 Fig. 2: Displacement Best error Training Set 0.000176 Validation Set 0.002177 Fig. 3: Orientation Only 4 Conclusion Deep Learning works very well for camera pose estimation. I should be beneficial to further go down that road and put more efforts intro increasing the performance. A trained network is quite fast in providing the estimation, thus using such a network on embedded system or smart phones should be better than using an Extended Kalman Filter for example. In a world of mobile smart devices this is an important benefit. The first improvement was going from an absolute position to the relative displacement. The network is better in estimating the displacement and it lowered the error by half. This is something to keep in mind for further research. The biggest gain was when we went to do a single purpose network. A network that only estimates the displacement and one that only estimates the orientation. Especially on the displacement part the improvement was tremendous. The training error was as low as 7mm. Best error Training Set 0.000052 Validation Set 0.000093 Fig. 4: Displacement Only I did a last small test using 3 LSTM layers that gave a small improvement. For further research I suggest to experiment with the network architecture to improve the performance. References [1] [2] [3] [4] Sepp Hochreiter and J¨ urgen Schmidhuber. “Long Short-term Memory”. In: Neural Comput. 9.9 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735. url: http://dx.doi.org/10.1162/neco.1997.9.8.1735. MNIST Dataset for Handwriting Recognition. url: http://yann.lecun.com/exdb/mnist/. Jason Raphael Rambach et al. “Learning to Fuse: A Deep Learning Approach to VisualInertial Camera Pose Estimation”. In: Proceedings of IEEE International Symposium on Mixed and Augmented Reality —. IEEE International Symposium on Mixed and Augmented Reality (ISMAR-2016), September 19-23, Merida, Mexico. IEEE, Sept. 2016. TensorFlow. url: https://www.tensorflow.org. Depth Map Estimation from Light Field images Alwi Husada1 and Yuriy Anisimov2 1 2 a [email protected] [email protected] Abstract. This report summarize a framework that estimates depth from light field camera based on a paper with the title ”Accurate Depth Map Estimation from a Lanslet Light Field Camera” by Jeon et al.[9]. As it is explained in the original paper, the framework consist of four steps: (1) Constructing cost volume. In this step, phase shift method is used, since it enables sub-pixels displacement in narrow baseline of stereo correspondence. (2) Cost aggregation is then applied to the computed cost-volume to reduce noises. (3) Multi-label optimization is used to propagate accurate disparity labels to the weak regions by using graph cut algorithm and weighted median filter. And to enhance computed disparity map (4) iterative refinement is used by applying quadratic polynomial interpolation to the computed depth from previous steps. In this project, we implement the framework using C++. Keywords: depthmap, lightfield 1 Introduction Depth estimation is a set of techniques, aiming to obtain representation of a spatial structure of a scene. In other words, there are techniques to obtain depth information of scene in 2D planes. Extensive works have been done in this fields where many methods and setup with excellent results have been proposed especially in stereo vision techniques [3]. However, some of those techniques are not suitable for light field images. Light fields can be described as a set of light rays, that travelling in every direction from every point in space [11,6]. It can be parametrized as 3D coordinates as ray positions as well as 2D rays directions. The parameters can be reduces to 4D space due to redundant information caused by the radiance along the ray remain constant. This 4D light field is currently used in many research and applications. Light field camera captures not only two-dimensional representation of scene but also directional light information. This allows capabilities of post-captures parameters adjustment such as aperture size and focus. There are many ways to capture light field images such as camera arrays, micro-lens array, code masks, objective lens array and gantry-based camera systems [6]. In this project we will focus on micro-lens array as it is implemented in commercial light field camera (i.e. Lytro [2] and Raytrix [5]). According to Alam et al.[6] in their paper, micro-lens based camera suffer from low spatial resolution issues. It is caused by the sharing mechanism of its imaging sensor to capture both angular and spatial information. Hence there is trade-off between spatial and angular resolution. Another limitation of micro-lens based is narrow baseline. Because of this issue the existing stereo matching algorithm cannot produce satisfying result for light field images. Thus, algorithm for stereo matching with narrow baseline is required. Jeon et al.[9] proposed the algorithm for estimating depth from light field images. The algorithm will be used as main source for this project. The objective of this project is to implement proposed algorithm in C/C++ code. The main method of the proposed algorithm by Jeon et al. is phase shift theorem. It is used to estimated sub-pixels shift of sub-aperture images to get sub-pixels accuracy in narrow baseline. The cost volume is computed by shifting sub-aperture images (except center view of sub-aperture 2 image) at different sub-pixel locations, then compute similarity measurement between shifted subaperture images and center view of sub-aperture image. To reduce noisy image, the edge-preserving filter is then applied in each cost volume slice. In this project, we experiment with three different edge-preserving filters. those are guided filter (as implemented in original paper), domain transform filter and bilateral filter. Multi-label optimization is then applied to the estimated depth to decrease oversmoothing in the edge region. In the final step, quadratic polynomial interpolation is used to enhance the estimated depth map. Details methods are explained in the next section. 2 Framework In this section, the proposed depth estimation algorithm is described. As mentioned earlier, the method used in this report is originally based on cost-volume for stereo camera. However, in order to accommodate small baseline between sub-aperture images in light field, three modifications have been made. First, phase shift algorithm was applied directly to sub-aperture images that enables point correspondence in narrow baseline at sub-pixel accuracy. Second, weight terms based on vertical or horizontal deviation between sub-aperture image pairs is defined. It is used for effectively aggregate gradient costs. Last, confidence matching correspondences are included in label optimization. 2.1 Cost Volume construction As it was previously mentioned, phase shift algorithm is used for sub-pixel displacement. In phase shift theorem, sub-pixel displacement is shifting image I by ∆x 2 R2 that can be represented in 2D Fourier transform as: F{I(x + ∆x)} = F{I(x)}exp2πi∆x , (1) where F{.} is the discrete Fourier transform. Thus, the sub-pixel shifted image I 0 (x) can be obtained using Inverse Fourier transform as per equation below: I 0 (x) = I(x + ∆x) = F −1 {F{I(x)}exp2πi∆x }, (2) where F −1 {.} is the inverse discrete Fourier transform. This algorithm shifts the entire sub-aperture image instead of local patches. It is intended to localized the artifact caused by periodicity at the boundary of the image within a width less than two pixels, which is insignificant to the final depth computation. Matching sub-aperture images is done by using two complimentary costs, which are sum of absolute differences (SAD) and sum of gradient differences (GRAD). Sum of absolute differences CA is the difference between shifted sub-aperture images and the center view of sub-aperture image. Then it followed by truncating the value with the robust function τ1 . The sum of absolute difference CA can be defined as a function x and label l : X X min(|I(sc , x) − I(s, x + ∆x(s, l))|, τ1 ) (3) CA (x, l) = s2V x2Rx where Rx is rectangular region with x as a center, τ1 is a truncation value for robust function, V contains coordinate pixel st except of center image sc and the ∆x is 2D shift vector defined in the following equation: ∆x(s, l) = kl(s − sc ) (4) where l is label and k is pixels unit of the label. ∆x is linearly increases as the angular deviation of the sub-aperture images and the center images increases. Another costs is sum of gradient differences 3 (GRAD).It is defined as follows: X X β(s)min(|Ix (sc , x) − Ix (s, x + ∆x(s, l))|, τ2 ) CG (x, l) = (5) s2V x2Rx +(1 − β(s))min(|Iy (sc , x) − Iy (s, x + ∆x(s, l))|, τ2 ) where |Ix (sc , x)−Ix (s, x+∆x(s, l))| is the differences between x-directional gradient of sub-aperture image I and |Iy (sc , x)−Iy (s, x+∆x(s, l))| denotes as the differences in y-directional gradient. β(s) is calculated based on sub-aperture images coordinates, to determine the relative importance between the two gradients. β(s) is defined as follows: β(s) = |s − sc | |s − sc | + |t − tc | (6) Accordingly, The cost volume C can be defined as follows: C(x, l) = αCa (x, l) + (1 − α)CG (x, l) (7) where α 2 [0, 1] tunes the relative importance between sum of absolute difference and sum of gradient difference costs. 2.2 Cost Aggregation To remove noisy effect from cost volume step cost aggregation is applied. The main idea of cost aggregation here is to filter each slice of cost volume with edge-preserving filter to remove unreliable matches. There are some edge-preserving filters available such as Bilateral filter, Domain Transform and Guided filter. In the original implementation, Guided Filter by He et al.[8] is used. The central view of sub-aperture image is used for image guide. From the filtered cost volume C 0 , then a disparity map la can be determined using winner-takes-all strategy as: la = arg min C 0 (x, l) (8) As it was mentioned earlier, in this project we experiment with three difference edge-preserving filters. In addition to guided filter, we also implement domain filter and bilateral filter. Bilateral filter is filtering method by combining domain and range filters to achieve smoothness in weak texture as in traditional domain filters and preserve high texture scene such as edges, lines or corners. It works by weight averaging the neighborhood pixel values based on theirs distance in both domain and range [13]. Whereas, the main idea of domain transform filter is that if an 2D RGB image manifold in 5D space (x,y,r,g,b) can be transformed to lower dimension and still preserve the distance among the pixels, then many spatially-invariant filters in this new space is edge preserving [7]. In this project we only use recursive filter (RF) of domain transform filter. The comparison of the result with those different filters can be seen in Fig.(3) 2.3 Multi-label optimization Graph Cut.[10] Multi-label optimization is performed from the point of view of energy minimization by using graph-cut. The main idea is to construct specific graph for energy function. Then energy minimization can be performed by using min-cut/max-flow algorithm such that the minimum cut on the graph also minimize the energy. The following is the energy function to be minimized: X X k l(x) − la (x) k + la = arg min C 0 (x, l(x)) + λ1 λ2 X x2M x x2I k l(x) − lc (x) k +λ3 X x0 2Nx k l(x) − l(x0 ) k (9) 4 Energy function equation (9) has four terms: C 0 (x, l(x)) is matching cost reliability, k l(x) − la (x) k is data fidelity, k l(x) − lc (x) k is confident matching cost and k l(x) − l(x0 ) k is local smoothness. Note that in this project we do not implement confident matching cost because at it is stated in the original paper that even without confident matching, the proposed algorithm still produce reliable disparity map. Weighted Median Filter.[12] After performing graph cut optimization then weighted median filter is applied to the computed depth. It is used if the estimated disparity map from previous steps is still noisy. In (unweighted) Median filter, value of pixel is replaced by the median value of its neighbors. It treats each neighbors equally and this can remove thin structures and rounding sharp corners. Those problems can be overcame by weighted median filter. In weighted median filter, pixels are weighted in the local histograms by an image. As per mentioned in the original paper, any edge-preserving filter can be used as weight in here. 2.4 Iterative Refinement Iterative refinement is post-processing step to enhance existing disparity map. it is adopted from paper by Yang et al. [14]. The author claimed that by applying this algorithm can enhance spatial resolution up to 100x. In this steps, a new cost volume is built based on computed disparity map from previous steps. The square different is selected for constructing new cost volume since quadratic polynomial interpolation is used for sub-pixel estimation in later step.The construction of new cost volume can be formulated as follow: ˆ lr ) = min(η ⇤ L, (d − la (x, y))2 ) C(x, (10) where d is potential depth candidate, la is computed depth from previous steps, L is search range and η is constant parameter. This cost function can help to preserve sub-pixel accuracy from previous computed depth. In the original paper [14], bilateral filtering is then applied to each slice of the cost volume. However in this implementation we use weighted median filter as suggested by Jeon et al.[9]. Finally, to reduce discontinuities, sub-pixel estimation algorithm is used.This algorithm is based on quadratic polynomial interpolation. It approximate the cost function between three discrete depth candidate: C(lr ), C(l+ ), C(l− ). Then, a non-discrete disparity l⇤ can be obtained via: l ⇤ = lr − C(l+ ) − C(l− ) 2(C(l+ ) + C(l− ) − 2C(lr )) (11) where l+ (cost slice plus 1), and l− (cost slice minus 1) are adjacent cost slices of lr . For better result this procedure is applied iteratively. As per mentioned here[9], four iteration is enough to get satisfying result. 3 Results and Implementation In this project we implement the proposed algorithm by Jeon et. al. [9] in C++. We utilize armadillo [1] and OpenCV [4] as libraries. To evaluate our implementation we use Lytro [2] datasets provided by Jeon et. al. together with MATLAB implementation from this website 3 . A machine with Intel i7 2 GHz CPU and 8 GB RAM was used for running the computation. The original code in MATLAB required 15 minutes for the Lytro datasets. However, our implementation in C++ requires longer to run: 19 minutes. The cost volume construction step required longest time to run compare to the other steps. Table below summarize running time at each step: 3 https://sites.google.com/site/hgjeoncv/home/depthfromlf_cvpr15 5 Steps Matlab C++ Cost Volume 404.45 seconds 1070.87 seconds Cost Aggregation 239.23 seconds 5.62 seconds Graph Cut 30.76 seconds 64.8189 seconds WMF 44.77 seconds 5.22 seconds Iterative Refinement 184.05 seconds 18.41 seconds TOTAL running time 903.26 seconds 1164.94 seconds For the evaluation, the parameter values are selected equal to parameters that are stated in the paper. They might be varied depends on the datasets. The selected parameters values can be seen in Fig.(1). The comparison between C++ and MATLAB results at different steps can be seen in Fig.(2). We compute the means square different among them to evaluate quantitatively. As it was mentioned in section 2.2, we applied three different edge-preserving filter at cost aggregation step namely guided filter, domain transform filter and bilateral filter. The comparison among them is shown in Fig.(3). There is not much difference among them, however based on the Fig.(3) domain filter shows slightly better compare the other two. 4 Conclusion A method for estimating disparity for light field images was proposed by Jeon et. al. [9]. Based on the results that mentioned in the original paper, this method out performed three existing methods. It show that sub-pixel shift in frequency domain by using phase shift theorem is effective for depth estimation in narrow baseline. Furthermore, the adaptive aggregation of the gradient costs and confidence cost by matching correspondence enhanced the depth map accuracy. The main drawback of this algorithm is the running time especially in cost volume computation. But it is expected that the speed can be significantly increased by parallelizing using GPU. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Armadillo. http://arma.sourceforge.net. Accessed: 2017-03-12. Lytro. https://www.lytro.com/imaging. Accessed: 2017-03-12. Middlebury stereo benchmark. http://vision.middlebury.edu/stereo/eval3/. Accessed: 2017-03-11. Open cv. http://opencv.org. Accessed: 2017-03-12. Raytrix. https://www.raytrix.de. Accessed: 2017-03-12. M. Zeshan Alam and Bahadir K. Gunturk. Hybrid light field imaging for improved spatial resolution and depth range. CoRR, abs/1611.05008, 2016. Eduardo SL Gastal and Manuel M Oliveira. Domain transform for edge-aware image and video processing. In ACM Transactions on Graphics (ToG), volume 30, page 69. ACM, 2011. K. He, J. Sun, and X. Tang. Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6):1397–1409, June 2013. H. G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. W. Tai, and I. S. Kweon. Accurate depth map estimation from a lenslet light field camera. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1547–1555, June 2015. Vladimir Kolmogorov and Ramin Zabih. Multi-camera scene reconstruction via graph cuts. In European conference on computer vision, pages 82–96. Springer, 2002. Marc Levoy. Light fields and computational imaging. Computer, 39(8):46–55, 2006. Z. Ma, K. He, Y. Wei, J. Sun, and E. Wu. Constant time weighted median filtering for stereo matching and beyond. In 2013 IEEE International Conference on Computer Vision, pages 49–56, Dec 2013. C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 839–846, Jan 1998. Q. Yang, R. Yang, J. Davis, and D. Nister. Spatial-depth super resolution for range images. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, June 2007. 6 Fig. 1. Selected parameter values at different step of the framework (a) (b) (c) (d) (e) (f) (g) (h) Fig. 2. Estimated disparity map at different step of the framework and also comparison of original MATLAB with C++ results. in the first row (upper (a-d)) show C++ result. Second row (lower (e-h)) show MATLAB result. (a and e) based on the initial cost volume -MSD= 0.026507, (b and f) after weighted median filter refinement -MSD= 0.001259, (c and g) after multi-label optimization -MSD= 0.000885 and (d and h) final results, after iterative refinement -MSD= 0.001566. (a) (b) (c) Fig. 3. comparison of results from three different edge-preserving filter in cost aggregation step. (a) Guided filter, (b) Domain transform (RF) and (c) Bilateral filter. There is not much difference among them. But (b) shows little bit better since it removes holes produce in (a) Augmented Reality Application for Mobile Devices Joshua B. Knobloch1 and Jason Rambach2 1 2 [email protected] Jason [email protected] Abstract. As first step in this project report we will swiftly introduce a physics experiment and set the goals for the project. After that we describe in general how we can reach this goals before we start with the implementation of a colour tracking and a marker tracking algorithm. The final section is about migrating the approach into Unity and creating a mobile application of it. Keywords: augmented reality, colour tracking, marker tracking, OpenCV, Unity, Windows application 1 Introduction The focus of this project is on a physics experiment, developed by the physics department of the TU Kaiserslautern. Fig. 1(a) shows the set-up of the experiment, with an air table on which several mechanical pucks float on an air layer. In a first step these pucks get pushed against each other while a permanently fixed camera above the table records the movement of every puck and their behaviour during the collisions. After the execution of the experiment, the video gets analysed with the aim of gaining knowledge about specific physical variables like velocity, angular velocity or kinematic energy. The goal of this project is to improve the current experiment in terms of availability of the physical variables by providing the measurements in real time as an augmentation overlay. This means first to develop an application which calculates the required information during the recording of the video in real-time and second to provide the information to mobile devices where the experiment and its values can be monitored. The first thing we need to think of is to find a general way of how to extract the information about the physical variables from a video or a live camera stream, which gets described in section 2. After we know how to calculate the required values we start implementing a first colour tracking approach in C++ using OpenCV libraries. While the colour tracking approach is sufficient for many tasks it still has some drawbacks, leading to the use of a marker tracking approach in section 4. The final step in section 5 is to develop an Unity application for mobile devices which is capable of tracking the pucks and display their properties. The following list is summing up the main tasks of this project: – Think of general ways how to get the values of the required physical variables. – Extract the position and orientation of each puck in every frame provided by a video or live camera stream. – Calculate velocity, angular velocity and energy out of that information. – Develop an Unity interface for tracking the pucks and displaying their physical variables 2 Physical variable calculation In this section we describe how we can extract the required physical variables solely by editing and comparing consecutive frames of a video or live video stream. We will start with the computation of velocity, followed by angular velocity and finally kinematic energy. 2 2.1 Velocity To get the velocity v of an arbitrary object we need to know the distance this object has covered and the time that has passed during its movement which is given as p1 − p 0 . (1) v= ∆t For the experiment, it means that we have to find the exact location of a puck in two consecutive frames (p0 and p1 ) and calculate the difference of their position in the frames. That gives us the distance this puck has covered during the recording of the two frames. The passed time depends on the properties of the used camera and can be deviated from the used FPS-rate. A frame rate of e.g. 120 fps would mean that we get a new frame every 8.333 ms. Another important requirement for a correct velocity estimation in our current set-up is that the camera is permanently fixed on one location. Since we would not know the relative movement between puck and camera we would get poor values for our estimation. 2.2 Angular Velocity The angular velocity va of an arbitrary object is computed quite similar to the velocity, except that we need to know the orientation of the object and not its position which is given as o1 − o0 . (2) va = ∆t Since the pucks always float over an even plain we know that a rotation will always be around the Z-axis, which reduces the task to a 2D problem. This allows us to compare the orientation of the X-axis of two consecutive frames (o0 ando1 ) against each other and divide the difference of the orientation by the passed time between the recording of the frames. 2.3 Kinematic Energy The formula for the kinetic energy Ekin requires the current velocity and the mass of an object and is given as 1 Ekin = mv 2 . (3) 2 We get the velocity v as described above but the mass m cannot be calculated using solely images. But since the mass of a puck doesn’t change during the experiment we can set it before the experiment starts. Similar to the mass, there are other properties we need for our calculation but which don’t change during the experiment like resolution of the camera or FPS-rate. That information needs to be set once at the beginning of every experiment and can be used during the whole recording phase. 3 Colour Tracking Approach In this section, we describe how the required physical variables can be computed by a colour tracking algorithm, followed by an overview of advantages and disadvantages of this method. The first thing we need to do before we can calculate any value is finding the pucks on the table. Since the core of the pucks are the only red areas in the set-up of the experiment (see Fig. 1(a)), we decided to track red areas of a specific size and draw a blue rectangle around them. The OpenCV libraries, summarized in the opencv world310.lib file, provide all methods necessary to reach this goal, so that we only had to specify the range of the colour and the range of the area size. Fig. 1(b) shows the result of that approach. 3 (a) The raw camera input during the experiment (b) The output of the color tracking algorithm Fig. 1. The camera image before and after the color tracking algorithm is applied on it (a) The inner circle encodes 4 bits while the outer one encodes 12 bits (b) The modified set-up of the experiment with a marker attached to each puck Fig. 2. A marker example and the new experiment set-up 3.1 Evauation The most important advantages of this approach are its simplicity and its good runtime behaviour, both coming from the well-defined OpenCV libraries. The red core areas of the pucks are very reliably tracked while other reddish areas are ignored. The runtime could even be improved by including previous positions of the pucks into the calculations of the current one. However, there are several issues making this approach not suitable for our overall goal. One reason is that the tracked pucks have no fixed order since they cannot be identified uniquely. The order could change due to the movement of the pucks or because the algorithm lost the track of a puck in one frame (e.g. while the markers get pushed) and changed the order when it gets the track back. Both scenarios would lead to false velocity values and are not acceptable. A simple but probably quite time consuming solution for that issue would be to use different colours for different pucks and run the algorithm several times per frame. The main reason for the ineptitude of the colour tracking approach is the inability to get the orientation of the pucks, which is implicitly necessary to calculate the angular velocity. A solution for both of the main disadvantages is given by the marker tracking approach as described in the next section. 4 (a) Current orientation of the markers with Z-axis blue, Y-axis green and X-axis red. (b) Output image after all values are calculated and displayed. Fig. 3. The orientation of each marker and the output image. 4 Marker Tracking Approach This section describes how to calculate the values of all required physical variables using a marker tracker algorithm implemented in C++ using the OpenCV libraries. To resolve the problem of unique identification of a puck and to get its orientation, we implemented our application around a marker tracking library developed by the Augmented Vision group. The framework consists of C++ classes using OpenCV libraries and was developed in 2010. Before we could start with the modification of the framework we had to change the experiment set-up and attach markers to the pucks. Fig. 2(a) gives an example of how a marker looks in detail while Fig. 2(b) shows the new set-up of the experiment. The markers consist of an inner circle, divided into four equal parts encoding 4 bits, and an outer circle, divided into twelve equal parts encoding 12 bits. This means that we can encode 216 = 65536 (minus the symmetric ones) markers which are uniquely identifiable by the framework. Since we only use dissymmetrical markers, every marker has an explicit initial orientation which is stored in the framework. While the markers are rotation during the execution of the experiment, the framework checks the current orientation of each marker against its initial one to get the rotation of each puck. Fig 3(a) illustrates the current orientation of two markers with a coordinate system whose source is at the centre of each tracked marker. Due to the unique encoding of each marker, we can clearly identify every puck even if the tracking fails in some frames or the tracking order changes during the execution of the experiment. With that information about location and orientation of each puck we are now able to compute the values of the required physical variables as described in section 2. In a final step, we now have to edit the camera image to display the computed values we calculated. OpenCV provides some useful methods for writing text onto an image with the result shown in Fig. 4(b). 4.1 Runtime Evaluation The marker tracker framework satisfies all requirements we need to calculate the physical variables but we also had to make some compromises. To track a marker and to calculate its current position and orientation takes some time and the more marker are part of the experiment the longer it takes. We discovered that the tracking of two markers takes 23 ms in average but with an initial frame rate of 120 fps the framework only has 8 ms for the calculations before the next frame needs to be managed. To avoid the constant overburdens of the system we decided to lower the frame rate to 40 fps 5 (a) Example screen of the application running on a mobile device with already tracked markers but before the ”Server” or ”Client” button was selected. (b) The architecture of the final Unity approach. Fig. 4. Realization and architecture of the final Unity approach. which means that the framework now has 25 ms before a new frame comes in. Another compromise we had to take was in the calculation of the angular velocity. Let’s say we have two frames and we take the orientation of the marker of the first frame as initial orientation. On the second frame, we can see that the orientation has changed a little but we don’t know the direction of the spin. It could be that it was a slow turn in the one direction or a fast turn in the other way. In the worst case, the marker could even take several turns between the recording of the two frames but since we only see the frames we would not recognize it. By testing we observed that in most of the time the rotation is quite slow so that we decided to always take the smaller angle to calculate the angular velocity. We are now able to calculate all the required physical variables for a set-up as described in section 1. The next and final step is to adapt our solution for a mobile device, which is done in the next section. 5 Unity Application The last step of this project is to develop an Unity application, capable of tracking each puck and displaying their current physical variable values. As the name says are mobile devices usually mobile and moving around but in section 2 we already explained that we only get satisfying results for the velocity estimation if the recording camera is fixed on a permanent position. This means we cannot just copy our code to a mobile device but have to think of a suitable architecture solving this issue. 5.1 Architecture To solve the problem of a moving camera and moving markers we decided to separate the tracking of a marker and the calculation of the values in a client-server architecture. We use one stationary device which is connected to the permanently fixed camera on which the markers are tracked and all the values are computed. Using a wireless router as a switch we can connect several mobile devices as clients to that server. The mobile devices are capable of tracking the markers but receive the physical variable values from the server, which broadcasts these values frequently to all connected 6 clients. If the clients lose the connection to the server, they display the last received value or zero, if there was never a connection established before. 5.2 Application After we developed a proper architecture for our application we could start adapting our current implementation to Unity. Unity is a widespread tool mostly used by game designers to create single scenes or even complete computer and mobile games. We decided to use it because it’s easy to build mobile applications with it but there were several issues we had to solve as well. The first problem was that Unity provides no method to replay a video with getting access on the single frames, which is a huge problem since we calculate the physical variable values by comparing two consecutive frames. We worked around this problem by storing each frame of a video as an image in a folder before we start the application and load a new image every 25 ms. Another issue we had to solve was the difference in programming languages, since we implemented our current solution in C++ but Unity uses C# for its scripts. To solve this issue, we modified an already existing C++ to C# wrapper of the marker tracker library which allowed us to call the C++ functions within the C# scripts of Unity. After we successfully transformed our current implementation into Unity we had to add the clientserver architecture to it. We decided to not split our application into a server and client solution but add buttons to decide during runtime whether the device the application is currently running on should be the server or a client. Fig. 4 shows the initial state of our application before the server or client button was selected. 6 Conclusion We started this project report by describing the initial situation of the physics experiment and defining our goals during the project. In the second section, we gave thoughts to how we can compute the required physical variables independent from programming language, libraries and other tools. For the actual implementation of our algorithm we decided to use C++ and the OpenCV libraries, at first with a colour tracking approach and later with a marker tracking approach. To complete the project, we migrated the C++ code into a c# Unity script and added a client-server architecture to provide several mobile devices with suitable velocity, angular velocity and kinetic energy estimations calculated from the server. Building a Map of Traffic Signs Kishan Nagendra1 , Oliver Wasenm¨ uller2 , and Stephan Krauß3 2 1 [email protected] [email protected] 3 [email protected] Abstract. The project report aims at creating a informative map of traffic signs for a specific region based on ease of automation of image data collection of roads in that region and implementation of traffic sign recognition on the collected images. Keywords: Maps, Traffic sign recognition 1 Introduction Driver assistance systems and autonomous driving cars rely on multiple modules for arriving at decisions about routing. This project would be a contributing module to this process and assist in increasing decision sense of the overall system. The whole project aims at formulating a generalized methodology which can be scaled. Further, implementation of the overall method of data collection and recognition for an area on the map. The goal of the project can be divided into two parts:Firstly, Creation of Dataset containing information specific to roads with the help of generalised map API provided by Google. Basically, this involves the creation of a workspace to hold the information pertaining to the roads and the traffic signs present in a particular city / region. Secondly, Implementation of traffic sign board recognition on the collected dataset in order to create a map containing traffic signs associated with latitude and longitude data. This involves the construction of a traffic sign board map associated with latitude and longitude data based on the workspace containing the collected data. 2 2.1 DataCollection Coordinate Collection A particular region in the map was chosen and the range of latitude and longitude was set in the form of a grid, such that for each cell of the grid, the nearest road was queried. One issue was that two different cells might return the same road reference. When such a situation is encountered, a mechanism was put in place to retain only one reference. This algorithm was run on the entire grid and all road references were gathered. In the initial approach, when the code incorporated a uniform looping mechanism, a sparse 4000 references were gathered because of response errors. Later, with code iteration modification, a dense 440615 references were received with no errors limiting the coordinate collection(Figure 1). Also, it was necessary to be mindful of the fact that the allowed maximum limit was 2500 queries per day per API key. At the end, the road coordinates for the entire city of Luxembourg were obtained. 2 Fig. 1. The pattern of dense road Coordinates obtained shown for a small area on the map 2.2 Image Collection It was necessary to collect four images based on the cameras orientation angle for each coordinate, which sums up to about 1.8 million images pertaining to the roads of Luxembourg. Concurrently, while the street view data was being downloaded, a CSV file containing features like the unique identification series of each image along with other important information that includes, but was not limited to, the latitude, the longitude, the location and the orientation of the image with reference to a coordinate point was created. An issue encountered at this stage was that the download limit that was allowed for images was only 25000 images per day per API key which would have resulted in the entire process taking about 75 days for completion. This issue was side stepped by creating multiple API keys and parallelising the entire download process that allowed for it to get completed in a couple of weeks time. Figure 2 shows four 90 degree width images gathered for a specific coordinate from street view image API. Fig. 2. Four 90 degree width image of a road coordinate in the city of Luxembourg 2.3 Image Duplicate Elimination Due to the proximity of the coordinates on the roads, some of the images in the dataset that were returned were the same, but, for different coordinates. In order to overcome the issue of duplicacy, a process was implemented which used size comparison and md5 hashing[4] to identify duplicate images in the dataset and eliminate their entries from the corresponding CSV file. The resultant CSV file obtained contained unique image information only. Further, the CSV file was parsed to retain the unique image references and remove the rest of the duplicate images from the dataset. 3 For further verification, if deemed necessary, the coordinates could be plotted on a map to confirm the uniformity and accuracy of the information gathered. Figure 3 shows the uniform distribution of the image coordinates obtained after elimination of duplicates. Fig. 3. The coordinates distribution after duplicate image elimination 3 3.1 Traffic Sign Recognition Method Selection for Recognition Task Initially, the Sliding Window approach with back-propagation was implemented on the dataset. However, the speed of execution was way too slow and processing intensive to be considered as a suitable approach. As a result, the primary requisites for the selection of the method for the task of recognition was the following: 1. Must work on unlabeled data (as large volumes of unlabeled data was procured in the previous steps) 2. Must execute rapidly due to the presence of a large dataset. 3. Low False Negative Rate. Based on the above mentioned conditions, the approach that was well suited and performed the best was the HAAR cascade classifier (commonly used for face recognition)[5] as it was stage-based and fast once a model was trained. Figure 4 shows certain type of features that are identified by the HAAR cascade. 4 Fig. 4. Features identified by HAAR cascade[5] 3.2 Collection of Negative Samples and Generation of Positive Samples The dataset that is generated is completely unlabelled and our motive is to implement a mechanism for automation of categorization between positive and negative samples. The samples were collected by separating images with no sign boards and categorizing them as negative images. In such a manner, around 4000 images were manually gathered. It was necessary to collect object images, perform transformations on them in three different dimensions along with accounting for variation of scale and then superimpose the image on to the negative samples collected in order to generate positive samples. Fig. 5. Different directions on which rotation transformation can be performed 5 Fig. 6. The Pedestrian sign superimposed on the background image and transformation performed to generate positive samples Fig. 7. The no entry sign superimposed on the background image and transformation performed to generate positive samples Fig. 8. The no parking sign superimposed on the background image and transformation performed to generate positive samples In order to have a good classifier, instead of taking one object image to create positive samples, multiple object images were considered. For example, while generating the positive samples for a No Parking sign, for each subset of 200 negative images, a different No Parking sign was chosen for the superimposition. Thus, if there were 2000 negative images, a total of 10 No Parking signs were used instead of just using one. This would help the classifier grab the most robust features even with certain variations which was inevitable. Figure 6 shows the different object images chosen for the No Parking Sign. 6 Fig. 9. The yield sign superimposed on the background image and transformation performed to generate positive samples Fig. 10. Different object images chosen for training in place of choosing just one object image One important thing that was necessary to be considered while creating the positive sample was the ratio of height to the width of the chosen object image. The variation in size was directly proportional to the training time. For instance, a width and height of 20x20 would take about an hour. However, increasing the width and height to about 40x40 would take more than a day in order to train a well-staged cascade. Ideally, it was suggestive that the algorithm performed best when the width and height were chosen as 20x20 respectively. 3.3 Training cascade The cascade trainer was stage-based and features would be gathered at each stage along with an acceptance ratio value. The cascade gatheres features based on the auto generated object placement information in the positive samples. These positive samples along with randomized background images are passed as parameters to the cascade training process for feature gathering. As the stages are higher, more and more distinctive and robust features are gathered for the object. The acceptance ratio value at the last stages should be 0.004 or less. In case this acceptance ratio was very high, i.e., in the order of 7.02e-07, the cascade would be considered over-trained. After creating enough number of positive samples and saving the positive data in a vector file, the amount of positive images versus negative images for training needs to be suitably considered. Generally, the number of positive samples needs to be more than the negative samples. Further, based on the availability of data, it was mandatory to set the number of stages that the algorithm needs to run for. An 7 additional parameter that was called for was the ratio of width to height. It was required to be set to the same value as it was during the creation of positive samples. In OpenCV, the output of the training was in the form of an xml file that can be used to identify the area of interest. The same positive data creation and the training process had to be followed for all the required traffic signs in order to generate cascade.xml file for each traffic sign. Figure 11 shows the region of interest of different signs identified by the cascade. Fig. 11. Region of interests identified for images containing different traffic signs 8 One of the hurdles that were faced with the cascade classifier was the high number of false positives. It was obligatory to have another layer of image processing on top of the current xml in order to further analyse the region of interest and eliminate false positives. Figure 13 shows the cropped images that are gathered for further analysis. Fig. 12. Cropped images from the detected region of interest gathered for further analysis Fig. 13. Cropped images of different traffic signs gathered from different images respectively Further, the image processing layer should have a mechanism to cumulatively work with cascade.xml of different signs and classify them into multiple classes based on specific criteria. 3.4 Histogram Approach The region of interest gathered for each of the signs using the cascade classifier involves many false positives, and also a mechanism is required to segregate the images into multiple classes due to which there is a requirement for another image processing layer. Histogram comparison was the initial approach opted for wherein the 8-bin 3D histograms were computed for the query image and the training image of specific traffic signs. Then, normalization of the images and calculation of the minimum distance (preferably Euclidean or the chi-square distance) between the two images was attained. One major drawback of the histogram comparison method was that the threshold could 9 not be set in such a way so as to be able to differentiate between certain true positives and false positives, and further, clear separation between multiple classes could not be achieved. The Figure 14 shows that the cropped images arranged in the ascending order of Chi-square distance for the pedestrian crossing sign, notice that the false positives are not well separated. Fig. 14. Histogram comparison for pedestrian sign arranged in ascending order of distance. 3.5 SIFT Approach With many false positives and failure of the histogram comparison. it is important for us to implement a technique that is robust and must cater two problems, one is regarding clear separation between true positives and false positives, second is regarding multiclass classification. SIFT(Scale Invariant Feature Transform)[2] based image matching technique was used for recognition and multiclass classification. An image of the traffic sign which appears distinctive was chosen for comparison. If there were five different classes, five different traffic sign images were chosen which were distinctive. SIFT based keypoint descriptors were gathered for these images. The regions of interest obtained by the cascade were cropped and collected for each of the images. These regions of interest contained both true positives and false positives. For each cropped image the SIFT based keypoint descriptors were gathered. The feature descriptor of the query image(cropped patches of each image) was matched with the feature descriptors of selected object images of all the different traffic signs. The mechanism that was followed in order to match these features was k-nearest neighbor to get the k best matches. Once these matches were obtained, computation of the distance between these matched points was arrived at by considering the ratio method wherein if the matches were in the format of (m,n) and the distance between the point m was lesser than 0.75 times the point n, then it was considered to be a good match and collected as goodness value. Figure 15 shows the SIFT based feature matching between different signs and the cropped images. The goodness value threshold was specific to each traffic sign and was decided based on the clear delineated separation between the matched goodness values. On a graph, representing the number of good point matches on the y axis and the signs on the x axis would help decide on choosing the threshold points for each of the signs based on maximum separation between the chosen traffic signs. The chosen threshold points for the signs Pedestrian Crossing, No Parking, yield, No Entry and Bus Stop was chosen to be 5,4,3,2 ad 3 respectively. An additional important factor needed to be kept in mind is that, as the system being designed is assistance based, the precision must be kept high which comes at the cost of false negatives. 10 Fig. 15. SIFT feature matching 4 Accuracy Evaluation In order to evaluate accuracy of a system, it will be vital to manually accumulate the ground truth data about the signs present in a subset of images and necessary to run the prediction algorithm to get an evaluation of how many were classified right. Individually, positive and negative images for each sign board were collected and the ground truth value for this data was maintained in a csv file. The prediction algorithm was then run on the dataset in order to get the evaluations of true positive rate(measures the proportion of positives that are correctly identified), false positive rate(measures the proportion of negatives that are incorrectly identified), true negative rate(measures the proportion of negatives that are correctly identified), false negative rate(measures the proportion of postives that are not identified), accuracy(the proportion of true results,inclusive of both true positives and true negatives, among the total number of cases that were examined), precision(fraction of retrieved instances that are relevant), recall(fraction of relevant instances that are retrieved) and F-measure(It is the measure of test accuracy and is defined as the weighted harmonic mean of precision and recall that are calculated). The appendix A, B C, D and E show the evaluation for the signs Pedestrian crossing,No Parking,Yield, Bus Stop and No Entry respectively. 11 The confusion matrix generated when the test cases are combined and run was as shown below:- On one side of the confusion matrix we take the ground truth and evaluate this again the predicted output. When the threshold is rightfully selected based on optimum accuracy value, the prediction algorithm is ready to be run on the large dataset. The algorithm categorises the predicted signs and generates a separate text file for the different categories of traffic signs to which the latitude and longitude of the predictions are written. 5 Conclusion The automation of road images collection for generating a dataset and maintaining the image association with latitude and longitude of the road points was successfully achieved. The combination of HAAR cascade in order to identify the regions of interest in an image and SIFT for recognition of traffic signs gives us an approach that can be used to implement the same on a larger dataset with ease. Clarity and illumination of the images collected plays a crucial role in improvement of the algorithm accuracy. The region of interest will not be lost in case the images are clear and thus, it can be further propagated to the SIFT layer. Certain hurdles faced with the image collected from the Google street view image API was the non-uniformity in illumination due to factors like weather changes and the time of the day when the images were captured. Using the generated files for each specific sign containing coordinate information, we can further use the sign specific markers and plot the coordinates on the Google map using the maps API. This gives us a visual representation of the points where the sign boards are present on the particular region of the map. The below figure shows multiple sceenshots of the sign boards mapped on the map. 12 6 References 1. Atallah, M. J., Szpankowski, W, and Genin, Y. (1996, September). A pattern matching approach to image compression. In Image Processing, 1996. Proceedings., International Conference on (Vol. 2, pp. 349-352). IEEE. 2. Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on (Vol. 2, pp. 1150-1157). Ieee 3. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91-110. 4. R. Venkatesan, S. M. Koon, M. H. Jakubowski and P. Moulin, ”Robust image hashing,” Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101), Vancouver, BC, 2000, pp. 664-666 vol.3. doi: 10.1109/ICIP.2000.899541 5. Jones, Michael J., and Paul Viola. ”Face recognition using boosted local features.” (2003). 6. Corvee, Etienne, Francois Bremond, and Monique Thonnat. ”Person re-identification using haarbased and dcd-based signature.” Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International Conference on. IEEE, 2010. 13 A Pedestrian Crossing Evaluation The evaluation for the Pedestrian Crossing sign was as follows:Total Images: 400 Total Pedestrian Images: 200 Total Non-Pedestrian Images: 200 True Positive Rate: 0.455 True Negative Rate: 0.92 False Positive Rate: 0.08 False Negative Rate: 0.55 Accuracy: .69 Precision: 0.85 Recall: 0.455 F1 Measure: 0.6 B No Parking Evaluation 14 The evaluation for the No Parking sign was as follows:Total Images: 400 Total No Parking Images: 200 Total Non-No Parking Images: 200 True Positive Rate: 0.56 True Negative Rate: 0.89 False Positive Rate: 0.11 False Negative Rate: 0.46 Accuracy: .715 Precision: 0.83 Recall: 0.56 F1 Measure: 0.67 C Yield Evaluation 15 The evaluation for the Yield sign was as follows:Total Images: 100 Total Yield Images: 50 Total Non-Yield Images: 50 True Positive Rate: 0.56 True Negative Rate: 0.96 False Positive Rate: 0.04 False Negative Rate: 0.45 Accuracy: .76 Precision: 0.93 Recall: 0.56 F1 Measure: 0.7 D Bus Stop Evaluation The evaluation for the Bus Stop Sign was as follows:Total Images: 80 Total Bus Stop Images: 30 Total Non-Bus Stop Images: 50 True Positive Rate: 0.37 True Negative Rate: 0.86 False Positive Rate: 0.14 False Negative Rate: 0.63 Accuracy: .69 Precision: 0.61 Recall: 0.37 F1 Measure: 0.46 E No Entry Evaluation 16 The evaluation for the No Entry sign was as follows:Total Images: 80 Total No Entry Images: 30 Total Non-No Entry Images: 50 True Positive Rate: 0.43 True Negative Rate: 0.84 False Positive Rate: 0.16 False Negative Rate: 0.57 Accuracy: .69 Precision: 0.62 Recall: 0.43 F1 Measure: 0.50 Evaluation of ORB-SLAM2 in Urban Areas Queens Maria Thomas1 and Oliver Wasenm¨ uller2 1 2 [email protected] [email protected] Abstract. This project presents an implementation of ORB-SLAM2 using ZED stereo camera and evaluation for differet frame rates and number of features. ORB-SLAM2, built on monocular feature-based ORB-SLAM stands out with the unique map reuse, loop closing and relocalization capabilities. ORB-SLAM operates in real time, in small and large, indoor and outdoor environments. This algorithm is built on excellent algorithms and uses the same features for tracking, mapping, relocalization and loop closing. ORB-SLAM2 shows good performance in tracking and mapping with the ZED camera dataset, however real time is still a problem. The dataset was created by recording indoor as well as outdoor scenarios of DFKI and Kaiserslautern town respectively. The evaluation was done on more than 10 sequences and results before and after optimization were compared with respect to frame rate and number of features. Keywords: ORB-SLAM2, State-of-the-art, ZED stereo camera 1 Introduction Odometry plays an essential role in autonomous driving where robust and precise self localization of vehicles is a main challenge. Cameras are cheap and accurate among the different sensor modalities if the scene is rich with texture and this process is called Visual Odometry (VO). VO is a particular case of structure from motion (SfM) as described in [10]. As opposed to monocular cameras, stereo cameras and ToF cameras [11] give depth information as well. Stereo odometry, having two views from a calibrated stereo camera, has the advantage of observable depth and is beneficial in exploiting the rotation and translation motion. Most modern cars are equipped with a stereo camera mounted infront of the car and images captured can be exploited to estimate the trajectory. The key idea behind odometry is motion integration through time to obtain the current position and it can be prone to drift. Simultaneous Localization and Mapping (SLAM) and Bundle Adjustment (BA) are two commonly used drift reduction methods. SLAM builds a map of the environment and localizes the sensor itself within the map. This becomes computationally complex since localization needs a consistent map and for acquiring the map, a good estimate of the location is required. Visual odometry approaches uses the ego-motion estimation which can be split into two parts, a rotation motion estimation and translation estimation to exploit the fundamental differences and thus boosting the performance. The idea of this project is to build a live demonstrator of ORB-SLAM2 using ZED stereo camera. The rectified and undistroted images captured by the camera is fed into the algorithm and the features extracted from the images are used to estimate the trajectory of the camera motion. ORBSLAM [5], is a feature-based approach which utilizes Bundle Adjustment and a key place recognition module based on bag-of-words[2]. As the camera explores the scene, the software builds an extensive map of features that is used for tracking and loop closure by recognizing the areas previously visited using the map. SLAM systems are more precise and drift less than odometry approaches but are computationally complex. 2 2 ORB-SLAM2 Overview ORB-SLAM2 [7] for stereo and RGB-D cameras is built on top of ORB-SLAM [5], a feature based monocular SLAM. ORB-SLAM2 runs in real time with no GPU support. Map points are key points projected to 3D space and has associated viewing direction with it. The main components are three threads running in parallel: tacking, local mapping and loop closing as visualized in Fig. 1. Tracking thread is responsible for localizing the camera with each frame and deciding when to insert the new key frame. Local mapping thread manages the local map and optimizes it by performing local bundle adjustment. Loop closing thread is responsible for identifying loop candidates and correcting the loop by performing pose-graph optimization. A full global bundle adjustment is performed at the end to compute the optimal structure and motion. Fig. 1: An overview of ORB-SLAM system. (Image Fig. 2: Input pre-processing. (Image taken from [7]) taken from [7]) The system also has a place recognition module based on bag of words (DBoW) [2] with ORB. ORB are binary features invariant to rotation and scale resulting in a very fast recognizer with good invariance to viewpoint. Monocular SLAM initilization was independent of scenes with two threads computing a homography and fundamental matrix in parallel and uses respectively in case of planar or general scenes. With RGB-D and stereo images specific structure from motion initializations are not neccessary because depth information is already available. The initial frame is set as the first keyframe and its pose is set to origin. A covisibility graph,an essential graph and a spanning tree are maintained by the system to track the map. Covisbility graph contains all the keyframes and an edge is added if two keyframes share match points. This covisibility information is useful in determining the local map independent of global map size and ensures real time operation in large environments. Essential graph is a subgraph of covisibility graph but less number of edges. An edge is inserted if more than a particular number of fetures are shared between two keyframes. A spanning tree has all the keyframes but edges only between keyframes sharing the maximum number of features. A visualization is given in Fig. 3 2.1 Feature Choice and Extraction This algorithm effectively uses the same features used by mapping and tracking for place recognition and loop detection and thereby eliminating the overhead of depth interpolation required by the previous algorithms. The need to extract features extremely fast (∼33ms) eliminates popular SIFT (∼300ms), SURF (∼300ms), A-KAZE (∼100ms) and supports ORB [6], Oriented FAST and Rotated Brief which are easy to match and fast to compute. They have good invariance to rotation as well and 3 Fig. 3: Left : Covisibility Graph, Middle : Essential Graph, Right : Spanning Tree. Image taken from [7] thus improves the place recognition accuracy. Traditional approaches like BRIEF [1] and LDB do not support viewpoint invariance. ORB features are extracted at 8 scale levels in the scale pyramid with a scaling factor of 1.2 ensuring homogeneous distribution as detailed in [7]. For stereo cameras, ORB features are extracted from both left and right images and images are discarded after this. For every ORB in left image, a match is searched for in the right image and stereo keypoints are generated [7]. This is visualized in Fig. 2. The key points are classified as close and far points depending on the depth. The computed ORB descriptor is later used in all feature matching. 2.2 Tracking This section explains the steps performed by tracking thread with every frame from the camera. ORB features are exracted depending on the scale factor and the number of features. ORB stands for Oriented Fast and Rotated BRIEF. Initial pose estimation is done either from previous frame or via global relocalization. If tracking was successful for last frame, a constant velocity motion model is used to predict the camera pose. If tracking is lost, pose is estimated via relocalization. Here the frame is converted into bag of words and query the recognition database for candidates for global relocalization. When a match is found, ORB correspondences are computed and PnP algorithm is used along with RANSAC to estimate pose. Once there is an initial estimation of camera pose and features, a local map is retrieved and is projected to the current frame to search for more matches. Only a local map is projected to manage the complexity in large maps. The local map contains the set of keyframes K1 that share common map points with the current frame and K2, the set of neighboring keyframes of K2. Each map point seen in K1 and K2 are searched for in the current keyframe and camera pose is optimized with all map points found in the frame. The next step is to decide if the current frame is a new keyframe. If a specified number of frames are passed since last keyframe insertion and if there is a significant change in the visual scene, the current frame can be added as a new keyframe. 2.3 Local Mapping When a new keyframe is inserted, the covisibility graph is updated by adding a new node for keyframe and edges are updated with other keyframes. Spanning tree and essential graphs are also updated similarly. Map points are removed from time to time if its not visible from more than three keyframes. This makes the map contain very few outliers. New map points are created by triangulating ORB from keyframes in covisibility graph. Local bundle adjustment is performed with each incoming keyframe by optimizing the connected keyframes and map points. All the keyframes 4 whose majority of the map points are seen by other keyframes as well are also removed by local mapping process. 2.4 Loop Closing With each incoming keyframe, the bag of words vector is computed. Loop closing thread computes the similarity between the current keyframe processed by the local mapping and detects if it is a loop candidate. There can be several loop candidates if multiple keyframes are matching the current keyframe. The similarity transformation between current keyframe and loop candidate keyframes informs about the drift accumulated in the loop. The duplicate map points are fused and new edges are inserted in the covisibiltiy graph and essential graph to ensure loop closing. Then the error computed is distributed along the loop to compensate the drift accumulated. A pose graph optimization over the essential graph ensures the uniform distribution of error. 3 New ZED Data Set A total of 12 indoor and outdoor sequences of data were recorded for evaluation using ZED stereo camera [4] with baseline 120 mm. Since ZED SDK uses CUDA, a NVIDIA GPU is required to be able to use the SDK. It outputs rectified and undistorted stereo images which can be passed into the algorithm directly without any further processing. Latest USB 3.0 drivers and NVIDIA drivers enable the camera to output a high resolution synchronized left and right video side-by-side. The need of a powerful GPU limited the initial goal to build a live demonstrator. Hence a new sequence of data were recorded for convenience and reuse. The camera has four different configurations: VGA (640×480) operates at maximum fps 100, HD720 (1280×720) operates at maximum fps 60, HD1080 (1920×1080) operates at maximum fps 30 and HD2K (2208×1242) operates at maximum fps 15. The sequences were captured in HD720 at 60 fps because of its similarity to the KITTI [3] dataset resolution 1241×376. The captured indoor sequences of DFKI included corridor sequences as well as open space sequences covering multiple loops, no loops and tracking in opposite direction scenarios. The outdoor sequences were captured around the university and living town of Kaiserslautern with the camera mounted on top of the car. Previous recordings shown that mounting the camera inside the car near the wind shield induces significant distortion. The dataset is challenging involving different scenarios of strong rotation, initial rotation, mirror reflections, low texture scenes, occlusion by sun light, quick motion etc. The largest sequences covers 2.5 km and the smallest covers 500 m. Ground truth is not available for this dataset because of the expensive GPS integration required. A sample set of images are shown in Fig. 4 and Fig. 5. Fig. 4: Sample images from indoor sequence captured at DFKI Kaiserslautern 5 Fig. 5: Sample images from outdoor sequence captured around the university, living town etc. in Kaiserslautern 4 Changes and Evaluation The implementation of ORB-SLAM2 is available as open-source [9] and reused here for the evaluation with minor changes. The changes are explained below. The original ORB-SLAM2 uses vocabulary data stored in a text file. As explained earlier a vocabulary tree is built by discretizing the image descriptors and is used to speed up correspondences for geometric verification. This vocabulary is created offline with the ORB descriptors extracted from a large setof images. The vocabulary for original ORB-SLAM2 is reused here since the images are general enough. Converting the text file into a binary file [8] speeds up the initial start up time from 20 seconds to 0.5 seconds. Global Bundle Adjustment (BA) is usually performed after a loop detection and the computed drift in estimation is distributed along the loop. However, global BA is not launched if the trajectory does not involve a loop. This is tackled by launching the global BA after every 100 keyframes. This also helps to reducce the accumulated drift at the end because the pose is optimized in between, but the probability of missing keyframes is very high. This is because local mapping is stopped temporarily stopped when global bundle adjustment is performed to ensure that no new frames are inserted while error is computed. The configuration data is stored in a .yaml file. configuration data involves the intrinsic parameters of the camera along with ORB-SLAM2 parameters. This includes the frame rate and number of features to be extracted. ORB-SLAM2 algorithm is evaluated for different frame rate and number of features to compute the drift before optimization. After optimization, error is distributed uniformly across the loop and hence the trajectory looks the same at different frame rates and number of features. The evaluated frame rates are 10, 15, 20, 30, 40, 50 and 60 by extracting 1000, 1500, 2000 and 2500 features at each frame rates. The aim was to find out the optimal fps and number of features to be extracted for an accurate trajectory estimation and real time trajectory estimation for a live demonstrator. The evaluation considered various challenging scenarios in the dataset for both indoor and outdoor datsets. Estimated trajectory is assumed to be less prone to drift if the path comes closer to the starting point of the sequence. 5 Results The estimated trajectory before optimization is evaluated for both indoor and outdoor datasets but the lack of ground truth makes the evaluation difficult for indoor dataset. But the estimated indoor trajectroy looks similar to the path traversed and the error accumulated is only in the range of centimeters. Estimated outdoor trajectory after optimization is compared with the path from google maps. The results are illustrated in Fig. 6. Reconstruction of two scenes are visualized in Fig. 7. The red points indicate the local map that is being projected on to the current frame to search for matches. Blue line indicate the key frames. As we can see, a lot of wrong map points are seen in the reconstructed images. 6 Fig. 6: Left: Path from Google Maps, Right: Estimated Trajectory. Sequnces captured using ZED camera around the University and Living Town Kaiserslatern Fig. 7: Sample Reconstructions of the sequences captured using ZED camera in the car parking of LIDL (left) and Living Town Kaiserslautern (right) 7 The datsets are evaluated for accuracy of the estimated trajectory for different frame rates and results before optimization and after optimization are compared. Blue line shows the trajectory after optimization. The accumulated drift is less when the algorithm consumes 30 frames per second which is indicated by red line. Red line comes closer to the beginning of the trajectory and hence less accumulated drift. The results are shown is Fig. 8. Fig. 8: Comparison of estimated trajectory before optimization and after optimization for different frame rates. The sequences are captured outdoor using ZED camera around the university (left) and in the car parking of LIDL (right) in Kaiserslautern Accuracy of the estimated trajectory again depends on the number of features extracted from the images. Better accuracy is found when extracted 1500 features. In Fig. 9, blue line shows the trajectory after optimization. Line in red color indicates the trajectory estimated when extracted 1500 features which is close to the optimized trajectory. No.of Features Processing time per frame Frames processed per second 1000 1500 2000 2500 ∼0.067s - ∼0.08s ∼0.085s - ∼0.09s ∼0.0962s - ∼0.12s ∼0.125s - ∼0.15s 14.925 11.764 10.395 8 Two scenarios can be assumed. First : For real time trajectory estimation, it is a good idea to reduce the number of frames consumed and extracting less number of features from each image. From Table. 1 it can be observed that extracting 1000 features enables the system to process around 15 frames per second. But this is challenging since the probability of losing track is inversly proportional to the number of features extracted especially during quick motions and strong rotations. A new keyframe is inserted only after 15 frames has passed after the previous keyframe insertion and this results in very less number of features to be matched during quick motions and strong rotations. Super real-time is possible if no challenging movements are involved by extracting 1000 features per frame at 10 fps. But this is not a realistic scenario. Second : For accurate trajectory estimation, the suitable configuration is 30 fps extracting 1500 features. Launching the global BA after every 100 or 250 keyframe is a good idea but the probability of losing keyframes has to be taken to account. 8 Fig. 9: Comparison of estimated trajectory before optimization and after optimization for varying number of features. The sequences are captured outdoor using ZED camera around the university (left) and in the car parking of LIDL (right) in Kaiserslautern If interested in better sparse reconstruction, it is a good practice to increase the number of features to be extracted at the cost of time. 6 Conclusion As discussed, for real-time trajectory estimation feeding the system with 15 frames per second and extracting 1000 features is a good practise. But it is worth to mention that this configuration is prone to losing track in case of challenging scenarios. The various challenging scenarios when track is lost are: strong rotation at high fps, occulsions by sun light, quick motions, pure rotation etc. For accurate trajectory estimation, ideal configuration settings are 30 fps and 1500 features but tracking can not be performed in real time because it takes around 2.7 seconds to process all the 30 frames. ORB-SLAM2 shows close to real time performance by building globally consistent maps in wide range of environments. Large urban sceanarios are handled by projecting only a local map to search for matches instead of projecting the entire map. The performance is even boosted by bundle adjustment and place recognition. ORB-SLAM2 successfully relocalizes when a previously visited place is encountered again if the track is lost at some point. Also good loop closing properties are exhibited by the algorithm. The algorithm shows drift in centimeters in indoor scenarios and few meters in outdoor environment. For a real time demonstration, extracting 1000 features at 15 frames per second is ideal. However, strong rotations and quick motions should be avoided. The system fails to initialize in case of rotating motion in the beginning of sequence. For robust and real time estimation, faster matching of correct map points is very important. Wrong map points or outliers can be removed for faster and accurate sparse reconstruction and estimation through correspondence chaining [12] to minimize redundant and imprecisely reconstructed 3D points where each 3D point is estimated from multiple images. 9 References 1. Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF: Binary Robust Independent Elementary Features, pages 778–792. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. 2. D. Galvez-Lpez and J. D. Tardos. Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics, 28(5):1188–1197, Oct 2012. 3. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 4. Stereo Labs. Zed stereo camera. 5. Raul Mur-Artal, JMM Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. 6. Ra´ ul Mur-Artal and Juan D Tard´ os. Orb-slam: Tracking and mapping recognizable features. In Workshop on Multi VIew Geometry in RObotics (MVIGRO) - RSS 2014, 2014. 7. Raul Mur-Artal and Juan D Tardos. Orb-slam2: an open-source slam system for monocular, stereo and rgb-d cameras. arXiv preprint arXiv:1610.06475, 2016. 8. Poine. Binary version o vocabulary file. 9. Raulmur. Binary version o vocabulary file. 10. D. Scaramuzza and F. Fraundorfer. Visual odometry [tutorial]. IEEE Robotics Automation Magazine, 18(4):80–92, Dec 2011. 11. Oliver Wasenm¨ uller, Mohammad Dawud Ansari, and Didier Stricker. Dna-slam: Dense noise aware slam for tof rgb-d cameras. In Asian Conference on Computer Vision Workshop (ACCV workshop). Springer, 2016. 12. Oliver Wasenm¨ uller, Bernd Krolla, Francesco Michielin, and Didier Stricker. Correspondence chaining for enhanced dense 3d reconstruction. In Communication Papers Proceedings of the International Conference on Computer Graphics, Visualization and Computer Vision (WSCG). o.A. Recurrent neural network based human pose estimation Saurabh Varshneya1 , Ahmed Elhayek2 , Pramod Murthy3 1 [email protected] 2 [email protected] 3 [email protected] Abstract. We propose here bi-sequencer LSTM network for prediction and estimation of human body pose in video sequence of data containing 2d joint positions. We have demonstrated by experiments that results of human motion prediction varies with different formats in which data is presented to the same recurrent network. Our major contribution is by training the 2d joint positions by offsets which gave the best results for a future prediction of 500ms. We have also proposed other techniques on how 2d joint positions can be trained on a recurrent network and compared them to get an overall view of how data can be presented to same network to improve the results. Keywords: human pose estimation, human 3.6M dataset, RNN, LSTM, bisequencer LSTM 1 Introduction Human joint prediction problem have been one of the core problems in computer vision for many years where many machine learning algorithms and statistical models like HMM have been applied in the past. Currently, however, deep learning methods have shown promising results in this area.[2] [5]. Human joint prediction involves localizing human body parts in the near future. The motivation behind the prediction in future is that the human motion is smooth over time and there are very few abrupt changes. For eg. if a person is performing an activity while sitting it is very probable that he will be performing the same activity in the near future. However, there are various challenges considering the flexible human body with many degree of freedoms of each joint. There could be huge number of possibilities as a combination of these degrees of freedom and it becomes extremely difficult to consider all possibilities while considering human motion. Localizing human body parts in the near future have many applications in the field of robotics where a robot can imitate a human motion very accurately and in real-time if it has some information of the future. Also many researches have shown results using convolution neural networks [4] and few on recurrent neural networks. There have also been implementation on the combination of RNN and CNN [8]. 2 Related Work The task of localizing human body parts has also been previously addressed in the context of computer vision and machine learning. Prior work has considered machine learning models for sequence prediction like using Hidden Markov Models, linear dynamical systems[6] and models based on Restricted Boltzmann Machines(RBMs)[7].These models, however, give valid results only for a short temporal horizon. In the recent past there have been efforts to solve this problem in the domain of deep learning, one is using the encoder-decoder network by Fragkiadaki et al.,2015[2] where a encoder-decoder network is used for human motion modelling and prediction. An encoder network is a CNN which encodes the frame of a video sequence through convolution layers which gives output in the form of joint positions 2 of the human body in the frame, this output is then fed to a decoder network in the form of LSTM. Output of sequential frames from CNN is fed to LSTM and then human motion forecasting is done for future frames. There has also been efforts to solve this problem as spatio-temporal problem and solve it by training a graph based model[5] which have trained Structural-RNN model for human motion forecasting. This structural RNN is a recurrent network based approach where they have used node-RNNs and edge-RNNs. For human motion forecasting Structural-RNN has node-RNNs as different body parts(leg, arm, spine) and edge RNNs for temporal connections between the body parts. This part based RNNs are then fused together and trained jointly for the task of human motion prediction. Our motivation and approaches are based on the last two related works based on deep learning. 3 Proposed Approach In our proposed approach we have divided the approach into modules. First module deal with the Data Preprocessing where we have take human 3.6m dataset and extracted frames and joint positions from it. Detailed description of this module is presented in section 3.1 .The second module deals with defining a recurrent neural network in the form of a bi-sequencer LSTM. Detailed description and network architecture is defined in section 3.2 . The third module deals with training data on network and fine-tuning the network parameters, detailed description of which is described in section 3.3 .Various experiments performed by training the same network but showing data to the network in different forms, detailed description and results of the experiments are described in Section 4. In the last module we do a quantitative evaluation and comparison of the approaches we propose through this paper, detailed description of which is described in Section 5. 3.1 Data Preprocessing We collect the required 2d annotations from Human 3.6m dataset. It contains video sequences of various activities performed by seven subjects. We considered S11 as the validation subject, S5 as the test subject and rest five subjects are considered for training. While training the network all the sequences of data are subtracted with mean and divided by standard deviation. For collecting the 2d annotations we took projections of 3d points on one of the given cameras. 3.2 Network Structure We started with a simple RNN structure and then with empirically found that bi-sequencer LSTM performs best in such sequential temporal data.Final architecture is shown in the fig.1 . Also, the bi-sequencer LSTM used is described in the form of forward and backward module in fig. 2. 3.3 Training Our aim is to model the future 2d joint positions as function of the 2d joint positions in current and past frames. At each time step we have a vector of 2d joint positions xt = (x1 , ...., xN ) where N is the number of joints to be modelled. We fed the the data of 50 time steps simultaneously to the network and estimated future joint positions at vector yt = (y1 , .., yN ) with a mean squared loss as described below: N 1 X 2 (yt − xt ) ℓ= N i=1 We started with training a large network while analyzing the loss functions and then increased the dropout neurons to get best validation error, which indeed gives the optimal size of the network. 3 Fig. 1. Network Architecture Fig. 2. Bi-sequencer LSTM 4 Ideally the network size should not be too small so that the data over-fits the network and neither it should be too large that it becomes difficult to train the large amount of the parameters.For this purpose we started with the simple network and then monitor the validation loss and the training loss. We compare both losses at regular intervals called checkpoints. If the training loss becomes much lower than the validation loss this mean the network might be over-fitting. Solution to this is to decrease the network size or to add dropouts.If training loss/validation loss are about equal then the model is under-fitting. In this case we increase the network size. Another tuning parameter is learning rate, for managing the learning rate we started with common value 0.001 and decreased the rate by a factor of 0.1 if the training loss remains constant or is increased in every 5000 iterations. Thus every time we obtained an optimal value of learning rate. 4 4.1 Experiments Training with Joints In this approach we estimate the joints as a function of the joints in the previous frames. We consider every 5th frame and every 3rd frame for the comparison. Data from each frame is not considered because the overall movement is very less in each frame. The videos are recorded at around 25 frames per second. We try to predict body parts for the next 1000 ms. So, if every 5th frame is considered we compare next five frame for the comparison and if every 3rd frame is considered, we predict eight frames in total. Results Prediction 200ms 400ms 600ms 800ms 1000ms Avg. Euclidean Loss 20.24 20.83 22.34 24.58 25.16 Fig. 3. Frame predicted while training with only joints data for activity walking. The red squares are predictions made by the network whereas the green ones are ground truth.Here, prediction is done for every 200 ms 5 We also tried to estimate the same results by considering every 3rd frame wherein estimating next 8 frames for the result. Prediction 120ms 240ms 360ms 480ms 600ms 720ms 840ms 960ms Avg. Euclidean Loss 17.88 18.11 19.13 20.71 20.40 21.21 20.78 23.35 Fig. 4. Frame predicted while training with only joints data for activity walking. The red squares are predictions made by the network whereas the green ones are ground truth.Here, prediction is done for every 120 ms 4.2 Training with Offsets The next approach is to feed offsets to the network. The idea was that if we know the current position with CNN we can predict the body parts by just knowing the offsets. Also, the future scope could be to merge the result of RNN with that of CNN. Training sequences like these using offsets have shown good results in other areas like handwriting production [3] and tracking pedestrians[1]. Again we considered every fifth frame and every 3rd frame and calculated offsets by subtracting data from the next frame.Below is the result we got after considering the offset every 5th frame. Results Prediction 200ms 400ms 600ms 800ms 1000ms Avg. Euclidean Loss 3.94 8.94 13.91 17.50 20.07 Here again we compared the result by training offsets calculated with every third frame. Prediction 120ms 240ms 360ms 480ms 600ms 720ms 840ms 960ms Avg. Euclidean Loss 1.69 3.12 4.51 6.41 9.12 11.97 14.74 17.11 6 Fig. 5. Frame predicted while training with offsets data for activity walking. The red squares are predictions made by the network whereas the green ones are ground truth.Here, prediction is done for every 200 ms Fig. 6. Frame predicted while training with offsets data for activity walking. The red squares are predictions made by the network whereas the green ones are ground truth. Here, prediction is done for every 120 ms 4.3 Training with Head Normalized Joints Here we train with the joints normalized with respect to the head. So at each frame the CNN output of head position will be considered as the origin and all other body parts are defined w.r.t. to the head co-ordinates. The idea is that mostly CNN predict the head position very accurately and rest body parts can be occluded. Our main effort is to locate those body parts knowing the head positions. We consider head-normalized body part positions every third frame and obtain following result: 7 Results Prediction 120ms 240ms 360ms 480ms 600ms 720ms 840ms 960ms Avg. Euclidean Loss 6.72 8.22 9.39 10.14 10.93 11.67 12.55 13.63 Fig. 7. Comparing various approaches in terms of average euclidean distance between the predicted joints and the ground truth 4.4 Part Based Training While training with the head normalized joints the results in the long-term improved. However, there was still a scope where motion of the body parts could be learned locally. To further improve the results we applied an approach of part based learning. The idea of this approach is to further improve the results by training on small RNNs and then fuse them together to obtain better result, small RNNs will learn the local motion and then fused RNN above them will be able to learn the global motion. With this idea we used four small RNNs one each for left leg, right leg, left arm and right arm and then fused them together to obtain the positions of all body parts. Results We considered every third frame to train our model and predictions were obtained for next 120 ms in one frame. Again in total prediction was done for 1000ms to obtain following result: Prediction 120ms 240ms 360ms 480ms 600ms 720ms 840ms 960ms Avg. Euclidean Loss 7.07 8.45 9.43 9.86 10.27 10.48 11.23 12.13 8 Fig. 8. Frame predicted while training with head normalized joints data for activity walking. The red squares are predictions made by the network whereas the green ones are ground truth. Here, prediction is done for every 120 ms 5 Comparison for approaches Quantitative comparison among the above approaches can be seen from the table1 mentioned below where euclidean distance of each approach is compared for 1000ms of prediction. Also fig.7 shows the graphical comparison. From the graph, one can interpolate that training with offsets performed best but predicted good only for a small time period but approaches with joints positions normalized w.r.t. the head performed much better in long-term as compared to the offset approach. Train Train Train Train 6 120ms 240ms 360ms 480ms 600ms 720ms 840ms Joints Positions 17.88 18.11 19.13 20.71 20.4 21.21 20.78 Offsets 1.69 3.12 4.51 6.41 9.12 11.97 14.74 Head Normalized Joints 6.72 8.22 9.39 10.14 10.93 11.67 12.55 Part based 7.07 8.45 9.43 9.86 10.27 10.48 11.23 Table 1. Compares various approaches discussed in this paper 960ms 23.35 17.11 13.63 12.13 Conclusion We proposed four approaches how data related to 2d joint positions of human body can be trained on a recurrent neural network. We also propose the bi-sequencer LSTM architecture to be used for human pose estimation in the 2d domain. Also, we did a exhaustive comparison of the given approaches and explained the benefits and cons of the underlying approach. These approaches uses RNN-LSTM based models and can be readily combined with any CNN model to improve the results for CNN. REFERENCES 9 After implementing the RNN based network and training it with 2d human joint positions our future idea is to train the network with 3d joint positions in the form of euler angles. Recurrent networks should give better results in 3d as we have some skeleton based information and also the movement in 3d is linear. Also, in future we would like to combine it with a trained CNN model on 3d joints and compare the results. References [1] [2] [3] [4] [5] [6] [7] [8] A. Alahi et al. “Social LSTM: Human Trajectory Prediction in Crowded Spaces”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2016, pp. 961– 971. doi: 10.1109/CVPR.2016.110. K. Fragkiadaki et al. “Recurrent Network Models for Human Dynamics”. In: ArXiv e-prints (Aug. 2015). arXiv: 1508.00271 [cs.CV]. A. Graves. “Generating Sequences With Recurrent Neural Networks”. In: ArXiv e-prints (Aug. 2013). arXiv: 1308.0850. A. Jain et al. “Learning Human Pose Estimation Features with Convolutional Networks”. In: ArXiv e-prints (Dec. 2013). arXiv: 1312.7302 [cs.CV]. A. Jain et al. “Structural-RNN: Deep Learning on Spatio-Temporal Graphs”. In: ArXiv e-prints (Nov. 2015). arXiv: 1511.05298 [cs.CV]. Vladimir Pavlovic, James M Rehg, and John MacCormick. “Learning switching linear models of human motion”. In: NIPS. Vol. 2. 2000, p. 4. Graham W. Taylor and Geoffrey E. Hinton. “Factored Conditional Restricted Boltzmann Machines for Modeling Motion Style”. In: Proceedings of the 26th Annual International Conference on Machine Learning. ICML ’09. Montreal, Quebec, Canada: ACM, 2009, pp. 1025–1032. isbn: 978-1-60558-516-1. doi: 10.1145/1553374.1553505. url: http://doi.acm.org/10.1145/ 1553374.1553505. J. Tompson et al. “Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation”. In: ArXiv e-prints (June 2014). arXiv: 1406.2984 [cs.CV]. CNN Based Depth Estimation from Single Images Rouven Bauer1 and Tewodros Amberbir Habtegebrial2 2 1 [email protected] tewodros [email protected] Abstract. The research area of computer vision yields tasks that turn out to be too complex to obtain reasonable results with traditional mathematical or computational models. In recent years, deep neural networks turned out to be a very successful choice for many of those tasks. In this paper we will survey approaches utilizing deep convolutional neural networks (CNNs) to tackle the task of depth estimation from monocular images. After presenting several approaches we will compare them using quantitative measurements on two data sets. Keywords: CNN, Depth Estimation, Monocular, Survey 1 Introduction Using epipolar geometry, one can compute a depth map from a binocular (i. e. stereo) image [9] if the relative camera poses are known. Another option is to use a set of locally close monocular images, to estimate the camera pose together with depth information. This approach is called Structure from Motion (SfM) [21]. A related approach is Simultaneous Localization and Mapping (SLAM) where a continuous stream of locally and temporal close images, e. g. a video camera on a robot, is used to simultaneously estimate the camera pose and 3D information of the environment [2]. But what if one wants to compute that information from a single 2D image? For humans it appears to be an easy task: given a single 2D image of an arbitrary scene the human brain is usually capable of reconstructing a rich 3D estimation of the photographed scene. For computer vision, however, this task has been a hard nut to crack for a long time as explained in Section 2. But the recent development of computing power and especially parallel processing capabilities gave rise to the recent flourishing of the research on artificial neural networks (ANNs). With the new power the task of deep learning (i. e. using ANNs with many hidden layers) becomes a manageable one [17]. Applications such as classification, content generation, and many more including depth estimation from monocular images received a qualitative push forward. 2 Previous work Previous approaches in this field focus on geometric assumptions like box models [10]. These models and their necessary assumptions lead to very specific and thus not generally robust systems. A major break through in the field was proposed in 2009 by Saxena et al. [15]. They called their system Make3D. It is a trained Markov random field (MRF) that infers depth from 2D cues. Make3D only assumes that the real 3D model is composed of planes, which is a not too restricting assumption, given the planes can be chosen small enough. However, the results could still be improved and were clearly distinguishable from what human abilities. Results were improved by using data driven approaches like depth transfer [11] or Fouhey’s learning 3D primitives [5]. The disadvantage is that the online search strikes out the possibility to use such systems in real-time applications. Furthermore, they have been shown to perform worse than the best systems in this survey [13,3]. 2 3 Training CNNs with ground-truth depth maps In 2014, Eigen et al. [4] proposed a new approach for the problem of depth estimation. They use two CNNs as can be seen in Figure 1. One calculates a coarse depth map with very low resolution. The second one calculates from the actual input combined with the previously generated coarse depth estimation a more fine-grained one. This structure of two CNNs was chosen because it yields better results. Probably by the fact, that the coarse network has a more global view of the input while a fine-grained network alone would only be enabled to predict depth from spatially local information. Figure 1: CNN architecture of Eigen et al. 2014 [4]. Figure copied from their work. Another important idea of Eigen et al. is the scale invariant training-loss they used. By using the ground-truth to set the mean log depth of the predictions to the correct one, the predictions could be relatively improved by 20%. This shows that a major part of the prediction error arises from a wrong scale guess by the network. To correct for this problem, the CNNs were trained with a scale-invariant mean squared error in log-space n 1 X 2 [log yi − log gi + α(y, g)] D(y, g) = 2n i=1 (1) where y is the prediction, g the ground-truth , n the number of pixels and α is the mean scale P error of the prediction y: α(y, g) = n1 i (log gi − log y). Or alternatively let di = log yi − log gi n D(y, g) = 1 X 2 [(log yi − log yi ) − (log gi − gj )] 2n i=1 n n n 1X 2 X 1 1X 2 = di − di − 2 di dj = n i=1 n n i=1 i=1 j=1 n X i=1 (2) di !2 (3) The publication of Eigen et al. had a great impact in the community. Inspired by the work, others like Li et al. [12] and Liu et al. [13] proposed refined approaches involving CNNs in 2015. Both use 3 a CNN to predict depth for super-pixels (that are patches segmenting the image into areas with high self-similarity3 ). These sparse depth values are then interpolated using Conditional Random Fields (CRFs) to obtain a global depth map. The approaches differ in their CRF models (as shown in Table 1). Both require the pixels corresponding to the super-pixel’s centroids to have the depth regressed form the CNN. Furthermore, Li et al. incorporate smoothness for depth of adjacent super-pixels normalized by their difference in CIELUV color space [16], i. e. the more similar the super-pixels the smoother the depth map locally. Additionally, they use a summand for auto-regression, i. e. the depth map’s local self-similarity corresponds to the color images’ local intensity-self-similarity. Liu et al., on the other hand, take another approach by extracting three features of adjacent super-pixels: color difference, color histogram difference, and texture disparity in form of of local binary patterns (LBP) [14]. These features are then fed into a neural network with one fully connected, hidden layer. The network’s output is used to weight the smoothness summand for adjacent super-pixels. A direct comparison of the chosen loss functions can be found in Table 1. However, both approaches yield results similar in quality as shown in Section 7. Meaning Total loss function E(y) = P i∈S Unary loss: enforce copying depth values of CNN at superpixel centroids Li et al. Liu et al. P P P EC (y C ) E(y) = Ei (yi ) + Eij (yi , yj ) + Eij (yi , yj ) Ei (yi ) + C∈P (i,j)∈ES (i,j)∈EC Ei (yi ) = (yi − yi )2 , where yi is the output of the CNN for super-pixel yi  yi −yj λij Eij (yi , yj ) = 12 Rij (yi − yj )2 2 Eij (yi , yj ) = w1 Smoothness at superpixel level where w1 is some learned weight and λij is the centroids’ distance in the CIELUV color space. EC (y C ) = w2 yu − P r∈C\u αur yr where Rij is calculated by a neural network from the superpixels color distance, colorhistogram distance, and LBP. !2 Super-pixel patch in- where w2 is some learned weight and C \ u is u’s trinsic smoothness at neighborhood. αur ∝ exp(−(gu − gr )2 /2σ 2 ) with u P pixel level αur = 1 is the local auto-regressions predictor where g is the ground-truth depth and σu is the intensity variance around u. Sadly, the paper is missing how u is chosen. None Table 1: In this table we compare Li et al.’s and Liu et al.’s loss functions used to train the CRF. We use y for the predicted depth map, yi for the ith pixel, S for all super-pixels, ES for the set of all adjacent super-pixel pairs, and P for all super-pixel patches, i. e. the sets of pixels where each set contains all pixels belonging to one super-pixel. 3 In this survey we will not focus on how these super-pixels are selected. 4 In 2015, Eigen et al. published an advanced version of their system with drastically improved results [3]. The used network structure as seen in Figure 2 is not only capable of learning depth prediction, but also normal and label prediction, both of which will not be discussed here as they are off-topic. The key-point that changed to their previous work is the new network structure that incorporates three instead of only two scales on which the network predicts. Furthermore, the loss function as shown in Equation 3 is altered to n 1X 2 1 E(y, g) = di − 2 n i=1 2n n X i=1 di !1 n + 1X [(∇x di )2 + (∇y di )2 ] n i=1 (4) where d = y −g instead of log y −log g as in [4]. The only other change is the new third summand. It encourages predictions where the depth gradient is similar to the ground-truth’s depth gradient. Figure 2: CNN architecture of Eigen et al. 2015 [4]. Figure copied from their work. 4 Utilizing labeled Sections Inspired by the work of Eigen et al. from 2014 [4] Wang et al. proposed an improved method in 2015 [20]. They join the task of semantic image labeling with depth prediction. To do so two CNNs are used. One to jointly predict depth and semantic labels on a global scale at low resolution, while a second CNN is used to predict depth and a label segment-wise. Similarly to the previously presented approaches by Li et al. [12] and Liu et al. [13] (see Section 3) Wang et al. combine these local and global predictions using a CRF to infer the final depth and label predictions. As we will show in Section 7, their results are in fact superior to their inspiration, but inferior to the results of Eigen et al. 2015. Because of that and the fact that their publication is missing some details about their system, we will not go further into detail. 5 5 Leveraging epipolar geometry The previously explained methods have in common that they need ground-truth depth data. It is hard to accurately collect such data suiting the use case. Therefore, approaches using only stereoscopic images were proposed. One example of such work was published by Garg et al. in 2016 [6]. The training data consists of image pairs that were taken simultaneously with a pair of stereo cameras. One major aspect of the approach by Garg et al. is the training loss they use. By using an image reconstruction loss, they obtain good results as show in Section 7. The image reconstruction loss is visualized in Figure 3. During training, the depth map predicted by the CNN is used to calculate the disparity map4 to warp the right image (ground-truth) back, see Equation 5. The training loss is then the difference between this reconstruction and the input image plus a weighted smoothness summand to overcome the lack of reconstruction information in homogenous regions of the image, see Equation 7. n Erecons = n 1X 1X ||Iw (i) − I1 (i)||2 = ||I2 (x + D(i) ) − I1 (i)||2 |{z} n i=0 n i=0 (5) f B/d(i) where Iw is the reconstruction of the input image I1 and I2 is the target image (ground-truth i. e. second view to be generated). D(i) = f B/d(i) is the disparity of pixel i from I2 to I1 with respect to the CNN’s depth prediction d(i) where f is the focal length of the used camera pair and B is their vertical distance. n Esmooth = 1X ||∇D(i)||2 n i=0 (6) where ∇D is the disparity gradient at pixel i. Together the partial error functions form the error function used by Garg et al.: E = Erecons + γEsmooth (7) Figure 3: Reconstruction loss as used by Garg et al. [6]. f stands for the focal length of the cameras while B is the horizontal distance between them and d(x) is the CNN’s depth prediction of pixel x. Figure copied from their work. 4 The disparity map stores the amount each pixel has to be shifted to obtain the image from another view. 6 One major disadvantage of the chosen loss function is the fact, that it’s not fully differentiable. Godard et al. [8] proposed a similar system in the same year. They overcome the need to use an approximation of the loss function’s derivative with a Taylor series expansion by replacing the smoothness error summand with bilinear sampling to obtain a sub-differentiable loss function. The system also differs in that it directly predicts the disparity map from which the corresponding depth map can then be calculated. Anyhow, the key insight is that the results can be improved by enforcing a left-right consistency. That is done by training the CNN to output a left (dl ) and a right (dr ) disparity map and to incorporate an l-r consistency loss as Elr-cons = n 1 X l dij − drij+dl ij n i=1 (8) j=1 This penalizes differences between the left disparity map and the right one if shifted according to the left one. 6 Summary After we have seen different approaches, all utilizing CNNs, to estimate depth from monocular images, we will briefly compare them by their main features. The first approach we have introduced (Eigen et al. 2014 [4]) uses two CNNs to directly predict the depth map for the whole image. Both Li et al. [12] and Liu et al. [13] extend this idea by only predicting the depth for super-pixels with CNNs. The interpolation between these super-pixels is done with CRFs. Both approaches are very similar and mainly differ in the loss function used to train the CRFs. In 2015, Eigen et al. published a similar system to the first one [3]. The main difference is that the CNN structure is changed and the CNNs are additionally trained to predict normals and image labels. Simultaneously, Wang et al. [20] published a system that also predicts depth and labels. It differs from Eigen 2015 in that it, similar to Li 2015 and Liu 2015, implements the CNN to only predict on super-pixels while CRFs interpolate the CNN’s sparse prediction to a full reconstruction. Finally, we have seen the approach by Garg et al. from 2016 [6] which starkly differs from the previous surveyed methods. The key idea of Garg et al. is to predict a disparity image (i. e. the missing view of a stereoscopic image) and then compute the depth using epipolar geometry, instead of predicting the depth information directly from the input as all other surveyed approaches before. 7 Comparison of the results Most of the surveyed systems use the NYU Depth[18] or the NYU Depth v2[19] data set to evaluate their results. Both consist of indoor scenes captured with a Microsoft Kinect which is a structured light depth sensor combined with a low-resolution webcam. Furthermore, the images are segmented with class labels. Some other approaches are additionally or exclusively using the KITTI Dataset [7]. Its data was captured utilizing a car with roof-mounted color and monochrome stereo-cameras, as well an inertial measurement unit (IMU), a Global Positioning System (GPS) sensor and a laser scanner. The car drove in a mid-sized city, rural areas, as well highways while recording the data. Due to this diversity in data sets used for training and evaluation, we will split the comparison into two parts according to the data sets. Note that there is also the Make3D Dataset [15] which is used by some of the approaches as well. However, we will ignore it in our comparison as its ground-truth depth data is more noisy than the one of the other data sets. Furthermore, we are able to get a good insight in the performance of the proposed systems even without relying on the Make3D Dataset. Godard et al. [8] are also utilizing the Cityscape Dataset by Cordths et al. [1] for training and 7 Approach higher is better lower is better δ < 1.25 δ < 1.252 δ < 1.253 abs rel sq rel RMSE RMSE log Eigen 2014 coarse[4] Eigen 2014 fine[4] Liu 2015 [13] Li 2015 [12] Wang 2015 [20] Eigen 2015 [3] 0.618 0.611 0.614 0.621 0.605 0.796 0.891 0.887 0.883 0.886 0.890 0.950 0.969 0.228 0.971 0.215 0.971 0.23 0.968 0.232 0.970 0.220 0.988 0.158 0.223 0.871 0.212 0.907 – 0.824 – 0.821 0.210 0.745 0.121 0.641 Eigen 2014 fine[4]◦ Garg 2016 [6] Godard 2016 [8]† 0.692 0.740 0.899 0.899 0.904 0.964 0.967 0.190 1.515 7.156 0.962 0.169 1.08 5.104 0.984 0.098 1.218 5.265 0.283 0.285 – NYU Depth – 0.262 0.214 0.27 0.273 0.175 KITTI Table 2: Comparing all surveyed approaches by quantitative measurements split into two data sets they were compared to. Bold cells mark the best result per data set and column while emerald cells mark the best results per column across data sets. ◦ We don’t list the results of the coarse network as it’s inferior to the fine network. † This system learns using stereo image pairs instead of depth ground-truth. evaluation. Nevertheless, we will also exclude this data set from our comparison as Godard et al. are the only of the reviewed works using it. Before we actually compare the results of the discussed methods, we’ll take a look at the used metrics. Let y be the predicted depth map, g the ground-truth, and n the number of pixels of both— y and g are of same size (n): Threshold δ < x: n1 {i| max{ ygii , ygii } = δ < x, i ∈ {1..n}} ; Absolute 2 Pn Pn i| i) ; Squared relative distance: n1 i=1 (yi −g ; Root mean squared relative distance: n1 i=1 |yig−g i q P q giP n n 1 1 error (RMSE): n i=1 (yi − gi )2 ; Logarithmic RMSE (RMSE log): n i=1 (log yi − log gi )2 . The comparison over these metrics can be found in Table 2. Finally, we want to present some images for the reader to gain a qualitative impression of the results. These result visualizations can be found in Figure 4 and Figure 5. Figure 5: Qualitative results of system as proposed by Godard et al. 2016 [8]. first row: input image, second row: ground-truth, third row: prediction Figure 4: Qualitative results of Eigen et al. 2015 [3]. (a) Figure copied from their work. input image, (b) result of Eigen et al. 2014 [4], (c) result of Eigen et al. 2015, (d) ground-truth. Figure copied from their work. 8 References 1. Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 2. Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006. 3. David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015. 4. David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014. 5. David F Fouhey, Abhinav Gupta, and Martial Hebert. Data-driven 3d primitives for single image understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 3392– 3399, 2013. 6. Ravi Garg and Ian Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. arXiv preprint arXiv:1603.04992, 2016. 7. Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and PatternRecognition (CVPR), 2012. 8. Cl´ement Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. arXiv preprint arXiv:1609.03677, 2016. 9. Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision, chapter 9, pages 239–261. Cambridge university press, 2003. 10. Varsha Hedau, Derek Hoiem, and David Forsyth. Thinking inside the box: Using appearance models and context based on room geometry. In European Conference on Computer Vision, pages 224–237. Springer, 2010. 11. Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth transfer: Depth extraction from video using nonparametric sampling. IEEE transactions on pattern analysis and machine intelligence, 36(11):2144–2158, 2014. 12. Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119–1127, 2015. 13. Fayao Liu, Chunhua Shen, and Guosheng Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5162–5170, 2015. 14. Timo Ojala, Matti Pietikainen, and David Harwood. Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In Pattern Recognition, 1994. Vol. 1Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on, volume 1, pages 582–585. IEEE, 1994. 15. Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2009. 16. J´ anos Schanda. Colorimetry: understanding the CIE system, chapter 3 CIE 1979 (L∗ u∗ v∗ ) Color Space, CIELUV Color Space, page 64. John Wiley & Sons, 2007. 17. J¨ urgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. 18. Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In Proceedings of the International Conference on Computer Vision - Workshop on 3D Representation and Recognition, 2011. 19. Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012. 20. Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, Brian Price, and Alan Yuille. Towards unified depth and semantic prediction from a single image. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2800–2809. IEEE, 2015. 21. Gao Yi, Luo Jianxin, Qiu Hangping, and Wu Bo. Survey of structure from motion. In Cloud Computing and Internet of Things (CCIOT), 2014 International Conference on, pages 72–76. IEEE, 2014. Survey on Convolutional Network (CNN) Based Human Pose Estimation Anitha Bhoopal1 and Ahmed Elhayek2 1 [email protected] 2 [email protected] Abstract. Human pose estimation is one of the hardest problem in computer vision, this report contains survey of how convolutional neural network has been evolved in solving human pose estimation problem. We come across the evolution of ConvNet architectures which addresses the problem and techniques of learning low-level features and higher level spatial model. And the hybrid architecture which explains the structural domain constraints such as geometric relationships between body joint locations. Finally survey explains which of these two architectures are better and finally a new generative pose estimation framework for fisheye views with a ConvNet-based body-part detector trained on a large new dataset. Keywords: Human-body pose estimation, convolutional network, FLIC dataset, RGB image, motion tracking, deep learning, joints 1 Introduction One of the hardest tasks in computer vision is determining the high degree-of-freedom configuration of a human body with all its limbs, complex self-occlusion, self-similar parts, and large variations due to clothing, body-type, lighting, and many other factors. The best performing pose estimation methods, including those based on deformable part models, typically are based on body part detectors. Such body part detectors commonly consist of multiple stages of processing. There are two ways to solve the problem of Human pose estimation 1. Generative and 2.Discriminative methods. In generative approach a skeleton is manually allined to the image, and track the motion is frames and update the skeleton until it fits the skeleton. Discriminative approach, individual joints are detected from image and joined to form a kinematic chain. The focus of this paper is on discriminative method. The survey combines three different articles on CNN architecture which explains different approaches of Human pose estimation. The article [2],[6] discusses about discriminative methods whereas article [4] discusses discriminative and generative aprproach combined together. 2 Convolution Neural Network Convolutional Neural Networks (ConvNets or CNNs) are a class of Neural Networks that can act directly on the raw inputs. ConvNets have been essential in recognizing confronts, objects and traffic signs separated from controlling vision in robots and self-driving autos. To perform human pose estimation one of obvious method is using Convoutional network architecture where RGB image is given as input which is then processed by three convolutional layers and subsampling, they use ReLUs(rectified linear unit) and max-pooling [2]. The output is a respective pose co-ordinates i.e 2D or 3D positions of joints. The one disadvantages is the network is pooling, while valuable for improving translation invariance during object recognition, destroys exact spatial information which is important to accurately anticipate pose. Another issue is that the immediate mapping from info space to kinematic body posture coefficients is very non-direct and not balanced. 2 3 Sliding Window Model Fig. 1. The Sliding-Window architecture[2]. Input: 64x64 patch from image. Output: Probability of joint being present in the patch. According to Jain et al[2] the sliding window model is shown in Figure 1 this model can overcome the problem of pooling . The survey found that, instead of using single CNN to learn pose co-ordiantes, training multiple CNN will find independent body-parts. So, convnets are applied as sliding windows to overlapping regions of the input, the output here is heatmap indicates the exact location of body-part. The same architecture is used for train each body-parts as in Figure 1, which starts with a 64x64 pixel RGB input patch which has been local contrast normalized (LCN) to emphasize geometric discontinuities and improve generalization performance. The LCN layer is comprised of a 9x9 pixel local subtractive normalization, followed by a 9x9 local divisive normalization. The input is then processed by three convolution and subsampling layers, which use rectified linear units (ReLUs) and max-pooling. Each of these output stages is composed of a linear matrixvector multiplication with learned bias, followed by a point-wise non-linearity (ReLU). The output layer has a single logistic unit, representing the probability of the body part being present in that patch. 4 Multi-Resolution Sliding-Window + Overlapping receptive fields According to Tompson et al.[6].For our Part-Detector, we combine an efficient sliding window-based architecture with multiresolution and overlapping receptive fields; the subsequent model is shown in Fig 3. Since the large context (low resolution) convolution bank requires a stride of half pixels in the lower resolution image to produce the same dense output as the sliding window model, the bank must process four down-sampled images, each with a half pixel offset, using shared weight convolutions. These four outputs laterally with the high resolution convolutional features, are processed through a 9x9 convolution stage (with 512 output features) using the same weights as the first fully connected stage and then the outputs of the low resolution bank are added and interleaved with the output of high resolution bank. One of the disadvantage of sliding window model is that detector is translation invariant. According to [6] to overcome the disadvantage we combine sliding window-based architecture with multiresolution and overlapping receptive fields for our part-detector. Figure2 shows that upper convolution bank process 64X64 input path whereas lower bank process 128x128 input patch which is then downsampled to 64x64 .The advantage of using overlapping contexts is that it allows the network to see a larger portion of the input image with 3 Fig. 2. Multi-resolution Sliding Window Detector with Overlapping Contexts (model used on FLIC dataset)[6] only a moderate increase in the number of weights. In our pragmatic usage we utilize 3 determination banks. Rearranged manner is no longer proportionate to the first sliding-window arrange, since the lower determination convolution components are viably destroyed and repeated driving into the completely associated organize, be that as it may we have discovered exactly that the execution misfortune is negligible.This model is trained network and during training time we randomly flip and scale images to increase generalization performance. 5 Spatial Model According to [2]to remove strong outliers from the convnet output a higher-level spatial model is used to simple body-pose priors. The inter-node connectivity of our simple spatial model is displayed in Figure 3. It consists of a linear chain of kinematic 2D nodes for a single side of the human body. Throughout our experiments we used the left shoulder, elbow and wrist; however we could have used the right side joints without loss of generality (since detection of the right body parts simply requires a horizontal mirror of the input image). For each node in the chain, our convnet detector generates Fig. 3. Spatial Model Connectivity with Spatial Priors[2] response-map unary distributions Pf ac(x) , Psho(x) , Pelb(x) , Pwri(x) ,over the dense pixel positions ′ x′ , for the face, shoulder, elbow and wrist joints respectively as in figure4.1. For the remainder of this section, all distributions are assumed to be a function over the pixel position, and so the ′ x′ notation ˆ will be dropped. The output of our spatial model will produce filtered response maps pfˆac, psho, ˆ ˆ pelb, and pwri[2].The body part priors for a pair of joints are calculated by creating a histogram of 4 Fig. 4. Part priors for left body parts[2] joint a locations over the training set, given that the adjacent joint b is located at the image center (x = ∼0).The histograms are then smoothed and normalized using Gaussian distribution. 5.1 Graphical Model Fig. 5. Message Passing Between the Face and Shoulder Joints [6] The feed forward network has difficulty in learning an implicit model of the constraints of the body parts for the full range of body poses. We use a higher-level Spatial-Model to constrain joint inter-connectivity and enforce global pose consistency. Like Jain et al [3], we define the Spatial-Model as a MRF-like model over the circulation of spatial areas for every body part. In any case, the greatest downside of their model is that the body part priors and the chart structure are unequivocally hand created. Then again, we take in the earlier model and certainly the structure of the spatial model. We begin by associating each body part to itself and to each other body part in a couple shrewd form in the spatial model to make a completely associated diagram. The combine shrewd possibilities in the chart are figured utilizing convolutional priors, which display the contingent conveyance of the area of one body part to another. The scholarly combine insightful disseminations are absolutely uniform when any pair wise edge ought to be expelled from the chart structure. Figure 5 demonstrates a viable case of how the Spatial-Model can evacuate an anatomically mistaken solid anomaly from the face warm guide by fusing the nearness of a solid shoulder discovery. For effortlessness, just the 5 shoulder and face joints are appeared, nonetheless, this illustration can be reached out to fuse all body part matches. 6 Egocentric Full-Body Motion Capture According to Rhodin et al [4], an egocentric motion-capture approach that estimates full-body pose from a pair of optical cameras carried by lightweight headgear. We designed a mobile egocentric camera setup to enable human motion capture within a virtually unlimited recording volume. We attach two fisheye cameras rigidly to a helmet or VR headset, such that their field of view captures the users full body as in figure 6. The camera design was made in such a way the wide field of view allows to observe interactions in front and beside the user, irrespective of their global motion and head orientation, and without requiring additional sensors. The stereo setup ensures that most actions are observed by at least one camera, despite substantial self-occlusions of arms, torso and legs in such an egocentric setup. Our self-centred unit isolates human movement into two sub issues: 1.Local skeleton posture estimation as for the camera rig, and 2. Global rig posture estimation in respect to nature [4].Human posture is evaluated with existing structure-from-movement methods. We define skeletal position estimation as a streamlining issue. Fig. 6. (1) Capturing human motions in outdoor environments of virtually unlimited size, (2) capturing motions in space-constrained environments, e.g. during social interactions, and (3) rendering the reconstruction of ones real body in virtual reality for embodied immersion.[4] 6.1 Egocentric Ray-Casting Model In our egocentric camera fix, the cameras move unbendingly with the client’s head. As opposed to regularly utilized skeleton arrangements, where the hip is taken as the root joint, our skeleton chain of importance is established at the head, see Figure 7. To apply the beam throwing definition portrayed in the past segment to our egocentric movement catch fix, with its 180 field of view, we supplant the first pinhole camera display with the unidirectional camera model of Scaramuzza et al [5]. 6.2 Egocentric Body-Part Detection We combine the generative model-based alignment with evidence from the discriminative jointlocation detector of Insafutdinov et al. [1] trained on marked egocentric fisheye images. The discriminative component intensely improves the quality and stability of reconstructed poses, providing 6 Fig. 7. Schematic of EgoCap, our egocentric motion-capture rig (left), visualization of the corresponding volumetric body model and kinematic skeleton (center), and the egocentric view of both in our Head-mounted fisheye cameras (right)[4] efficient recovery from tracking failures, and enables probable tracking even under notable obstructions. To avoid the monotonous and error-prone manual explanation of locations in thousands of images, as in previous work, we use a state-of-the-art marker-less motion capture system to estimate the skeleton motion in 3D from eight stationary cameras placed around the scene. The projection requires tracking the rigid motion of our head-mounted camera rig comparative to the stationary cameras of the motion-capture system, for which we use a large checkerboard attached to our camera rig as shown in fig 8. We detect the checkerboard in all stationary cameras in which it is visible, and triangulate the 3D positions of its corners to estimate pose and orientation of camera rig. Later by using Scaramuzza et al.s [5] camera distortion model, we then project the 3D joint locations into the fisheye images recorded by our camera rig. Fig. 8. For database annotation, the skeleton estimated from the multi-view motion capture system (left), is converted from global coordinates (center) into each fisheye cameras coordinate system (right) via the checkerboard.[4] . 7 Result We evaluated our architecture on the FLIC dataset, which is comprised of 5003 still RGB images. The FLIC dataset is very challenging for state-of-the-art pose estimation methodologies because the poses are unconstrained. At test time we run our model on images with only one person. As stated the model is run on 6 different input image scales and we then use the joint location 7 with highest confidence across those scales as the final location [3]. Further improve the results, but the most generic higher level spatial model achieved the best results and we evaluate our architecture on the FLIC and extended-LSP datasets. As in figure 9 both of these challenging datasets with a considerable margin shows the predicted joint locations for a variety of inputs in the FLIC and LSP test-sets. Our network produces convincing results on the FLIC dataset (with low joint position error), however, because our simple Spatial-Model is less effective for a number of the highly articulated poses in the LSP dataset, our detector results in incorrect joint predictions for some images. We believe that increasing the size of the training set will improve performance for these difficult cases [6]. The application was implemented based on CNN architecture for body part detector where the paper focuses on an entirely new way of capturing the full egocentric skeletal body pose, that is decoupled from global pose and rotation relative to the environment. Global pose can be inferred separately by structure-from-motion from the fisheye cameras or is provided by HMD tracking in VR applications. Fisheye cameras keep the whole body in view, but cause distortions reducing the image resolution of distant body parts such as the legs. However, the CNN performs one shot estimation and does not suffer from illumination changes [4] Fig. 9. Predicted Joint Positions, Top Row: FLIC Test-Set, Bottom Row: LSP Test-Set[6] 8 Conclusion We have shown how to improve the state-of-the-art on one of the most complex computer vision tasks: unconstrained human pose estimation. Convnets are impressive low-level feature detectors, which when combined with a global position prior is able to outperform much more complex and popular models. Later we have shown novel ConvNet Part-Detector and an MRF inspired Spatial- 8 Model into a single learning framework significantly outperforms existing architectures on the task of human body pose recognition [2]. Finally based on CNN models we presented EgoCap, the first approach for marker-less egocentric full-body motion capture with a head-mounted fisheye stereo rig. Pose optimization approach that jointly employs two components. first is a new generative pose estimation approach based on a ray-casting image formation model, second component is a new ConvNet-based body-part detector for fisheye cameras that was trained on the first automatically annotated real-image training dataset of egocentric fisheye body poses. It enables motion capture of dense and crowded scenes, and reconstruction of large-scale activities that would not fit into the constrained recording volumes of outside-in motion-capture methods [4]. References 1. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016. 2. Arjun Jain, Jonathan Tompson, Mykhaylo Andriluka, Graham W Taylor, and Christoph Bregler. Learning human pose estimation features with convolutional networks. arXiv preprint arXiv:1312.7302, 2013. 3. Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph Bregler. Modeep: A deep learning framework using motion features for human pose estimation. In Asian Conference on Computer Vision, pages 302– 315. Springer, 2014. 4. Helge Rhodin, Christian Richardt, Dan Casas, Eldar Insafutdinov, Mohammad Shafiei, Hans-Peter Seidel, Bernt Schiele, and Christian Theobalt. Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics (TOG), 35(6):162, 2016. 5. Davide Scaramuzza, Agostino Martinelli, and Roland Siegwart. A toolbox for easily calibrating omnidirectional cameras. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 5695–5701. IEEE, 2006. 6. Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems, pages 1799–1807, 2014. Survey on Human Pose Estimation in Spatio-Temporal 3D Convolutional Neural Network Sidney Pontes-Filho1 , Ahmed Elhayek2 , and Pramod Murthy3 1 s [email protected] 2 [email protected] 3 [email protected] Abstract. There are many real-world applications that Human pose estimation is essential. Estimating human pose from monocular videos or images is a challenging computer vision problem, mainly because of occlusion, clutter, and missing depth information. In this survey, we focus on approaches that estimate 3D human pose from videos on Human3.6M dataset or recognize human action on TRECVID and KTH dataset using 2D or 3D Convolutional Neural Network. Firstly, we summarize various state-of-the-art approaches predicting 3D human pose or human activity recognition. Then we explain the common characteristics, datasets used, and their differences. In the end, we compare and analyze the performance of those approaches. Keywords: Human pose estimation, 3D Convolutional Neural Network, Motion Capture 1 Introduction Human Pose Estimation is actively researched in computer vision because of various real-world applications, such as human-computer interaction, movies, virtual reality, pose-based games, medical analysis, video surveillance, and so on [7]. There are several methods for estimating human pose, from ones using single cameras to others using extra supporting hardware devices which have additional costs. In computer vision, the research efforts are on the improvement of methods that use monocular videos or images which are widely available. There are several problems due to using that simple and abundant resource (a single camera), such as, occlusion caused by objects and pose variations, difficulty to separate human from the scene, and the missing depth information. One of the several methods is Convolutional Neural Network (CNN) that learns to extract features from images or consecutive video frames and learns the pattern of those extracted features to infer the desired output. Acquiring real-world image implies the loss of one spatial dimension (the depth) and remaining the other two (width and length). A manner to compensate that missing information is using videos, because it contains the time dimension coded in consecutive frames. In other words, it takes advantage of movement information. Using CNNs, the addition of the time dimension is applied just modifying the standard 2D convolution to a 3D convolution resulting in a spatial-temporal 3D CNN. 2 Survey overview In this survey, we analyzed four approaches which [5, 6] use 2D CNN or [2] 3D CNN for inferring 3D joint locations of human body, and [4] uses 3D CNN for recognizing human actions. The proposed work of [2, 4] which use 3D CNN are the ones that benefit from the time dimension, i.e. consecutive video frames. The Table 1 shows an overview of these articles. The following subsections are a summary of each article studied in this survey. Next, we discuss to give an overview of various approaches selected as a part of the survey. 2 Article Li and Chan [5] Li et al. [6] Grinciunaite et al. [2] Ji et al. [4] Method 2D CNN 2D CNN 3D CNN 3D CNN Input image image and 3D pose 5 consecutive video frames 7 or 9 consecutive video frames Ouput 3D pose 3D pose and matching score 3D pose human action class Table 1. Survey overview. 2.1 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network Li and Chan [5] proposed a deep convolutional neural network to estimate 3D joint locations from monocular images. The network is trained in two parts. The first part is the pre-training for body part detection, then the second part is the final training for 3D Human Pose Estimation. Therefore, the network performs the multi-task of body part detection and pose regression. 2.2 Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation Li et al. [6] proposed a network architecture for 3D pose estimation and image-pose pair matching score, because of that the inputs are an image and a 3D pose. Their network assigns high score when image-pose pair matches and low otherwise. 2.3 Human Pose Estimation in Space and Time using 3D CNN The approach of Grinciunaite et al. [2] is a convolutional neural network for 3D pose estimation from RGB videos, in this way, adding the time dimension using consecutive video frames extends the normal 2D convolution to 3D one. 2.4 3D Convolutional Neural Networks for Human Action Recognition Ji et al. [4] developed a 3D CNN model for action recognition from RGB videos using the spatial and temporal dimensions to capture the motion information. 3 Dataset We now review various datasets containing videos and ground truth used for training and evaluation by the chosen articles are explained in this section. The articles [5, 6, 2] have as goal 3D Human Pose Estimation, hence the ground truth is 3D joint locations. So Human3.6M contains what they need. The article [4] aims Human Action Recognition, therefore its ground truth is the action of each human subject in the recordings and TRECVID and KTH dataset are used for that. 3.1 Human3.6M Human3.6M dataset [3, 1, 5, 6, 2] is, to the best of our knowledge, the largest publicly available motion capture dataset. It contains around 3.6 million frames of high-resolution 50Hz video, recording 11 human subjects performing 15 different actions from 4 calibrated cameras at different viewpoints. The ground truth pose for each frame was measured by a MoCap system and there are available 32 joint coordinates. 3 Fig. 1. Sample images from Human3.6M dataset, showing the variability of subjects, poses and viewing angles. Image taken from [3]. 3.2 TRECVID TRECVID dataset is used by Ji et al. [4] for action recognition. Videos of 49-hour were recorded at London Gatwick Airport from 5 different cameras with a resolution of 720 x 576 at 25 fps frame rate. The 3 action classes of this dataset are CellToEar, ObjectPut, and Pointing. Fig. 2. Sample images from TRECVID dataset when human detection and tracking is performed. Image taken from [4]. 3.3 KTH Ji et al. [4] use another video dataset called KTH. It was introduced by [8] and consists of several short videos of four seconds in average over homogeneous background with a static camera at 25fps frame rate. Six actions are performed, such as Walking, Jogging, Running, Boxing, HandWaving, and HandClapping, by 25 human subjects. These videos were recorded in four different scenarios: outdoors, outdoors with scale variation, outdoors with different clothes, and indoors. 4 Preprocessing In this section, we discuss various preprocessing strategies applied to the dataset before using it for training. In [5, 6], the original Human3.6M video frames of 1000 x 1000 pixels are cropped to 112 x 112 pixels using the bounding box containing the human subjects. Grinciunaite et al. [2] use also Human3.6M but the image size is modified to 128 x 128 pixels and the 3D joint locations used as ground truth were centered to the pelvic bone location. Its last preprocessing procedure is the global contrast normalization (GCN) for each colour channel 4 Fig. 3. Sample images from KTH dataset. Image taken from [8]. to reduce the variability of the input images that the deep neural network needs to deal during training. In those methods, the 3D pose ground truth of Human3.6M is filtered, then remaining 17 core joint locations out of 32. Ji et al. [4] use two datasets, TRECVID and KTH. TRECVID videos are in a real-world environment recorded at an airport, therefore each frame contains multiple humans and the dataset does not contain a bounding box for each human. Further, a human detection and tracking is performed over the video frames for each human performing an action, the cropped images are resized to 60 x 40 pixels. For the KTH dataset, Ji et al. [4] reduce the size of input frames to 80 x 60 pixels. 5 Convolutional Neural Networks A standard deep convolutional neural network contains two trainable sub-networks in sequence, first one is for feature extraction and the second one is for mapping function approximation task. The building blocks of feature extraction sub-network are, in feed-forward direction, the convolutional layer, activation function, and pooling layer. The two last ones are optional. Those blocks can be repeated to increase complexity of extracted features. The last sub-network consists of a fully connected layer and computes the output depending on the extracted features from the previous sub-network. The articles [5, 6] use 2D convolutional layers and [2, 4] use 3D convolutional layers in their CNNs. When the input of the network is an image, the standard convolution is 2D, the kernel is a plane shape like the input, i.e. the image or the previous layers output. In 3D convolutions, the input of the network can be a video. The third dimension, in this case, is time and the convolution is applied in consecutive frames. If the frames are stacked, the 3D kernel presents a cuboid shape. The mathematical expression of the 2D and 3D convolution are provided, respectively, by Equations 1 and 2 (adapted from [2]). (K ∗ X)i,j = XX m (K ∗ X)i,j,k = XXX m n Xi−m,j−n Km,n (1) Xi−m,j−n,k−l Km,n,l (2) n l The symbol ∗ denotes a discrete convolution, X is the data, and K is the flipped kernel. The difference between 2D and 3D convolution is that X and K follow the dimensionality of the equation, i.e. X and K are 2D (m×n dimensions) in a 2D convolution, and they are 3D (m×n×l dimensions) in a 3D convolution. Visually, the comparison between 2D and 3D convolution is shown in Figure 4. 5 Fig. 4. Comparison of 2D (a) and 3D (b) convolutions. In (b) the connections of same colour share the weights and the size of the convolutional kernel in the temporal dimension is 3. Image taken from [4]. 6 Network architecture The model of networks architecture proposed by Li and Chan [5] for 3D pose estimation using 2D convolution is as illustrated in Figure 5. The network architecture contains, in total, 9 trainable layers which are 3 convolutional layers and 6 fully connected layers splitted in two independent networks with 3 layers performing detection task and the other three performing regression task. The detection network is trained first, therefore the convolutional layers learn the features for the task of detecting the joint locations and after that the connection between the last convolutional layer and the detection network is blocked, so the regression network is trained using the pre-trained convolutional layers for the detection task. In this way, performing a multi-task training for detection and regression. Fig. 5. Network architecture of Li and Chan [5]. N stands for the number of joints (N = 17 due to preprocessing). Image taken from [5]. 6 Fig. 6. Network architecture of Li et al. [6]. There are max-pooling layers after all convolutional layers, they are not drawn to reduce clutter. Image taken from [6]. Fig. 7. Network architecture of Grinciunaite et al. [2]. C means convolutional layer and P is the pooling layer. Kernel sizes are specified in parenthesis. Second row presents the size of corresponding layer’s output. Images show slices of some 3D activation maps per layer. Image taken from [2]. The activation function of conv1, conv2, and the first two fully connected layers for both regression and detection is the Rectified linear units (ReLu). The last regression layers activation function is tanh. A local response normalization layer is added after conv2 to make the network robust to pixel intensity. The network of Li et al. [6] for 3D pose estimation and image-pose pair match score contains two sub-networks. The feature extraction sub-network consists of 2D convolutional layers with raw image as input. The other sub-network has fully connected layers for image-pose embedding which maps the extracted image features and 3D pose input into common embedding spaces. With the image and pose embeddings, the dot-product is applied resulting the image-pose pair match score. Also, the image embedding space assists the scoring task predicting the 3D pose. The network architecture is presented in Figure 6. Grinciunaite et al. [2] proposed a 3D CNN architecture for 3D pose estimation that it is shown in Figure 7. The activation function is PReLU and it is used after all layers. GCN is the global contrast normalization as explained in Section 4. 7 Fig. 8. Network architecture of Ji et al. [4]. C means convolutional layer and P is the pooling layer. Kernel sizes are specified in parenthesis. Second row presents the size of corresponding layer’s output. Images shows slices of some 3D activation maps per layer. Image taken from [4]. A 3D CNN model for Human Action Recognition on TRECVID dataset is presented by Ji et al. [4] and illustrated in Figure 8. A similar model is used on KTH. Since, the size of input video frames are bigger (60x40 in TRECVID and 80x60 in KTH), the sizes of kernels are modified and the input is 9 consecutive video frames. Both models map the input frames into 128D feature vectors before the fully connected layer. There are 3 action classes in TRECVID and 6 in KTH. Therefore, the number of units in final layer of fully connected network corresponds, respectively, to the number of classes. This architecture contains 1 hardwired layer, 3 convolutional layers, 2 subsampling layers, and 1 fully connected layer. The hardwired layer is a non-trainable convolutional layer that consists of 5 premade kernels. Its result is 33 feature maps, 7 are gray values of all input frames, 14 are referent to gradient-x and gradient-y applied at each frame, 10 are the optflow-x and optflow-y computed from adjacent input frames. 7 Evaluation The articles [5, 6, 2] which have the goal of fitting their estimated 3D joint locations to the 3D pose ground truth. Their authors evaluate the systems using mean per joint position error (MPJPE): NJ 1 X M P JP E = Ji − Jˆi NJ i=1 2 (3) Equation 3 calculates the mean of the distance between estimated joint location and ground truth. NJ is the number of joints, Ji stands for the estimated i joint location and Jˆi for i joint location of ground truth. They use Human3.6M dataset for evaluation. The dataset is separated for training, validation and test in regard to different human subject. Li and Chan [5] use 5 subjects (S1, S5, S6, S7, S8) for training and validation, and 2 subjects (S9, S11) for testing, while [6, 2] includes all of them for training and validation, and tests using the (hidden) official test set. The results are separated by actions and there is another issue with [5], they report only 6 actions. Knowing these issues with the report of [5], and that [6] excluded the action Directions due to video corruption, we present the results of state-of-the-art on the Human3.6M test set [3] and the studied articles [5, 6, 2] in Table 2. The actions ’Sitting’ and ’SittingDown’ were poorly performed by 3DCNN because there are no significant information through the time, i.e, movement. The background does not play an important 8 Action Directions Discussion Eating Greeting Phoning Posing Purchases Sitting SittingDown Smoking TakingPhoto Waiting Walking WalkingDog WalkingTogether Avg LinKDE(BS) [3] DconvMP-HML [5] StructNet-Avg(500) [6] 3DCNN [2] 124 91 117 148 92 89 93 104 76 94 138 127 98 102 111 92 105 145 106 99 136 140 112 139 135 151 203 260 239 118 98 109 197 189 170 151 146 105 106 115 77 99 101 166 146 138 141 153 109 106 140 131 122 119 Table 2. MPJPE results of the best method of each article. Values are in milimeters (less is better). role in making the separation of the subject from the scene a difficult task because Human3.6M dataset videos were recorded in the same indoor environment. Only one stool is used in action ’Sitting’, hence occlusion caused by objects is (almost) not a challenge in this dataset. The article of Ji et al. [4] is the only one in this survey which the goal is to recognize human action. They evaluate comparing precision, recall and accuracy. They reported that their 3D CNN model outperforms other methods on the TRECVID dataset. On the KTH dataset, the model presents similar performance compared to other state-of-the-art approaches. 8 Conclusion The application of 3D CNNs to solve spatio-temporal problems shows a significant gain in performance compared to 2D CNN and other methods cited in the selected publications [5, 6, 2, 4]. For Human Pose Estimation, the best result is, in average, the 3D CNN model proposed by Grinciunaite et al. [2] and with real-time processing speeds. However, time information needs to be significant as discussed in Section 7. A solution for lack of time information is using more frames as input, or even skipping some frames, then giving a wider range in time and increasing the chance of more significant data. In case of Human Action Recognition, the 3D CNN model proposed by Ji et al. [4] achieve an outstanding performance in the real-world environment TRECVID dataset. Thus, 3D CNN are capable of handling well the difficulties of the real-world. For the future work, we will investigate methods to improve the results of MPJPE presented in this survey, especially methods that handle spatio-temporal dimensions, such as 3D CNNs or recurrent neural network. References 1. Fuxin Li Cristian Sminchisescu Catalin Ionescu. Latent Structured Models for Human Pose Estimation. In International Conference on Computer Vision, 2011. 2. Agne Grinciunaite, Amogh Gudi, Emrah Tasli, and Marten den Uyl. Human Pose Estimation in Space and Time using 3D CNN, October 2016. 3. Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014. 4. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D Convolutional Neural Networks for Human Action Recognition. In Johannes F¨ urnkranz and Thorsten Joachims, editors, ICML, pages 495–502. Omnipress, 2010. 5. Sijin Li and Antoni B. Chan. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network. In Asian Conference on Computer Vision (ACCV), 2014. 6. Sijin Li, Weichen Zhang, and Antoni B. Chan. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation. CoRR, abs/1508.06708, 2015. 7. Zhao Liu, Jianke Zhu, Jiajun Bu, and Chun Chen. A survey of human pose estimation: The body parts parsing based methods. Journal of Visual Communication and Image Representation, 32:10–19, 2015. 8. Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local SVM approach. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004. Survey on Human Body Tracking Sreekanth Kata1 and Kiran Varanasi2 1 Dept. of Computer Science, University of Kaiserslautern s [email protected] 2 Augmented Vision, DFKI GmbH, Kaiserslautern [email protected] Abstract. In this survey, the historic advancement in the optimization of human body tracking and hand tracking are presented starting from very basic vision-based tracking techniques only using monocular RGB cameras with restricted degrees of freedom(DoF) to more realtime advanced model approaches. The real time joint tracking of hand while manipulation and interaction with an object is also shown. At the end some methods to describe how complex scenes can be reconstructed from multiple depth cameras to update the tracking of several real time, irregular topology changes are defined. Each tracking method is compared distinctly in independent sections. Earlier the hand trackers needed high computational GPU bandwidth that proved costly, not enabling them for daily consumer applications. The solution for more robust, efficient and flexible tracking is made possible by the advances of 3D imaging technology, notably depth cameras. Keywords: RGB-D, Pose, Occlusion handling, 3D Mesh deformation 1 Introduction Tracking means following one or multiple objects of interest in the scene, continuously providing their position. The aim is to estimate parameters of the dynamic system, e.g., 3D data points, object position, human joint angles, etc. Virtual reality manipulation with hands is useful for users to interact faster and with greater ease. The gap between the model fit and the noisy captured data Fig. 1. Unskilled designers are able to handle a variety of interactive scenarios like direct physical interaction, pointing, and analog control. Image taken from [12]. can be reduced by the techniques presented here with increased level of accuracy. 3D points are collected from the depth sensor camera and a model fit is produced by non-linear optimization of gradients that solve deformations of a model to match the input depth image. Precise hand tracking is very difficult to achieve due to its invariable structure, fast motions, incorrect finger tips detection and occlusions. Fingers hidden behind each other can cause self- intersection and recovering from tracking errors can be difficult. Hence, certain kinematic constraints 2 are used while hand tracking to reason jointly about disambiguating the fingers locally [12]. After getting a kinematic model of an articulated hand to a point cloud, there are two methods of hand tracking. First, is by efficient articulated iterative closest point algorithm (ICP), which finds the rigid transformation that aligns two point clouds. Second, by correspondences, which involves a search over the entire surface domain [12],[7],[3]. The optimization using Levenberg Marquardt(LM) algorithm is done dynamically over both the model pose and correspondences, unlike only over correspondences and transformations in ICP. Hand tracking while manipulation of an object in 3D space can vary based on temporal sequence, so this variation can be overcomed by the methods stated in section 3 below. For tracking a body the skeletal structure of a human body is initially constructed by defining the kinematic chains of bones connected by the joints. The kinematics equations, namely forward kinematics (FK) and inverse kinematics (IK) of the body define the relationship between the joint angles and its pose. FK finds how the positions of particular parts of a model are calculated at a specified time from the position and orientation, while IK computes the joint angles for a desired pose of the articulated body. 2 Hand Tracking The goal of hand tracking is to develop a pose vector (θ) that accurately describes the detailed hand movement [12],[11],[10],[6],[3]. Two categories of hand tracking approaches used are discriminative and generative hand tracking. While the first one directly estimates the pose by extracting image features or using a multi layered random forest to learn from a large database of poses, the later recovers pose across a temporal sequence of images with an explicit hand model [12],[11]. But they both can suffer from factors like either inaccuracy or dependency on previous frames for error recovery. Finally, the Hybrid method which is a combination of both the discriminative and generative categories, together with per frame re-initialization by Ballan et al. 2012 can recover from errors and increase robustness. Hence combining both the generative and discriminative tracking also enables recovery from some tracking failures [12],[11],[10]. The smooth model technique by Fig. 2. i) Model fitting examples converging to an energy minimum, showing the large range of convergence using the depth signal from single commodity depth camera (left). ii) Detection-guided tracking method uses depth data and hand model as a mixture of 2.5D Gaussians for pose optimization (right). Images are taken from [12] and [10] respectively. Taylor et al. 2016, using single commodity RGB-D based Camera makes it possible for Levenberg Marquardt Algorithm to work with reduced need for computational power. First the continuous deformation of the 3D meshes using Laplacian editing is done by manipulating and modifying a surface while preserving the geometric details [7]. The deformation is based on the Laplacian of the 3 mesh, by encoding each vertex relative to its neighbourhood called Region of Interest (ROI). Next, error minimization is achieved between 3D data points acquired from depth sensor camera and the parameterized function by fitting a smooth model from a non-linear model by LM. In this paper, first a stream of depth images is acquired from the sensor (It ), and then pre-processed into N 3D points (Xtn ), 3D normals (nnt ) and detected finger tips (Ft ) [12]. The hand tracking goal is to find pose parameters θt at time t such that the 3D surface S(θt ) is a good explanation of the image data in It and previous frames. i.e., The energy function encapsulated has to be minimal(θˆt ) and will be the output as systems estimate of the pose at time t. θˆt = argmin Et (θ) (1) θ The overall energy is defined as the weighted sum of the non-data term identifiers, i.e., Terms = {bg, tips, pose, limit, int, temp}, which are listed in the table below. Table 1. The table is for purpose of understanding the terms and is taken from [12]. bg - Model points must be within the background pose - The pose θ can only be human hand pose temp - The temporal sequence has to be close to limit - It has to obey joint-angle limits each other tips - Each detected fingertip ff in the data has int - The hand model should not self-intersect closer model fingertip. In the data term Edata (θ), the distance however from each data point Xn to the model S(u; θ) with correspondence(u) and the difference in surface orientation S ⊥ (u; θ) to the assosicated data normal nn is limited, by Edata (θ) = N 1 X ||S(u; θ) − Xn ||2 ||S ⊥ (u; θ) − nn ||2 min + 2 N n=1 u∈Ω σx σn2 (2) Where σx2 , σn2 are estimates of the noise variance on points and normals and Ω is the surface domain. λτ are parameters. The normal term allows the energy to use surface orientation S ⊥ (u; θ) to select better locations on the model even when they are far from the data [12]. Finally, to avoid the iterative steps of ICP and Edata , the Lifted energy is derived using the lifting strategy from Taylor et al. 2014. E′ (θ, u) = E′data (θ, u) + X λτ Eτ (θ) (3) τ ∈Terms Where, E′data (θ, u) is the mean summation of ε(un , Xn , nn ). The obtained lifted energy E ′ (θ, u) should be as gradient-based minimized as possible, so that it is easily differentiable. The equations 1, 2 and 3 above are taken from Taylor et al. 2016. Though another approach which is shown in the Fig. 2 by Sridhar et al. tracks a users hand with highly accurate results. Nonetheless, this method has high computational cost and heavyweight multi-camera setups, making it impractical for daily consumer interactive scenarios [12],[10]. In Part i) (left) the model with global minimum in the top row has wide basin of convergence to correct pose. The LM algorithm used in Taylor et al. 2016, updates both correspondences and pose simultaneously, unlike ICP which updates only correspondences. So less number of iterations are enough for accurate tracking. The correct fitting obtained can be known after the first iteration. 4 Part ii) (right) Multiple Cameras are used by Sridhar et al. 2015, with 5 RGB cameras and TOF sensor. They mentioned that there are two parts in thier approach. A detection step, where pixels are classified into parts of hand by randomized decision forest. Next a optmization step, where a detected part labels and a Gaussian mixture representation of the depth are combined with an objective function to estimate a pose that best fits the depth [10]. 3 Hand Tracking with Manipulation of an Object Tracking methods for both the hand and object distinctly are recently well developed. This methods work better only when used alone for tracking a hand seperately and an object separately. But when tracking a hand together manipulating an object, these methods fail indefinitely as it can be very cumbersome to distinguish the hand from the object, leading to instability and tracking loss. The disambiguation of hand and object may face several challenges due to difficulty in segmentation, occlusions and higher dimensionality. In real world scenario, the hand-object manipulation can vary in pressure sensitivity, temperatures and rigidity. Also the objects have variable weights resulting in different hand motion types. The uneven motions form irregular poses with all the unknown objects having uncommon spatial differentiations and rigidity variation [11],[9],[6]. Different type of objects have different hand movement which can be very difficult to track and loses optimality. Initial methods have multiple cameras built-in to limit the number of occlusions caused resulting in an optimal solution [9], but this restricts its use to heavily controlled setups. In another method, the segmentation can be done on the adjacent fingers individually using the local tracker with connected neighbouring segments and finally connecting them with Markov random field(MRF) in pairs. The drawback with MRF is it can only enforce the anatomical hand structure through soft constraints on the joints between adjacent segments [6]. Even though this method effectively deals with forces related in holding and manipulating the object and different occlusion types, it does not provide faster recovery from tracking loss errors. Fig. 3. Classification of the input into object and hand parts, where the hand and object are tracked using 3D articulated GMM alignment. Image taken from [11]. Employing multiple cameras to overcome occlusions and appearance ambiguities is one solution to the problem. The procedure of using RGB-D cameras to detect dense depth, color estimate measurements is also proposed [6],[3],[2],[1], which either need expensive steps for segmentation, optimization or has limited range of poses, limited depth range. Multiple 2D color images and model-to-image alignment are used with a 2D error metric for pose optimization. Instead the use of 2.5D formulation based on model alignment to a single depth image proved to provide better results. This is the reason, researchers used 2.5D articulated Gaussian mixture alignment. 5 A more improved solution in this area of research is one where the use of only a single RGB-D Camera is needed. The configuration updates made on previous methods makes use of a fully articulated 3D Gaussian Mixture Model(GMM) and is very effective in providing discontinuous pose estimations in real-time [11],[6], which can be tracked easily. The benefit of this technique is with the 3D articulated GMM method it makes use of regularizes to handle occlusion and enforce handobject contact points [11]. The novel regularization terms used by Sridhar et al. are two objective functions. First energy(Ealign ) measures the alignment with the input, while the second energy (Elabel ) incorporates the classification results. The benefit of using two simultaneously optimized regularizers is later only the best one of the both is used for better pose estimation and can recover from failures. With this approach two Gaussian mixtures can be aligned for successful 3D point registration to track hand and object simultaneously with accurate results. This provides solutions that need less number of iterations and low computations of tracking algorithms by capturing the frame to frame updates of the movement. Description is given in the Fig. 3, showing how the optimization is done using a multi-layer random forest hand part classifier, to improve robustness and error recovery from the tracker [11]. Meaning that a multi classification layer is used here to classify the hand and object distinctly. Later, Similar Iterative closest points (ICP) are clustered together and then the optimization is performed using GMM to produce a stable tracking pose. This method is proven to significantly reduce the strong occlusions, either by the object or by the interacting fingers of the hand. Fast error recovery from tracking loss is possible as the tracking initiates updates at each iteration. The semantic alignment, occlusion handling, and contact points help improve robustness of tracking results and recovery from failures [11]. 4 Body Tracking Human body tracking is being used in many applications such as surveillance, motion capture, and human-computer interface [7]. Yet there are multiple challenges to overcome like high dimensionality of the state space, variable appearance of human forms, and complex dynamics of the body. An articulated human body can be represented as a kinematic system consisting of a set of rigid objects, called bones, connected together by joints. The joints have a single DoF and can only be either rotated or translated. Kinematic joint constraints, joint and point weights constraints can be included by using an IK approach for solving an articulated ICP problem [10]. Current research of body tracking for motion capture is based on vision techniques that uses stereo depth sensor Fig. 4. Sensor head (left), 2D image (middle left), disparity image (middle right), 3D image (right). Image taken from [7]. cameras, time of flight (TOF) cameras and kinetic sensors to detect body shapes. Kiran et al. 2012 proposed a marker-less method for full body human performance capture by analyzing shading information from a sequence of multi-view images, which are recorded under irregular lighting conditions 6 and occlusions in arbitrary environments. Finally, the skeletal motion estimates and finely detailed time-varying 3D surface geometry for human performances are reconstructed [13],[7]. This method benefits especially in the use of performance capture in practical applications such as outdoor movie sets or sport stadiums [13]. For 3D tracking of human body movements based on a 3D body model and the ICP algorithm, Knoop et al. 2006 proposed a tracking system called VooDoo[7]. In Fig. 4, The body parts are modelled as cylinders having specific DOF for movement. Then the 2D image is transformed into a depth map via disparity images by stereo reconstruction to generate a full body 3D model by matching the closest corresponding points with methods like ICP in an iterative manner to get the optimal tracking results. The total body amounts to a tree hierarchy structure into total 10 parts from torso as the root part to the head, two for each arm and two for each leg [7]. This is similar to the hand graph in the hand tracking interacting with the object, where the palm is the root and the fingertips acts as the leaves. To avoid noises of the sensor input, the minimized sum of squares between the data points of the model and the sensor data is computed. Another method uses monocular image sequences like color and edges for tracking people. In some differential methods where the image gradient is linearly related to the model movement, dense optical flow is used. Multiple hypothesis frameworks are the best option available for most implementations while tracking humans, because many ambiguities can be encountered in using cluttered monocular image sequences. Presently one of the advanced method is based on Stereo image techniques, though rarely implemented than monocular image methods, it generates less ambiguity. Example is a multi-view technique which uses physical forces that are applied to each rigid part of a kinematic 3D model of the tracked object [4]. These forces guide the minimization of the fitting error between model and data. This approach uses a recursive algorithm to solve the dynamical equations. 5 Capture of Challenging Scenes Complex performance scenes can be captured with multiple depth cameras and reconstructed in real-time to produce a 3D model. This technique is used in several applications in many fields starting from movie modeling, sports broadcasting to aerospace industries. Advanced research methods were studied to tackle fast frame to frame motions, complex topology changes of the scenes and noisy input data [5]. Michael Zollh¨ ofer et al. 2014, initially defined a statistical captured reference model, i.e., template for model fitting. This template is used for the reconstruction of nonrigid scenes in realtime before tracking and it can handle the fast and large deformations [14]. Newcombe et al. 2015 implemented a reference volume model reconstruction. DynamicFusion approach is used by this model which is capable of real-time dense reconstruction in dynamic scenes [8]. But both these two Fig. 5. Real-time results captured of Fusion4D, showing a variety of challenging sequences. Image taken from [5]. 7 models only use a single depth camera, and are not real-time leading to inconsistency for real world data which has large topological changes or fast motions. In the paper ”Fusion 4D” by Dou Mingsong et al. 2016, a real-time tracking dynamic fusion system of multiview 4D performance capture for reconstructing high quality 3D models using multiple depth sensor cameras (e.g., 8 cameras) is studied. The Fusion 4D method is extension of the reference model used in Newcombe et al. 2015 and is continuously updated, producing robust 3D model reconstruction even under complex topological changes like removing the scarf and rotating it very fast and irregularly, reconstruction of the skirt details precisely, high frame rate motions. As there is no initial prior on captured scene to determine the kinematic or skeletal model of the body, it can be applied to any unknown origin scenes like the example of the dog interacting with the owner in the Fig. 5, The input is real time captured RGB frames and associated segmentation mask from a performance capture rig [11]. By using a learning based technique correspondences are estimated that are used to initialize a non-rigid matching phase that aligns the volumetric model to the input point cloud with embedded deformation [11]. Both the model and data are then fused and blended volumetrically to give the result. The 3D model obtained is consistent with respect to time and space in real-time to get the shape and motion, but small holes can still be visible in middle as the movements between frames are faster to capture. Learning-based RGB-D correspondence matching with a parallelized registration framework is used to tackle the fast motions. The topological changes are handled by the concept of key volumes, which is a type of voxel grid that has a reference model with unexpected changes across key volumes but smooth nonrigid key motions within the key volume sequence [11]. Nonrigid alignment error with estimated correspondence fields can be helpful for the fusion of the model into the data volume. 6 Evaluation Of Tracking The hand tracking gesture of Pinching is compared between two methods. The Leap Motion which produces sliding fingertips that makes Pinching gestures detection difficult from the tracked skeleton. In Contrast, the Sridhar et al. 2015 method reproduces Pinching faithfully [10],[2]. Two other tracking methods are also compared using the DEXTER dataset. Sridhar et al. resulted in lowest average error of 19.6mm using only a single depth camera. The results are much improved, when the same dataset was implemented with the LM algorithm by Taylor et al. [12],[10]. In hand-object manipulation, a new benchmark dataset is introduced by Sridhar et al. 2016 with ground truth for fingertip positions and object pose. This dataset is better than other two datasets DEXTER and ICJV, since they are missing object annotations making it impossible for object pose evaluation. Two sequences Rotate and Grasp2 along with varying object shapes, sizes, colors are compared here. The realtime tracking results of different users, hand dimensions compared even when multiple hands are in view produced good results. In challenging scenes reconstruction, the Fusion 4D method which updates continously supporing alignment errors and large topology changes in real-time is mainly compared with other state-of˙ the-art methods namely Newcombe et al. and Zollh¨ofer et al.The first method is a reference volume reconstructed non-rigidly and the second one as stated before uses a defined template before tracking. However when compared with real world data, they both become out-of-date. 7 Conclusion In these research papers, several contributions made by many researches to optimize the tracking of hand, body and hand-object manipulation are presented clearly. Initially a model to estimate 8 the parameters is designed. Then iteratively pose fit methods are applied to handle several tracking problems like occlusion, tracking error recovery failures, response and lagging failures. The multi depth camera capture scenes are reconstructed using only the 3D point cloud for reconstruction of comparatively perfect 3D model with newly proposed methods. State-of-the-art approaches are demonstrated to understand the complexity associated in detail with the hand-object manipulation tracking with the sensor data in real-time. The body tracking methods listed here handle detection and pose estimation with accurate results also under unpredictable lighting conditions. Nonetheless, there is still much further research in this field for making it a daily consumer use. It is a belief that the methods mentioned in this survey can be considered as the basis. References 1. Ishrat Badami, J¨ org St¨ uckler, and Sven Behnke. Depth-enhanced hough forests for object-class detection and continuous pose estimation. In Workshop on Semantic Perception, Mapping and Exploration (SPME), 2013. 2. Michael Buckwald. 2017. Leap Motion. https://www.leapmotion.com, Accessed on: 23.01.2017. 3. Cedric Cagniart, Edmond Boyer, and Slobodan Ilic. Free-form mesh tracking: a patch-based approach. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1339–1346. IEEE, 2010. 4. Edilson De Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel. Marker-less deformable mesh tracking for human shape and motion capture. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007. 5. Mingsong Dou, Sameh Khamis, Yury Degtyarev, Philip Davidson, Sean Ryan Fanello, Adarsh Kowdle, Sergio Orts Escolano, Christoph Rhemann, David Kim, Jonathan Taylor, et al. Fusion4d: Real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG), 35(4):114, 2016. 6. Henning Hamer, Konrad Schindler, Esther Koller-Meier, and Luc Van Gool. Tracking a hand manipulating an object. In Computer Vision, 2009 IEEE 12th International Conference On, pages 1475–1482. IEEE, 2009. 7. Steffen Knoop, Stefan Vacek, and R¨ udiger Dillmann. Sensor fusion for 3d human body tracking with an articulated 3d body model. In Robotics and Automation, 2006. ICRA 2006. Proceedings 2006 IEEE International Conference on, pages 1686–1691. IEEE, 2006. 8. Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352, 2015. 9. Paschalis Panteleris, Nikolaos Kyriazis, and Antonis A Argyros. 3d tracking of human hands in interaction with unknown objects. In BMVC, pages 123–1, 2015. 10. Srinath Sridhar, Franziska Mueller, Antti Oulasvirta, and Christian Theobalt. Fast and robust hand tracking using detection-guided optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3221, 2015. 11. Srinath Sridhar, Franziska Mueller, Michael Zollh¨ ofer, Dan Casas, Antti Oulasvirta, and Christian Theobalt. Real-time joint tracking of a hand manipulating an object from rgb-d input. In European Conference on Computer Vision, pages 294–310. Springer, 2016. 12. Jonathan Taylor, Lucas Bordeaux, Thomas Cashman, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto, David Sweeney, Julien Valentin, Benjamin Luff, et al. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4):143, 2016. 13. Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture under uncontrolled and varying illumination: A shading-based approach. In European Conference on Computer Vision, pages 757–770. Springer, 2012. 14. Michael Zollh¨ ofer, Matthias Nießner, Shahram Izadi, Christoph Rehmann, Christopher Zach, Matthew Fisher, Chenglei Wu, Andrew Fitzgibbon, Charles Loop, Christian Theobalt, et al. Real-time non-rigid reconstruction using an rgb-d camera. ACM Transactions on Graphics (TOG), 33(4):156, 2014. Survey on Dense 3D Reconstruction using Spherical Images Ngoc Thy My Nguyen1 and Christiano Gava2 1 2 [email protected] [email protected] Abstract. This paper discusses the state-of-the-art methods for 3D reconstruction focusing mainly on spherical images. We will categorize those methods based on the chosen baseline, the reconstruction techniques and their 3D scene representations. The baseline is divided into short and wide baseline in concern with the distance between two camera centers and the size of the reconstructed structures. The reconstruction techniques addressed here are divided into four categories: Variational approach, Patch-based Multi-view Stereopsis, Optical flow- or Light field-based methods. Based on the final 3D model, we can have one of the four common 3D representations: depth-map, point-cloud, voxel-grid or 3D mesh. We will discuss how these approaches could be applied to spherical images, their advantages and disadvantages. Keywords: 3D reconstruction, spherical images, baseline, variational method, optical flow, epipolar plane image, light field 1 Introduction 3D reconstruction has been widely developed over years and has an important role in many areas or researches such as archeology, film production, and infrastructures maintenance, etc. Due to the popularity of perspective cameras, the reconstruction of scenes or structures based on perspective images has been thoroughly focused and has important achievements. However, out-of-lab reconstruction using perspective images faces many challenges: for large-scale structure a huge number of images need to be acquired because of the limited field of view (FOV) of perspective camera, and lighting condition can produce non-uniform illumination between images of a view. Spherical Images In contrast to perspective image, spherical camera’s FOV is not limited. They can capture the full scene in a 360o horizontal and 180o vertical from a single position (see Fig. 1-(a,b)). That significantly reduces the number of images needed to reconstruct a large scene compared to that of perspective ones. Fig. 1-(a) shows a local spherical camera coordinate system where θ is in the range of [0:2π] corresponding to the horizontal axis in (b), and φ is from 0 to π corresponding to the vertical axis in Fig. 1-(b). This coordinate system can be seen as latitude-longitude geometry. With the advantage of its large FOV, spherical image is a seamless representation of a full surrounding scene, which enables measurement of the entire structure at once. Another advantage of using spherical images for 3D reconstruction is that a spherical field has less ambiguities in rotation and translation compared to that of perspective one [2, 12] which is good for estimating the camera poses in Structure from Motion (SFM). Moreover, spherical image has a unique parallax property which allows to make use of global information [16]. Taking these advantages into account, methods have been developed to reconstruct 3D scene using spherical images, most of those are inspired from methods originally devised for perspective images. In this report, we discuss the state-of-the-art methods for 3D reconstructions using spherical images. Our discussion will go in the direction of categorizing those methods according to the most relevant approaches. Starting from the cameras setup, we will categorize them based on the baseline 2 Fig. 1: (a) Local camera coordinate system of spherical image; (b) An example of full panoramic spherical images, also referred to as equirectangular image, (c) Baseline setup with wide (left) and short (right) baseline in section 2. Section 3 will classify these methods according to the reconstruction techniques. Section 4 will focus on the scene representations of the reconstructed results. Next in section 5 we will have a short discussion about the advantages and disadvantages of these methods, and the correlation between the categorized approaches. And finally we conclude our work in section 6. 2 Baseline The baseline setup regards the relative distance between the cameras. In concern with the distance between two camera centers and the size of the reconstructed structures, the baseline can be short or wide and each has their own advantage and disadvantage. As can be seen in Fig. 1-(c), wide baseline has the advantage of lower depth uncertainties compared to short baseline [7]. However, the matching problem becomes hard to solve, therefore the possibility to have a false match is high. Pagani et al. [14] used several spherical images at different locations when reconstructing cultural heritage. They applied SFM with a fixed scale, by setting the distance between two first cameras to the unit distance. The authors proposed a generic error function for the epipolar geometry which could also be applied in perspective cases. After computing the epipolar geometry for the two first cameras, they triangulated all matches (which were computed using PC-SIFT [3]) to get the 3D coordinates. Each new camera will then be added to the set for getting new possible set of 3D coordinates for a sparse reconstruction of the scene. The sparse reconstruction will then serve as the input for dense reconstruction later (see Section 3.2). The same technique has been used in [15]. 2.1 Short baseline setup In short baseline, the matching problem is relatively easy because the correspondences move in the neighborhood. However, depth uncertainties become high. In perspective case, Newcombe et al. [13] used hundreds of images extracted from a video stream to track the camera poses and reconstruct the scene in real-time. Another camera setup could be found in [8] where Changil et al. used different camera locations uniformly placed along a 1D line. This setup allowed to rectify all captured images, hence all the epipolar lines of a scene point lay in the same horizontal scanline in all images, which is necessary for applying epipolar plane image (EPI) technique in the next stage (see Section 3.4). Kim et al. [10] used video stream to capture dynamic scene, which would then be extracted into image frames for SFM. These camera setups are also promising in spherical image cases. In the same work of Kim and his colleagues [10] for static scene, they captured two spherical images at two different heights to recover depth information of the scene through stereo geometry. Another work from Kim et al. [9] used vertical stereo pairs with a baseline of 60cm. Depth reconstruction was then computed using the knowledge of the baseline and correction of radial distortion in the vertical direction. Pathak et 3 al. [16, 17] reconstructed 3D scene using two spherical images taken near the structure with a small displacement. They then made use of the epipolar geometry and optical flow vectors to estimate depth information. Im et al. [6] also chose to use images extracted from a video clip captured by a spherical panoramic camera. They were inspired from methods which compute depth from small motion applied in perspective case, and proposed a new bundle adjustment approach that tried to minimize the reprojection error directly on the unit sphere. Details from these proposed methods will be discussed in the next sections. 3 3.1 Reconstruction Techniques Variational Methods Variational method is a class of optimization methods. Instead of implementing a heuristic sequence of processing steps, we define beforehand the properties the solution should have. This method is suitable in solving infinite-dimensional problems and spatially continuous representations. This method is good for 3D reconstruction due to some reasons: 1. Formulates 3D reconstruction as an energy minimization problem; 2. They are easy to fuse: energy/cost functions can be simply added; 3. It allows to make statements on the existence and uniqueness of solutions by analyzing the cost functions. Many applications in 3D reconstruction utilize this method to both reconstruction using perspective images or spherical ones. Kim et. al [10] have used variational approach for scene reconstruction in outdoor environment when dealing with static background scene. They used a partial differential equation (PDE)-based method to estimate the angular disparity of spherical image pair in subpixel level. They argued that sub-pixel disparity field could generate smooth surface and minimize depth errors. However, it could not handle the occlusion around depth discontinuity regions. They improved that by adding a unit step function, which replaced the role of the normal balanced diffusion equation by a pure diffusion filtering to smooth disparity field and propagate correct depth information from visible regions to occluded regions. Newcombe et al. [13] implemented DTAM application using low-resolution perspective images extracted from a video frame. The authors designed an energy function which contained a regularization term and a data term in their DTAM application . The regularization term tried to enforce smooth reconstruction in low-textured areas, while the data term tried to minimize the photometric error in high-textured areas. Hansung Kim and Adrian Hilton [9] has constructed 3D scene from multiple spherical stereo pairs. Their work was similar to what they have done in [10] in using PDE-based disparity estimation method. The disparity field was estimated by minimizing a designed energy function which involves fidelity and smoothing terms. They stated three problems with this method: 1. Over-segmentation in highly textured regions; 2. Stereo occlusion problem; 3. Local minima and computational complexity. To solve the first problem, they proposed a new diffusion tensor controlled by image and disparity gradients. They argued that the function could deal with the over-segmentation as well as preserving sharp object boundaries. Details can be found in [9]. The second problem was handled by adding the bi-directional matching function as a weighting factor of the fidelity term. For the computational complexity issue and convergence to local minima, a hierarchical structure was used. This structure started from low-resolution images and recursively refined the results at higher level to reduce the computational load and, as they argued, to avoid local minima. Im et al. [6] calculated depth from small motion with spherical panoramic camera with their new bundle adjustment approach that minimizes the cost function (or the re-projection error) directly on the defined unit sphere instead of the image domain. 4 Fig. 2: (a) The PMVS building block; (b) Optical flow vectors are aligned along their epipolar lines shown in red from [16] and (c) an example of EPI from [8] 3.2 PMVS Method PMVS stands for Patch-based Multi-View Stereopsis [4]. It produces dense (or semi-dense) 3D point cloud from a set of accurately calibrated cameras. The method has two stages (see Fig. 2-(a)). In the initialization stage features are extracted and matched along epipolar lines to create patches. After initialization, a sparse reconstruction of the model is created. These sparse 3D points will serve as seeds for the expansion (region growing) algorithm. The filtering enforces the global visibility consistency. The second stage iterates between expansion and filtering until the point cloud is dense enough. Pagani et al. [14] proposed S-PMVS, a spherical version of the original PMVS by Furukawa et al. [4]. They selected a number of regularly distributed anchor points from the prior sparse 3D reconstruction. They created virtual perspective images for dense reconstruction, by projecting the pixels from the sphere to the tangent plane perpendicular to the line through the anchor point and the center of the spherical camera. Each resulted image could be seen as a virtual perspective view. It had a virtual focal length and hence its intrinsic parameters could be computed. Each anchor point had their set of virtual perspective reference images and they would all serve as the input for the expansion and filtering step to generate a dense point cloud in the neighborhood of the anchor point, which is similar to the perspective case. The same technique for dense reconstruction was also applied in [15]. However, in [15], the authors focused more on the SFM with the spherical cameras. 3.3 Optical flow-based Method Optical flow is the apparent brightness pattern change in the image. Optical flow-based is a method to recover image motion at each pixel from spatio-temporal image brightness variations. Optical flow works well in small motion assumption, i.e. correspondences are limited to small displacements. With fast motion or large displacement, coarse-to-fine optical flow is applied to build an image pyramid and then apply optical flow on each layer of the pyramid. Ifueko Igbinedion and Harvey Han [5] applied optical flow to estimate disparity in their work in 2015. However, they did not concern the distortion in their reconstruction algorithm which could lead to poor results. Pathak et al. [16] used a combination of feature-point matching and dense optical flow in their work in 2016. They applied 8-point RANSAC-based sparse correspondences for an initial estimation to bring the images to the same orientation, which enabled the computation of dense optical flow. They then optimized the epipolar geometry parameters over the dense optical flow to align all flow vectors along their epipolar lines shown in red (see Fig. 2-(b)). Dense reconstruction was computed using the magnitude of the aligned optical flow vectors. However, this work has a drawback: they estimated the optical flow vectors on a 2D equirectangular image and projected to a 3D unit spherical surface for optimizing the epipolar geometry. This can lead to inaccurate results due to strong distortions. They solved this in [17]. Here they ensured the optimization of the 5 epipolar geometry parameters in a 2D equirectangular and used values that were directly computed on the 2D image. 3.4 Light field-based Method The idea of this method is simplifying the matching of features between successive images by taking them very close together [1]. An epipolar plane image is constructed by the sequence of corresponding epipolar lines placed on top of each other in order. As a result, the slope of a line in the EPI is proportional to the depth of the corresponding 3D point. The EPI helps concentrating this information in one image. Fig. 2-(c) shows an example of EPI: the left image shows a 2D slice of the 3D input light field, the red lines connect corresponding scanlines in the images with their respective position in the EPI. Changil et al. applied this method in [8]. They created a dense set of perspective images captured along a linear path. For every captured point, there was a corresponding EPI which contained a linear trace of that point, hence it could not work without the uniform displacement between each pair of positions. If the light field is sampled densely enough, the slope of this linear trace reflects the depth information of this point. They proposed a fine-to-coarse algorithm for a hierarchical approach which handled the smooth and homogeneous region. This algorithm started by estimating depth from the highest resolution level first, and then propagated the information to lower resolutions. Krolla et al. [11] adapted this technique on spherical light fields. However, they did not completely consider distortion of spherical image. 4 4.1 Scene Representation Results Depth-map Representation Fig. 3: Scene representation: (a) depth-map, (b) point-cloud, (c) voxel-grid and (d) mesh This is a simple representation of 3D model (Fig. 3-(a)). One depth map is usually constructed for each view. The final result is obtained by merging all constructed depth maps. Changil et. al [8] has used this representation in their work after getting the depth information from EPI (see Section 3.4). Kim et al. [9] estimated disparity field using variational method. Im et al. [6] also used the same representation for their constructed model after estimating depth map from multiple spherical images extracted from a video frame (see Section 3.1). Pathak et al. in their both work [16, 17] constructed their depth maps with depth information computed using optical flow (see Section 3.3). 4.2 Point-cloud Representation Point cloud is a way of representing the surface using 3D points (Fig. 3-(b)). A normal vector is usually associated to each point. Dense point cloud is constructed by applying region growing algorithm to the sparse model. This representation may contain noisy data. 6 Pagani et al. [14, 15] represented their constructed model in a point-cloud form, where each point of the cloud was produced during the iteration between expansion and filtering stage using PMVS algorithm (see Section 3.2). Pathak et al. [16, 17] did not directly model the scenes with point cloud. However, they converted the constructed depth map into point cloud to present their final result. 4.3 Voxel-grid Representation In this representation, we divide the scene into cubes (voxels), each is marked as occupied/empty according to a photometric consistency measure (NCC, SSD,...). Memory allocation is a big issue of this representation. Newcombe et al. [13] modeled the scene of interest as a cost volume represented by a modified voxel-grid. Each voxel in a row (as illustrated in Fig. 3-(c)) stores the average photometric error of the corresponding pixel in the reference frame (Ir) for a given depth value. The average photometric error is computed by averaging the sum of all the photometric errors at that pixel for each overlapping image. Pixels are supposed to belong to the true surface if they have the smallest photometric error value at the inverse depth. However, this representation is not efficiently applied to reconstruction using spherical images because of its big memory consumption for large environment. 4.4 Mesh Representation Mesh representation models the surface as a connected set of planar facets (Fig. 3-(d)). Triangular meshes are usually used, which are local good approximations of the surface. This representation is good for optimization (mesh refinement), but difficult to handle for complex surfaces. This is a good choice to present the model after reconstruction, however, no relevant papers have been found using this kind of representation to model their scene yet. 5 Discussion In the previous sections, we have concerned a few methods in 3D reconstruction in the direction of categorizing them base on the baseline, the reconstruction techniques, and their 3D representation. Starting with the baseline setup, we have divided them in two types: wide baseline and short baseline. The wide baseline setup has an advantage that the image acquisition can be everywhere. However, that brings more challenging problem when the cameras can be distributed over a large area. With wide displacement, each camera catches part of the scene visible to it, however it suffers from large occlusion. For the variational methods, it is hard to handle the wide baseline case directly [9, 13, 6] for some reasons: In [13], the photometric error is applied under the constraint of brightness constancy. This constraint might be violated with wide displacement or significant change of lighting condition; and variational method mostly depends on a differential function, which can not be applied if there is a significant change between the values. The same scenario would also appear when applying classic optical flow with wide baseline. Optical flow requires small displacement in order to find the match point in the next image within the neighborhood. Light field-based method needs a densely enough sampled ray space, hence it would not be applied with wide baseline. Wide baseline case is handled using PMVS approach in [14, 15]. In their papers, the authors did not use any differential function, instead they used triangulation based on the epipolar geometry to get a sparse reconstruction of the scene while triangulation is good with wide baseline setup (see Section 2.1). The light field-based method has an advantage that each EPI image can be processed in parallel using a GPU to reduce the computation time. We have also mentioned one advantage of spherical images over perspective images is that they have a full FOV, and hence spherical images contain more information. That helps to significantly reduce the number of images used to reconstruct large structures like bridges, historical buildings, 7 Fig. 4: The relationship between baseline setup, reconstruction techniques and 3D representations. etc. However, because of their high resolution, the computation becomes intensive. Moreover, the acquisition of spherical images costs time with the existing hardware. Those two problems make it hard to use spherical images for reconstruction in real-time. Although each 3D model representation can be flexibly translated to the other one, it is necessary to consider which representation is convenient when applying the reconstruction techniques. For instance, in the work of Pagani and his colleagues [14, 15], the output of the iteration stage between expansion and filtering in PMVS method are 3D points, hence it is convenient to represent the final model by point-cloud. However, optical flow- and light field-based methods produce depth information in [16, 17, 5, 8] which is suitable for using depth-map. Some representation is good to model in reconstruction with perspective images, but has problem in spherical case. For instance voxel-grid, it is chosen in DTAM [13]. However, if we apply this representation with spherical images, it faces memory consumption issue because spherical images are high-resolution and contain information of a full scene of large structures. Point-cloud, depth-map are good 3D representations to model the scene in the computation phase. However, they are often translated to mesh in the presentation phase due to the visualization effect. As a brief summary for this report, Fig. 4 shows the relationship between baseline setup, reconstruction techniques and 3D representations. Due to dealing with the derivatives, variational methods are strongly connected to short baseline setup. Short baseline setup is also the requirement of optical flow-based method due to the need of small displacement and light field-based method for dense samples of the ray space. The PMVS method can handle the wide baseline and short baseline as well. The relation between PMVS and short baseline is shown in dot because their application is not discussed in this report. In 3D representations, depth-map is most commonly used due to their convenience of computation, and then point-cloud. Mesh and point-cloud are preferred to present the final result after the computation because of their good visualization effect. There exist another 3D representation which is voxel-grid. This representation is not connected to any techniques in spherical images due to its memory consumption issue which have been discussed above. 6 Conclusion In this paper, we have discussed the state-of-the-art methods of 3D reconstructions using spherical images. These techniques are inspired mostly from ones in the perspective case. We began with the baseline setup, which can be seen as wide or short baseline according to the camera setup in concern with the size of the reconstructed model. In wide baseline, the cameras can be positioned anywhere which is good to cover the entire scene. However, it brings difficulty to the computation. With wide displacement, each camera catches part of the scene visible to it, and suffers from large 8 occlusions. Pagani et al [14, 15] approached this setup and used triangulation based on epipolar geometry to get a sparse scene reconstruction, which is an initialization for dense reconstruction after that. Short baseline seems to be applied more common in [13, 10, 9, 8, 16, 17, 6] because there are many methods which can be applied with this setup. We also categorized the 3D reconstructions based on their techniques, which are Variational method, PMVS, Optical flow- and Light field-based method. Variational methods are applied in [10, 13, 9, 6]. With the PMVS, Pagani et al. [14, 15] have proposed their new version S-PMVS for reconstructing with spherical images. Pathak et al. in their two work in 2016 [16, 17] applied optical flow for their reconstruction. Light field-based method is applied in [8] by Changil et al. Just like the variations of approaches to produce a 3D models from multiple of images, their final results also has various types of representation which are addressed here: point-cloud [14, 15], depth-map [8, 6, 16, 17, 9] and voxel-grid [13]. Our discussion section also pointed out some advantages and disadvantages, along with the correlation between these category approaches. References 1. R.C. Bolles, H.H. Baker, and D.H. Marimont. Epipolar-plane image analysis: An approach to determine structure from motion. International Journal of Computer Vision, 1:7–55, 1987. 2. T. Brodsky, C. Fermller, and Y. Aloimonos. Directions of motion fields are hardly ever ambiguous. International Journal of Computer Vision, 26:5–24, 1998. 3. Y. Cui, A. Pagani, and D. Stricker. SIFT in perception-based color space. Image Processing (ICIP), 2010 17th IEEE International Conference on, 2010. 4. Y. Furukawa and B. Curless. Towards internet-scale multi-view stereo. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010. 5. I. Igbinedion and H. Han. 3D stereo reconstruction using multiple spherical views. 2015. 6. S. Im, H. Ha, et al. All-around depth from small motion with a spherical panoramic camera. 9907:156– 172, September 2016. 7. T. Kanade and M. Okutomi. A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15:353–363, 2002. 8. C. Kim, H. Zimmer, et al. Scene reconstruction from high spatio-angular resolution light fields. ACM Transactions on Graphics (TOG) - SIGGRAPH 2013 Conference Proceedings, 32, July 2013. 9. H. Kim and A. Hilton. 3D scene reconstruction from multiple spherical stereo pairs. International Journal of Computer Vision, 104:94–116, August 2013. 10. H. Kim, M. Sarim, et al. Dynamic 3D scene reconstruction in outdoor environments. 2010. 11. B. Krolla, M. Diebold, B. Goldl¨ ucke, and D. Stricker. Spherical light fields. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014. 12. R.C. Nelson and J. Aloimonos. Finding motion parameters from spherical motion fields (or the advantages of having eyes in the back of your head). Biological Cybernetics, 58:261–273, 1988. 13. R.A Newcombe, S.T Lovegrove, and others. DTAM: Dense Tracking and Mapping in Real-Time. Computer Vision (ICCV), 2011, November 2011. 14. A. Pagani, C. Gava, et al. Dense 3D point cloud generation from multiple high-resolution spherical images. The 12th International Symposium on Virtual Reality, Archaeology and Intelligent Cultural Heritage, 2011. 15. A. Pagani and D. Stricker. Structure from motion using full spherical panoramic cameras. Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, November 2011. 16. S. Pathak, A. Moro, et al. 3D reconstruction of structures using spherical cameras with small motion. International Conference on Control, Automation and Systems 2016, October 2016. 17. S. Pathak, A. Moro, et al. Dense 3D reconstruction from two spherical images via optical flow-based equirectangular epipolar rectification. Imaging Systems and Techniques (IST), 2016 IEEE International Conference on, October 2016. Survey of different alternatives for object tracking using deep neural networks Apurba Roy1 and Stephan Krauss2 1 2 [email protected] [email protected] Abstract. In this paper, we discuss the challenges that we face in the field of object tracking and also bring up and compare different approaches to counter these challenges. So, the main motivation behind this paper is to come up with an approach that can perform object tracking much better than its counterparts after having a detailed comparison among other approaches. The three papers that will be discussed in this paper are very much relevant to the overall topic because all of the papers that are chosen propose a unique approach of handling the challenges of object tracking. After this paper, one will have a very clear understanding about the possible ways of performing object tracking more efficiently. In the first approach [3], the main focus is on developing an algorithm by combining the functions of convolutional neural networks with the Long Short-Term Memory (LSTM) [1]. In approach [2] the main focus is on training of convolutional neural networks(CNN) using data from multiple domains along with the domain specific information and in the last paper [4], the main task is on understanding the CNN well rather than considering it as a black box and improvise the networks capability to achieve better performance. We mainly give the overviews or provide a high level understanding of [2] and [4]. Through this paper we jump into depth of approach [3] and discuss the intricacies of the algorithm proposed by [3]. 1 Introduction As far as the object tracking is concerned, several approaches have evolved over the past few decades. Each different approach has its advantages and disadvantages. But the most important factor for performing these tasks well is the concept called “Feature Learning”. In the early days of machine learning people were using hand designed features which basically represents features after we explicitly apply some oriented filters on the input image frame, but the result of this shallow learning is not so satisfactory and at the same time the whole process is really time consuming (Fig. 1) and the main disadvantage is that the features are not learned here. With hand crafted features there is no scope for learning and thus the number of layers in the network is less and that’s why it’s named “Shallow Architecture”. Fig. 1. Traditional Recognition Approach – “Shallow Architecture” Then several years of rigorous research bring the concept of “Deep Learning” which focuses more on understanding the features dynamically rather than hand crafted ones. Deep learning networks 2 consist of several layers which basically forms the feature hierarchy and the classifiers are also trained several times. Here each layer extracts features from the output of the previous layer and they all are jointly trained. So, this network contains many layers and thus it is named as “Deep Architecture”(Fig.1) Fig. 2. Deep Learning – “Deep Architecture” But in object tracking there are several issues that hamper the performance. For example, factors like scale changes, non-uniform illumination, motion blur, target deformations, background etc. are great challenges to overcome. So, the chosen papers mentioned above propose different network architectures to overcome these issues and the main goal of this paper is to analyze them and do a comparative study of the different suggested algorithms and their achievements in object tracking. 1.1 Overview of Spatially Supervised RCNN for visual tracking The first approach [3], the recurrent convolutional neural network uses the history of locations as well distinctive features learnt by deep neural networks. It also focuses on the regression capability of LSTM (long short term memory) in the temporal domain and here the regression is directly applied in the prediction of the tracking location in both the convolution as well as the recurrent unit. The most interesting thing about this approach is that unlike existing methods, the recurrent network is “doubly deep”.That means it examines the location history as well as the robust visual features of the past framesobtained by the convolutional neural network. So, the proposed network here extends the neural network analysis into the spatiotemporal domain for a better object tracking. Here, at first from the raw image sequences the YOLO network collects the rich and robust visual features along with initial location preferences and then the LSTM comes into play which does the sequential processing by taking the robust features as input as shown in (Fig. 3). Here, at first the rich and robust visual features are extracted by CNNs, then these features are fed to the detectors to find out whether the objects are present in the calculated bounding box or not by using spatial contraint. Then, output of the detectors along with the raw visual features are fed to the LSTM, which performs the sequential processing by applying temporal constraints. At last, the output from the LSTM is used to predict the current object location. Here the YOLO network along with the LSTM is referred to as recurrent YOLO, i.e. ROLO, which takes visual features and regresses them into location predictions. 1.2 Overview of Multi-domain CNN for object tracking The main idea behind the corresponding paper [2] is to focus on the characteristics of the training data. They use a multi-domain approach here which means the training data is coming from multiple domains. This data also contains domain specific information and the model is getting the location data at the start of the training which is really a novel approach and is efficient as well. Here, 3 Fig. 3. A simple overview of the proposed system. Figure is taken from [3] two dedicated sections are used named ”Shared Layers” and ”Domain-Specific Layers”. The shared layer contains three convolutional layers (conv1 – conv3) which extract robust visual features and two fully connected layers (FC4, FC5), which are used as hidden neurons to perform the prediction based on the features. The domain-specific layer contains several branches of fully connected layers that are used as classifiers with some domain-specific information but at training time only. Once the multi-domain learning is finished, the multiple branches of domain specific layers are replaced with a single branch (fc6) for new test sequences (Fig. 4). The CNN is trained by a Stochastic Gradient Descent algorithm (SGD). As mentioned above, the domain specific layers contain the K Fig. 4. The Architecture of Multi–Domain Network. Figure is taken from [2] branches and the yellow and the blue bounding boxes refer to the positive and the negative samples in each domain respectively. This network is also smaller than AlexNet [5] or a VGGNet [5] as far as the depth in concerned. It uses a bounding box regression technique, given a first set of samples, 4 a simple linear regression model is trained using the conv3 features, but in the subsequent frames the below equation is used. Here X ∗ denotes the maximum positively scored sample and f + (X i ) denotes the positive score for the ith candidate. X ∗ = argmaxXi f + (X i ) 1.3 (1) Overview of Fully convolutional network for visual tracking In the third and the final approach [4], instead of considering CNNs as black box representations, they conducted an in-depth study on the properties of CNNs offline pre-trained on massive image data and the classification task on ImageNet. After studying, it’s found that the top layer mainly carries semantic information and acts as a category detector whereas the lower layers convey more discriminative features which can perform really well where similar targets are presented. Here, first the image frame is sent to the Conv4-3 layer of the VGGNet which carries the rich discriminative features and then to the Conv5-3 layer which conveys more of semantic feature. Then, two different networks are used named Specific Net (SNet) and the General Net (GNet), so the semantic information is fed to GNet from Conv5-3 layer and the discriminative feature map from Conv4-3 layer is transported to SNet. In the next stage the SNet and GNet will learn the features and produce two separate heat maps independent of each other. Then these two heat maps are passed to a distracter detection module, which outputs the appropriate heat map to be passed on to the next phase (Fig. 5). Fig. 5. Pipeline of the algorithm, (a) Input region of interest (ROI), (b) VGG network, (c) SNet, (d) GNet, (e) Tracking results. Figure is taken from [4] The model does the feature selection by minimizing the squared loss between the predicted foreground heat map M ∗ and the target heat map M, depicted in the formula below, LSel = ||M ∗ − M ||2 (2) In case of the online update, because of the background noise only the SNet is updated. Two different rules are followed for updating the SNet, the adaption rule and the discriminative rule. 5 The discriminative rule improves the discriminative power for foreground and background while the adaptation rule helps SNet to adapt to target appearance variation. According to adaptation rule, SNet is finetuned every 20 frames using the most confident tracking result within the intervening frames. In case of Discriminative rule, SNet is further updated when the distracters (close objects that are similar to target object) are detected. 2 Detailed presentation of Spatially Supervised RCNN for visual tracking In this section, a detailed view of [3] is presented. This approach is a synergy of a convolutional neural network which is responsible for extracting the rich robust features from the video frames and the LSTM’s capability of sequential data processing. So, there is a collaboration of features and the location inferences. The algorithm uses a regression method for the prediction of the object location both in the convolutional layer as well as the recurrent unit section (LSTM). In case of other object tracker for example using a Kalman filter, only the location history is used to predict the object location in the next frame. But in this approach they use the location history as well as the rich features extracted by the CNN (Fig. 1). 2.1 System Overview of Spatially Supervised RCNN for visual tracking Here the network is basically taking raw video frames and outputting the coordinates of the bounding box representing the location of the object in each of the frames. So, here both the rich features and the initial location inferences are used, which eventually increases the efficiency of the tracker manyfold. The mathematical representation of the process is: P (B1 , B2 , ........, BT |X1 , X2 , ......XT ) = T Y P (BT |B