Search | arXiv e-print repository

Continuous Cost Aggregation for Dual-Pixel Disparity Extraction

Authors: Sagi Monin, Sagi Katz, Georgios Evangelidis

Abstract: Recent works have shown that depth information can be obtained from Dual-Pixel (DP) sensors. A DP arrangement provides two views in a single shot, thus resembling a stereo image pair with a tiny baseline. However, the different point spread function (PSF) per view, as well as the small disparity range, makes the use of typical stereo matching algorithms problematic. To address the above shortcomin… ▽ More Recent works have shown that depth information can be obtained from Dual-Pixel (DP) sensors. A DP arrangement provides two views in a single shot, thus resembling a stereo image pair with a tiny baseline. However, the different point spread function (PSF) per view, as well as the small disparity range, makes the use of typical stereo matching algorithms problematic. To address the above shortcomings, we propose a Continuous Cost Aggregation (CCA) scheme within a semi-global matching framework that is able to provide accurate continuous disparities from DP images. The proposed algorithm fits parabolas to matching costs and aggregates parabola coefficients along image paths. The aggregation step is performed subject to a quadratic constraint that not only enforces the disparity smoothness but also maintains the quadratic form of the total costs. This gives rise to an inherently efficient disparity propagation scheme with a pixel-wise minimization in closed-form. Furthermore, the continuous form allows for a robust multi-scale aggregation that better compensates for the varying PSF. Experiments on DP data from both DSLR and phone cameras show that the proposed scheme attains state-of-the-art performance in DP disparity estimation. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2212.08059 [pdf, other]

Rethinking Vision Transformers for MobileNet Size and Speed

Authors: Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, Jian Ren

Abstract: With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its varian… ▽ More With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel fine-grained joint search strategy for transformer models that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve 3.5% higher top-1 accuracy than MobileNetV2 on ImageNet-1K with similar latency and parameters. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed. △ Less

Submitted 4 September, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

Comments: Code is available at: https://github.com/snap-research/EfficientFormer

arXiv:2206.01191 [pdf, other]

EfficientFormer: Vision Transformers at MobileNet Speed

Authors: Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, Jian Ren

Abstract: Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challen… ▽ More Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, \textit{e.g.}, attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as a design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves $79.2\%$ top-1 accuracy on ImageNet-1K with only $1.6$ ms inference latency on iPhone 12 (compiled with CoreML), which runs as fast as MobileNetV2$\times 1.4$ ($1.6$ ms, $74.7\%$ top-1), and our largest model, EfficientFormer-L7, obtains $83.3\%$ accuracy with only $7.0$ ms latency. Our work proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance. △ Less

Submitted 10 October, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2012.06772 [pdf, other]

doi 10.1007/s00138-016-0784-4

An Overview of Depth Cameras and Range Scanners Based on Time-of-Flight Technologies

Authors: Radu Horaud, Miles Hansard, Georgios Evangelidis, Clement Menier

Abstract: Time-of-flight (TOF) cameras are sensors that can measure the depths of scene-points, by illuminating the scene with a controlled laser or LED source, and then analyzing the reflected light. In this paper, we will first describe the underlying measurement principles of time-of-flight cameras, including: (i) pulsed-light cameras, which measure directly the time taken for a light pulse to travel fro… ▽ More Time-of-flight (TOF) cameras are sensors that can measure the depths of scene-points, by illuminating the scene with a controlled laser or LED source, and then analyzing the reflected light. In this paper, we will first describe the underlying measurement principles of time-of-flight cameras, including: (i) pulsed-light cameras, which measure directly the time taken for a light pulse to travel from the device to the object and back again, and (ii) continuous-wave modulated-light cameras, which measure the phase difference between the emitted and received signals, and hence obtain the travel time indirectly. We review the main existing designs, including prototypes as well as commercially available devices. We also review the relevant camera calibration principles, and how they are applied to TOF devices. Finally, we discuss the benefits and challenges of combined TOF and color camera systems. △ Less

Submitted 12 December, 2020; originally announced December 2020.

Journal ref: Machine Vision and Applications, 27(7), 2016

arXiv:2012.06769 [pdf, other]

doi 10.1109/TPAMI.2015.2400465

Fusion of Range and Stereo Data for High-Resolution Scene-Modeling

Authors: Georgios D. Evangelidis, Miles Hansard, Radu Horaud

Abstract: This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation. Unlike existing schemes that build on MRF optimizers, we infer the disparity map from a series of local energy minimization problems that are solved hierarchica… ▽ More This paper addresses the problem of range-stereo fusion, for the construction of high-resolution depth maps. In particular, we combine low-resolution depth data with high-resolution stereo data, in a maximum a posteriori (MAP) formulation. Unlike existing schemes that build on MRF optimizers, we infer the disparity map from a series of local energy minimization problems that are solved hierarchically, by growing sparse initial disparities obtained from the depth data. The accuracy of the method is not compromised, owing to three properties of the data-term in the energy function. Firstly, it incorporates a new correlation function that is capable of providing refined correlations and disparities, via subpixel correction. Secondly, the correlation scores rely on an adaptive cost aggregation step, based on the depth data. Thirdly, the stereo and depth likelihoods are adaptively fused, based on the scene texture and camera geometry. These properties lead to a more selective growing process which, unlike previous seed-growing methods, avoids the tendency to propagate incorrect disparities. The proposed method gives rise to an intrinsically efficient algorithm, which runs at 3FPS on 2.0MP images on a standard desktop computer. The strong performance of the new method is established both by quantitative comparisons with state-of-the-art methods, and by qualitative comparisons using real depth-stereo data-sets. △ Less

Submitted 12 December, 2020; originally announced December 2020.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2015

arXiv:2010.02153 [pdf, ps, other]

Ego-Motion Alignment from Face Detections for Collaborative Augmented Reality

Authors: Branislav Micusik, Georgios Evangelidis

Abstract: Sharing virtual content among multiple smart glasses wearers is an essential feature of a seamless Collaborative Augmented Reality experience. To enable the sharing, local coordinate systems of the underlying 6D ego-pose trackers, running independently on each set of glasses, have to be spatially and temporally aligned with respect to each other. In this paper, we propose a novel lightweight solut… ▽ More Sharing virtual content among multiple smart glasses wearers is an essential feature of a seamless Collaborative Augmented Reality experience. To enable the sharing, local coordinate systems of the underlying 6D ego-pose trackers, running independently on each set of glasses, have to be spatially and temporally aligned with respect to each other. In this paper, we propose a novel lightweight solution for this problem, which is referred as ego-motion alignment. We show that detecting each other's face or glasses together with tracker ego-poses sufficiently conditions the problem to spatially relate local coordinate systems. Importantly, the detected glasses can serve as reliable anchors to bring sufficient accuracy for the targeted practical use. The proposed idea allows us to abandon the traditional visual localization step with fiducial markers or scene points as anchors. A novel closed form minimal solver which solves a Quadratic Eigenvalue Problem is derived and its refinement with Gaussian Belief Propagation is introduced. Experiments validate the presented approach and show its high practical potential. △ Less

Submitted 5 October, 2020; originally announced October 2020.

arXiv:2008.06399 [pdf, ps, other]

Renormalization for Initialization of Rolling Shutter Visual-Inertial Odometry

Authors: Branislav Micusik, Georgios Evangelidis

Abstract: In this paper we deal with the initialization problem of a visual-inertial odometry system with rolling shutter cameras. Initialization is a prerequisite for using inertial signals and fusing them with visual data. We propose a novel statistical solution to the initialization problem on visual and inertial data simultaneously, by casting it into the renormalization scheme of Kanatani. The renormal… ▽ More In this paper we deal with the initialization problem of a visual-inertial odometry system with rolling shutter cameras. Initialization is a prerequisite for using inertial signals and fusing them with visual data. We propose a novel statistical solution to the initialization problem on visual and inertial data simultaneously, by casting it into the renormalization scheme of Kanatani. The renormalization is an optimization scheme which intends to reduce the inherent statistical bias of common linear systems. We derive and present the necessary steps and methodology specific to the initialization problem. Extensive evaluations on ground truth exhibit superior performance and a gain in accuracy of up to $20\%$ over the originally proposed Least Squares solution. The renormalization performs similarly to the optimal Maximum Likelihood estimate, despite arriving at the solution by different means. With this paper we are adding to the set of Computer Vision problems which can be cast into the renormalization scheme. △ Less

Submitted 24 March, 2021; v1 submitted 14 August, 2020; originally announced August 2020.

arXiv:2006.06017 [pdf, ps, other]

Revisiting visual-inertial structure from motion for odometry and SLAM initialization

Authors: Georgios Evangelidis, Branislav Micusik

Abstract: In this paper, an efficient closed-form solution for the state initialization in visual-inertial odometry (VIO) and simultaneous localization and map** (SLAM) is presented. Unlike the state-of-the-art, we do not derive linear equations from triangulating pairs of point observations. Instead, we build on a direct triangulation of the unknown $3D$ point paired with each of its observations. We sho… ▽ More In this paper, an efficient closed-form solution for the state initialization in visual-inertial odometry (VIO) and simultaneous localization and map** (SLAM) is presented. Unlike the state-of-the-art, we do not derive linear equations from triangulating pairs of point observations. Instead, we build on a direct triangulation of the unknown $3D$ point paired with each of its observations. We show and validate the high impact of such a simple difference. The resulting linear system has a simpler structure and the solution through analytic elimination only requires solving a $6\times 6$ linear system (or $9 \times 9$ when accelerometer bias is included). In addition, all the observations of every scene point are jointly related, thereby leading to a less biased and more robust solution. The proposed formulation attains up to $50$ percent decreased velocity and point reconstruction error compared to the standard closed-form solver, while it is $4\times$ faster for a $7$-frame set. Apart from the inherent efficiency, fewer iterations are needed by any further non-linear refinement thanks to better parameter initialization. In this context, we provide the analytic Jacobians for a non-linear optimizer that optionally refines the initial parameters. The superior performance of the proposed solver is established by quantitative comparisons with the state-of-the-art solver. △ Less

Submitted 28 January, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

arXiv:1710.05193 [pdf, other]

doi 10.1016/j.ins.2019.03.024

K-means clustering for efficient and robust registration of multi-view point sets

Authors: Zutao Jiang, Jihua Zhu, Georgios D. Evangelidis, Changqing Zhang, Shanmin Pang, Yaochen Li

Abstract: Generally, there are three main factors that determine the practical usability of registration, i.e., accuracy, robustness, and efficiency. In real-time applications, efficiency and robustness are more important. To promote these two abilities, we cast the multi-view registration into a clustering task. All the centroids are uniformly sampled from the initially aligned point sets involved in the m… ▽ More Generally, there are three main factors that determine the practical usability of registration, i.e., accuracy, robustness, and efficiency. In real-time applications, efficiency and robustness are more important. To promote these two abilities, we cast the multi-view registration into a clustering task. All the centroids are uniformly sampled from the initially aligned point sets involved in the multi-view registration, which makes it rather efficient and effective for the clustering. Then, each point is assigned to a single cluster and each cluster centroid is updated accordingly. Subsequently, the shape comprised by all cluster centroids is used to sequentially estimate the rigid transformation for each point set. For accuracy and stability, clustering and transformation estimation are alternately and iteratively applied to all point sets. We tested our proposed approach on several benchmark datasets and compared it with state-of-the-art approaches. Experimental results validate its efficiency and robustness for the registration of multi-view point sets. △ Less

Submitted 30 April, 2018; v1 submitted 14 October, 2017; originally announced October 2017.

arXiv:1609.01466 [pdf, other]

doi 10.1109/TPAMI.2017.2717829

Joint Alignment of Multiple Point Sets with Batch and Incremental Expectation-Maximization

Authors: Georgios Evangelidis, Radu Horaud

Abstract: This paper addresses the problem of registering multiple point sets. Solutions to this problem are often approximated by repeatedly solving for pairwise registration, which results in an uneven treatment of the sets forming a pair: a model set and a data set. The main drawback of this strategy is that the model set may contain noise and outliers, which negatively affects the estimation of the regi… ▽ More This paper addresses the problem of registering multiple point sets. Solutions to this problem are often approximated by repeatedly solving for pairwise registration, which results in an uneven treatment of the sets forming a pair: a model set and a data set. The main drawback of this strategy is that the model set may contain noise and outliers, which negatively affects the estimation of the registration parameters. In contrast, the proposed formulation treats all the point sets on an equal footing. Indeed, all the points are drawn from a central Gaussian mixture, hence the registration is cast into a clustering problem. We formally derive batch and incremental EM algorithms that robustly estimate both the GMM parameters and the rotations and translations that optimally align the sets. Moreover, the mixture's means play the role of the registered set of points while the variances provide rich information about the contribution of each component to the alignment. We thoroughly test the proposed algorithms on simulated data and on challenging real data collected with range sensors. We compare them with several state-of-the-art algorithms, and we show their potential for surface reconstruction from depth data. △ Less

Submitted 6 March, 2017; v1 submitted 6 September, 2016; originally announced September 2016.

Comments: 14 pages, 12 figures, 5 tables

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1397 - 1410, 2018

arXiv:1603.09732 [pdf, other]

doi 10.1109/TIP.2017.2654165

Robust Head-Pose Estimation Based on Partially-Latent Mixture of Linear Regressions

Authors: Vincent Drouard, Radu Horaud, Antoine Deleforge, Silèye Ba, Georgios Evangelidis

Abstract: Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propo… ▽ More Head-pose estimation has many applications, such as social event analysis, human-robot and human-computer interaction, driving assistance, and so forth. Head-pose estimation is challenging because it must cope with changing illumination conditions, variabilities in face orientation and in appearance, partial occlusions of facial landmarks, as well as bounding-box-to-face alignment errors. We propose tu use a mixture of linear regressions with partially-latent output. This regression method learns to map high-dimensional feature vectors (extracted from bounding boxes of faces) onto the joint space of head-pose angles and bounding-box shifts, such that they are robustly predicted in the presence of unobservable phenomena. We describe in detail the map** method that combines the merits of unsupervised manifold learning techniques and of mixtures of regressions. We validate our method with three publicly available datasets and we thoroughly benchmark four variants of the proposed algorithm with several state-of-the-art head-pose estimation methods. △ Less

Submitted 6 March, 2017; v1 submitted 31 March, 2016; originally announced March 2016.

Comments: 12 pages, 5 figures, 3 tables

Journal ref: IEEE Transactions on Image Processing, volume 26, Issue 3, 1428-1440, 2017

arXiv:1406.0288 [pdf, other]

doi 10.1007/s11263-014-0758-9

Continuous Action Recognition Based on Sequence Alignment

Authors: Kaustubh Kulkarni, Georgios Evangelidis, Jan Cech, Radu Horaud

Abstract: Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time war** (DTW) framework and devise a novel visual alignment technique, namely dynamic frame war** (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a te… ▽ More Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time war** (DTW) framework and devise a novel visual alignment technique, namely dynamic frame war** (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two extensions which enable to perform recognition concomitant with segmentation, namely one-pass DFW and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action recognition has been overlooked. We test and illustrate the proposed techniques with a recently released dataset (RAVEL) and with two public-domain datasets widely used in action recognition (Hollywood-1 and Hollywood-2). We also compare the performances of the proposed isolated and continuous recognition algorithms with several recently published methods. △ Less

Submitted 2 June, 2014; originally announced June 2014.

Journal ref: International Journal of Computer Vision 112(1), 90-114, 2015

arXiv:1401.8092 [pdf, other]

doi 10.1016/j.cviu.2014.09.001

Cross-calibration of Time-of-flight and Colour Cameras

Authors: Miles Hansard, Georgios Evangelidis, Quentin Pelorson, Radu Horaud

Abstract: Time-of-flight cameras provide depth information, which is complementary to the photometric appearance of the scene in ordinary images. It is desirable to merge the depth and colour information, in order to obtain a coherent scene representation. However, the individual cameras will have different viewpoints, resolutions and fields of view, which means that they must be mutually calibrated. This p… ▽ More Time-of-flight cameras provide depth information, which is complementary to the photometric appearance of the scene in ordinary images. It is desirable to merge the depth and colour information, in order to obtain a coherent scene representation. However, the individual cameras will have different viewpoints, resolutions and fields of view, which means that they must be mutually calibrated. This paper presents a geometric framework for this multi-view and multi-modal calibration problem. It is shown that three-dimensional projective transformations can be used to align depth and parallax-based representations of the scene, with or without Euclidean reconstruction. A new evaluation procedure is also developed; this allows the reprojection error to be decomposed into calibration and sensor-dependent components. The complete approach is demonstrated on a network of three time-of-flight and six colour cameras. The applications of such a system, to a range of automatic scene-interpretation problems, are discussed. △ Less

Submitted 30 June, 2014; v1 submitted 31 January, 2014; originally announced January 2014.

Comments: 18 pages, 12 figures, 3 tables

Journal ref: Computer Vision and Image Understanding, 134, pp.105-115, 2015

arXiv:1401.6393 [pdf, other]

doi 10.1016/j.cviu.2014.01.007

Automatic Detection of Calibration Grids in Time-of-Flight Images

Authors: Miles Hansard, Radu Horaud, Michel Amat, Georgios Evangelidis

Abstract: It is convenient to calibrate time-of-flight cameras by established methods, using images of a chequerboard pattern. The low resolution of the amplitude image, however, makes it difficult to detect the board reliably. Heuristic detection methods, based on connected image-components, perform very poorly on this data. An alternative, geometrically-principled method is introduced here, based on the H… ▽ More It is convenient to calibrate time-of-flight cameras by established methods, using images of a chequerboard pattern. The low resolution of the amplitude image, however, makes it difficult to detect the board reliably. Heuristic detection methods, based on connected image-components, perform very poorly on this data. An alternative, geometrically-principled method is introduced here, based on the Hough transform. The projection of a chequerboard is represented by two pencils of lines, which are identified as oriented clusters in the gradient-data of the image. A projective Hough transform is applied to each of the two clusters, in axis-aligned coordinates. The range of each transform is properly bounded, because the corresponding gradient vectors are approximately parallel. Each of the two transforms contains a series of collinear peaks; one for every line in the given pencil. This pattern is easily detected, by swee** a dual line through the transform. The proposed Hough-based method is compared to the standard OpenCV detection routine, by application to several hundred time-of-flight images. It is shown that the new method detects significantly more calibration boards, over a greater variety of poses, without any overall loss of accuracy. This conclusion is based on an analysis of both geometric and photometric error. △ Less

Submitted 24 January, 2014; originally announced January 2014.

Comments: 11 pages, 11 figures, 1 table

Journal ref: Computer Vision and Image Understanding, 121, 2014

arXiv:1309.7750 [pdf, ps, other]

An Extensive Experimental Study on the Cluster-based Reference Set Reduction for speeding-up the k-NN Classifier

Authors: Stefanos Ougiaroglou, Georgios Evangelidis, Dimitris A. Dervos

Abstract: The k-Nearest Neighbor (k-NN) classification algorithm is one of the most widely-used lazy classifiers because of its simplicity and ease of implementation. It is considered to be an effective classifier and has many applications. However, its major drawback is that when sequential search is used to find the neighbors, it involves high computational cost. Speeding-up k-NN search is still an active… ▽ More The k-Nearest Neighbor (k-NN) classification algorithm is one of the most widely-used lazy classifiers because of its simplicity and ease of implementation. It is considered to be an effective classifier and has many applications. However, its major drawback is that when sequential search is used to find the neighbors, it involves high computational cost. Speeding-up k-NN search is still an active research field. Hwang and Cho have recently proposed an adaptive cluster-based method for fast Nearest Neighbor searching. The effectiveness of this method is based on the adjustment of three parameters. However, the authors evaluated their method by setting specific parameter values and using only one dataset. In this paper, an extensive experimental study of this method is presented. The results, which are based on five real life datasets, illustrate that if the parameters of the method are carefully defined, one can achieve even better classification performance. △ Less

Submitted 11 February, 2014; v1 submitted 30 September, 2013; originally announced September 2013.

Comments: Proceeding of International Conference on Integrated Information (IC-InInfo 2011), pp. 12-15, Kos island, Greece, 2011

Showing 1–15 of 15 results for author: Evangelidis, G