Search | arXiv e-print repository

arXiv:2010.15261 [pdf, other]

Deep Shells: Unsupervised Shape Correspondence with Optimal Transport

Authors: Marvin Eisenberger, Aysim Toker, Laura Leal-Taixé, Daniel Cremers

Abstract: We propose a novel unsupervised learning approach to 3D shape correspondence that builds a multiscale matching pipeline into a deep neural network. This approach is based on smooth shells, the current state-of-the-art axiomatic correspondence method, which requires an a priori stochastic search over the space of initial poses. Our goal is to replace this costly preprocessing step by directly learn… ▽ More We propose a novel unsupervised learning approach to 3D shape correspondence that builds a multiscale matching pipeline into a deep neural network. This approach is based on smooth shells, the current state-of-the-art axiomatic correspondence method, which requires an a priori stochastic search over the space of initial poses. Our goal is to replace this costly preprocessing step by directly learning good initializations from the input surfaces. To that end, we systematically derive a fully differentiable, hierarchical matching pipeline from entropy regularized optimal transport. This allows us to combine it with a local feature extractor based on smooth, truncated spectral convolution filters. Finally, we show that the proposed unsupervised method significantly improves over the state-of-the-art on multiple datasets, even in comparison to the most recent supervised methods. Moreover, we demonstrate compelling generalization results by applying our learned filters to examples that significantly deviate from the training set. △ Less

Submitted 28 October, 2020; originally announced October 2020.

arXiv:2010.14300 [pdf, other]

Ice Monitoring in Swiss Lakes from Optical Satellites and Webcams using Machine Learning

Authors: Manu Tom, Rajanie Prabha, Tianyu Wu, Emmanuel Baltsavias, Laura Leal-Taixe, Konrad Schindler

Abstract: Continuous observation of climate indicators, such as trends in lake freezing, is important to understand the dynamics of the local and global climate system. Consequently, lake ice has been included among the Essential Climate Variables (ECVs) of the Global Climate Observing System (GCOS), and there is a need to set up operational monitoring capabilities. Multi-temporal satellite images and publi… ▽ More Continuous observation of climate indicators, such as trends in lake freezing, is important to understand the dynamics of the local and global climate system. Consequently, lake ice has been included among the Essential Climate Variables (ECVs) of the Global Climate Observing System (GCOS), and there is a need to set up operational monitoring capabilities. Multi-temporal satellite images and publicly available webcam streams are among the viable data sources to monitor lake ice. In this work we investigate machine learning-based image analysis as a tool to determine the spatio-temporal extent of ice on Swiss Alpine lakes as well as the ice-on and ice-off dates, from both multispectral optical satellite images (VIIRS and MODIS) and RGB webcam images. We model lake ice monitoring as a pixel-wise semantic segmentation problem, i.e., each pixel on the lake surface is classified to obtain a spatially explicit map of ice cover. We show experimentally that the proposed system produces consistently good results when tested on data from multiple winters and lakes. Our satellite-based method obtains mean Intersection-over-Union (mIoU) scores >93%, for both sensors. It also generalises well across lakes and winters with mIoU scores >78% and >80% respectively. On average, our webcam approach achieves mIoU values of 87% (approx.) and generalisation scores of 71% (approx.) and 69% (approx.) across different cameras and winters respectively. Additionally, we put forward a new benchmark dataset of webcam images (Photi-LakeIce) which includes data from two winters and three cameras. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: Accepted for publication in MDPI Remote Sensing Journal

arXiv:2010.07548 [pdf, other]

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

Authors: Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, Laura Leal-Taixé

Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched… ▽ More Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data, and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light on potential future research directions. △ Less

Submitted 8 December, 2020; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: Accepted at IJCV

arXiv:2010.01114 [pdf, other]

Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation

Authors: Patrick Dendorfer, Aljoša Ošep, Laura Leal-Taixé

Abstract: In this paper, we present Goal-GAN, an interpretable and end-to-end trainable model for human trajectory prediction. Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route… ▽ More In this paper, we present Goal-GAN, an interpretable and end-to-end trainable model for human trajectory prediction. Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route towards the estimated goal. We leverage information about the past trajectory and visual context of the scene to estimate a multi-modal probability distribution over the possible goal positions, which is used to sample a potential goal during the inference. The routing is governed by a recurrent neural network that reacts to physical constraints in the nearby surroundings and generates feasible paths that route towards the sampled goal. Our extensive experimental evaluation shows that our method establishes a new state-of-the-art on several benchmarks while being able to generate a realistic and diverse set of trajectories that conform to physical constraints. △ Less

Submitted 2 October, 2020; originally announced October 2020.

Comments: Oral presentation at ACCV 2020

arXiv:2010.00602 [pdf, other]

doi 10.1051/0004-6361/202039574

HOLISMOKES -- IV. Efficient mass modeling of strong lenses through deep learning

Authors: S. Schuldt, S. H. Suyu, T. Meinhardt, L. Leal-Taixé, R. Cañameras, S. Taubenberger, A. Halkola

Abstract: Modelling the mass distributions of strong gravitational lenses is often necessary to use them as astrophysical and cosmological probes. With the high number of lens systems ($>10^5$) expected from upcoming surveys, it is timely to explore efficient modeling approaches beyond traditional MCMC techniques that are time consuming. We train a CNN on images of galaxy-scale lenses to predict the paramet… ▽ More Modelling the mass distributions of strong gravitational lenses is often necessary to use them as astrophysical and cosmological probes. With the high number of lens systems ($>10^5$) expected from upcoming surveys, it is timely to explore efficient modeling approaches beyond traditional MCMC techniques that are time consuming. We train a CNN on images of galaxy-scale lenses to predict the parameters of the SIE mass model ($x,y,e_x,e_y$, and $θ_E$). To train the network, we simulate images based on real observations from the HSC Survey for the lens galaxies and from the HUDF as lensed galaxies. We tested different network architectures, the effect of different data sets, and using different input distributions of $θ_E$. We find that the CNN performs well and obtain with the network trained with a uniform distribution of $θ_E$ $>0.5"$ the following median values with $1σ$ scatter: $Δx=(0.00^{+0.30}_{-0.30})"$, $Δy=(0.00^{+0.30}_{-0.29})" $, $Δθ_E=(0.07^{+0.29}_{-0.12})"$, $Δe_x = -0.01^{+0.08}_{-0.09}$ and $Δe_y = 0.00^{+0.08}_{-0.09}$. The bias in $θ_E$ is driven by systems with small $θ_E$. Therefore, when we further predict the multiple lensed image positions and time delays based on the network output, we apply the network to the sample limited to $θ_E>0.8"$. In this case, the offset between the predicted and input lensed image positions is $(0.00_{-0.29}^{+0.29})"$ and $(0.00_{-0.31}^{+0.32})"$ for $x$ and $y$, respectively. For the fractional difference between the predicted and true time delay, we obtain $0.04_{-0.05}^{+0.27}$. Our CNN is able to predict the SIE parameters in fractions of a second on a single CPU and with the output we can predict the image positions and time delays in an automated way, such that we are able to process efficiently the huge amount of expected lens detections in the near future. △ Less

Submitted 18 December, 2020; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: 17 pages, 14 Figures

arXiv:2009.07736 [pdf, other]

doi 10.1007/s11263-020-01375-2

HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking

Authors: Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixe, Bastian Leibe

Abstract: Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing track… ▽ More Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance. △ Less

Submitted 29 September, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: Pre-print. Accepted for Publication in the International Journal of Computer Vision, 19 August 2020. Code is available at https://github.com/JonathonLuiten/HOTA-metrics

Journal ref: International Journal of Computer Vision (2020)

arXiv:2008.11516 [pdf, other]

Making a Case for 3D Convolutions for Object Segmentation in Videos

Authors: Sabarinath Mahadevan, Ali Athar, Aljoša Ošep, Sebastian Hennen, Laura Leal-Taixé, Bastian Leibe

Abstract: The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involv… ▽ More The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance. In this work, we show that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss. To this end, we leverage an efficient 3D encoder, and propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster, thus showing that our architecture can efficiently learn expressive spatio-temporal features and produce high quality video segmentation masks. We have made our code and trained models publicly available at https://github.com/sabarim/3DC-Seg. △ Less

Submitted 1 September, 2023; v1 submitted 26 August, 2020; originally announced August 2020.

Comments: BMVC '20

arXiv:2005.09623 [pdf, other]

Focus on defocus: bridging the synthetic to real domain gap for depth estimation

Authors: Maxim Maximov, Kevin Galim, Laura Leal-Taixé

Abstract: Data-driven depth estimation methods struggle with the generalization outside their training scenes due to the immense variability of the real-world scenes. This problem can be partially addressed by utilising synthetically generated images, but closing the synthetic-real domain gap is far from trivial. In this paper, we tackle this issue by using domain invariant defocus blur as direct supervisio… ▽ More Data-driven depth estimation methods struggle with the generalization outside their training scenes due to the immense variability of the real-world scenes. This problem can be partially addressed by utilising synthetically generated images, but closing the synthetic-real domain gap is far from trivial. In this paper, we tackle this issue by using domain invariant defocus blur as direct supervision. We leverage defocus cues by using a permutation invariant convolutional neural network that encourages the network to learn from the differences between images with a different point of focus. Our proposed network uses the defocus map as an intermediate supervisory signal. We are able to train our model completely on synthetic data and directly apply it to a wide range of real-world images. We evaluate our model on synthetic and real datasets, showing compelling generalization results and state-of-the-art depth prediction. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: CVPR 2020

arXiv:2005.09544 [pdf, other]

doi 10.1109/CVPR42600.2020.00549

CIAGAN: Conditional Identity Anonymization Generative Adversarial Networks

Authors: Maxim Maximov, Ismail Elezi, Laura Leal-Taixé

Abstract: The unprecedented increase in the usage of computer vision technology in society goes hand in hand with an increased concern in data privacy. In many real-world scenarios like people tracking or action recognition, it is important to be able to process the data while taking careful consideration in protecting people's identity. We propose and develop CIAGAN, a model for image and video anonymizati… ▽ More The unprecedented increase in the usage of computer vision technology in society goes hand in hand with an increased concern in data privacy. In many real-world scenarios like people tracking or action recognition, it is important to be able to process the data while taking careful consideration in protecting people's identity. We propose and develop CIAGAN, a model for image and video anonymization based on conditional generative adversarial networks. Our model is able to remove the identifying characteristics of faces and bodies while producing high-quality images and videos that can be used for any computer vision task, such as detection or tracking. Unlike previous methods, we have full control over the de-identification (anonymization) procedure, ensuring both anonymization as well as diversity. We compare our method to several baselines and achieve state-of-the-art results. △ Less

Submitted 30 November, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

Comments: CVPR 2020

arXiv:2005.03770 [pdf, other]

Planning from Images with Deep Latent Gaussian Process Dynamics

Authors: Nathanael Bosch, Jan Achterhold, Laura Leal-Taixé, Jörg Stückler

Abstract: Planning is a powerful approach to control problems with known environment dynamics. In unknown environments the agent needs to learn a model of the system dynamics to make planning applicable. This is particularly challenging when the underlying states are only indirectly observable through images. We propose to learn a deep latent Gaussian process dynamics (DLGPD) model that learns low-dimension… ▽ More Planning is a powerful approach to control problems with known environment dynamics. In unknown environments the agent needs to learn a model of the system dynamics to make planning applicable. This is particularly challenging when the underlying states are only indirectly observable through images. We propose to learn a deep latent Gaussian process dynamics (DLGPD) model that learns low-dimensional system dynamics from environment interactions with visual observations. The method infers latent state representations from observations using neural networks and models the system dynamics in the learned latent space with Gaussian processes. All parts of the model can be trained jointly by optimizing a lower bound on the likelihood of transitions in image space. We evaluate the proposed approach on the pendulum swing-up task while using the learned dynamics model for planning in latent space in order to solve the control problem. We also demonstrate that our method can quickly adapt a trained agent to changes in the system dynamics from just a few rollouts. We compare our approach to a state-of-the-art purely deep learning based method and demonstrate the advantages of combining Gaussian processes with deep learning for data efficiency and transfer learning. △ Less

Submitted 7 May, 2020; originally announced May 2020.

Comments: Accepted for publication at the 2nd Annual Conference on Learning for Dynamics and Control (L4DC) 2020, with supplementary material. First two authors contributed equally

arXiv:2004.13048 [pdf, other]

doi 10.1051/0004-6361/202038219

HOLISMOKES -- II. Identifying galaxy-scale strong gravitational lenses in Pan-STARRS using convolutional neural networks

Authors: R. Canameras, S. Schuldt, S. H. Suyu, S. Taubenberger, T. Meinhardt, L. Leal-Taixe, C. Lemon, K. Rojas, E. Savary

Abstract: We present a systematic search for wide-separation (Einstein radius >1.5"), galaxy-scale strong lenses in the 30 000 sq.deg of the Pan-STARRS 3pi survey on the Northern sky. With long time delays of a few days to weeks, such systems are particularly well suited for catching strongly lensed supernovae with spatially-resolved multiple images and open new perspectives on early-phase supernova spectro… ▽ More We present a systematic search for wide-separation (Einstein radius >1.5"), galaxy-scale strong lenses in the 30 000 sq.deg of the Pan-STARRS 3pi survey on the Northern sky. With long time delays of a few days to weeks, such systems are particularly well suited for catching strongly lensed supernovae with spatially-resolved multiple images and open new perspectives on early-phase supernova spectroscopy and cosmography. We produce a set of realistic simulations by painting lensed COSMOS sources on Pan-STARRS image cutouts of lens luminous red galaxies with known redshift and velocity dispersion from SDSS. First of all, we compute the photometry of mock lenses in gri bands and apply a simple catalog-level neural network to identify a sample of 1050207 galaxies with similar colors and magnitudes as the mocks. Secondly, we train a convolutional neural network (CNN) on Pan-STARRS gri image cutouts to classify this sample and obtain sets of 105760 and 12382 lens candidates with scores pCNN>0.5 and >0.9, respectively. Extensive tests show that CNN performances rely heavily on the design of lens simulations and choice of negative examples for training, but little on the network architecture. Finally, we visually inspect all galaxies with pCNN>0.9 to assemble a final set of 330 high-quality newly-discovered lens candidates while recovering 23 published systems. For a subset, SDSS spectroscopy on the lens central regions proves our method correctly identifies lens LRGs at z~0.1-0.7. Five spectra also show robust signatures of high-redshift background sources and Pan-STARRS imaging confirms one of them as a quadruply-imaged red source at z_s = 1.185 strongly lensed by a foreground LRG at z_d = 0.3155. In the future, we expect that the efficient and automated two-step classification method presented in this paper will be applicable to the deeper gri stacks from the LSST with minor adjustments. △ Less

Submitted 7 April, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

Comments: 18 pages and 11 figures (plus appendix), version published in A&A

Journal ref: A&A 644, A163 (2020)

arXiv:2003.09003 [pdf, other]

MOT20: A benchmark for multi object tracking in crowded scenes

Authors: Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixé

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of mu… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios. △ Less

Submitted 19 March, 2020; originally announced March 2020.

Comments: The sequences of the new MOT20 benchmark were previously presented in the CVPR 2019 tracking challenge ( arXiv:1906.04567 ). The differences between the two challenges are: - New and corrected annotations - New sequences, as we had to crop and transform some old sequences to achieve higher quality in the annotations. - New baselines evaluations and different sets of public detections

arXiv:2003.08429 [pdf, other]

doi 10.1007/978-3-030-58621-8_10

STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos

Authors: Ali Athar, Sabarinath Mahadevan, Aljoša Ošep, Laura Leal-Taixé, Bastian Leibe

Abstract: Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In… ▽ More Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg. △ Less

Submitted 1 September, 2023; v1 submitted 18 March, 2020; originally announced March 2020.

Comments: ECCV 2020 28 pages, 6 figures

MSC Class: 68T45; 68T10; 62H30 ACM Class: I.2.10; I.4.6; I.4.8; I.5.3

arXiv:2002.07875 [pdf, other]

Lake Ice Monitoring with Webcams and Crowd-Sourced Images

Authors: Rajanie Prabha, Manu Tom, Mathias Rothermel, Emmanuel Baltsavias, Laura Leal-Taixe, Konrad Schindler

Abstract: Lake ice is a strong climate indicator and has been recognised as part of the Essential Climate Variables (ECV) by the Global Climate Observing System (GCOS). The dynamics of freezing and thawing, and possible shifts of freezing patterns over time, can help in understanding the local and global climate systems. One way to acquire the spatio-temporal information about lake ice formation, independen… ▽ More Lake ice is a strong climate indicator and has been recognised as part of the Essential Climate Variables (ECV) by the Global Climate Observing System (GCOS). The dynamics of freezing and thawing, and possible shifts of freezing patterns over time, can help in understanding the local and global climate systems. One way to acquire the spatio-temporal information about lake ice formation, independent of clouds, is to analyse webcam images. This paper intends to move towards a universal model for monitoring lake ice with freely available webcam data. We demonstrate good performance, including the ability to generalise across different winters and different lakes, with a state-of-the-art Convolutional Neural Network (CNN) model for semantic image segmentation, Deeplab v3+. Moreover, we design a variant of that model, termed Deep-U-Lab, which predicts sharper, more correct segmentation boundaries. We have tested the model's ability to generalise with data from multiple camera views and two different winters. On average, it achieves intersection-over-union (IoU) values of ~71% across different cameras and ~69% across different winters, greatly outperforming prior work. Going even further, we show that the model even achieves 60% IoU on arbitrary images scraped from photo-sharing web sites. As part of the work, we introduce a new benchmark dataset of webcam images, Photi-LakeIce, from multiple cameras and two different winters, along with pixel-wise ground truth annotations. △ Less

Submitted 7 May, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: Accepted for ISPRS Congress 2020, Nice, France

arXiv:2001.11845 [pdf, other]

Learn to Predict Sets Using Feed-Forward Neural Networks

Authors: Hamid Rezatofighi, Tianyu Zhu, Roman Kaskman, Farbod T. Motlagh, Qinfeng Shi, Anton Milan, Daniel Cremers, Laura Leal-Taixé, Ian Reid

Abstract: This paper addresses the task of set prediction using deep feed-forward neural networks. A set is a collection of elements which is invariant under permutation and the size of a set is not fixed in advance. Many real-world problems, such as image tagging and object detection, have outputs that are naturally expressed as sets of entities. This creates a challenge for traditional deep neural network… ▽ More This paper addresses the task of set prediction using deep feed-forward neural networks. A set is a collection of elements which is invariant under permutation and the size of a set is not fixed in advance. Many real-world problems, such as image tagging and object detection, have outputs that are naturally expressed as sets of entities. This creates a challenge for traditional deep neural networks which naturally deal with structured outputs such as vectors, matrices or tensors. We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. In our formulation we define a likelihood for a set distribution represented by a) two discrete distributions defining the set cardinally and permutation variables, and b) a joint distribution over set elements with a fixed cardinality. Depending on the problem under consideration, we define different training models for set prediction using deep neural networks. We demonstrate the validity of our set formulations on relevant vision problems such as: 1) multi-label image classification where we outperform the other competing methods on the PASCAL VOC and MS COCO datasets, 2) object detection, for which our formulation outperforms popular state-of-the-art detectors, and 3) a complex CAPTCHA test, where we observe that, surprisingly, our set-based network acquired the ability of mimicking arithmetics without any rules being coded. △ Less

Submitted 25 October, 2021; v1 submitted 29 January, 2020; originally announced January 2020.

Comments: Accepted in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2022. arXiv admin note: substantial text overlap with arXiv:1805.00613

arXiv:1912.07515 [pdf, other]

Learning a Neural Solver for Multiple Object Tracking

Authors: Guillem Brasó, Laura Leal-Taixé

Abstract: Graphs offer a natural way to formulate Multiple Object Tracking (MOT) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such \textit{structured domain} is not trivial. As a consequence, most learning-based work has been devoted to learning better features for MOT, and then using these with well-e… ▽ More Graphs offer a natural way to formulate Multiple Object Tracking (MOT) within the tracking-by-detection paradigm. However, they also introduce a major challenge for learning methods, as defining a model that can operate on such \textit{structured domain} is not trivial. As a consequence, most learning-based work has been devoted to learning better features for MOT, and then using these with well-established optimization frameworks. In this work, we exploit the classical network flow formulation of MOT to define a fully differentiable framework based on Message Passing Networks (MPNs). By operating directly on the graph domain, our method can reason globally over an entire set of detections and predict final solutions. Hence, we show that learning in MOT does not need to be restricted to feature extraction, but it can also be applied to the data association step. We show a significant improvement in both MOTA and IDF1 on three publicly available benchmarks. Our code is available at https://bit.ly/motsolv . △ Less

Submitted 18 April, 2020; v1 submitted 16 December, 2019; originally announced December 2019.

Comments: Accepted to CVPR 2020 (oral)

arXiv:1912.05227 [pdf, other]

HistoNet: Predicting size histograms of object instances

Authors: Kishan Sharma, Moritz Gold, Christian Zurbruegg, Laura Leal-Taixé, Jan Dirk Wegner

Abstract: We propose to predict histograms of object sizes in crowded scenes directly without any explicit object instance segmentation. What makes this task challenging is the high density of objects (of the same category), which makes instance identification hard. Instead of explicitly segmenting object instances, we show that directly learning histograms of object sizes improves accuracy while using dras… ▽ More We propose to predict histograms of object sizes in crowded scenes directly without any explicit object instance segmentation. What makes this task challenging is the high density of objects (of the same category), which makes instance identification hard. Instead of explicitly segmenting object instances, we show that directly learning histograms of object sizes improves accuracy while using drastically less parameters. This is very useful for application scenarios where explicit, pixel-accurate instance segmentation is not needed, but there lies interest in the overall distribution of instance sizes. Our core applications are in biology, where we estimate the size distribution of soldier fly larvae, and medicine, where we estimate the size distribution of cancer cells as an intermediate step to calculate the tumor cellularity score. Given an image with hundreds of small object instances, we output the total count and the size histogram. We also provide a new data set for this task, the FlyLarvae data set, which consists of 11,000 larvae instances labeled pixel-wise. Our method results in an overall improvement in the count and size distribution prediction as compared to state-of-the-art instance segmentation method Mask R-CNN. △ Less

Submitted 4 June, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

arXiv:1912.00385 [pdf, other]

The Group Loss for Deep Metric Learning

Authors: Ismail Elezi, Sebastiano Vascon, Alessandro Torcinovich, Marcello Pelillo, Laura Leal-Taixe

Abstract: Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or… ▽ More Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or triplets of samples within a mini-batch to compute the loss function, which is commonly based on the distance between embeddings. We propose Group Loss, a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. Guided by the smoothness assumption that "similar objects should belong to the same group", the proposed loss trains the neural network for a classification task, enforcing a consistent labelling amongst samples within a class. We show state-of-the-art results on clustering and image retrieval on several datasets, and show the potential of our method when combined with other techniques such as ensembles △ Less

Submitted 20 July, 2020; v1 submitted 1 December, 2019; originally announced December 2019.

Comments: Accepted to European Conference on Computer Vision (ECCV) 2020, includes non-archival supplementary material

arXiv:1908.01293 [pdf, other]

To Learn or Not to Learn: Visual Localization from Essential Matrices

Authors: Qunjie Zhou, Torsten Sattler, Marc Pollefeys, Laura Leal-Taixe

Abstract: Visual localization is the problem of estimating a camera within a scene and a key component in computer vision applications such as self-driving cars and Mixed Reality. State-of-the-art approaches for accurate visual localization use scene-specific representations, resulting in the overhead of constructing these models when applying the techniques to new scenes. Recently, deep learning-based appr… ▽ More Visual localization is the problem of estimating a camera within a scene and a key component in computer vision applications such as self-driving cars and Mixed Reality. State-of-the-art approaches for accurate visual localization use scene-specific representations, resulting in the overhead of constructing these models when applying the techniques to new scenes. Recently, deep learning-based approaches based on relative pose estimation have been proposed, carrying the promise of easily adapting to new scenes. However, it has been shown such approaches are currently significantly less accurate than state-of-the-art approaches. In this paper, we are interested in analyzing this behavior. To this end, we propose a novel framework for visual localization from relative poses. Using a classical feature-based approach within this framework, we show state-of-the-art performance. Replacing the classical approach with learned alternatives at various levels, we then identify the reasons for why deep learned approaches do not perform well. Based on our analysis, we make recommendations for future work. △ Less

Submitted 9 March, 2020; v1 submitted 4 August, 2019; originally announced August 2019.

Comments: Accepted to ICRA 2020

arXiv:1907.11025 [pdf, other]

Towards Generalizing Sensorimotor Control Across Weather Conditions

Authors: Qadeer Khan, Patrick Wenzel, Daniel Cremers, Laura Leal-Taixé

Abstract: The ability of deep learning models to generalize well across different scenarios depends primarily on the quality and quantity of annotated data. Labeling large amounts of data for all possible scenarios that a model may encounter would not be feasible; if even possible. We propose a framework to deal with limited labeled training data and demonstrate it on the application of vision-based vehicle… ▽ More The ability of deep learning models to generalize well across different scenarios depends primarily on the quality and quantity of annotated data. Labeling large amounts of data for all possible scenarios that a model may encounter would not be feasible; if even possible. We propose a framework to deal with limited labeled training data and demonstrate it on the application of vision-based vehicle control. We show how limited steering angle data available for only one condition can be transferred to multiple different weather scenarios. This is done by leveraging unlabeled images in a teacher-student learning paradigm complemented with an image-to-image translation network. The translation network transfers the images to a new domain, whereas the teacher provides soft supervised targets to train the student on this domain. Furthermore, we demonstrate how utilization of auxiliary networks can reduce the size of a model at inference time, without affecting the accuracy. The experiments show that our approach generalizes well across multiple different weather conditions using only ground truth labels from one domain. △ Less

Submitted 25 July, 2019; originally announced July 2019.

Comments: Accepted for publication in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

arXiv:1906.06618 [pdf, other]

How To Train Your Deep Multi-Object Tracker

Authors: Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixe, Xavier Alameda-Pineda

Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP).… ▽ More The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multi-object tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multi-object trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark. Our code is publicly available from https://github.com/yihongXU/deepMOT. △ Less

Submitted 23 April, 2020; v1 submitted 15 June, 2019; originally announced June 2019.

Comments: 14 pages, 9 figures, 6 tables

arXiv:1906.04567 [pdf, other]

CVPR19 Tracking and Detection Challenge: How crowded can it get?

Authors: Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixe

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16 and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our CVPR19 benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark will be presented at the 4th BMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and will evaluate the state-of-the-art in multiple object tracking whend handling extremely crowded scenarios. △ Less

Submitted 10 June, 2019; originally announced June 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1603.00831, arXiv:1504.01942

arXiv:1903.07504 [pdf, other]

Understanding the Limitations of CNN-based Absolute Camera Pose Regression

Authors: Torsten Sattler, Qunjie Zhou, Marc Pollefeys, Laura Leal-Taixe

Abstract: Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular.… ▽ More Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods. △ Less

Submitted 18 March, 2019; originally announced March 2019.

Comments: Initial version of a paper accepted to CVPR 2019

arXiv:1903.05625 [pdf, other]

doi 10.1109/ICCV.2019.00103

Tracking without bells and whistles

Authors: Philipp Bergmann, Tim Meinhardt, Laura Leal-Taixe

Abstract: The problem of tracking multiple objects in a video sequence poses several challenging tasks. For tracking-by-detection, these include object re-identification, motion prediction and dealing with occlusions. We present a tracker (without bells and whistles) that accomplishes tracking without specifically targeting any of these tasks, in particular, we perform no training or optimization on trackin… ▽ More The problem of tracking multiple objects in a video sequence poses several challenging tasks. For tracking-by-detection, these include object re-identification, motion prediction and dealing with occlusions. We present a tracker (without bells and whistles) that accomplishes tracking without specifically targeting any of these tasks, in particular, we perform no training or optimization on tracking data. To this end, we exploit the bounding box regression of an object detector to predict the position of an object in the next frame, thereby converting a detector into a Tracktor. We demonstrate the potential of Tracktor and provide a new state-of-the-art on three multi-object tracking benchmarks by extending it with a straightforward re-identification and camera motion compensation. We then perform an analysis on the performance and failure cases of several state-of-the-art tracking methods in comparison to our Tracktor. Surprisingly, none of the dedicated tracking methods are considerably better in dealing with complex tracking scenarios, namely, small and occluded objects or missing detections. However, our approach tackles most of the easy tracking scenarios. Therefore, we motivate our approach as a new tracking paradigm and point out promising future research directions. Overall, Tracktor yields superior tracking performance than any current tracking method and our analysis exposes remaining and unsolved tracking challenges to inspire future research directions. △ Less

Submitted 17 August, 2019; v1 submitted 13 March, 2019; originally announced March 2019.

arXiv:1811.09393 [pdf, other]

doi 10.1145/3386569.3392457

Learning Temporal Coherence via Self-Supervision for GAN-based Video Generation

Authors: Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé, Nils Thuerey

Abstract: Our work explores temporal self-supervision for GAN-based video generation tasks. While adversarial training successfully yields generative models for a variety of areas, temporal relationships in the generated data are much less explored. Natural temporal changes are crucial for sequential generation tasks, e.g. video super-resolution and unpaired video translation. For the former, state-of-the-a… ▽ More Our work explores temporal self-supervision for GAN-based video generation tasks. While adversarial training successfully yields generative models for a variety of areas, temporal relationships in the generated data are much less explored. Natural temporal changes are crucial for sequential generation tasks, e.g. video super-resolution and unpaired video translation. For the former, state-of-the-art methods often favor simpler norm losses such as $L^2$ over adversarial training. However, their averaging nature easily leads to temporally smooth results with an undesirable lack of spatial detail. For unpaired video translation, existing approaches modify the generator networks to form spatio-temporal cycle consistencies. In contrast, we focus on improving learning objectives and propose a temporally self-supervised algorithm. For both tasks, we show that temporal adversarial learning is key to achieving temporally coherent solutions without sacrificing spatial detail. We also propose a novel **-Pong loss to improve the long-term temporal consistency. It effectively prevents recurrent networks from accumulating artifacts temporally without depressing detailed features. Additionally, we propose a first set of metrics to quantitatively evaluate the accuracy as well as the perceptual quality of the temporal evolution. A series of user studies confirm the rankings computed with these metrics. Code, data, models, and results are provided at https://github.com/thunil/TecoGAN. The project page https://ge.in.tum.de/publications/2019-tecogan-chu/ contains supplemental materials. △ Less

Submitted 21 May, 2020; v1 submitted 23 November, 2018; originally announced November 2018.

Comments: Project page: https://ge.in.tum.de/publications/2019-tecogan-chu/, code link: https://github.com/thunil/TecoGAN

arXiv:1807.01001 [pdf, other]

Modular Vehicle Control for Transferring Semantic Information Between Weather Conditions Using GANs

Authors: Patrick Wenzel, Qadeer Khan, Daniel Cremers, Laura Leal-Taixé

Abstract: Even though end-to-end supervised learning has shown promising results for sensorimotor control of self-driving cars, its performance is greatly affected by the weather conditions under which it was trained, showing poor generalization to unseen conditions. In this paper, we show how knowledge can be transferred using semantic maps to new weather conditions without the need to obtain new ground tr… ▽ More Even though end-to-end supervised learning has shown promising results for sensorimotor control of self-driving cars, its performance is greatly affected by the weather conditions under which it was trained, showing poor generalization to unseen conditions. In this paper, we show how knowledge can be transferred using semantic maps to new weather conditions without the need to obtain new ground truth data. To this end, we propose to divide the task of vehicle control into two independent modules: a control module which is only trained on one weather condition for which labeled steering data is available, and a perception module which is used as an interface between new weather conditions and the fixed control module. To generate the semantic data needed to train the perception module, we propose to use a generative adversarial network (GAN)-based model to retrieve the semantic information for the new conditions in an unsupervised manner. We introduce a master-servant architecture, where the master model (semantic labels available) trains the servant model (semantic labels not available). We show that our proposed method trained with ground truth data for a single weather condition is capable of achieving similar results on the task of steering angle prediction as an end-to-end model trained with ground truth data of 15 different weather conditions. △ Less

Submitted 1 October, 2018; v1 submitted 3 July, 2018; originally announced July 2018.

Comments: 2nd Conference on Robot Learning (CoRL 2018), Zürich, Switzerland

arXiv:1805.00613 [pdf, other]

Deep Perm-Set Net: Learn to predict sets with unknown permutation and cardinality using deep neural networks

Authors: S. Hamid Rezatofighi, Roman Kaskman, Farbod T. Motlagh, Qinfeng Shi, Daniel Cremers, Laura Leal-Taixé, Ian Reid

Abstract: Many real-world problems, e.g. object detection, have outputs that are naturally expressed as sets of entities. This creates a challenge for traditional deep neural networks which naturally deal with structured outputs such as vectors, matrices or tensors. We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. Specifically, in… ▽ More Many real-world problems, e.g. object detection, have outputs that are naturally expressed as sets of entities. This creates a challenge for traditional deep neural networks which naturally deal with structured outputs such as vectors, matrices or tensors. We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. Specifically, in our formulation we incorporate the permutation as unobservable variable and estimate its distribution during the learning process using alternating optimization. We demonstrate the validity of this new formulation on two relevant vision problems: object detection, for which our formulation outperforms state-of-the-art detectors such as Faster R-CNN and YOLO, and a complex CAPTCHA test, where we observe that, surprisingly, our set based network acquired the ability of mimicking arithmetics without any rules being coded. △ Less

Submitted 2 October, 2018; v1 submitted 1 May, 2018; originally announced May 2018.

arXiv:1804.00863 [pdf, other]

Deep Appearance Maps

Authors: Maxim Maximov, Laura Leal-Taixé, Mario Fritz, Tobias Ritschel

Abstract: We propose a deep representation of appearance, i. e., the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have useddeep learning to extract classic appearance representationsrelating to reflectance model parameters (e. g., Phong) orillumination (e. g., HDR environment maps). We suggest todirectly represent appearance itself as a network we c… ▽ More We propose a deep representation of appearance, i. e., the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have useddeep learning to extract classic appearance representationsrelating to reflectance model parameters (e. g., Phong) orillumination (e. g., HDR environment maps). We suggest todirectly represent appearance itself as a network we call aDeep Appearance Map (DAM). This is a 4D generalizationover 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learning-to-learn). Finally, we show the example of an appearance estimation-and-segmentation task, map** from an image showingmultiple materials to multiple deep appearance maps. △ Less

Submitted 29 October, 2019; v1 submitted 3 April, 2018; originally announced April 2018.

Journal ref: ICCV 2019

arXiv:1803.08660 [pdf, other]

Lifting Layers: Analysis and Applications

Authors: Peter Ochs, Tim Meinhardt, Laura Leal-Taixe, Michael Moeller

Abstract: The great advances of learning-based approaches in image processing and computer vision are largely based on deeply nested networks that compose linear transfer functions with suitable non-linearities. Interestingly, the most frequently used non-linearities in imaging applications (variants of the rectified linear unit) are uncommon in low dimensional approximation problems. In this paper we propo… ▽ More The great advances of learning-based approaches in image processing and computer vision are largely based on deeply nested networks that compose linear transfer functions with suitable non-linearities. Interestingly, the most frequently used non-linearities in imaging applications (variants of the rectified linear unit) are uncommon in low dimensional approximation problems. In this paper we propose a novel non-linear transfer function, called lifting, which is motivated from a related technique in convex optimization. A lifting layer increases the dimensionality of the input, naturally yields a linear spline when combined with a fully connected layer, and therefore closes the gap between low and high dimensional approximation problems. Moreover, applying the lifting operation to the loss layer of the network allows us to handle non-convex and flat (zero-gradient) cost functions. We analyze the proposed lifting theoretically, exemplify interesting properties in synthetic experiments and demonstrate its effectiveness in deep learning approaches to image classification and denoising. △ Less

Submitted 23 March, 2018; originally announced March 2018.

arXiv:1709.06031 [pdf, other]

Video Object Segmentation Without Temporal Information

Authors: Kevis-Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool

Abstract: Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly or they may not even produce a… ▽ More Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly or they may not even produce any result at all. This paper explores the orthogonal approach of processing each frame independently, i.e disregarding the temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOS-S), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent video segmentation databases, which show that OSVOS-S is both the fastest and most accurate method in the state of the art. △ Less

Submitted 16 May, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

Comments: Accepted to T-PAMI. Extended version of "One-Shot Video Object Segmentation", CVPR 2017 (arXiv:1611.05198). Project page: http://www.vision.ee.ethz.ch/~cvlsegmentation/osvos/

arXiv:1706.02434 [pdf, other]

doi 10.1016/j.compmedimag.2015.11.002

Automatic tracking of vessel-like structures from a single starting point

Authors: Dario Augusto Borges Oliveira, Laura Leal-Taixe, Raul Queiroz Feitosa, Bodo Rosenhahn

Abstract: The identification of vascular networks is an important topic in the medical image analysis community. While most methods focus on single vessel tracking, the few solutions that exist for tracking complete vascular networks are usually computationally intensive and require a lot of user interaction. In this paper we present a method to track full vascular networks iteratively using a single starti… ▽ More The identification of vascular networks is an important topic in the medical image analysis community. While most methods focus on single vessel tracking, the few solutions that exist for tracking complete vascular networks are usually computationally intensive and require a lot of user interaction. In this paper we present a method to track full vascular networks iteratively using a single starting point. Our approach is based on a cloud of sampling points distributed over concentric spherical layers. We also proposed a vessel model and a metric of how well a sample point fits this model. Then, we implement the network tracking as a min-cost flow problem, and propose a novel optimization scheme to iteratively track the vessel structure by inherently handling bifurcations and paths. The method was tested using both synthetic and real images. On the 9 different data-sets of synthetic blood vessels, we achieved maximum accuracies of more than 98\%. We further use the synthetic data-set to analyse the sensibility of our method to parameter setting, showing the robustness of the proposed algorithm. For real images, we used coronary, carotid and pulmonary data to segment vascular structures and present the visual results. Still for real images, we present numerical and visual results for networks of nerve fibers in the olfactory system. Further visual results also show the potential of our approach for identifying vascular networks topologies. The presented method delivers good results for the several different datasets tested and have potential for segmenting vessel-like structures. Also, the topology information, inherently extracted, can be used for further analysis to computed aided diagnosis and surgical planning. Finally, the method's modular aspect holds potential for problem-oriented adjustments and improvements. △ Less

Submitted 7 June, 2017; originally announced June 2017.

arXiv:1705.08314 [pdf, other]

Fusion of Head and Full-Body Detectors for Multi-Object Tracking

Authors: Roberto Henschel, Laura Leal-Taixé, Daniel Cremers, Bodo Rosenhahn

Abstract: In order to track all persons in a scene, the tracking-by-detection paradigm has proven to be a very effective approach. Yet, relying solely on a single detector is also a major limitation, as useful image information might be ignored. Consequently, this work demonstrates how to fuse two detectors into a tracking system. To obtain the trajectories, we propose to formulate tracking as a weighted gr… ▽ More In order to track all persons in a scene, the tracking-by-detection paradigm has proven to be a very effective approach. Yet, relying solely on a single detector is also a major limitation, as useful image information might be ignored. Consequently, this work demonstrates how to fuse two detectors into a tracking system. To obtain the trajectories, we propose to formulate tracking as a weighted graph labeling problem, resulting in a binary quadratic program. As such problems are NP-hard, the solution can only be approximated. Based on the Frank-Wolfe algorithm, we present a new solver that is crucial to handle such difficult problems. Evaluation on pedestrian tracking is provided for multiple scenarios, showing superior results over single detector tracking and standard QP-solvers. Finally, our tracker ranks 2nd on the MOT16 benchmark and 1st on the new MOT17 benchmark, outperforming over 90 trackers. △ Less

Submitted 24 April, 2018; v1 submitted 23 May, 2017; originally announced May 2017.

Comments: 10 pages, 4 figures; Winner of the MOT17 challenge; CVPRW 2018

arXiv:1705.05020 [pdf, other]

Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs

Authors: Emanuel Laude, Jan-Hendrik Lange, Jonas Schüpfer, Csaba Domokos, Laura Leal-Taixé, Frank R. Schmidt, Bjoern Andres, Daniel Cremers

Abstract: This paper introduces a novel algorithm for transductive inference in higher-order MRFs, where the unary energies are parameterized by a variable classifier. The considered task is posed as a joint optimization problem in the continuous classifier parameters and the discrete label variables. In contrast to prior approaches such as convex relaxations, we propose an advantageous decoupling of the ob… ▽ More This paper introduces a novel algorithm for transductive inference in higher-order MRFs, where the unary energies are parameterized by a variable classifier. The considered task is posed as a joint optimization problem in the continuous classifier parameters and the discrete label variables. In contrast to prior approaches such as convex relaxations, we propose an advantageous decoupling of the objective function into discrete and continuous subproblems and a novel, efficient optimization method related to ADMM. This approach preserves integrality of the discrete label variables and guarantees global convergence to a critical point. We demonstrate the advantages of our approach in several experiments including video object segmentation on the DAVIS data set and interactive image segmentation. △ Less

Submitted 28 April, 2018; v1 submitted 14 May, 2017; originally announced May 2017.

arXiv:1704.02781 [pdf, other]

Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking

Authors: Laura Leal-Taixé, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. We present a benchmark for Multiple Object Tracking launched in the late 2014, with the goal of creating a framework for the stan… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. We present a benchmark for Multiple Object Tracking launched in the late 2014, with the goal of creating a framework for the standardized evaluation of multiple object tracking methods. This paper collects the two releases of the benchmark made so far, and provides an in-depth analysis of almost 50 state-of-the-art trackers that were tested on over 11000 frames. We show the current trends and weaknesses of multiple people tracking methods, and provide pointers of what researchers should be focusing on to push the field forward. △ Less

Submitted 10 April, 2017; originally announced April 2017.

arXiv:1704.01085 [pdf, other]

Deep Depth From Focus

Authors: Caner Hazirbas, Sebastian Georg Soyer, Maximilian Christian Staab, Laura Leal-Taixé, Daniel Cremers

Abstract: Depth from focus (DFF) is one of the classical ill-posed inverse problems in computer vision. Most approaches recover the depth at each pixel based on the focal setting which exhibits maximal sharpness. Yet, it is not obvious how to reliably estimate the sharpness level, particularly in low-textured areas. In this paper, we propose `Deep Depth From Focus (DDFF)' as the first end-to-end learning ap… ▽ More Depth from focus (DFF) is one of the classical ill-posed inverse problems in computer vision. Most approaches recover the depth at each pixel based on the focal setting which exhibits maximal sharpness. Yet, it is not obvious how to reliably estimate the sharpness level, particularly in low-textured areas. In this paper, we propose `Deep Depth From Focus (DDFF)' as the first end-to-end learning approach to this problem. One of the main challenges we face is the hunger for data of deep neural networks. In order to obtain a significant amount of focal stacks with corresponding groundtruth depth, we propose to leverage a light-field camera with a co-calibrated RGB-D sensor. This allows us to digitally create focal stacks of varying sizes. Compared to existing benchmarks our dataset is 25 times larger, enabling the use of machine learning for this inverse problem. We compare our results with state-of-the-art DFF methods and we also analyze the effect of several key deep architectural components. These experiments show that our proposed method `DDFFNet' achieves state-of-the-art performance in all scenes, reducing depth error by more than 75% compared to the classical DFF methods. △ Less

Submitted 28 October, 2018; v1 submitted 4 April, 2017; originally announced April 2017.

Comments: accepted to Asian Conference on Computer Vision (ACCV) 2018

arXiv:1611.07890 [pdf, other]

Image-based localization using LSTMs for structured feature correlation

Authors: Florian Walch, Caner Hazirbas, Laura Leal-Taixé, Torsten Sattler, Sebastian Hilsenbeck, Daniel Cremers

Abstract: In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. CNNs allow us to learn suitable feature representations for localization that are robust against motion blur and illumination changes. We make use of LSTM units on the CNN output, which play the role of a structured dimensionality reduction on the feature vector, leading to drastic improve… ▽ More In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. CNNs allow us to learn suitable feature representations for localization that are robust against motion blur and illumination changes. We make use of LSTM units on the CNN output, which play the role of a structured dimensionality reduction on the feature vector, leading to drastic improvements in localization performance. We provide extensive quantitative comparison of CNN-based and SIFT-based localization methods, showing the weaknesses and strengths of each. Furthermore, we present a new large-scale indoor dataset with accurate ground truth from a laser scanner. Experimental results on both indoor and outdoor public datasets show our method outperforms existing deep architectures, and can localize images in hard conditions, e.g., in the presence of mostly textureless surfaces, where classic SIFT-based methods fail. △ Less

Submitted 20 August, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

arXiv:1611.05198 [pdf, other]

One-Shot Video Object Segmentation

Authors: Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, Luc Van Gool

Abstract: This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foregro… ▽ More This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%). △ Less

Submitted 13 April, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

Comments: CVPR 2017 camera ready. Code: http://www.vision.ee.ethz.ch/~cvlsegmentation/osvos/

arXiv:1607.07304 [pdf, other]

Tracking with multi-level features

Authors: Roberto Henschel, Laura Leal-Taixé, Bodo Rosenhahn, Konrad Schindler

Abstract: We present a novel formulation of the multiple object tracking problem which integrates low and mid-level features. In particular, we formulate the tracking problem as a quadratic program coupling detections and dense point trajectories. Due to the computational complexity of the initial QP, we propose an approximation by two auxiliary problems, a temporal and spatial association, where the tempor… ▽ More We present a novel formulation of the multiple object tracking problem which integrates low and mid-level features. In particular, we formulate the tracking problem as a quadratic program coupling detections and dense point trajectories. Due to the computational complexity of the initial QP, we propose an approximation by two auxiliary problems, a temporal and spatial association, where the temporal subproblem can be efficiently solved by a linear program and the spatial association by a clustering algorithm. The objective function of the QP is used in order to find the optimal number of clusters, where each cluster ideally represents one person. Evaluation is provided for multiple scenarios, showing the superiority of our method with respect to classic tracking-by-detection methods and also other methods that greedily integrate low-level features. △ Less

Submitted 25 July, 2016; originally announced July 2016.

Comments: Submitted as an IEEE PAMI short article

arXiv:1604.07866 [pdf, other]

Learning by tracking: Siamese CNN for robust target association

Authors: Laura Leal-Taixé, Cristian Canton Ferrer, Konrad Schindler

Abstract: This paper introduces a novel approach to the task of data association within the context of pedestrian tracking, by introducing a two-stage learning scheme to match pairs of detections. First, a Siamese convolutional neural network (CNN) is trained to learn descriptors encoding local spatio-temporal structures between the two input image patches, aggregating pixel values and optical flow informat… ▽ More This paper introduces a novel approach to the task of data association within the context of pedestrian tracking, by introducing a two-stage learning scheme to match pairs of detections. First, a Siamese convolutional neural network (CNN) is trained to learn descriptors encoding local spatio-temporal structures between the two input image patches, aggregating pixel values and optical flow information. Second, a set of contextual features derived from the position and size of the compared input patches are combined with the CNN output by means of a gradient boosting classifier to generate the final matching probability. This learning approach is validated by using a linear programming based multi-person tracker showing that even a simple and efficient tracker may outperform much more complex models when fed with our learned matching probabilities. Results on publicly available sequences show that our method meets state-of-the-art standards in multiple people tracking. △ Less

Submitted 4 August, 2016; v1 submitted 26 April, 2016; originally announced April 2016.

Journal ref: Computer Vision and Pattern Recognition Conference Workshops (CVPRW). DeepVision: Deep Learning for Computer Vision. 2016

arXiv:1603.00831 [pdf, other]

MOT16: A Benchmark for Multi-Object Tracking

Authors: Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, Konrad Schindler

Abstract: Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for reseach. Recently, a new benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal of collecting existing and new… ▽ More Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for reseach. Recently, a new benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal of collecting existing and new data and creating a framework for the standardized evaluation of multiple object tracking methods. The first release of the benchmark focuses on multiple people tracking, since pedestrians are by far the most studied object in the tracking community. This paper accompanies a new release of the MOTChallenge benchmark. Unlike the initial release, all videos of MOT16 have been carefully annotated following a consistent protocol. Moreover, it not only offers a significant increase in the number of labeled boxes, but also provides multiple object classes beside pedestrians and the level of visibility for every single object of interest. △ Less

Submitted 3 May, 2016; v1 submitted 2 March, 2016; originally announced March 2016.

Comments: arXiv admin note: substantial text overlap with arXiv:1504.01942

arXiv:1504.01942 [pdf, other]

MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking

Authors: Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, Konrad Schindler

Abstract: In the recent past, the computer vision community has developed centralized benchmarks for the performance evaluation of a variety of tasks, including generic object and pedestrian detection, 3D reconstruction, optical flow, single-object short-term tracking, and stereo estimation. Despite potential pitfalls of such benchmarks, they have proved to be extremely helpful to advance the state of the a… ▽ More In the recent past, the computer vision community has developed centralized benchmarks for the performance evaluation of a variety of tasks, including generic object and pedestrian detection, 3D reconstruction, optical flow, single-object short-term tracking, and stereo estimation. Despite potential pitfalls of such benchmarks, they have proved to be extremely helpful to advance the state of the art in the respective area. Interestingly, there has been rather limited work on the standardization of quantitative benchmarks for multiple target tracking. One of the few exceptions is the well-known PETS dataset, targeted primarily at surveillance applications. Despite being widely used, it is often applied inconsistently, for example involving using different subsets of the available data, different ways of training the models, or differing evaluation scripts. This paper describes our work toward a novel multiple object tracking benchmark aimed to address such issues. We discuss the challenges of creating such a framework, collecting existing and new data, gathering state-of-the-art methods to be tested on the datasets, and finally creating a unified evaluation system. With MOTChallenge we aim to pave the way toward a unified evaluation framework for a more meaningful quantification of multi-target tracking. △ Less

Submitted 8 April, 2015; originally announced April 2015.

arXiv:1411.7935 [pdf, other]

Multiple object tracking with context awareness

Authors: Laura Leal-Taixé

Abstract: Multiple people tracking is a key problem for many applications such as surveillance, animation or car navigation, and a key input for tasks such as activity recognition. In crowded environments occlusions and false detections are common, and although there have been substantial advances in recent years, tracking is still a challenging task. Tracking is typically divided into two steps: detection,… ▽ More Multiple people tracking is a key problem for many applications such as surveillance, animation or car navigation, and a key input for tasks such as activity recognition. In crowded environments occlusions and false detections are common, and although there have been substantial advances in recent years, tracking is still a challenging task. Tracking is typically divided into two steps: detection, i.e., locating the pedestrians in the image, and data association, i.e., linking detections across frames to form complete trajectories. For the data association task, approaches typically aim at develo** new, more complex formulations, which in turn put the focus on the optimization techniques required to solve them. However, they still utilize very basic information such as distance between detections. In this thesis, I focus on the data association task and argue that there is contextual information that has not been fully exploited yet in the tracking community, mainly social context and spatial context coming from different views. △ Less

Submitted 30 November, 2016; v1 submitted 24 November, 2014; originally announced November 2014.

Comments: PhD thesis, Leibniz University Hannover, Germany

Showing 51–92 of 92 results for author: Leal-Taixe, L