Search | arXiv e-print repository

Two-View Geometry Scoring Without Correspondences

Authors: Axel Barroso-Laguna, Eric Brachmann, Victor Adrian Prisacariu, Gabriel J. Brostow, Daniyar Turmukhambetov

Abstract: Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of "consensus". We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. A… ▽ More Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of "consensus". We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet), which infers a score for a pair of overlap** images and any proposed fundamental matrix. It does not rely on sparse correspondences, but rather embodies a two-view geometry model through an epipolar attention mechanism that predicts the pose error of the two images. FSNet can be incorporated into traditional RANSAC loops. We evaluate FSNet on fundamental and essential matrix estimation on indoor and outdoor datasets, and establish that FSNet can successfully identify good poses for pairs of images with few or unreliable correspondences. Besides, we show that naively combining FSNet with MAGSAC++ scoring approach achieves state of the art results. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

arXiv:2305.07014 [pdf, other]

Virtual Occlusions Through Implicit Depth

Authors: Jamie Watson, Mohamed Sayed, Zawar Qureshi, Gabriel J. Brostow, Sara Vicente, Oisin Mac Aodha, Michael Firman

Abstract: For augmented reality (AR), it is important that virtual assets appear to `sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and t… ▽ More For augmented reality (AR), it is important that virtual assets appear to `sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted to CVPR 2023

arXiv:2303.10532 [pdf, other]

doi 10.20380/GI2000.09

Automatic Joint Parameter Estimation from Magnetic Motion Capture Data

Authors: James F. O'Brien, Robert E. Bodenheimer, Gabriel J. Brostow, Jessica K. Hodgins

Abstract: This paper describes a technique for using magnetic motion capture data to determine the joint parameters of an articulated hierarchy. This technique makes it possible to determine limb lengths, joint locations, and sensor placement for a human subject without external measurements. Instead, the joint parameters are inferred with high accuracy from the motion data acquired during the capture sessi… ▽ More This paper describes a technique for using magnetic motion capture data to determine the joint parameters of an articulated hierarchy. This technique makes it possible to determine limb lengths, joint locations, and sensor placement for a human subject without external measurements. Instead, the joint parameters are inferred with high accuracy from the motion data acquired during the capture session. The parameters are computed by performing a linear least squares fit of a rotary joint model to the input data. A hierarchical structure for the articulated model can also be determined in situations where the topology of the model is not known. Once the system topology and joint parameters have been recovered, the resulting model can be used to perform forward and inverse kinematic procedures. We present the results of using the algorithm on human motion capture data, as well as validation results obtained with data from a simulation and a wooden linkage of known dimensions. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Comments: 8 pages, 8 figures, 4 tables

ACM Class: I.3.5

Journal ref: In Proceedings of Graphics Interface 2000, pages 53-60, May 2000

arXiv:2008.10634 [pdf, other]

DiverseNet: When One Right Answer is not Enough

Authors: Michael Firman, Neill D. F. Campbell, Lourdes Agapito, Gabriel J. Brostow

Abstract: Many structured prediction tasks in machine vision have a collection of acceptable answers, instead of one definitive ground truth answer. Segmentation of images, for example, is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning methods are typically optimized to make a sing… ▽ More Many structured prediction tasks in machine vision have a collection of acceptable answers, instead of one definitive ground truth answer. Segmentation of images, for example, is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning methods are typically optimized to make a single test-time prediction for each query, failing to find other modes in the output space. Existing methods that allow for sampling often sacrifice speed or accuracy. We introduce a simple method for training a neural network, which enables diverse structured predictions to be made for each test-time query. For a single input, we learn to predict a range of possible answers. We compare favorably to methods that seek diversity through an ensemble of networks. Such stochastic multiple choice learning faces mode collapse, where one or more ensemble members fail to receive any training signal. Our best performing solution can be deployed for various tasks, and just involves small modifications to the existing single-mode architecture, loss function, and training regime. We demonstrate that our method results in quantitative improvements across three challenging tasks: 2D image completion, 3D volume estimation, and flow prediction. △ Less

Submitted 24 August, 2020; originally announced August 2020.

Comments: Presented at CVPR 2018

arXiv:2008.06959 [pdf, other]

Image Stylization for Robust Features

Authors: Iaroslav Melekhov, Gabriel J. Brostow, Juho Kannala, Daniyar Turmukhambetov

Abstract: Local features that are robust to both viewpoint and appearance changes are crucial for many computer vision tasks. In this work we investigate if photorealistic image stylization improves robustness of local features to not only day-night, but also weather and season variations. We show that image stylization in addition to color augmentation is a powerful method of learning robust features. We e… ▽ More Local features that are robust to both viewpoint and appearance changes are crucial for many computer vision tasks. In this work we investigate if photorealistic image stylization improves robustness of local features to not only day-night, but also weather and season variations. We show that image stylization in addition to color augmentation is a powerful method of learning robust features. We evaluate learned features on visual localization benchmarks, outperforming state of the art baseline models despite training without ground-truth 3D correspondences using synthetic homographies only. We use trained feature networks to compete in Long-Term Visual Localization and Map-based Localization for Autonomous Driving challenges achieving competitive scores. △ Less

Submitted 16 August, 2020; originally announced August 2020.

Comments: v1.1

arXiv:2008.05785 [pdf, other]

Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings

Authors: Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow, Daniyar Turmukhambetov

Abstract: To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don't obviate the need for geometri… ▽ More To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don't obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup. Our approach measures the asymmetric relation between two images. The model then learns a scene-specific measure of similarity, from training examples with known 3D visible-surface overlaps. The result is that we can quickly identify, for example, which test image is a close-up version of another, and by what scale factor. Subsequently, local features need only be detected at that scale. We validate our scene-specific model by showing how this embedding yields competitive image-matching results, while being simpler, faster, and also interpretable by humans. △ Less

Submitted 13 August, 2020; originally announced August 2020.

Comments: ECCV 2020

arXiv:2008.01484 [pdf, other]

Learning Stereo from Single Images

Authors: Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, Michael Firman

Abstract: Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding ste… ▽ More Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury. △ Less

Submitted 20 August, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted as an oral presentation at ECCV 2020

arXiv:2004.06376 [pdf, other]

Footprints and Free Space from a Single Color Image

Authors: Jamie Watson, Michael Firman, Aron Monszpart, Gabriel J. Brostow

Abstract: Understanding the shape of a scene from a single color image is a formidable computer vision task. However, most methods aim to predict the geometry of surfaces that are visible to the camera, which is of limited use when planning paths for robots or augmented reality agents. Such agents can only move when grounded on a traversable surface, which we define as the set of classes which humans can al… ▽ More Understanding the shape of a scene from a single color image is a formidable computer vision task. However, most methods aim to predict the geometry of surfaces that are visible to the camera, which is of limited use when planning paths for robots or augmented reality agents. Such agents can only move when grounded on a traversable surface, which we define as the set of classes which humans can also walk over, such as grass, footpaths and pavement. Models which predict beyond the line of sight often parameterize the scene with voxels or meshes, which can be expensive to use in machine learning frameworks. We introduce a model to predict the geometry of both visible and occluded traversable surfaces, given a single RGB image as input. We learn from stereo video sequences, using camera poses, per-frame depth and semantic segmentation to form training data, which is used to supervise an image-to-image network. We train models from the KITTI driving dataset, the indoor Matterport dataset, and from our own casually captured stereo footage. We find that a surprisingly low bar for spatial coverage of training scenes is required. We validate our algorithm against a range of strong baselines, and include an assessment of our predictions for a path-planning task. △ Less

Submitted 14 April, 2020; originally announced April 2020.

Comments: Accepted to CVPR 2020 as an oral presentation

arXiv:1909.09051 [pdf, other]

Self-Supervised Monocular Depth Hints

Authors: Jamie Watson, Michael Firman, Gabriel J. Brostow, Daniyar Turmukhambetov

Abstract: Monocular depth estimators can be trained with various forms of self-supervision from binocular-stereo data to circumvent the need for high-quality laser scans or other ground-truth data. The disadvantage, however, is that the photometric reprojection losses used with self-supervised learning typically have multiple local minima. These plausible-looking alternatives to ground truth can restrict wh… ▽ More Monocular depth estimators can be trained with various forms of self-supervision from binocular-stereo data to circumvent the need for high-quality laser scans or other ground-truth data. The disadvantage, however, is that the photometric reprojection losses used with self-supervised learning typically have multiple local minima. These plausible-looking alternatives to ground truth can restrict what a regression network learns, causing it to predict depth maps of limited quality. As one prominent example, depth discontinuities around thin structures are often incorrectly estimated by current state-of-the-art methods. Here, we study the problem of ambiguous reprojections in depth prediction from stereo-based self-supervision, and introduce Depth Hints to alleviate their effects. Depth Hints are complementary depth suggestions obtained from simple off-the-shelf stereo algorithms. These hints enhance an existing photometric loss function, and are used to guide a network to learn better weights. They require no additional data, and are assumed to be right only sometimes. We show that using our Depth Hints gives a substantial boost when training several leading self-supervised-from-stereo models, not just our own. Further, combined with other good practices, we produce state-of-the-art depth predictions on the KITTI benchmark. △ Less

Submitted 19 September, 2019; originally announced September 2019.

Comments: Accepted to ICCV 2019

arXiv:1711.07476 [pdf, other]

Virtual Adversarial Ladder Networks For Semi-supervised Learning

Authors: Saki Shinoda, Daniel E. Worrall, Gabriel J. Brostow

Abstract: Semi-supervised learning (SSL) partially circumvents the high cost of labeling data by augmenting a small labeled dataset with a large and relatively cheap unlabeled dataset drawn from the same distribution. This paper offers a novel interpretation of two deep learning-based SSL approaches, ladder networks and virtual adversarial training (VAT), as applying distributional smoothing to their respec… ▽ More Semi-supervised learning (SSL) partially circumvents the high cost of labeling data by augmenting a small labeled dataset with a large and relatively cheap unlabeled dataset drawn from the same distribution. This paper offers a novel interpretation of two deep learning-based SSL approaches, ladder networks and virtual adversarial training (VAT), as applying distributional smoothing to their respective latent spaces. We propose a class of models that fuse these approaches. We achieve near-supervised accuracy with high consistency on the MNIST dataset using just 5 labels per class: our best model, ladder with layer-wise virtual adversarial noise (LVAN-LW), achieves 1.42% +/- 0.12 average error rate on the MNIST test set, in comparison with 1.62% +/- 0.65 reported for the ladder network. On adversarial examples generated with L2-normalized fast gradient method, LVAN-LW trained with 5 examples per class achieves average error rate 2.4% +/- 0.3 compared to 68.6% +/- 6.5 for the ladder network and 9.9% +/- 7.5 for VAT. △ Less

Submitted 12 December, 2017; v1 submitted 20 November, 2017; originally announced November 2017.

Comments: Camera-ready version for NIPS 2017 workshop Learning with Limited Labeled Data

arXiv:1710.07307 [pdf, other]

Interpretable Transformations with Encoder-Decoder Networks

Authors: Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, Gabriel J. Brostow

Abstract: Deep feature spaces have the capacity to encode complex transformations of their input data. However, understanding the relative feature-space relationship between two transformed encoded images is difficult. For instance, what is the relative feature space relationship between two rotated images? What is decoded when we interpolate in feature space? Ideally, we want to disentangle confounding fac… ▽ More Deep feature spaces have the capacity to encode complex transformations of their input data. However, understanding the relative feature-space relationship between two transformed encoded images is difficult. For instance, what is the relative feature space relationship between two rotated images? What is decoded when we interpolate in feature space? Ideally, we want to disentangle confounding factors, such as pose, appearance, and illumination, from object identity. Disentangling these is difficult because they interact in very nonlinear ways. We propose a simple method to construct a deep feature space, with explicitly disentangled representations of several known transformations. A person or algorithm can then manipulate the disentangled representation, for example, to re-render an image with explicit control over parameterized degrees of freedom. The feature space is constructed using a transforming encoder-decoder network with a custom feature transform layer, acting on the hidden representations. We demonstrate the advantages of explicit disentangling on a variety of datasets and transformations, and as an aid for traditional tasks, such as classification. △ Less

Submitted 19 October, 2017; originally announced October 2017.

Comments: Accepted at ICCV 2017

arXiv:1705.07273 [pdf, other]

doi 10.1145/3025453.3025880

Responsive Action-based Video Synthesis

Authors: Corneliu Ilisescu, Halil Aytac Kanaci, Matteo Romagnoli, Neill D. F. Campbell, Gabriel J. Brostow

Abstract: We propose technology to enable a new medium of expression, where video elements can be looped, merged, and triggered, interactively. Like audio, video is easy to sample from the real world but hard to segment into clean reusable elements. Reusing a video clip means non-linear editing and compositing with novel footage. The new context dictates how carefully a clip must be prepared, so our end-to-… ▽ More We propose technology to enable a new medium of expression, where video elements can be looped, merged, and triggered, interactively. Like audio, video is easy to sample from the real world but hard to segment into clean reusable elements. Reusing a video clip means non-linear editing and compositing with novel footage. The new context dictates how carefully a clip must be prepared, so our end-to-end approach enables previewing and easy iteration. We convert static-camera videos into loopable sequences, synthesizing them in response to simple end-user requests. This is hard because a) users want essentially semantic-level control over the synthesized video content, and b) automatic loop-finding is brittle and leaves users limited opportunity to work through problems. We propose a human-in-the-loop system where adding effort gives the user progressively more creative control. Artists help us evaluate how our trigger interfaces can be used for authoring of videos and video-performances. △ Less

Submitted 20 May, 2017; originally announced May 2017.

Comments: 10 pages, 12 figures, 1 table, accepted and published in Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

ACM Class: H.5.2

arXiv:1612.04642 [pdf, other]

Harmonic Networks: Deep Translation and Rotation Equivariance

Authors: Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukhambetov, Gabriel J. Brostow

Abstract: Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more diffi… ▽ More Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and low computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges. △ Less

Submitted 11 April, 2017; v1 submitted 14 December, 2016; originally announced December 2016.

Comments: Submitted to CVPR 2017

arXiv:1611.03906 [pdf, other]

Help, It Looks Confusing: GUI Task Automation Through Demonstration and Follow-up Questions

Authors: Thanapong Intharah, Daniyar Turmukhambetov, Gabriel J. Brostow

Abstract: Non-programming users should be able to create their own customized scripts to perform computer-based tasks for them, just by demonstrating to the machine how it's done. To that end, we develop a system prototype which learns-by-demonstration called HILC (Help, It Looks Confusing). Users train HILC to synthesize a task script by demonstrating the task, which produces the needed screenshots and the… ▽ More Non-programming users should be able to create their own customized scripts to perform computer-based tasks for them, just by demonstrating to the machine how it's done. To that end, we develop a system prototype which learns-by-demonstration called HILC (Help, It Looks Confusing). Users train HILC to synthesize a task script by demonstrating the task, which produces the needed screenshots and their corresponding mouse-keyboard signals. After the demonstration, the user answers follow-up questions. We propose a user-in-the-loop framework that learns to generate scripts of actions performed on visible elements of graphical applications. While pure programming-by-demonstration is still unrealistic, we use quantitative and qualitative experiments to show that non-programming users are willing and effective at answering follow-up queries posed by our system. Our models of events and appearance are surprisingly simple, but are combined effectively to cope with varying amounts of supervision. The best available baseline, Sikuli Slides, struggled with the majority of the tests in our user study experiments. The prototype with our proposed approach successfully helped users accomplish simple linear tasks, complicated tasks (monitoring, loo**, and mixed), and tasks that span across multiple executables. Even when both systems could ultimately perform a task, ours was trained and refined by the user in less time. △ Less

Submitted 13 January, 2017; v1 submitted 11 November, 2016; originally announced November 2016.

Comments: Camera Ready version. Accepted to be presented at the ACM IUI 2017

arXiv:1609.08080 [pdf, other]

Swipe Mosaics from Video

Authors: Malcolm Reynolds, Tom S. F. Haines, Gabriel J. Brostow

Abstract: A panoramic image mosaic is an attractive visualization for viewing many overlap** photos, but its images must be both captured and processed correctly to produce an acceptable composite. We propose Swipe Mosaics, an interactive visualization that places the individual video frames on a 2D planar map that represents the layout of the physical scene. Compared to traditional panoramic mosaics, our… ▽ More A panoramic image mosaic is an attractive visualization for viewing many overlap** photos, but its images must be both captured and processed correctly to produce an acceptable composite. We propose Swipe Mosaics, an interactive visualization that places the individual video frames on a 2D planar map that represents the layout of the physical scene. Compared to traditional panoramic mosaics, our capture is easier because the user can both translate the camera center and film moving subjects. Processing and display degrade gracefully if the footage lacks distinct, overlap**, non-repeating texture. Our proposed visual odometry algorithm produces a distribution over (x,y) translations for image pairs. Inferring a distribution of possible camera motions allows us to better cope with parallax, lack of texture, dynamic scenes, and other phenomena that hurt deterministic reconstruction techniques. Robustness is obtained by training on synthetic scenes with known camera motions. We show that Swipe Mosaics are easy to generate, support a wide range of difficult scenes, and are useful for documenting a scene for closer inspection. △ Less

Submitted 26 September, 2016; originally announced September 2016.

arXiv:1609.03677 [pdf, other]

Unsupervised Monocular Depth Estimation with Left-Right Consistency

Authors: Clément Godard, Oisin Mac Aodha, Gabriel J. Brostow

Abstract: Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we… ▽ More Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth. △ Less

Submitted 12 April, 2017; v1 submitted 13 September, 2016; originally announced September 2016.

Comments: CVPR 2017 oral

arXiv:1504.08219 [pdf, other]

Hierarchical Subquery Evaluation for Active Learning on a Graph

Authors: Oisin Mac Aodha, Neill D. F. Campbell, Jan Kautz, Gabriel J. Brostow

Abstract: To train good supervised and semi-supervised object classifiers, it is critical that we not waste the time of the human experts who are providing the training labels. Existing active learning strategies can have uneven performance, being efficient on some datasets but wasteful on others, or inconsistent just between runs on the same dataset. We propose perplexity based graph construction and a new… ▽ More To train good supervised and semi-supervised object classifiers, it is critical that we not waste the time of the human experts who are providing the training labels. Existing active learning strategies can have uneven performance, being efficient on some datasets but wasteful on others, or inconsistent just between runs on the same dataset. We propose perplexity based graph construction and a new hierarchical subquery evaluation algorithm to combat this variability, and to release the potential of Expected Error Reduction. Under some specific circumstances, Expected Error Reduction has been one of the strongest-performing informativeness criteria for active learning. Until now, it has also been prohibitively costly to compute for sizeable datasets. We demonstrate our highly practical algorithm, comparing it to other active learning measures on classification datasets that vary in sparsity, dimensionality, and size. Our algorithm is consistent over multiple runs and achieves high accuracy, while querying the human expert for labels at a frequency that matches their desired time budget. △ Less

Submitted 30 April, 2015; originally announced April 2015.

Comments: CVPR 2014

arXiv:1504.07575 [pdf, other]

Becoming the Expert - Interactive Multi-Class Machine Teaching

Authors: Edward Johns, Oisin Mac Aodha, Gabriel J. Brostow

Abstract: Compared to machines, humans are extremely good at classifying images into categories, especially when they possess prior knowledge of the categories at hand. If this prior information is not available, supervision in the form of teaching images is required. To learn categories more quickly, people should see important and representative images first, followed by less important images later - or n… ▽ More Compared to machines, humans are extremely good at classifying images into categories, especially when they possess prior knowledge of the categories at hand. If this prior information is not available, supervision in the form of teaching images is required. To learn categories more quickly, people should see important and representative images first, followed by less important images later - or not at all. However, image-importance is individual-specific, i.e. a teaching image is important to a student if it changes their overall ability to discriminate between classes. Further, students keep learning, so while image-importance depends on their current knowledge, it also varies with time. In this work we propose an Interactive Machine Teaching algorithm that enables a computer to teach challenging visual concepts to a human. Our adaptive algorithm chooses, online, which labeled images from a teaching set should be shown to the student as they learn. We show that a teaching strategy that probabilistically models the student's ability and progress, based on their correct and incorrect answers, produces better 'experts'. We present results using real human participants across several varied and challenging real-world datasets. △ Less

Submitted 28 April, 2015; originally announced April 2015.

Comments: CVPR 2015

arXiv:1502.04983 [pdf, other]

Context Tricks for Cheap Semantic Segmentation

Authors: Thanapong Intharah, Gabriel J. Brostow

Abstract: Accurate semantic labeling of image pixels is difficult because intra-class variability is often greater than inter-class variability. In turn, fast semantic segmentation is hard because accurate models are usually too complicated to also run quickly at test-time. Our experience with building and running semantic segmentation systems has also shown a reasonably obvious bottleneck on model complexi… ▽ More Accurate semantic labeling of image pixels is difficult because intra-class variability is often greater than inter-class variability. In turn, fast semantic segmentation is hard because accurate models are usually too complicated to also run quickly at test-time. Our experience with building and running semantic segmentation systems has also shown a reasonably obvious bottleneck on model complexity, imposed by small training datasets. We therefore propose two simple complementary strategies that leverage context to give better semantic segmentation, while scaling up or down to train on different-sized datasets. As easy modifications for existing semantic segmentation algorithms, we introduce Decorrelated Semantic Texton Forests, and the Context Sensitive Image Level Prior. The proposed modifications are tested using a Semantic Texton Forest (STF) system, and the modifications are validated on two standard benchmark datasets, MSRC-21 and PascalVOC-2010. In Python based comparisons, our system is insignificantly slower than STF at test-time, yet produces superior semantic segmentations overall, with just push-button training. △ Less

Submitted 17 February, 2015; originally announced February 2015.

Comments: Supplementary material can be found at http://www0.cs.ucl.ac.uk/staff/T.Intharah/research.html

Showing 1–19 of 19 results for author: Brostow, G J