Search | arXiv e-print repository

arXiv:2405.19773 [pdf, other]

VQA Training Sets are Self-play Environments for Generating Few-shot Pools

Authors: Tautvydas Misiunas, Hassan Mansoor, Jasper Uijlings, Oriana Riva, Victor Carbune

Abstract: Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or API… ▽ More Large-language models and large-vision models are increasingly capable of solving compositional reasoning tasks, as measured by breakthroughs in visual-question answering benchmarks. However, state-of-the-art solutions often involve careful construction of large pre-training and fine-tuning datasets, which can be expensive. The use of external tools, whether other ML models, search engines, or APIs, can significantly improve performance by breaking down high-level reasoning questions into sub-questions that are answerable by individual tools, but this approach has similar dataset construction costs to teach fine-tuned models how to use the available tools. We propose a technique in which existing training sets can be directly used for constructing computational environments with task metrics as rewards. This enables a model to autonomously teach itself to use itself or another model as a tool. By doing so, we augment training sets by integrating external signals. The proposed method starts with zero-shot prompts and iteratively refines them by selecting few-shot examples that maximize the task metric on the training set. Our experiments showcase how Gemini learns how to use itself, or another smaller and specialized model such as ScreenAI, to iteratively improve performance on training sets. Our approach successfully generalizes and improves upon zeroshot performance on charts, infographics, and document visual question-answering datasets △ Less

Submitted 30 May, 2024; originally announced May 2024.

arXiv:2404.05465 [pdf, other]

HAMMR: HierArchical MultiModal React agents for generic VQA

Authors: Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, Jasper Uijlings

Abstract: Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal probl… ▽ More Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal problems. Therefore we pose the VQA problem from a unified perspective and evaluate a single system on a varied suite of VQA tasks including counting, spatial reasoning, OCR-based reasoning, visual pointing, external knowledge, and more. In this setting, we demonstrate that naively applying the LLM+tools approach using the combined set of all tools leads to poor results. This motivates us to introduce HAMMR: HierArchical MultiModal React. We start from a multimodal ReAct-based system and make it hierarchical by enabling our HAMMR agents to call upon other specialized agents. This enhances the compositionality of the LLM+tools approach, which we show to be critical for obtaining high accuracy on generic VQA. Concretely, on our generic VQA suite, HAMMR outperforms the naive LLM+tools approach by 19.5%. Additionally, HAMMR achieves state-of-the-art results on this task, outperforming the generic standalone PaLI-X VQA model by 5.0%. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2310.06641 [pdf, other]

How (not) to ensemble LVLMs for VQA

Authors: Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink

Abstract: This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wik… ▽ More This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it? △ Less

Submitted 7 December, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: 4th I Can't Believe It's Not Better Workshop (co-located with NeurIPS 2023)

arXiv:2306.09224 [pdf, other]

Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

Authors: Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

Abstract: We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evi… ▽ More We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. It is available at https://github.com/google-research/google-research/tree/master/encyclopedic_vqa . △ Less

Submitted 24 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: ICCV'23

arXiv:2206.04453 [pdf, other]

The Missing Link: Finding label relations across datasets

Authors: Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instance… ▽ More Computer vision is driven by the many datasets available for training or evaluating novel methods. However, each dataset has a different set of class labels, visual definition of classes, images following a specific distribution, annotation protocols, etc. In this paper we explore the automatic discovery of visual-semantic relations between labels across datasets. We aim to understand how instances of a certain class in a dataset relate to the instances of another class in another dataset. Are they in an identity, parent/child, overlap relation? Or is there no link between them at all? To find relations between labels across datasets, we propose methods based on language, on vision, and on their combination. We show that we can effectively discover label relations across datasets, as well as their type. We apply our method to four applications: understand label relations, identify missing aspects, increase label specificity, and predict transfer learning gains. We conclude that label relations cannot be established by looking at the names of classes alone, as they depend strongly on how each of the datasets was constructed. △ Less

Submitted 9 August, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: ECCV 2022

arXiv:2204.01403 [pdf, other]

How stable are Transferability Metrics evaluations?

Authors: Andrea Agostinelli, Michal Pándy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without fine-tuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this pa… ▽ More Transferability metrics is a maturing field with increasing interest, which aims at providing heuristics for selecting the most suitable source models to transfer to a given target dataset, without fine-tuning them all. However, existing works rely on custom experimental setups which differ across papers, leading to inconsistent conclusions about which transferability metrics work best. In this paper we conduct a large-scale study by systematically constructing a broad range of 715k experimental setup variations. We discover that even small variations to an experimental setup lead to different conclusions about the superiority of a transferability metric over another. Then we propose better evaluations by aggregating across many experiments, enabling to reach more stable conclusions. As a result, we reveal the superiority of LogME at selecting good source datasets to transfer from in a semantic segmentation scenario, NLEEP at selecting good source architectures in an image classification scenario, and GBC at determining which target task benefits most from a given source model. Yet, no single transferability metric works best in all scenarios. △ Less

Submitted 20 October, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

Comments: ECCV 2022

arXiv:2111.13011 [pdf, other]

Transferability Metrics for Selecting Source Model Ensembles

Authors: Andrea Agostinelli, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Abstract: We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally… ▽ More We address the problem of ensemble selection in transfer learning: Given a large pool of source models we want to select an ensemble of models which, after fine-tuning on the target training set, yields the best performance on the target test set. Since fine-tuning all possible ensembles is computationally prohibitive, we aim at predicting performance on the target dataset using a computationally efficient transferability metric. We propose several new transferability metrics designed for this task and evaluate them in a challenging and realistic transfer learning setup for semantic segmentation: we create a large and diverse pool of source models by considering 17 source datasets covering a wide variety of image domain, two different architectures, and two pre-training schemes. Given this pool, we then automatically select a subset to form an ensemble performing well on a given target dataset. We compare the ensemble selected by our method to two baselines which select a single source model, either (1) from the same pool as our method; or (2) from a pool containing large source models, each with similar capacity as an ensemble. Averaged over 17 target datasets, we outperform these baselines by 6.0% and 2.5% relative mean IoU, respectively. △ Less

Submitted 31 March, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.12780 [pdf, other]

Transferability Estimation using Bhattacharyya Class Separability

Authors: Michal Pándy, Andrea Agostinelli, Jasper Uijlings, Vittorio Ferrari, Thomas Mensink

Abstract: Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya… ▽ More Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. In this work, we propose Gaussian Bhattacharyya Coefficient (GBC), a novel method for quantifying transferability between a source model and a target dataset. In a first step we embed all target images in the feature space defined by the source model, and represent them with per-class Gaussians. Then, we estimate their pairwise class separability using the Bhattacharyya coefficient, yielding a simple and effective measure of how well the source model transfers to the target task. We evaluate GBC on image classification tasks in the context of dataset and architecture selection. Further, we also perform experiments on the more complex semantic segmentation transferability estimation task. We demonstrate that GBC outperforms state-of-the-art transferability metrics on most evaluation criteria in the semantic segmentation settings, matches the performance of top methods for dataset transferability in image classification, and performs best on architecture selection problems for image classification. △ Less

Submitted 11 April, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

Comments: Accepted for CVPR 2022

arXiv:2103.13318 [pdf, other]

Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types

Authors: Thomas Mensink, Jasper Uijlings, Alina Kuznetsova, Michael Gygli, Vittorio Ferrari

Abstract: Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the cir… ▽ More Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this paper we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 2000 transfer learning experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyze these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations: (1) for most tasks there exists a source which significantly outperforms ILSVRC'12 pre-training; (2) the image domain is the most important factor for achieving positive transfer; (3) the source dataset should \emph{include} the image domain of the target dataset to achieve best results; (4) at the same time, we observe only small negative effects when the image domain of the source task is much broader than that of the target; (5) transfer across task types can be beneficial, but its success is heavily dependent on both the source and target task types. △ Less

Submitted 20 November, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: Accepted for future publication in TPAMI

arXiv:2004.03898 [pdf, other]

Towards Reusable Network Components by Learning Compatible Representations

Authors: Michael Gygli, Jasper Uijlings, Vittorio Ferrari

Abstract: This paper proposes to make a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. In particular, we split a network into two components, a features extractor and a target task head, and propose various approaches to accompli… ▽ More This paper proposes to make a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. In particular, we split a network into two components, a features extractor and a target task head, and propose various approaches to accomplish compatibility between them. We systematically analyse these approaches on the task of image classification on standard datasets. We demonstrate that we can produce components which are directly compatible without any fine-tuning or compromising accuracy on the original tasks. Afterwards, we demonstrate the use of compatible components on three applications: Unsupervised domain adaptation, transferring classifiers across feature extractors with different architectures, and increasing the computational efficiency of transfer learning. △ Less

Submitted 16 December, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: Preprint; To be presented at AAAI 2021

arXiv:1912.03098 [pdf, other]

Connecting Vision and Language with Localized Narratives

Authors: Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari

Abstract: We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a… ▽ More We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning. △ Less

Submitted 20 July, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

Comments: ECCV 2020 Camera Ready

arXiv:1911.12709 [pdf, other]

Continuous Adaptation for Interactive Object Segmentation by Learning from Corrections

Authors: Theodora Kontogianni, Michael Gygli, Jasper Uijlings, Vittorio Ferrari

Abstract: In interactive object segmentation a user collaborates with a computer vision model to segment an object. Recent works employ convolutional neural networks for this task: Given an image and a set of corrections made by the user as input, they output a segmentation mask. These approaches achieve strong performance by training on large datasets but they keep the model parameters unchanged at test ti… ▽ More In interactive object segmentation a user collaborates with a computer vision model to segment an object. Recent works employ convolutional neural networks for this task: Given an image and a set of corrections made by the user as input, they output a segmentation mask. These approaches achieve strong performance by training on large datasets but they keep the model parameters unchanged at test time. Instead, we recognize that user corrections can serve as sparse training examples and we propose a method that capitalizes on that idea to update the model parameters on-the-fly to the data at hand. Our approach enables the adaptation to a particular object and its background, to distributions shifts in a test set, to specific object classes, and even to large domain changes, where the imaging modality changes between training and testing. We perform extensive experiments on 8 diverse datasets and show: Compared to a model with frozen parameters, our method reduces the required corrections (i) by 9%-30% when distribution shifts are small between training and testing; (ii) by 12%-44% when specializing to a specific class; (iii) and by 60% and 77% when we completely change domain between training and testing. △ Less

Submitted 8 November, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

Comments: ECCV 2020 Camera Ready

arXiv:1906.06798 [pdf, other]

Panoptic Image Annotation with a Collaborative Assistant

Authors: Jasper R. R. Uijlings, Mykhaylo Andriluka, Vittorio Ferrari

Abstract: This paper aims to reduce the time to annotate images for panoptic segmentation, which requires annotating segmentation masks and class labels for all object instances and stuff regions. We formulate our approach as a collaborative process between an annotator and an automated assistant who take turns to jointly annotate an image using a predefined pool of segments. Actions performed by the annota… ▽ More This paper aims to reduce the time to annotate images for panoptic segmentation, which requires annotating segmentation masks and class labels for all object instances and stuff regions. We formulate our approach as a collaborative process between an annotator and an automated assistant who take turns to jointly annotate an image using a predefined pool of segments. Actions performed by the annotator serve as a strong contextual signal. The assistant intelligently reacts to this signal by annotating other parts of the image on its own, which reduces the amount of work required by the annotator. We perform thorough experiments on the COCO panoptic dataset, both in simulation and with human annotators. These demonstrate that our approach is significantly faster than the recent machine-assisted interface of [4], and 2.4x to 5x faster than manual polygon drawing. Finally, we show on ADE20k that our method can be used to efficiently annotate new datasets, bootstrap** from a very small amount of annotated data. △ Less

Submitted 15 December, 2020; v1 submitted 16 June, 2019; originally announced June 2019.

arXiv:1812.01888 [pdf, other]

Interactive Full Image Segmentation by Considering All Regions Jointly

Authors: Eirikur Agustsson, Jasper R. R. Uijlings, Vittorio Ferrari

Abstract: We address interactive full image annotation, where the goal is to accurately segment all object and stuff regions in an image. We propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all regions. This enables sharing scribble corrections across regions, and allows the annotator to focus on the largest errors made by the machine… ▽ More We address interactive full image annotation, where the goal is to accurately segment all object and stuff regions in an image. We propose an interactive, scribble-based annotation framework which operates on the whole image to produce segmentations for all regions. This enables sharing scribble corrections across regions, and allows the annotator to focus on the largest errors made by the machine across the whole image. To realize this, we adapt Mask-RCNN into a fast interactive segmentation framework and introduce an instance-aware loss measured at the pixel-level in the full image canvas, which lets predictions for nearby regions properly compete for space. Finally, we compare to interactive single object segmentation on the COCO panoptic dataset. We demonstrate that our interactive full image segmentation approach leads to a 5% IoU gain, reaching 90% IoU at a budget of four extreme clicks and four corrective scribbles per region. △ Less

Submitted 10 April, 2019; v1 submitted 5 December, 2018; originally announced December 2018.

Comments: Accepted to CVPR 2019

arXiv:1811.00982 [pdf, other]

doi 10.1007/s11263-020-01316-z

The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale

Authors: Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, Vittorio Ferrari

Abstract: We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an in… ▽ More We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection. △ Less

Submitted 21 February, 2020; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: Accepted to International Journal of Computer Vision, 2020

arXiv:1806.07527 [pdf, other]

doi 10.1145/3240508.3241916

Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation

Authors: Mykhaylo Andriluka, Jasper R. R. Uijlings, Vittorio Ferrari

Abstract: We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles: (I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new… ▽ More We introduce Fluid Annotation, an intuitive human-machine collaboration interface for annotating the class label and outline of every object and background region in an image. Fluid annotation is based on three principles: (I) Strong Machine-Learning aid. We start from the output of a strong neural network model, which the annotator can edit by correcting the labels of existing regions, adding new regions to cover missing objects, and removing incorrect regions. The edit operations are also assisted by the model. (II) Full image annotation in a single pass. As opposed to performing a series of small annotation tasks in isolation, we propose a unified interface for full image annotation in a single pass. (III) Empower the annotator. We empower the annotator to choose what to annotate and in which order. This enables concentrating on what the machine does not already know, i.e. putting human effort only on the errors it made. This helps using the annotation budget effectively. Through extensive experiments on the COCO+Stuff dataset, we demonstrate that Fluid Annotation leads to accurate annotations very efficiently, taking three times less annotation time than the popular LabelMe interface. △ Less

Submitted 20 December, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

Comments: ACM MultiMedia 2018. Live demo is available at fluidann.appspot.com

arXiv:1712.08087 [pdf, other]

Learning Intelligent Dialogs for Bounding Box Annotation

Authors: Ksenia Konyushkova, Jasper Uijlings, Christoph Lampert, Vittorio Ferrari

Abstract: We introduce Intelligent Annotation Dialogs for bounding box annotation. We train an agent to automatically choose a sequence of actions for a human annotator to produce a bounding box in a minimal amount of time. Specifically, we consider two actions: box verification, where the annotator verifies a box generated by an object detector, and manual box drawing. We explore two kinds of agents, one b… ▽ More We introduce Intelligent Annotation Dialogs for bounding box annotation. We train an agent to automatically choose a sequence of actions for a human annotator to produce a bounding box in a minimal amount of time. Specifically, we consider two actions: box verification, where the annotator verifies a box generated by an object detector, and manual box drawing. We explore two kinds of agents, one based on predicting the probability that a box will be positively verified, and the other based on reinforcement learning. We demonstrate that (1) our agents are able to learn efficient annotation strategies in several scenarios, automatically adapting to the image difficulty, the desired quality of the boxes, and the detector strength; (2) in all scenarios the resulting annotation dialogs speed up annotation compared to manual box drawing alone and box verification alone, while also outperforming any fixed combination of verification and drawing in most scenarios; (3) in a realistic scenario where the detector is iteratively re-trained, our agents evolve a series of strategies that reflect the shifting trade-off between verification and drawing as the detector grows stronger. △ Less

Submitted 20 November, 2018; v1 submitted 21 December, 2017; originally announced December 2017.

Comments: This paper appeared at CVPR 2018

arXiv:1708.06128 [pdf, other]

Revisiting knowledge transfer for training object class detectors

Authors: Jasper Uijlings, Stefan Popov, Vittorio Ferrari

Abstract: We propose to revisit knowledge transfer for training object detectors on target classes from weakly supervised training images, helped by a set of source classes with bounding-box annotations. We present a unified knowledge transfer framework based on training a single neural network multi-class object detector over all source classes, organized in a semantic hierarchy. This generates proposals w… ▽ More We propose to revisit knowledge transfer for training object detectors on target classes from weakly supervised training images, helped by a set of source classes with bounding-box annotations. We present a unified knowledge transfer framework based on training a single neural network multi-class object detector over all source classes, organized in a semantic hierarchy. This generates proposals with scores at multiple levels in the hierarchy, which we use to explore knowledge transfer over a broad range of generality, ranging from class-specific (bicycle to motorbike) to class-generic (objectness to any class). Experiments on the 200 object classes in the ILSVRC 2013 detection dataset show that our technique: (1) leads to much better performance on the target classes (70.3% CorLoc, 36.9% mAP) than a weakly supervised baseline which uses manually engineered objectness [11] (50.5% CorLoc, 25.4% mAP). (2) delivers target object detectors reaching 80% of the mAP of their fully supervised counterparts. (3) outperforms the best reported transfer learning results on this dataset (+41% CorLoc and +3% mAP over [18, 46], +16.2% mAP over [32]). Moreover, we also carry out several across-dataset knowledge transfer experiments [27, 24, 35] and find that (4) our technique outperforms the weakly supervised baseline in all dataset pairs by 1.5x-1.9x, establishing its general applicability. △ Less

Submitted 28 March, 2018; v1 submitted 21 August, 2017; originally announced August 2017.

Comments: CVPR 18

arXiv:1708.02750 [pdf, other]

Extreme clicking for efficient object annotation

Authors: Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, Vittorio Ferrari

Abstract: Manually annotating object bounding boxes is central to building computer vision datasets, and it is very time consuming (annotating ILSVRC [53] took 35s for one high-quality box [62]). It involves clicking on imaginary corners of a tight box around the object. This is difficult as these corners are often outside the actual object and several adjustments are required to obtain a tight box. We prop… ▽ More Manually annotating object bounding boxes is central to building computer vision datasets, and it is very time consuming (annotating ILSVRC [53] took 35s for one high-quality box [62]). It involves clicking on imaginary corners of a tight box around the object. This is difficult as these corners are often outside the actual object and several adjustments are required to obtain a tight box. We propose extreme clicking instead: we ask the annotator to click on four physical points on the object: the top, bottom, left- and right-most points. This task is more natural and these points are easy to find. We crowd-source extreme point annotations for PASCAL VOC 2007 and 2012 and show that (1) annotation time is only 7s per box, 5x faster than the traditional way of drawing boxes [62]; (2) the quality of the boxes is as good as the original ground-truth drawn the traditional way; (3) detectors trained on our annotations are as accurate as those trained on the original ground-truth. Moreover, our extreme clicking strategy not only yields box coordinates, but also four accurate boundary points. We show (4) how to incorporate them into GrabCut to obtain more accurate segmentations than those delivered when initializing it from bounding boxes; (5) semantic segmentations models trained on these segmentations outperform those trained on segmentations derived from bounding boxes. △ Less

Submitted 9 August, 2017; originally announced August 2017.

Comments: ICCV 2017

arXiv:1707.05847 [pdf, other]

The Devil is in the Decoder: Classification, Regression and GANs

Authors: Zbigniew Wojna, Vittorio Ferrari, Sergio Guadarrama, Nathan Silberman, Liang-Chieh Chen, Alireza Fathi, Jasper Uijlings

Abstract: Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image. Models for such problems usually consist of encoders which decrease spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders… ▽ More Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image. Models for such problems usually consist of encoders which decrease spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. This paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise tasks ranging from classification, regression to synthesis. Our contributions are: (1) Decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce new residual-like connections for decoders. (3) We introduce a novel decoder: bilinear additive upsampling. (4) We explore prediction artifacts. △ Less

Submitted 19 February, 2019; v1 submitted 18 July, 2017; originally announced July 2017.

arXiv:1704.06189 [pdf, other]

Training object class detectors with click supervision

Authors: Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, Vittorio Ferrari

Abstract: Training object class detectors typically requires a large set of images with objects annotated by bounding boxes. However, manually drawing bounding boxes is very time consuming. In this paper we greatly reduce annotation time by proposing center-click annotations: we ask annotators to click on the center of an imaginary bounding box which tightly encloses the object instance. We then incorporate… ▽ More Training object class detectors typically requires a large set of images with objects annotated by bounding boxes. However, manually drawing bounding boxes is very time consuming. In this paper we greatly reduce annotation time by proposing center-click annotations: we ask annotators to click on the center of an imaginary bounding box which tightly encloses the object instance. We then incorporate these clicks into existing Multiple Instance Learning techniques for weakly supervised object localization, to jointly localize object bounding boxes over all training images. Extensive experiments on PASCAL VOC 2007 and MS COCO show that: (1) our scheme delivers high-quality detectors, performing substantially better than those produced by weakly supervised techniques, with a modest extra annotation effort; (2) these detectors in fact perform in a range close to those trained from manually drawn bounding boxes; (3) as the center-click task is very fast, our scheme reduces total annotation time by 9x to 18x. △ Less

Submitted 19 May, 2017; v1 submitted 20 April, 2017; originally announced April 2017.

Comments: CVPR 2017

arXiv:1612.03716 [pdf, other]

COCO-Stuff: Thing and Stuff Classes in Context

Authors: Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Abstract: Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classification and detection works focus on thing classes, less attention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) wh… ▽ More Semantic classes can be either things (objects with a well-defined shape, e.g. car, person) or stuff (amorphous background regions, e.g. grass, sky). While lots of classification and detection works focus on thing classes, less attention has been given to stuff classes. Nonetheless, stuff classes are important as they allow to explain important aspects of an image, including (1) scene type; (2) which thing classes are likely to be present and their location (through contextual reasoning); (3) physical attributes, material types and geometric properties of the scene. To understand stuff and things in context we introduce COCO-Stuff, which augments all 164K images of the COCO 2017 dataset with pixel-wise annotations for 91 stuff classes. We introduce an efficient stuff annotation protocol based on superpixels, which leverages the original thing annotations. We quantify the speed versus quality trade-off of our protocol and explore the relation between annotation time and boundary complexity. Furthermore, we use COCO-Stuff to analyze: (a) the importance of stuff and thing classes in terms of their surface cover and how frequently they are mentioned in image captions; (b) the spatial relations between stuff and things, highlighting the rich contextual relations that make our dataset unique; (c) the performance of a modern semantic segmentation method on stuff and thing classes, and whether stuff is easier to segment than things. △ Less

Submitted 28 March, 2018; v1 submitted 12 December, 2016; originally announced December 2016.

Comments: CVPR 2018 camera-ready

arXiv:1607.07671 [pdf, other]

Region-based semantic segmentation with end-to-end training

Authors: Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Abstract: We propose a novel method for semantic segmentation, the task of labeling each pixel in an image with a semantic class. Our method combines the advantages of the two main competing paradigms. Methods based on region classification offer proper spatial support for appearance measurements, but typically operate in two separate stages, none of which targets pixel labeling performance at the end of th… ▽ More We propose a novel method for semantic segmentation, the task of labeling each pixel in an image with a semantic class. Our method combines the advantages of the two main competing paradigms. Methods based on region classification offer proper spatial support for appearance measurements, but typically operate in two separate stages, none of which targets pixel labeling performance at the end of the pipeline. More recent fully convolutional methods are capable of end-to-end training for the final pixel labeling, but resort to fixed patches as spatial support. We show how to modify modern region-based approaches to enable end-to-end training for semantic segmentation. This is achieved via a differentiable region-to-pixel layer and a differentiable free-form Region-of-Interest pooling layer. Our method improves the state-of-the-art in terms of class-average accuracy with 64.0% on SIFT Flow and 49.9% on PASCAL Context, and is particularly accurate at object boundaries. △ Less

Submitted 26 July, 2016; originally announced July 2016.

Comments: ECCV 2016 camera-ready

arXiv:1602.08405 [pdf, other]

We don't need no bounding-boxes: Training object class detectors using only human verification

Authors: Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, Vittorio Ferrari

Abstract: Training object class detectors typically requires a large set of images in which objects are annotated by bounding-boxes. However, manually drawing bounding-boxes is very time consuming. We propose a new scheme for training object detectors which only requires annotators to verify bounding-boxes produced automatically by the learning algorithm. Our scheme iterates between re-training the detector… ▽ More Training object class detectors typically requires a large set of images in which objects are annotated by bounding-boxes. However, manually drawing bounding-boxes is very time consuming. We propose a new scheme for training object detectors which only requires annotators to verify bounding-boxes produced automatically by the learning algorithm. Our scheme iterates between re-training the detector, re-localizing objects in the training images, and human verification. We use the verification signal both to improve re-training and to reduce the search space for re-localisation, which makes these steps different to what is normally done in a weakly supervised setting. Extensive experiments on PASCAL VOC 2007 show that (1) using human verification to update detectors and reduce the search space leads to the rapid production of high-quality bounding-box annotations; (2) our scheme delivers detectors performing almost as good as those trained in a fully supervised setting, without ever drawing any bounding-box; (3) as the verification task is very quick, our scheme substantially reduces total annotation time by a factor 6x-9x. △ Less

Submitted 24 April, 2017; v1 submitted 26 February, 2016; originally announced February 2016.

Comments: CVPR 2016, pp. 854-863. Las Vegas, NV

arXiv:1507.01581 [pdf, other]

Joint Calibration for Semantic Segmentation

Authors: Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Abstract: Semantic segmentation is the task of assigning a class-label to each pixel in an image. We propose a region-based semantic segmentation framework which handles both full and weak supervision, and addresses three common problems: (1) Objects occur at multiple scales and therefore we should use regions at multiple scales. However, these regions are overlap** which creates conflicting class predict… ▽ More Semantic segmentation is the task of assigning a class-label to each pixel in an image. We propose a region-based semantic segmentation framework which handles both full and weak supervision, and addresses three common problems: (1) Objects occur at multiple scales and therefore we should use regions at multiple scales. However, these regions are overlap** which creates conflicting class predictions at the pixel-level. (2) Class frequencies are highly imbalanced in realistic datasets. (3) Each pixel can only be assigned to a single class, which creates competition between classes. We address all three problems with a joint calibration method which optimizes a multi-class loss defined over the final pixel-level output labeling, as opposed to simply region classification. Our method outperforms the state-of-the-art on the popular SIFT Flow [18] dataset in both the fully and weakly supervised setting by a considerably margin (+6% and +10%, respectively). △ Less

Submitted 12 August, 2015; v1 submitted 6 July, 2015; originally announced July 2015.

Comments: Includes improved results based on VGG16 CNN

MSC Class: 68T45

Journal ref: BMVC 2015

arXiv:1504.06434 [pdf, other]

Situational Object Boundary Detection

Authors: Jasper Uijlings, Vittorio Ferrari

Abstract: Intuitively, the appearance of true object boundaries varies from image to image. Hence the usual monolithic approach of training a single boundary predictor and applying it to all images regardless of their content is bound to be suboptimal. In this paper we therefore propose situational object boundary detection: We first define a variety of situations and train a specialized object boundary det… ▽ More Intuitively, the appearance of true object boundaries varies from image to image. Hence the usual monolithic approach of training a single boundary predictor and applying it to all images regardless of their content is bound to be suboptimal. In this paper we therefore propose situational object boundary detection: We first define a variety of situations and train a specialized object boundary detector for each of them using [Dollar and Zitnick 2013]. Then given a test image, we classify it into these situations using its context, which we model by global image appearance. We apply the corresponding situational object boundary detectors, and fuse them based on the classification probabilities. In experiments on ImageNet, Microsoft COCO, and Pascal VOC 2012 segmentation we show that our situational object boundary detection gives significant improvements over a monolithic approach. Additionally, our method substantially outperforms [Hariharan et al. 2011] on semantic contour detection on their SBD dataset. △ Less

Submitted 24 April, 2015; originally announced April 2015.

Showing 1–26 of 26 results for author: Uijlings, J