-
Does Object Recognition Work for Everyone?
Authors:
Terrance DeVries,
Ishan Misra,
Changhan Wang,
Laurens van der Maaten
Abstract:
The paper analyzes the accuracy of publicly available object-recognition systems on a geographically diverse dataset. This dataset contains household items and was designed to have a more representative geographical coverage than commonly used image datasets in object recognition. We find that the systems perform relatively poorly on household items that commonly occur in countries with a low hous…
▽ More
The paper analyzes the accuracy of publicly available object-recognition systems on a geographically diverse dataset. This dataset contains household items and was designed to have a more representative geographical coverage than commonly used image datasets in object recognition. We find that the systems perform relatively poorly on household items that commonly occur in countries with a low household income. Qualitative analyses suggest the drop in performance is primarily due to appearance differences within an object class (e.g., dish soap) and due to items appearing in a different context (e.g., toothbrushes appearing outside of bathrooms). The results of our study suggest that further work is needed to make object-recognition systems work equally well for people across different countries and income levels.
△ Less
Submitted 18 June, 2019; v1 submitted 6 June, 2019;
originally announced June 2019.
-
Scaling and Benchmarking Self-Supervised Visual Representation Learning
Authors:
Priya Goyal,
Dhruv Mahajan,
Abhinav Gupta,
Ishan Misra
Abstract:
Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show…
▽ More
Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation (3D) and visual navigation using reinforcement learning. Scaling these methods also provides many interesting insights into the limitations of current self-supervised techniques and evaluations. We conclude that current self-supervised methods are not 'hard' enough to take full advantage of large scale data and do not seem to learn effective high level semantic representations. We also introduce an extensive benchmark across 9 different datasets and tasks. We believe that such a benchmark along with comparable evaluation settings is necessary to make meaningful progress. Code is at: https://github.com/facebookresearch/fair_self_supervision_benchmark.
△ Less
Submitted 6 June, 2019; v1 submitted 3 May, 2019;
originally announced May 2019.
-
Evaluating Text-to-Image Matching using Binary Image Selection (BISON)
Authors:
Hexiang Hu,
Ishan Misra,
Laurens van der Maaten
Abstract:
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as text-based image retrieval and image captioning were designed to test this ability but come with evaluation measures that have a high variance or are difficult to interpret. We study an alternative task for systems that match text and images: given a text query, the syste…
▽ More
Providing systems the ability to relate linguistic and visual content is one of the hallmarks of computer vision. Tasks such as text-based image retrieval and image captioning were designed to test this ability but come with evaluation measures that have a high variance or are difficult to interpret. We study an alternative task for systems that match text and images: given a text query, the system is asked to select the image that best matches the query from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is interpretable, eliminates the reliability problems of retrieval evaluations, and focuses on the system's ability to understand fine-grained visual structure. We gather a BISON dataset that complements the COCO dataset and use it to evaluate modern text-based image retrieval and image captioning systems. Our results provide novel insights into the performance of these systems. The COCO-BISON dataset and corresponding evaluation code are publicly available from \url{http://hexianghu.com/bison/}.
△ Less
Submitted 5 April, 2019; v1 submitted 19 January, 2019;
originally announced January 2019.
-
Learning by Asking Questions
Authors:
Ishan Misra,
Ross Girshick,
Rob Fergus,
Martial Hebert,
Abhinav Gupta,
Laurens van der Maaten
Abstract:
We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics…
▽ More
We introduce an interactive learning framework for the development and testing of intelligent visual systems, called learning-by-asking (LBA). We explore LBA in context of the Visual Question Answering (VQA) task. LBA differs from standard VQA training in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.
△ Less
Submitted 4 December, 2017;
originally announced December 2017.
-
Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection
Authors:
Debidatta Dwibedi,
Ishan Misra,
Martial Hebert
Abstract:
A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets. For example, finding a large labeled dataset containing instances in a particular kitchen is unlikely. Each new environment with new instances requires expensive data collection and annotation. In this paper, we propose a simple approach to generate large annotated instan…
▽ More
A major impediment in rapidly deploying object detection models for instance detection is the lack of large annotated datasets. For example, finding a large labeled dataset containing instances in a particular kitchen is unlikely. Each new environment with new instances requires expensive data collection and annotation. In this paper, we propose a simple approach to generate large annotated instance datasets with minimal effort. Our key insight is that ensuring only patch-level realism provides enough training signal for current object detector models. We automatically `cut' object instances and `paste' them on random backgrounds. A naive way to do this results in pixel artifacts which result in poor performance for trained models. We show how to make detectors ignore these artifacts during training and generate data that gives competitive performance on real data. Our method outperforms existing synthesis approaches and when combined with real images improves relative performance by more than 21% on benchmark datasets. In a cross-domain setting, our synthetic data combined with just 10% real data outperforms models trained on all real data.
△ Less
Submitted 4 August, 2017;
originally announced August 2017.
-
Visual Storytelling
Authors:
Ting-Hao,
Huang,
Francis Ferraro,
Nasrin Mostafazadeh,
Ishan Misra,
Aishwarya Agrawal,
Jacob Devlin,
Ross Girshick,
Xiaodong He,
Pushmeet Kohli,
Dhruv Batra,
C. Lawrence Zitnick,
Devi Parikh,
Lucy Vanderwende,
Michel Galley,
Margaret Mitchell
Abstract:
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benc…
▽ More
We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.
△ Less
Submitted 13 April, 2016;
originally announced April 2016.
-
Cross-stitch Networks for Multi-task Learning
Authors:
Ishan Misra,
Abhinav Shrivastava,
Abhinav Gupta,
Martial Hebert
Abstract:
Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multiple supervisory tasks. However, existing multi-task approaches rely on enumerating multiple network architectures specific to the tasks at hand, that do not generalize. In this paper, we propose a principled…
▽ More
Multi-task learning in Convolutional Networks has displayed remarkable success in the field of recognition. This success can be largely attributed to learning shared representations from multiple supervisory tasks. However, existing multi-task approaches rely on enumerating multiple network architectures specific to the tasks at hand, that do not generalize. In this paper, we propose a principled approach to learn shared representations in ConvNets using multi-task learning. Specifically, we propose a new sharing unit: "cross-stitch" unit. These units combine the activations from multiple networks and can be trained end-to-end. A network with cross-stitch units can learn an optimal combination of shared and task-specific representations. Our proposed method generalizes across multiple tasks and shows dramatically improved performance over baseline methods for categories with few training examples.
△ Less
Submitted 12 April, 2016;
originally announced April 2016.
-
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification
Authors:
Ishan Misra,
C. Lawrence Zitnick,
Martial Hebert
Abstract:
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic…
▽ More
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.
△ Less
Submitted 26 July, 2016; v1 submitted 28 March, 2016;
originally announced March 2016.
-
Generating Natural Questions About an Image
Authors:
Nasrin Mostafazadeh,
Ishan Misra,
Jacob Devlin,
Margaret Mitchell,
Xiaodong He,
Lucy Vanderwende
Abstract:
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by…
▽ More
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have focused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art captioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images, there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
△ Less
Submitted 8 June, 2016; v1 submitted 19 March, 2016;
originally announced March 2016.
-
Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels
Authors:
Ishan Misra,
C. Lawrence Zitnick,
Margaret Mitchell,
Ross Girshick
Abstract:
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use th…
▽ More
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy "human-centric" annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting "what's in the image" versus "what's worth saying." We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.
△ Less
Submitted 12 April, 2016; v1 submitted 22 December, 2015;
originally announced December 2015.
-
Watch and Learn: Semi-Supervised Learning of Object Detectors from Videos
Authors:
Ishan Misra,
Abhinav Shrivastava,
Martial Hebert
Abstract:
We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exha…
▽ More
We present a semi-supervised approach that localizes multiple unknown object instances in long videos. We start with a handful of labeled boxes and iteratively learn and label hundreds of thousands of object instances. We propose criteria for reliable object detection and tracking for constraining the semi-supervised learning process and minimizing semantic drift. Our approach does not assume exhaustive labeling of each object instance in any single frame, or any explicit annotation of negative data. Working in such a generic setting allow us to tackle multiple object instances in video, many of which are static. In contrast, existing approaches either do not consider multiple object instances per video, or rely heavily on the motion of the objects present. The experiments demonstrate the effectiveness of our approach by evaluating the automatically labeled data on a variety of metrics like quality, coverage (recall), diversity, and relevance to training an object detector.
△ Less
Submitted 21 May, 2015;
originally announced May 2015.
-
CPU and/or GPU: Revisiting the GPU Vs. CPU Myth
Authors:
Kishore Kothapalli,
Dip Sankar Banerjee,
P. J. Narayanan,
Surinder Sood,
Aman Kumar Bahl,
Shashank Sharma,
Shrenik Lad,
Krishna Kumar Singh,
Kiran Matam,
Sivaramakrishna Bharadwaj,
Rohit Nigam,
Parikshit Sakurikar,
Aditya Deshpande,
Ishan Misra,
Siddharth Choudhary,
Shubham Gupta
Abstract:
Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi- cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching c…
▽ More
Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi- cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational resources, it is important to also include CPUs in the computation. We call this the hybrid computing model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore such a model is quite natural.
We reevaluate the claim of a recent paper by Lee et al.(ISCA 2010). We argue that the right question arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions. On both these platforms, we also show that our solutions are 90% resource efficient on average.
Our work therefore suggests that hybrid computing can offer tremendous advantages at not only research-scale platforms but also the more realistic scale systems with significant performance gains and resource efficiency to the large scale user community.
△ Less
Submitted 9 March, 2013;
originally announced March 2013.
-
Load Balancing with Reduced Unnecessary Handoff in Energy Efficient Macro/Femto-cell based BWA Networks
Authors:
Prasun Chowdhury,
Anindita Kundu,
Iti Saha Misra,
Salil K Sanyal
Abstract:
The hierarchical macro/femto cell based BWA networks are observed to be quite promising for mobile operators as it improves their network coverage and capacity at the outskirt of the macro cell. However, this new technology introduces increased number of macro/femto handoff and wastage of electrical energy which in turn may affect the system performance. Users moving with high velocity or undergoi…
▽ More
The hierarchical macro/femto cell based BWA networks are observed to be quite promising for mobile operators as it improves their network coverage and capacity at the outskirt of the macro cell. However, this new technology introduces increased number of macro/femto handoff and wastage of electrical energy which in turn may affect the system performance. Users moving with high velocity or undergoing real-time transmission suffers degraded performance due to huge number of unnecessary macro/femto handoff. On the other hand, huge amount of electrical energy is wasted when a femto BS is active in the network but remains unutilized due to low network load. Our proposed energy efficient handoff decision algorithm eliminates the unnecessary handoff while balancing the load of the macro and femto cells at minimal energy consumption. The performance of the proposed algorithm is analyzed using Continuous Time Markov Chain (CTMC) Model. In addition, we have also contributed a method to determine the balanced threshold level of the received signal strength (RSS) from macro base station (BS). The balanced threshold level provides equal load distribution of the mobile users to the macro and femto BSs. The balanced threshold level is evaluated based on the distant location of the femto cells for small scaled networks. Numerical analysis shows that threshold level above the balanced threshold results in higher load distribution of the mobile users to the femto BSs.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Cross Layer QoS Support Architecture with Integrated CAC and Scheduling Algorithms for WiMAX BWA Networks
Authors:
Prasun Chowdhury,
Iti Saha Misra,
Salil K Sanyal
Abstract:
In this paper, a new technique for cross layer design, based on present Eb/N0 (bit energy per noise density) ratio of the connections and target values of the Quality of Service (QoS) information parameters from MAC layer, is proposed to dynamically select the Modulation and Coding Scheme (MCS) at the PHY layer for WiMAX Broadband Wireless Access (BWA) networks. The QoS information parameter inclu…
▽ More
In this paper, a new technique for cross layer design, based on present Eb/N0 (bit energy per noise density) ratio of the connections and target values of the Quality of Service (QoS) information parameters from MAC layer, is proposed to dynamically select the Modulation and Coding Scheme (MCS) at the PHY layer for WiMAX Broadband Wireless Access (BWA) networks. The QoS information parameter includes New Connection Blocking Probability (NCBP), Hand off Connection Drop** Probability (HCDP) and Connection Outage Probability (COP). In addition, a Signal to Interference plus Noise Ratio (SINR) based Call Admission Control (CAC) algorithm and Queue based Scheduling algorithm are integrated for the cross layer design. An analytical model using the Continuous Time Markov Chain (CTMC) is developed for performance evaluation of the algorithms under various MCS. The effect of Eb/No is observed for QoS information parameters in order to determine its optimum range. Simulation results show that the integrated CAC and packet Scheduling model maximizes the bandwidth utilization and fair allocation of the system resources for all types of MCS and guarantees the QoS to the connections.
△ Less
Submitted 7 April, 2012;
originally announced April 2012.
-
VoIP Call Optimization in Diverse Network Scenarios Using Learning Based State-Space Search Technique
Authors:
Tamal Chakraborty,
Atri Mukhopadhyay,
Iti Saha Misra,
Salil Kumar Sanyal
Abstract:
A VoIP based call has stringent QoS requirements with respect to delay, jitter, loss, MOS and R-Factor. Various QoS mechanisms implemented to satisfy these requirements must be adaptive under diverse network scenarios and applied in proper sequence, otherwise they may conflict with each other. The objective of this paper is to address the problem of adaptive QoS maintenance and sequential executio…
▽ More
A VoIP based call has stringent QoS requirements with respect to delay, jitter, loss, MOS and R-Factor. Various QoS mechanisms implemented to satisfy these requirements must be adaptive under diverse network scenarios and applied in proper sequence, otherwise they may conflict with each other. The objective of this paper is to address the problem of adaptive QoS maintenance and sequential execution of available QoS implementation mechanisms with respect to VoIP under varying network conditions. In this paper, we generalize this problem as state-space problem and solve it. Firstly, we map the problem of QoS optimization into state-space domain and apply incremental heuristic search. We implement the proposed algorithm under various network and user scenarios in a VoIP test-bed for QoS enhancement. Then learning strategy is implemented for refinement of knowledge base to improve the performance of call quality over time. Finally, we discuss the advantages and uniqueness of our approach.
△ Less
Submitted 17 November, 2011;
originally announced November 2011.
-
A Fair and Efficient Packet Scheduling Scheme for IEEE 802.16 Broadband Wireless Access Systems
Authors:
Prasun Chowdhury,
Iti Saha Misra
Abstract:
This paper proposes a fair and efficient QoS scheduling scheme for IEEE 802.16 BWA systems that satisfies both throughput and delay guarantee to various real and non-real time applications. The proposed QoS scheduling scheme is compared with an existing QoS scheduling scheme proposed in literature in recent past. Simulation results show that the proposed scheduling scheme can provide a tight QoS g…
▽ More
This paper proposes a fair and efficient QoS scheduling scheme for IEEE 802.16 BWA systems that satisfies both throughput and delay guarantee to various real and non-real time applications. The proposed QoS scheduling scheme is compared with an existing QoS scheduling scheme proposed in literature in recent past. Simulation results show that the proposed scheduling scheme can provide a tight QoS guarantee in terms of delay, delay violation rate and throughput for all types of traffic as defined in the WiMAX standard, thereby maintaining the fairness and helps to eliminate starvation of lower priority class services. Bandwidth utilization of the system and fairness index of the resources are also encountered to validate the QoS provided by our proposed scheduling scheme.
△ Less
Submitted 30 September, 2010;
originally announced September 2010.
-
Use of Service Curve for Resource Reservation in Wired-cum-Wireless Scenario
Authors:
Nitul Dutta,
Iti Saha Misra
Abstract:
In a network, arrival process is converted into departure process through network elements. The departure process suffer propagation delay in the link, processing delay at the network elements like router and data loss due to buffer overflow or congestion. For providing guaranteed service resources need to be reserved before conversation takes place. To reserve such resources estimation of them ar…
▽ More
In a network, arrival process is converted into departure process through network elements. The departure process suffer propagation delay in the link, processing delay at the network elements like router and data loss due to buffer overflow or congestion. For providing guaranteed service resources need to be reserved before conversation takes place. To reserve such resources estimation of them are indispensable. The idea of service curve gives beforehand deterministic value of these parameters. In this paper, we aim to minimum and maximum buffer space required in the router, minimum link capacity required to guarantee a pre-specified end-to-end delay for an ongoing session in a wired-cum-wireless scenario by analyzing minimum and maximum service curve. We assume that the network we are analyzing is an IP based mobile network. The findings of the work are presented in the form of tables which can be used for resource reservation to offer quality service to end-users.
△ Less
Submitted 7 March, 2010;
originally announced March 2010.