Search | arXiv e-print repository

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Authors: Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi

Abstract: The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative… ▽ More The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures

ACM Class: I.2.7

arXiv:2406.10889 [pdf, other]

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Authors: Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

Abstract: Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships.… ▽ More Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 26 pages, 17 figures, 3 tables

arXiv:2402.04835 [pdf, other]

Pseudo-labelling meets Label Smoothing for Noisy Partial Label Learning

Authors: Darshana Saravanan, Naresh Manwani, Vineet Gandhi

Abstract: Partial label learning (PLL) is a weakly-supervised learning paradigm where each training instance is paired with a set of candidate labels (partial label), one of which is the true label. Noisy PLL (NPLL) relaxes this constraint by allowing some partial labels to not contain the true label, enhancing the practicality of the problem. Our work centres on NPLL and presents a minimalistic framework t… ▽ More Partial label learning (PLL) is a weakly-supervised learning paradigm where each training instance is paired with a set of candidate labels (partial label), one of which is the true label. Noisy PLL (NPLL) relaxes this constraint by allowing some partial labels to not contain the true label, enhancing the practicality of the problem. Our work centres on NPLL and presents a minimalistic framework that initially assigns pseudo-labels to images by exploiting the noisy partial labels through a weighted nearest neighbour algorithm. These pseudo-label and image pairs are then used to train a deep neural network classifier with label smoothing. The classifier's features and predictions are subsequently employed to refine and enhance the accuracy of pseudo-labels. We perform thorough experiments on seven datasets and compare against nine NPLL and PLL methods. We achieve state-of-the-art results in all studied settings from the prior literature, obtaining substantial gains in fine-grained classification and extreme noise scenarios. Further, we show the promising generalisation capability of our framework in realistic crowd-sourced datasets. △ Less

Submitted 28 May, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Comments: 7 tables, 2 figures

arXiv:2311.15581 [pdf, other]

Real Time GAZED: Online Shot Selection and Editing of Virtual Cameras from Wide-Angle Monocular Video Recordings

Authors: Sudheer Achary, Rohit Girmaji, Adhiraj Anil Deshmukh, Vineet Gandhi

Abstract: Eliminating time-consuming post-production processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajectory stabilization approach. It enables users to create professiona… ▽ More Eliminating time-consuming post-production processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches. To address these needs, we present Real Time GAZED: a real-time adaptation of the GAZED framework integrated with CineFilter, a novel real-time camera trajectory stabilization approach. It enables users to create professionally edited videos in real-time. Comparative evaluations against baseline methods, including the non-real-time GAZED, demonstrate that Real Time GAZED achieves similar editing results, ensuring high-quality video output. Furthermore, a user study confirms the aesthetic quality of the video edits produced by the Real Time GAZED approach. With these advancements in real-time camera trajectory optimization and video editing presented, the demand for immediate and dynamic content creation in industries such as live broadcasting, sports coverage, news reporting, and social media content creation can be met more efficiently. △ Less

Submitted 27 November, 2023; originally announced November 2023.

arXiv:2307.01233 [pdf, other]

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Authors: Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, Vineet Gandhi

Abstract: Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/… ▽ More Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/ △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2305.12363 [pdf, other]

doi 10.1109/RO-MAN57019.2023.10309534

Instance-Level Semantic Maps for Vision Language Navigation

Authors: Laksh Nanwani, Anmol Agarwal, Kanishk Jain, Raghav Prabhakar, Aaron Monis, Aditya Mathur, Krishna Murthy, Abdul Hafez, Vineet Gandhi, K. Madhava Krishna

Abstract: Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this… ▽ More Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment, allowing them to navigate on-demand when given linguistic instructions. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recent works take a step towards this goal by creating a semantic spatial map representation of the environment without any labeled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the map** representation. The resulting map representation improves the navigation performance by two-fold (233%) on realistic language commands with instance-specific descriptions compared to the baseline. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments. △ Less

Submitted 1 July, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Journal ref: IEEE RO-MAN 2023

arXiv:2305.11926 [pdf, other]

MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting

Authors: Neil Shah, Vishal Tambrahalli, Saiteja Kosgi, Niranjan Pedanekar, Vineet Gandhi

Abstract: We present MParrotTTS, a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model that can produce high-quality speech. Benefiting from a modularized training paradigm exploiting self-supervised speech representations, MParrotTTS adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone. Moreover, without… ▽ More We present MParrotTTS, a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model that can produce high-quality speech. Benefiting from a modularized training paradigm exploiting self-supervised speech representations, MParrotTTS adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on any bilingual or parallel examples, MParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker's voice and accent. We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis. The proposed model outperforms the state-of-the-art multilingual TTS models and baselines, using only a small fraction of supervised training data. Speech samples from our model can be found at https://paper2438.github.io/tts/ △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 5 pages, 1 figure

arXiv:2303.01261 [pdf, other]

ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations

Authors: Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha Sahipjohn, Niranjan Pedanekar, Vineet Gandhi

Abstract: We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual… ▽ More We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker's voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual TTS models using only a fraction of paired data as latter. △ Less

Submitted 16 December, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

arXiv:2302.00368 [pdf, other]

Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification

Authors: Kanishk Jain, Shyamgopal Karthik, Vineet Gandhi

Abstract: We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correcti… ▽ More We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE renders it practical to be used with any off-the-shelf trained model to improve its predictions further. △ Less

Submitted 30 October, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

Comments: 8 pages, 2 figures, 3 tables, Accepted at NeurIPS 2023

arXiv:2212.04003 [pdf, other]

A Systematic Literature Review On Privacy Of Deep Learning Systems

Authors: Vishal Jignesh Gandhi, Sanchit Shokeen, Saloni Koshti

Abstract: The last decade has seen a rise of Deep Learning with its applications ranging across diverse domains. But usually, the datasets used to drive these systems contain data which is highly confidential and sensitive. Though, Deep Learning models can be stolen, or reverse engineered, confidential training data can be inferred, and other privacy and security concerns have been identified. Therefore, th… ▽ More The last decade has seen a rise of Deep Learning with its applications ranging across diverse domains. But usually, the datasets used to drive these systems contain data which is highly confidential and sensitive. Though, Deep Learning models can be stolen, or reverse engineered, confidential training data can be inferred, and other privacy and security concerns have been identified. Therefore, these systems are highly prone to security attacks. This study highlights academic research that highlights the several types of security attacks and provides a comprehensive overview of the most widely used privacy-preserving solutions. This relevant systematic evaluation also illuminates potential future possibilities for study, instruction, and usage in the fields of privacy and deep learning. △ Less

Submitted 7 December, 2022; originally announced December 2022.

Comments: 14 pages, 3 figures, 5 tables

arXiv:2211.14915 [pdf, other]

doi 10.1016/j.jmps.2023.105239

Mesoscale shock structure in particulate composites

Authors: Suraj Ravindran, Vatsa Gandhi, Barry Lawlor, Guruswami Ravichandran

Abstract: Multiscale experiments in heterogeneous materials and the knowledge of their physics under shock compression are limited. This study examines the multiscale shock response of particulate composites comprised of soda-lime glass particles in a PMMA matrix using full-field high-speed digital image correlation (DIC) for the first time. Normal plate impact experiments, and complementary numerical simul… ▽ More Multiscale experiments in heterogeneous materials and the knowledge of their physics under shock compression are limited. This study examines the multiscale shock response of particulate composites comprised of soda-lime glass particles in a PMMA matrix using full-field high-speed digital image correlation (DIC) for the first time. Normal plate impact experiments, and complementary numerical simulations, are conducted at stresses ranging from $1.1-3.1$ GPa to elucidate the mesoscale mechanisms responsible for the distinct shock structure observed in particulate composites. The particle velocity from the macroscopic measurement at continuum scale shows a relatively smooth velocity profile, with shock thickness decreasing with an increase in shock stress, and the composite exhibits strain rate scaling as the second power of the shock stress. In contrast, the mesoscopic response was highly heterogeneous, which led to a rough shock front and the formation of a train of weak shocks traveling at different velocities. Additionally, the normal shock was seen to diffuse the momentum in the transverse direction, affecting the shock rise and the rounding-off observed at the continuum scale measurements. The numerical simulations indicate that the reflections at the interfaces, wave scattering, and interference of these reflected waves are the primary mechanisms for the observed rough shock fronts. △ Less

Submitted 27 November, 2022; originally announced November 2022.

arXiv:2210.12568 [pdf, other]

doi 10.1063/5.0131590

Three dimensional full-field velocity measurements in shock compression experiments using stereo digital image correlation

Authors: Suraj Ravindran, Vatsa Gandhi, Akshay Joshi, Guruswami Ravichandran

Abstract: Shock compression plate impact experiments conventionally rely on point-wise velocimetry measurements based on laser-based interferometric techniques. This study presents an experimental methodology to measure the free surface full-field particle velocity in shock compression experiments using high-speed imaging and three-dimensional (3D) digital image correlation (DIC). The experimental setup has… ▽ More Shock compression plate impact experiments conventionally rely on point-wise velocimetry measurements based on laser-based interferometric techniques. This study presents an experimental methodology to measure the free surface full-field particle velocity in shock compression experiments using high-speed imaging and three-dimensional (3D) digital image correlation (DIC). The experimental setup has a temporal resolution of 100 ns with a spatial resolution varying from 90 to 200 $μ$m/pixel. Experiments were conducted under three different plate impact configurations to measure spatially resolved free surface velocity and validate the experimental technique. First, a normal impact experiment was conducted on polycarbonate to measure the macroscopic full-field normal free surface velocity. Second, an isentropic compression experiment on Y-cut quartz-tungsten carbide assembly is performed to measure the particle velocity for experiments involving ramp compression waves. To explore the capability of the technique in multi-axial loading conditions, a pressure shear plate impact experiment was conducted to measure both the normal and transverse free surface velocities under combined normal and shear loading. The velocities measured in the experiments using digital image correlation are validated against previous data obtained from laser interferometry. Numerical simulations were also performed using established material models to compare and validate the experimental velocity profiles for these different impact configurations. The novel ability of the employed experimental setup to measure full-field free surface velocities with high spatial resolutions in shock compression experiments is demonstrated for the first time in this work. △ Less

Submitted 22 October, 2022; originally announced October 2022.

Comments: The first two authors contributed equally to this work

arXiv:2209.11972 [pdf, other]

Ground then Navigate: Language-guided Navigation in Dynamic Scenes

Authors: Kanishk Jain, Varun Chhangani, Amogh Tiwari, K. Madhava Krishna, Vineet Gandhi

Abstract: We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, w… ▽ More We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands requiring finer manoeuvres like "park between the two cars". Furthermore, we propose a novel meta-dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach. △ Less

Submitted 24 September, 2022; originally announced September 2022.

arXiv:2112.13031 [pdf, other]

doi 10.1109/IROS51168.2021.9636172

Grounding Linguistic Commands to Navigable Regions

Authors: Nivedita Rufus, Kanishk Jain, Unni Krishnan R Nair, Vineet Gandhi, K Madhava Krishna

Abstract: Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring… ▽ More Humans have a natural ability to effortlessly comprehend linguistic commands such as "park next to the yellow sedan" and instinctively know which region of the road the vehicle should navigate. Extending this ability to autonomous vehicles is the next step towards creating fully autonomous agents that respond and act according to human commands. To this end, we propose the novel task of Referring Navigable Regions (RNR), i.e., grounding regions of interest for navigation based on the linguistic command. RNR is different from Referring Image Segmentation (RIS), which focuses on grounding an object referred to by the natural language expression instead of grounding a navigable region. For example, for a command "park next to the yellow sedan," RIS will aim to segment the referred sedan, and RNR aims to segment the suggested parking region on the road. We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands. A separate test split with concise manoeuvre-oriented commands is provided to assess the practicality of our dataset. We benchmark the proposed dataset using a novel transformer-based architecture. We present extensive ablations and show superior performance over baselines on multiple evaluation metrics. A downstream path planner generating trajectories based on RNR outputs confirms the efficacy of the proposed framework. △ Less

Submitted 24 December, 2021; originally announced December 2021.

Journal ref: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 8593-8600

arXiv:2111.04730 [pdf, other]

doi 10.21437/Interspeech.2021-307

Emotional Prosody Control for Speech Generation

Authors: Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi

Abstract: Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated s… ▽ More Machine-generated speech is characterized by its limited or unnatural emotional variation. Current text to speech systems generates speech with either a flat emotion, emotion selected from a predefined set, average variation learned from prosody sequences in training data or transferred from a source style. We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space (Arousal-Valence space). The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion. We show that the system works on emotion unseen during training and can scale to previously unseen speakers given his/her speech sample. Our work expands the horizon of the state-of-the-art FastSpeech2 backbone to a multi-speaker setting and gives it much-coveted continuous (and interpretable) affective control, without any observable degradation in the quality of the synthesized speech. △ Less

Submitted 7 November, 2021; originally announced November 2021.

arXiv:2110.07981 [pdf, other]

Reappraising Domain Generalization in Neural Networks

Authors: Sarath Sivaprasad, Akshay Goindani, Vaibhav Garg, Ritam Basu, Saiteja Kosgi, Vineet Gandhi

Abstract: Given that Neural Networks generalize unreasonably well in the IID setting (with benign overfitting and betterment in performance with more parameters), OOD presents a consistent failure case to better the understanding of how they learn. This paper focuses on Domain Generalization (DG), which is perceived as the front face of OOD generalization. We find that the presence of multiple domains incen… ▽ More Given that Neural Networks generalize unreasonably well in the IID setting (with benign overfitting and betterment in performance with more parameters), OOD presents a consistent failure case to better the understanding of how they learn. This paper focuses on Domain Generalization (DG), which is perceived as the front face of OOD generalization. We find that the presence of multiple domains incentivizes domain agnostic learning and is the primary reason for generalization in Tradition DG. We show that the state-of-the-art results can be obtained by borrowing ideas from IID generalization and the DG tailored methods fail to add any performance gains. Furthermore, we perform explorations beyond the Traditional DG (TDG) formulation and propose a novel ClassWise DG (CWDG) benchmark, where for each class, we randomly select one of the domains and keep it aside for testing. Despite being exposed to all domains during training, CWDG is more challenging than TDG evaluation. We propose a novel iterative domain feature masking approach, achieving state-of-the-art results on the CWDG benchmark. Overall, while explaining these observations, our work furthers insights into the learning mechanisms of neural networks. △ Less

Submitted 28 April, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

arXiv:2109.12227 [pdf, other]

Bringing Generalization to Deep Multi-View Pedestrian Detection

Authors: Jeet Vora, Swetanjal Dutta, Kanishk Jain, Shyamgopal Karthik, Vineet Gandhi

Abstract: Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evalu… ▽ More Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment. While recent works using deep learning have made significant advances in the field, they have overlooked the generalization aspect, which makes them impractical for real-world deployment. The key novelty of our work is to formalize three critical forms of generalization and propose experiments to evaluate them: generalization with i) a varying number of cameras, ii) varying camera positions, and finally, iii) to new scenes. We find that existing state-of-the-art models show poor generalization by overfitting to a single scene and camera configuration. To address the concerns: (a) we propose a novel Generalized MVD (GMVD) dataset, assimilating diverse scenes with changing daytime, camera configurations, varying number of cameras, and (b) we discuss the properties essential to bring generalization to MVD and propose a barebones model to incorporate them. We perform a comprehensive set of experiments on the WildTrack, MultiViewX, and the GMVD datasets to motivate the necessity to evaluate the generalization abilities of MVD methods and to demonstrate the efficacy of the proposed approach. The code and the proposed dataset can be found at https://github.com/jeetv/GMVD △ Less

Submitted 13 March, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

arXiv:2107.14688 [pdf, other]

doi 10.1109/ICRA.2012.6224771

High-Resolution Depth Maps Based on TOF-Stereo Fusion

Authors: Vineet Gandhi, Jan Cech, Radu Horaud

Abstract: The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods of combining range- and color-data have been investigated and successfully used in various robotic applications. Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor an… ▽ More The combination of range sensors with color cameras can be very useful for robot navigation, semantic perception, manipulation, and telepresence. Several methods of combining range- and color-data have been investigated and successfully used in various robotic applications. Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor and the color cameras, since the resolution of current range sensors is much less than the resolution of color cameras. High-resolution depth maps can be obtained using stereo matching, but this often fails to construct accurate depth maps of weakly/repetitively textured scenes, or if the scene exhibits complex self-occlusions. Range sensors provide coarse depth information regardless of presence/absence of texture. The use of a calibrated system, composed of a time-of-flight (TOF) camera and of a stereoscopic camera pair, allows data fusion thus overcoming the weaknesses of both individual sensors. We propose a novel TOF-stereo fusion method based on an efficient seed-growing algorithm which uses the TOF data projected onto the stereo image pair as an initial set of correspondences. These initial "seeds" are then propagated based on a Bayesian model which combines an image similarity score with rough depth priors computed from the low-resolution range data. The overall result is a dense and accurate depth map at the resolution of the color cameras at hand. We show that the proposed algorithm outperforms 2D image-based stereo algorithms and that the results are of higher resolution than off-the-shelf color-range sensors, e.g., Kinect. Moreover, the algorithm potentially exhibits real-time performance on a single CPU. △ Less

Submitted 30 July, 2021; originally announced July 2021.

Comments: IEEE International Conference on Robotics and Automation, 2012

arXiv:2104.10412 [pdf, other]

doi 10.18653/v1/2022.findings-acl.270

Comprehensive Multi-Modal Interactions for Referring Image Segmentation

Authors: Kanishk Jain, Vineet Gandhi

Abstract: We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (lead… ▽ More We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intramodal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods. △ Less

Submitted 14 August, 2022; v1 submitted 21 April, 2021; originally announced April 2021.

Comments: Findings of ACL 2022

Journal ref: 2022.findings-acl.270

arXiv:2104.00795 [pdf, other]

No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks

Authors: Shyamgopal Karthik, Ameya Prabhu, Puneet K. Dokania, Vineet Gandhi

Abstract: There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we… ▽ More There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy-aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-$k$ predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates. △ Less

Submitted 1 April, 2021; originally announced April 2021.

arXiv:2012.06170 [pdf, other]

ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

Authors: Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, Vineet Gandhi

Abstract: We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simp… ▽ More We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first network to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models \cite{tsiami2020stavis} for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet. △ Less

Submitted 7 August, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Comments: Appearing in the proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2021) (camera-ready version)

arXiv:2010.11886 [pdf, other]

doi 10.1145/3313831.3376544

GAZED- Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings

Authors: K L Bhanu Moorthy, Moneish Kumar, Ramanathan Subramaniam, Vineet Gandhi

Abstract: We present GAZED- eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to ge… ▽ More We present GAZED- eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos. △ Less

Submitted 22 October, 2020; originally announced October 2020.

Comments: 10 pages

Journal ref: In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Machinery, New York, NY, USA, 1-11

arXiv:2009.06066 [pdf, other]

Cosine meets Softmax: A tough-to-beat baseline for visual grounding

Authors: Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet Gandhi

Abstract: In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial… ▽ More In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives. △ Less

Submitted 13 September, 2020; originally announced September 2020.

arXiv:2006.05103 [pdf, other]

The Curious Case of Convex Neural Networks

Authors: Sarath Sivaprasad, Ankur Singh, Naresh Manwani, Vineet Gandhi

Abstract: In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non… ▽ More In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input. We show that the convexity constraints can be enforced on both fully connected and convolutional layers, making them applicable to most architectures. The convexity constraints include restricting the weights (for all but the first layer) to be non-negative and using a non-decreasing convex activation function. Albeit simple, these constraints have profound implications on the generalization abilities of the network. We draw three valuable insights: (a) Input Output Convex Neural Networks (IOC-NNs) self regularize and reduce the problem of overfitting; (b) Although heavily constrained, they outperform the base multi layer perceptrons and achieve similar performance as compared to base convolutional architectures and (c) IOC-NNs show robustness to noise in train labels. We demonstrate the efficacy of the proposed idea using thorough experiments and ablation studies on standard image classification datasets with three different neural network architectures. △ Less

Submitted 10 July, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

Comments: 20 pages, accepted at ECML-PKDD

arXiv:2006.02609 [pdf, other]

Simple Unsupervised Multi-Object Tracking

Authors: Shyamgopal Karthik, Ameya Prabhu, Vineet Gandhi

Abstract: Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for develo** better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestep** the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first gene… ▽ More Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for develo** better and larger labeled datasets. In this work, we remove the need for annotated datasets by proposing an unsupervised re-identification network, thus sidestep** the labeling costs entirely, required for training. Given unlabeled videos, our proposed method (SimpleReID) first generates tracking labels using SORT and trains a ReID network to predict the generated labels using crossentropy loss. We demonstrate that SimpleReID performs substantially better than simpler alternatives, and we recover the full performance of its supervised counterpart consistently across diverse tracking frameworks. The observations are unusual because unsupervised ReID is not expected to excel in crowded scenarios with occlusions, and drastic viewpoint changes. By incorporating our unsupervised SimpleReID with CenterTrack trained on augmented still images, we establish a new state-of-the-art performance on popular datasets like MOT16/17 without using tracking supervision, beating current best (CenterTrack) by 0.2-0.3 MOTA and 4.4-4.8 IDF1 scores. We further provide evidence for limited scope for improvement in IDF1 scores beyond our unsupervised ReID in the studied settings. Our investigation suggests reconsideration towards more sophisticated, supervised, end-to-end trackers by showing promise in simpler unsupervised alternatives. △ Less

Submitted 3 June, 2020; originally announced June 2020.

arXiv:2003.05970 [pdf, other]

LiDAR guided Small obstacle Segmentation

Authors: Aasheesh Singh, Aditya Kamireddypalli, Vineet Gandhi, K Madhava Krishna

Abstract: Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as… ▽ More Detecting small obstacles on the road is critical for autonomous driving. In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision. LiDAR is employed to provide additional context in the form of confidence maps to monocular segmentation networks. We show significant performance gains when the context is fed as an additional input to monocular semantic segmentation frameworks. We further present a new semantic segmentation dataset to the community, comprising of over 3000 image frames with corresponding LiDAR observations. The images come with pixel-wise annotations of three classes off-road, road, and small obstacle. We stress that precise calibration between LiDAR and camera is crucial for this task and thus propose a novel Hausdorff distance based calibration refinement method over extrinsic parameters. As a first benchmark over this dataset, we report our results with 73% instance detection up to a distance of 50 meters on challenging scenarios. Qualitatively by showcasing accurate segmentation of obstacles less than 15 cms at 50m depth and quantitatively through favourable comparisons vis a vis prior art, we vindicate the method's efficacy. Our project-page and Dataset is hosted at https://small-obstacle-dataset.github.io/ △ Less

Submitted 12 March, 2020; originally announced March 2020.

Comments: 8 pages, Submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020

arXiv:2003.04942 [pdf, other]

Tidying Deep Saliency Prediction Architectures

Authors: Navyasri Reddy, Samyak Jain, Pradeep Yarlagadda, Vineet Gandhi

Abstract: Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necess… ▽ More Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. Code and pre-trained models are available at https://github.com/samyak0210/saliency. △ Less

Submitted 10 March, 2020; originally announced March 2020.

arXiv:1912.05636 [pdf, ps, other]

CineFilter: Unsupervised Filtering for Real Time Autonomous Camera Systems

Authors: Sudheer Achary, K L Bhanu Moorthy, Syed Ashar Javed, Nikita Shravan, Vineet Gandhi, Anoop Namboodiri

Abstract: Autonomous camera systems are often subjected to an optimization/filtering operation to smoothen and stabilize the rough trajectory estimates. Most common filtering techniques do reduce the irregularities in data; however, they fail to mimic the behavior of a human cameraman. Global filtering methods modeling human camera operators have been successful; however, they are limited to offline setting… ▽ More Autonomous camera systems are often subjected to an optimization/filtering operation to smoothen and stabilize the rough trajectory estimates. Most common filtering techniques do reduce the irregularities in data; however, they fail to mimic the behavior of a human cameraman. Global filtering methods modeling human camera operators have been successful; however, they are limited to offline settings. In this paper, we propose two online filtering methods called Cinefilters, which produce smooth camera trajectories that are motivated by cinematographic principles. The first filter (CineConvex) uses a sliding window-based convex optimization formulation, and the second (CineCNN) is a CNN based encoder-decoder model. We evaluate the proposed filters in two different settings, namely a basketball dataset and a stage performance dataset. Our models outperform previous methods and baselines on both quantitative and qualitative metrics. The CineConvex and CineCNN filters operate at about 250fps and 1000fps, respectively, with a minor latency (half a second), making them apt for a variety of real-time applications. △ Less

Submitted 27 May, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

arXiv:1910.12273 [pdf, other]

Exploring 3 R's of Long-term Tracking: Re-detection, Recovery and Reliability

Authors: Shyamgopal Karthik, Abhinav Moudgil, Vineet Gandhi

Abstract: Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation stra… ▽ More Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements. The current evaluation methodologies, however, do not focus on several aspects that are crucial in a long term perspective like Re-detection, Recovery, and Reliability. In this paper, we propose novel evaluation strategies for a more in-depth analysis of trackers from a long-term perspective. More specifically, (a) we test re-detection capability of the trackers in the wild by simulating virtual cuts, (b) we investigate the role of chance in the recovery of tracker after failure and (c) we propose a novel metric allowing visual inference on the ability of a tracker to track contiguously (without any failure) at a given accuracy. We present several original insights derived from an extensive set of quantitative and qualitative experiments. △ Less

Submitted 27 October, 2019; originally announced October 2019.

arXiv:1812.00739 [pdf, other]

Nose, eyes and ears: Head pose estimation by locating facial keypoints

Authors: Aryaman Gupta, Kalpit Thakkar, Vineet Gandhi, P J Narayanan

Abstract: Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data c… ▽ More Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face. Annotating ground truth head pose angles for images in the wild is difficult and requires ad-hoc fitting procedures (which provides only coarse and approximate annotations). This highlights the need for approaches which can train on data captured in controlled environment and generalize on the images in the wild (with varying appearance and illumination of the face). Most present day deep learning approaches which learn a regression function directly on the input images fail to do so. To this end, we propose to use a higher level representation to regress the head pose while using deep learning architectures. More specifically, we use the uncertainty maps in the form of 2D soft localization heatmap images over five facial keypoints, namely left ear, right ear, left eye, right eye and nose, and pass them through an convolutional neural network to regress the head-pose. We show head pose estimation results on two challenging benchmarks BIWI and AFLW and our approach surpasses the state of the art on both the datasets. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: 4 pages, ICASSP 2019

arXiv:1807.03125 [pdf, other]

Watch to Edit: Video Retargeting using Gaze

Authors: Kranthi Kumar, Moneish Kumar, Vineet Gandhi, Ramanathan Subramanian

Abstract: We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a crop** window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematogr… ▽ More We present a novel approach to optimally retarget videos for varied displays with differing aspect ratios by preserving salient scene content discovered via eye tracking. Our algorithm performs editing with cut, pan and zoom operations by optimizing the path of a crop** window within the original video while seeking to (i) preserve salient regions, and (ii) adhere to the principles of cinematography. Our approach is (a) content agnostic as the same methodology is employed to re-edit a wide-angle video recording or a close-up movie sequence captured with a static or moving camera, and (b) independent of video length and can in principle re-edit an entire movie in one shot. Our algorithm consists of two steps. The first step employs gaze transition cues to detect time stamps where new cuts are to be introduced in the original video via dynamic programming. A subsequent step optimizes the crop** window path (to create pan and zoom effects), while accounting for the original and new cuts. The crop** window path is designed to include maximum gaze information, and is composed of piecewise constant, linear and parabolic segments. It is obtained via L(1) regularized convex optimization which ensures a smooth viewing experience. We test our approach on a wide variety of videos and demonstrate significant improvement over the state-of-the-art, both in terms of computational complexity and qualitative aspects. A study performed with 16 users confirms that our approach results in a superior viewing experience as compared to gaze driven re-editing and letterboxing methods, especially for wide-angle static camera recordings. △ Less

Submitted 27 June, 2018; originally announced July 2018.

Journal ref: Computer Graphics Forum, Volume37, Issue2(2018)205-215

arXiv:1803.06508 [pdf, other]

MergeNet: A Deep Net Architecture for Small Obstacle Discovery

Authors: Krishnam Gupta, Syed Ashar Javed, Vineet Gandhi, K. Madhava Krishna

Abstract: We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose… ▽ More We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multi-stage training procedure involving weight-sharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve state-of-art results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples. △ Less

Submitted 17 March, 2018; originally announced March 2018.

arXiv:1803.06506 [pdf, other]

Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Authors: Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi

Abstract: Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuit… ▽ More Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities. In the unsupervised setting, lack of supervisory signals exacerbate this difficulty. In this paper, we propose a novel framework for unsupervised visual grounding which uses concept learning as a proxy task to obtain self-supervision. The simple intuition behind this idea is to encourage the model to localize to regions which can explain some semantic property in the data, in our case, the property being the presence of a concept in a set of images. We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the ReferItGame dataset and comparable to state-of-art performance on the Flickr30k dataset. △ Less

Submitted 16 November, 2018; v1 submitted 17 March, 2018; originally announced March 2018.

Comments: NIPS Workshop 2018

arXiv:1712.01358 [pdf, other]

Long-Term Visual Object Tracking Benchmark

Authors: Abhinav Moudgil, Vineet Gandhi

Abstract: We propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing gen… ▽ More We propose a new long video dataset (called Track Long and Prosper - TLP) and benchmark for single object tracking. The dataset consists of 50 HD videos from real world scenarios, encompassing a duration of over 400 minutes (676K frames), making it more than 20 folds larger in average duration per sequence and more than 8 folds larger in terms of total covered duration, as compared to existing generic datasets for visual tracking. The proposed dataset paves a way to suitably assess long term tracking performance and train better deep learning architectures (avoiding/reducing augmentation, which may not reflect real world behaviour). We benchmark the dataset on 17 state of the art trackers and rank them according to tracking accuracy and run time speeds. We further present thorough qualitative and quantitative evaluation highlighting the importance of long term aspect of tracking. Our most interesting observations are (a) existing short sequence benchmarks fail to bring out the inherent differences in tracking algorithms which widen up while tracking on long sequences and (b) the accuracy of trackers abruptly drops on challenging long sequences, suggesting the potential need of research efforts in the direction of long-term tracking. △ Less

Submitted 1 January, 2019; v1 submitted 4 December, 2017; originally announced December 2017.

Comments: ACCV 2018 (Oral)

arXiv:1703.01437 [pdf, other]

Automated Top View Registration of Broadcast Football Videos

Authors: Rahul Anand Sharma, Bharath Bhat, Vineet Gandhi, C. V. Jawahar

Abstract: In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to th… ▽ More In this paper, we propose a novel method to register football broadcast video frames on the static top view model of the playing surface. The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model. Automatic registration using existing approaches has been difficult due to the lack of sufficient point correspondences. We investigate an alternate approach exploiting the edge information from the line markings on the field. We formulate the registration problem as a nearest neighbour search over a synthetically generated dictionary of edge map and homography pairs. The synthetic dictionary generation allows us to exhaustively cover a wide variety of camera angles and positions and reduce this problem to a minimal per-frame edge map matching procedure. We show that the per-frame results can be improved in videos using an optimization framework for temporal camera stabilization. We demonstrate the efficacy of our approach by presenting extensive results on a dataset collected from matches of football World Cup 2014. △ Less

Submitted 4 March, 2017; originally announced March 2017.

arXiv:1508.07593 [pdf, other]

The Prose Storyboard Language: A Tool for Annotating and Directing Movies

Authors: Remi Ronfard, Vineet Gandhi, Laurent Boiron, Vaishnavi Ameya Murukutla

Abstract: The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent c… ▽ More The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence. The language uses a simple syntax and limited vocabulary borrowed from working practices in traditional movie-making, and is intended to be readable both by machines and humans. The language is designed to serve as a high-level user interface for intelligent cinematography and editing systems. △ Less

Submitted 29 April, 2022; v1 submitted 30 August, 2015; originally announced August 2015.

Comments: 20 pages, extended version includes new figures and references

Showing 1–36 of 36 results for author: Gandhi, V