Skip to main content

Showing 1–50 of 60 results for author: Cherian, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.15736  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

    Authors: Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Joanna Matthiesen, Kevin Smith, Joshua B. Tenenbaum

    Abstract: Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as h… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  2. arXiv:2404.16306  [pdf, other

    cs.CV

    TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

    Authors: Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

    Abstract: Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  3. arXiv:2312.10571  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    Multi-level Reasoning for Robotic Assembly: From Sequence Inference to Contact Selection

    Authors: Xinghao Zhu, Devesh K. Jha, Diego Romeres, Lingfeng Sun, Masayoshi Tomizuka, Anoop Cherian

    Abstract: Automating the assembly of objects from their parts is a complex problem with innumerable applications in manufacturing, maintenance, and recycling. Unlike existing research, which is limited to target segmentation, pose regression, or using fixed target blueprints, our work presents a holistic multi-level framework for part assembly planning consisting of part assembly sequence inference, part mo… ▽ More

    Submitted 16 December, 2023; originally announced December 2023.

    Comments: Supplementary video is available at https://www.youtube.com/watch?v=XNYkWSHkAaU&ab_channel=MitsubishiElectricResearchLabs%28MERL%29

  4. arXiv:2310.00224  [pdf, other

    cs.CV cs.AI cs.LG

    Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

    Authors: Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks

    Abstract: Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidanc… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

    Comments: Accepted at ICCV 2023

  5. arXiv:2309.14531  [pdf, other

    cs.CV

    Pixel-Grounded Prototypical Part Networks

    Authors: Zachariah Carmichael, Suhas Lohit, Anoop Cherian, Michael Jones, Walter Scheirer

    Abstract: Prototypical part neural networks (ProtoPartNNs), namely PROTOPNET and its derivatives, are an intrinsically interpretable approach to machine learning. Their prototype learning scheme enables intuitive explanations of the form, this (prototype) looks like that (testing image patch). But, does this actually look like that? In this work, we delve into why object part localization and associated hea… ▽ More

    Submitted 25 September, 2023; originally announced September 2023.

    Comments: 21 pages

  6. arXiv:2306.04047  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

    Authors: Xiulong Liu, Sudipta Paul, Moitreya Chatterjee, Anoop Cherian

    Abstract: Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget… ▽ More

    Submitted 26 December, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

    Comments: Accepted at AAAI 2024

  7. arXiv:2304.00387  [pdf, other

    cs.CV

    HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

    Authors: Anshul Shah, Aniket Roy, Ketul Shah, Shlok Kumar Mishra, David Jacobs, Anoop Cherian, Rama Chellappa

    Abstract: Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data au… ▽ More

    Submitted 1 April, 2023; originally announced April 2023.

    Comments: To be presented at CVPR 2023

  8. arXiv:2303.13800  [pdf, other

    cs.CV

    Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

    Authors: Jiahao Zhang, Anoop Cherian, Yanbin Liu, Yizhak Ben-Shabat, Cristian Rodriguez, Stephen Gould

    Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in t… ▽ More

    Submitted 20 March, 2024; v1 submitted 24 March, 2023; originally announced March 2023.

    Comments: Project website: https://academic.davidz.cn/en/publication/zhang-cvpr-2023/

  9. arXiv:2212.09993  [pdf, other

    cs.AI cs.CV cs.LG

    Are Deep Neural Networks SMARTer than Second Graders?

    Authors: Anoop Cherian, Kuan-Chuan Peng, Suhas Lohit, Kevin A. Smith, Joshua B. Tenenbaum

    Abstract: Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algor… ▽ More

    Submitted 11 September, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Extended version of CVPR 2023 paper. For the SMART-101 dataset, see http://smartdataset.github.io/smart101

  10. arXiv:2210.16472  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

    Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian

    Abstract: There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted at NeurIPS 2022

  11. arXiv:2210.12521  [pdf, other

    cs.RO cs.AI cs.CV

    H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions

    Authors: Kei Ota, Hsiao-Yu Tung, Kevin A. Smith, Anoop Cherian, Tim K. Marks, Alan Sullivan, Asako Kanezaki, Joshua B. Tenenbaum

    Abstract: The world is filled with articulated objects that are difficult to determine how to use from vision alone, e.g., a door might open inwards or outwards. Humans handle these objects with strategic trial-and-error: first pushing a door then pulling if that doesn't work. We enable these capabilities in autonomous agents by proposing "Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR), a probabil… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

  12. arXiv:2210.07940  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

    Authors: Sudipta Paul, Amit K. Roy-Chowdhury, Anoop Cherian

    Abstract: Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equip** the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in t… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted at NeurIPS 2022

  13. arXiv:2202.09277  [pdf, other

    cs.CV cs.AI cs.LG

    (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

    Authors: Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

    Abstract: Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight,… ▽ More

    Submitted 26 March, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

    Comments: Accepted at AAAI 2022 (Oral)

  14. arXiv:2201.06372  [pdf, other

    cs.DC physics.bio-ph physics.comp-ph q-bio.BM

    GROMACS in the cloud: A global supercomputer to speed up alchemical drug design

    Authors: Carsten Kutzner, Christian Kniep, Austin Cherian, Ludvig Nordstrom, Helmut Grubmüller, Bert L. de Groot, Vytautas Gapsys

    Abstract: We assess costs and efficiency of state-of-the-art high performance cloud computing compared to a traditional on-premises compute cluster. Our use case are atomistic simulations carried out with the GROMACS molecular dynamics (MD) toolkit with a focus on alchemical protein-ligand binding free energy calculations. We set up a compute cluster in the Amazon Web Services (AWS) cloud that incorporate… ▽ More

    Submitted 13 May, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: 59 pages, 11 figures, 11 tables v2 fixed a typo in the abstract

    Journal ref: Journal of Chemical Information and Modelling, 2022, 62, 1691-1711

  15. arXiv:2112.11450  [pdf, other

    cs.LG cs.AI cs.CV

    Max-Margin Contrastive Learning

    Authors: Anshul Shah, Suvrit Sra, Rama Chellappa, Anoop Cherian

    Abstract: Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learni… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

    Comments: Accepted at AAAI 2022

  16. arXiv:2111.01048  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    MOST-GAN: 3D Morphable StyleGAN for Disentangled Face Image Manipulation

    Authors: Safa C. Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B. Tenenbaum, Xiaoming Liu, Tim K. Marks

    Abstract: Recent advances in generative adversarial networks (GANs) have led to remarkable achievements in face image synthesis. While methods that use style-based GANs can generate strikingly photorealistic face images, it is often difficult to control the characteristics of the generated faces in a meaningful and disentangled way. Prior approaches aim to achieve such semantic control and disentanglement w… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

    ACM Class: I.2.10

  17. arXiv:2110.06894  [pdf, other

    cs.CL

    Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

    Authors: Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

    Abstract: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the dat… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: https://dstc10.dstc.community/home and https://github.com/dialogtekgeek/AVSD-DSTC10_Official/

  18. arXiv:2110.03446  [pdf, other

    cs.CV cs.AI cs.LG

    A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

    Authors: Moitreya Chatterjee, Narendra Ahuja, Anoop Cherian

    Abstract: Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) be… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted at ICCV 2021 (Oral)

  19. arXiv:2109.11955  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Visual Scene Graphs for Audio Source Separation

    Authors: Moitreya Chatterjee, Jonathan Le Roux, Narendra Ahuja, Anoop Cherian

    Abstract: State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinc… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: Accepted at ICCV 2021

  20. Measurement and Analysis of GPU-accelerated Applications with HPCToolkit

    Authors: Keren Zhou, Laksono Adhianto, Jonathon Anderson, Aaron Cherian, Dejan Grubisic, Mark Krentel, Yumeng Liu, Xiaozhu Meng, John Mellor-Crummey

    Abstract: To address the challenge of performance analysis on the US DOE's forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit's measurement and analysis tools attribute metrics to calling co… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Journal ref: Parallel Computing 2021

  21. arXiv:2108.13865  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

    Authors: Anoop Cherian, Goncalo Dias Pais, Siddarth Jain, Tim K. Marks, Alan Sullivan

    Abstract: In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vecto… ▽ More

    Submitted 28 January, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

    Comments: Accepted at ICCV 2021. Code & data @ https://www.merl.com/research/license/InSeGAN

  22. arXiv:2106.13272  [pdf, other

    cs.CV cs.LG

    Generalized One-Class Learning Using Pairs of Complementary Classifiers

    Authors: Anoop Cherian, Jue Wang

    Abstract: One-class learning is the classic problem of fitting a model to the data for which annotations are available only for a single class. In this paper, we explore novel objectives for one-class learning, which we collectively refer to as Generalized One-class Discriminative Subspaces (GODS). Our key idea is to learn a pair of complementary classifiers to flexibly bound the one-class data distribution… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: Accepted at Trans. PAMI. arXiv admin note: text overlap with arXiv:1908.05884

  23. arXiv:2104.06461  [pdf, other

    cs.LG cs.CV

    Learning Log-Determinant Divergences for Positive Definite Matrices

    Authors: Anoop Cherian, Panagiotis Stanitsas, Jue Wang, Mehrtash Harandi, Vassilios Morellas, Nikolaos Papanikolopoulos

    Abstract: Representations in the form of Symmetric Positive Definite (SPD) matrices have been popularized in a variety of visual learning applications due to their demonstrated ability to capture rich second-order statistics of visual data. There exist several similarity measures for comparing SPD matrices with documented benefits. However, selecting an appropriate measure for a given problem remains a chal… ▽ More

    Submitted 22 December, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at Trans. PAMI (extended version of ICCV 2017 paper). arXiv admin note: substantial text overlap with arXiv:1708.01741

  24. Tensor Representations for Action Recognition

    Authors: Piotr Koniusz, Lei Wang, Anoop Cherian

    Abstract: Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) a… ▽ More

    Submitted 28 August, 2021; v1 submitted 28 December, 2020; originally announced December 2020.

    Comments: Published with TPAMI, 2020. arXiv admin note: text overlap with arXiv:1604.00239

  25. arXiv:2010.02990  [pdf, other

    cs.LG eess.SY

    First-Order Optimization Inspired from Finite-Time Convergent Flows

    Authors: Siqi Zhang, Mouhacine Benosman, Orlando Romero, Anoop Cherian

    Abstract: In this paper, we investigate the performance of two first-order optimization algorithms, obtained from forward Euler discretization of finite-time optimization flows. These flows are the rescaled-gradient flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or discontinuous dynamical systems that converge locally in finite time to the minima of gradient-dominated functions.… ▽ More

    Submitted 17 October, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

    MSC Class: 68T07

  26. arXiv:2007.12130  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Sound2Sight: Generating Visual Dynamics from Sound and Context

    Authors: Anoop Cherian, Moitreya Chatterjee, Narendra Ahuja

    Abstract: Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on… ▽ More

    Submitted 23 July, 2020; originally announced July 2020.

    Comments: Accepted at ECCV 2020

  27. arXiv:2007.05840  [pdf, other

    cs.LG cs.CV stat.ML

    Representation Learning via Adversarially-Contrastive Optimal Transport

    Authors: Anoop Cherian, Shuchin Aeron

    Abstract: In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatio-temporal cues. To maximize extraction of such informative cues from the data, we set the problem within the context of contrastive representation learning and to that end propose a novel objective via optimal transport. Specifically, our formulation seeks… ▽ More

    Submitted 11 July, 2020; originally announced July 2020.

    Comments: Accepted at ICML 2020

  28. arXiv:2007.03848  [pdf, other

    cs.CV cs.CL

    Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

    Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian

    Abstract: Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To… ▽ More

    Submitted 2 March, 2021; v1 submitted 7 July, 2020; originally announced July 2020.

    Comments: Accepted at AAAI 2021

  29. arXiv:2006.09197  [pdf, other

    cs.CV cs.CG

    Dense Non-Rigid Structure from Motion: A Manifold Viewpoint

    Authors: Suryansh Kumar, Luc Van Gool, Carlos E. P. de Oliveira, Anoop Cherian, Yuchao Dai, Hongdong Li

    Abstract: Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames. Classical approaches to this problem assume a small number of feature points and, ignore the local non-linearities of the shape deformation, and therefore, struggles to reliably model non-linear deformations. Furthermore, available dense NRSf… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: A comprehensive version that combines our cvpr 2018 and cvpr 2019 work (Still under development and refinement, Initial Version). 13 Figures, 1 Table. arXiv admin note: text overlap with arXiv:1902.01077

  30. arXiv:2004.13217  [pdf, other

    cs.CV

    Inferring Temporal Compositions of Actions Using Probabilistic Automata

    Authors: Rodrigo Santa Cruz, Anoop Cherian, Basura Fernando, Dylan Campbell, Stephen Gould

    Abstract: This paper presents a framework to recognize temporal compositions of atomic actions in videos. Specifically, we propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata to recognize complex actions as satisfying these expressions on the input video features. Our approach is different from existing works that… ▽ More

    Submitted 27 April, 2020; originally announced April 2020.

    Comments: Accepted in Workshop on Compositionality in Computer Vision at CVPR, 2020

  31. arXiv:2004.02980  [pdf, other

    cs.CV cs.LG eess.IV

    LUVLi Face Alignment: Estimating Landmarks' Location, Uncertainty, and Visibility Likelihood

    Authors: Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, Chen Feng

    Abstract: Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We… ▽ More

    Submitted 6 April, 2020; originally announced April 2020.

    Comments: Accepted to CVPR 2020

  32. arXiv:2001.06127  [pdf, other

    cs.CV

    Spatio-Temporal Ranked-Attention Networks for Video Captioning

    Authors: Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks

    Abstract: Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Sp… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

  33. arXiv:1911.06394  [pdf, other

    cs.CL

    The Eighth Dialog System Technology Challenge

    Authors: Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sung** Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, **chao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta

    Abstract: This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and eval… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

    Comments: Submitted to NeurIPS 2019 3rd Conversational AI Workshop

  34. arXiv:1909.02856  [pdf, ps, other

    cs.CV

    Discriminative Video Representation Learning Using Support Vector Classifiers

    Authors: Jue Wang, Anoop Cherian

    Abstract: Most popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action---many are common across multiple actions---pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tack… ▽ More

    Submitted 5 September, 2019; originally announced September 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1803.10628

  35. arXiv:1908.05884  [pdf, ps, other

    cs.CV

    GODS: Generalized One-class Discriminative Subspaces for Anomaly Detection

    Authors: Jue Wang, Anoop Cherian

    Abstract: One-class learning is the classic problem of fitting a model to data for which annotations are available only for a single class. In this paper, we propose a novel objective for one-class learning. Our key idea is to use a pair of orthonormal frames -- as subspaces -- to "sandwich" the labeled data via optimizing for two objectives jointly: i) minimize the distance between the origins of the two s… ▽ More

    Submitted 16 August, 2019; originally announced August 2019.

    Comments: Accepted by ICCV 2019, 8 pages

  36. arXiv:1905.05927  [pdf, ps, other

    cs.LG cs.CV math.OC stat.ML

    Game Theoretic Optimization via Gradient-based Nikaido-Isoda Function

    Authors: Arvind U. Raghunathan, Anoop Cherian, Devesh K. Jha

    Abstract: Computing Nash equilibrium (NE) of multi-player games has witnessed renewed interest due to recent advances in generative adversarial networks. However, computing equilibrium efficiently is challenging. To this end, we introduce the Gradient-based Nikaido-Isoda (GNI) function which serves: (i) as a merit function, vanishing only at the first-order stationary points of each player's optimization pr… ▽ More

    Submitted 14 May, 2019; originally announced May 2019.

    Comments: Accepted at International Conference on Machine Learning (ICML), 2019

  37. arXiv:1901.09107  [pdf, other

    cs.CV

    Audio-Visual Scene-Aware Dialog

    Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

    Abstract: We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audi… ▽ More

    Submitted 8 May, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

  38. arXiv:1807.09380  [pdf, ps, other

    cs.CV

    Contrastive Video Representation Learning via Adversarial Perturbations

    Authors: Jue Wang, Anoop Cherian

    Abstract: Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations within a novel contrastive learning setup to build negative samples, which are then used to produce improved video representations. To this end, given a well-trained deep model for per-frame video recognition, we firs… ▽ More

    Submitted 15 April, 2020; v1 submitted 24 July, 2018; originally announced July 2018.

    Comments: Revised version of ECCV 2018 Paper: Learning Discriminative Video Representations Using Adversarial Perturbations

  39. arXiv:1807.04409  [pdf, other

    cs.CV

    Sem-GAN: Semantically-Consistent Image-to-Image Translation

    Authors: Anoop Cherian, Alan Sullivan

    Abstract: Unpaired image-to-image translation is the problem of map** an image in the source domain to one in the target domain, without requiring corresponding image pairs. To ensure the translated images are realistically plausible, recent works, such as Cycle-GAN, demands this map** to be invertible. While, this requirement demonstrates promising results when the domains are unimodal, its performance… ▽ More

    Submitted 11 July, 2018; originally announced July 2018.

  40. arXiv:1806.08409  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

    Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

    Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog dat… ▽ More

    Submitted 29 June, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Comments: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7

  41. arXiv:1806.00525  [pdf, other

    cs.CL cs.CV

    Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

    Authors: Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori

    Abstract: Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, wh… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  42. arXiv:1803.11064  [pdf, other

    cs.CV

    Non-Linear Temporal Subspace Representations for Activity Recognition

    Authors: Anoop Cherian, Suvrit Sra, Stephen Gould, Richard Hartley

    Abstract: Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action… ▽ More

    Submitted 27 March, 2018; originally announced March 2018.

    Comments: Accepted at the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, 2018. arXiv admin note: substantial text overlap with arXiv:1705.08583

  43. arXiv:1803.10628  [pdf, ps, other

    cs.CV

    Video Representation Learning Using Discriminative Pooling

    Authors: Jue Wang, Anoop Cherian, Fatih Porikli, Stephen Gould

    Abstract: Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action---indeed, many are common across multiple actions---pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to t… ▽ More

    Submitted 29 March, 2018; v1 submitted 26 March, 2018; originally announced March 2018.

    Comments: 8 pages, 7 figures, Accepted in CVPR2018. arXiv admin note: substantial text overlap with arXiv:1704.01716

  44. arXiv:1803.00233  [pdf, other

    cs.CV

    Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective

    Authors: Suryansh Kumar, Anoop Cherian, Yuchao Dai, Hongdong Li

    Abstract: This paper addresses the task of dense non-rigid structure-from-motion (NRSfM) using multiple images. State-of-the-art methods to this problem are often hurdled by scalability, expensive computations, and noisy measurements. Further, recent methods to NRSfM usually either assume a small number of sparse feature points or ignore local non-linearities of shape deformations, and thus cannot reliably… ▽ More

    Submitted 23 March, 2018; v1 submitted 1 March, 2018; originally announced March 2018.

    Comments: 10 pages, 7 figure, 4 tables. Accepted for publication in Conference on Computer Vision and Pattern Recognition (CVPR), 2018, typos fixed and acknowledgement added

  45. arXiv:1801.08676  [pdf, other

    cs.CV cs.LG

    Neural Algebra of Classifiers

    Authors: Rodrigo Santa Cruz, Basura Fernando, Anoop Cherian, Stephen Gould

    Abstract: The world is fundamentally compositional, so it is natural to think of visual recognition as the recognition of basic visually primitives that are composed according to well-defined rules. This strategy allows us to recognize unseen complex concepts from simple visual primitives. However, the current trend in visual recognition follows a data greedy approach where huge amounts of data are required… ▽ More

    Submitted 26 January, 2018; originally announced January 2018.

    Comments: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)

  46. arXiv:1709.06391  [pdf, other

    cs.CV

    Human Action Forecasting by Learning Task Grammars

    Authors: Tengda Han, Jue Wang, Anoop Cherian, Stephen Gould

    Abstract: For effective human-robot interaction, it is important that a robotic assistant can forecast the next action a human will consider in a given task. Unfortunately, real-world tasks are often very long, complex, and repetitive; as a result forecasting is not trivial. In this paper, we propose a novel deep recurrent architecture that takes as input features from a two-stream Residual action recogniti… ▽ More

    Submitted 19 September, 2017; originally announced September 2017.

  47. arXiv:1708.01741  [pdf, other

    cs.CV

    Learning Discriminative Alpha-Beta-divergence for Positive Definite Matrices (Extended Version)

    Authors: Anoop Cherian, Panagiotis Stanitsas, Mehrtash Harandi, Vassilios Morellas, Nikolaos Papanikolopoulos

    Abstract: Symmetric positive definite (SPD) matrices are useful for capturing second-order statistics of visual data. To compare two SPD matrices, several measures are available, such as the affine-invariant Riemannian metric, Jeffreys divergence, Jensen-Bregman logdet divergence, etc.; however, their behaviors may be application dependent, raising the need of manual selection to achieve the best possible p… ▽ More

    Submitted 5 August, 2017; originally announced August 2017.

    Comments: Accepted at the International Conference on Computer Vision (ICCV)

  48. arXiv:1707.09240  [pdf, other

    cs.CV cs.LG

    Human Pose Forecasting via Deep Markov Models

    Authors: Sam Toyer, Anoop Cherian, Tengda Han, Stephen Gould

    Abstract: Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving. Usually, forecasting algorithms use 3D skeleton sequences and are trained to forecast for a few milliseconds into the future. Long-range forecasting is challenging due to the difficulty of estimating how long a person continues an activity. To… ▽ More

    Submitted 5 September, 2017; v1 submitted 24 July, 2017; originally announced July 2017.

    Comments: Accepted to DICTA'17

  49. arXiv:1705.08583  [pdf, other

    cs.CV

    Sequence Summarization Using Order-constrained Kernelized Feature Subspaces

    Authors: Anoop Cherian, Suvrit Sra, Richard Hartley

    Abstract: Representations that can compactly and effectively capture temporal evolution of semantic content are important to machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characteri… ▽ More

    Submitted 23 May, 2017; originally announced May 2017.

  50. arXiv:1704.06925  [pdf, other

    cs.CV

    Second-order Temporal Pooling for Action Recognition

    Authors: Anoop Cherian, Stephen Gould

    Abstract: Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specificall… ▽ More

    Submitted 6 August, 2018; v1 submitted 23 April, 2017; originally announced April 2017.

    Comments: Accepted in the International Journal of Computer Vision (IJCV)