Skip to main content

Showing 1–17 of 17 results for author: Marks, T K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16306  [pdf, other

    cs.CV

    TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

    Authors: Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

    Abstract: Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  2. arXiv:2310.00224  [pdf, other

    cs.CV cs.AI cs.LG

    Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

    Authors: Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks

    Abstract: Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidanc… ▽ More

    Submitted 29 September, 2023; originally announced October 2023.

    Comments: Accepted at ICCV 2023

  3. arXiv:2210.12521  [pdf, other

    cs.RO cs.AI cs.CV

    H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions

    Authors: Kei Ota, Hsiao-Yu Tung, Kevin A. Smith, Anoop Cherian, Tim K. Marks, Alan Sullivan, Asako Kanezaki, Joshua B. Tenenbaum

    Abstract: The world is filled with articulated objects that are difficult to determine how to use from vision alone, e.g., a door might open inwards or outwards. Humans handle these objects with strategic trial-and-error: first pushing a door then pulling if that doesn't work. We enable these capabilities in autonomous agents by proposing "Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR), a probabil… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

  4. arXiv:2202.09277  [pdf, other

    cs.CV cs.AI cs.LG

    (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

    Authors: Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

    Abstract: Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight,… ▽ More

    Submitted 26 March, 2022; v1 submitted 18 February, 2022; originally announced February 2022.

    Comments: Accepted at AAAI 2022 (Oral)

  5. arXiv:2111.01048  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    MOST-GAN: 3D Morphable StyleGAN for Disentangled Face Image Manipulation

    Authors: Safa C. Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B. Tenenbaum, Xiaoming Liu, Tim K. Marks

    Abstract: Recent advances in generative adversarial networks (GANs) have led to remarkable achievements in face image synthesis. While methods that use style-based GANs can generate strikingly photorealistic face images, it is often difficult to control the characteristics of the generated faces in a meaningful and disentangled way. Prior approaches aim to achieve such semantic control and disentanglement w… ▽ More

    Submitted 1 November, 2021; originally announced November 2021.

    ACM Class: I.2.10

  6. arXiv:2110.06894  [pdf, other

    cs.CL

    Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

    Authors: Ankit P. Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, Chiori Hori

    Abstract: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the dat… ▽ More

    Submitted 13 October, 2021; originally announced October 2021.

    Comments: https://dstc10.dstc.community/home and https://github.com/dialogtekgeek/AVSD-DSTC10_Official/

  7. arXiv:2108.13865  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

    Authors: Anoop Cherian, Goncalo Dias Pais, Siddarth Jain, Tim K. Marks, Alan Sullivan

    Abstract: In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vecto… ▽ More

    Submitted 28 January, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

    Comments: Accepted at ICCV 2021. Code & data @ https://www.merl.com/research/license/InSeGAN

  8. arXiv:2004.02980  [pdf, other

    cs.CV cs.LG eess.IV

    LUVLi Face Alignment: Estimating Landmarks' Location, Uncertainty, and Visibility Likelihood

    Authors: Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu, Chen Feng

    Abstract: Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We… ▽ More

    Submitted 6 April, 2020; originally announced April 2020.

    Comments: Accepted to CVPR 2020

  9. arXiv:2001.06127  [pdf, other

    cs.CV

    Spatio-Temporal Ranked-Attention Networks for Video Captioning

    Authors: Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks

    Abstract: Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Sp… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

  10. arXiv:1911.06394  [pdf, other

    cs.CL

    The Eighth Dialog System Technology Challenge

    Authors: Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sung** Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, **chao Li, Mahmoud Adada, Minlie Huang, Luis Lastras, Jonathan K. Kummerfeld, Walter S. Lasecki, Chiori Hori, Anoop Cherian, Tim K. Marks, Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta

    Abstract: This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and eval… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

    Comments: Submitted to NeurIPS 2019 3rd Conversational AI Workshop

  11. arXiv:1901.09107  [pdf, other

    cs.CV

    Audio-Visual Scene-Aware Dialog

    Authors: Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

    Abstract: We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audi… ▽ More

    Submitted 8 May, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

  12. arXiv:1901.03461  [pdf, ps, other

    cs.CL

    Dialog System Technology Challenge 7

    Authors: Koichiro Yoshino, Chiori Hori, Julien Perez, Luis Fernando D'Haro, Lazaros Polymenakos, Chulaka Gunasekara, Walter S. Lasecki, Jonathan K. Kummerfeld, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan, Xiang Gao, Huda Alamari, Tim K. Marks, Devi Parikh, Dhruv Batra

    Abstract: This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems. Recently, end-to-end dialog modeling approaches have been applied to various dialog tasks. The seventh DSTC (DSTC7) focuses on develo** technologies related to end-to-end dialog systems for (1) sentence selection, (2) sentence generation and (… ▽ More

    Submitted 10 January, 2019; originally announced January 2019.

    Comments: This paper is presented at NIPS2018 2nd Conversational AI workshop

  13. arXiv:1806.08409  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

    Authors: Chiori Hori, Huda Alamri, Jue Wang, Gordon Wichern, Takaaki Hori, Anoop Cherian, Tim K. Marks, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Irfan Essa, Dhruv Batra, Devi Parikh

    Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog dat… ▽ More

    Submitted 29 June, 2018; v1 submitted 21 June, 2018; originally announced June 2018.

    Comments: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7

  14. arXiv:1806.00525  [pdf, other

    cs.CL cs.CV

    Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

    Authors: Huda Alamri, Vincent Cartillier, Raphael Gontijo Lopes, Abhishek Das, Jue Wang, Irfan Essa, Dhruv Batra, Devi Parikh, Anoop Cherian, Tim K. Marks, Chiori Hori

    Abstract: Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, wh… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  15. arXiv:1804.00060  [pdf, other

    cs.CV

    Class Subset Selection for Transfer Learning using Submodularity

    Authors: Varun Manjunatha, Srikumar Ramalingam, Tim K. Marks, Larry Davis

    Abstract: In recent years, it is common practice to extract fully-connected layer (fc) features that were learned while performing image classification on a source dataset, such as ImageNet, and apply them generally to a wide range of other tasks. The general usefulness of some large training datasets for transfer learning is not yet well understood, and raises a number of questions. For example, in the con… ▽ More

    Submitted 30 March, 2018; originally announced April 2018.

  16. arXiv:1701.03126  [pdf, other

    cs.CV cs.CL cs.MM

    Attention-Based Multimodal Fusion for Video Description

    Authors: Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks

    Abstract: Currently successful methods for video description are based on encoder-decoder sentence generation using recur-rent neural networks (RNNs). Recent work has shown the advantage of integrating temporal and/or spatial attention mechanisms into these models, in which the decoder net-work predicts each word in the description by selectively giving more weight to encoded features from specific time fra… ▽ More

    Submitted 9 March, 2017; v1 submitted 11 January, 2017; originally announced January 2017.

    Comments: Resubmitted to the rebuttal for CVPR 2017 for review, 8 pages, 4 figures

  17. Robust Face Alignment Using a Mixture of Invariant Experts

    Authors: Oncel Tuzel, Tim K. Marks, Salil Tambe

    Abstract: Face alignment, which is the task of finding the locations of a set of facial landmark points in an image of a face, is useful in widespread application areas. Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression. To address this issue, we propose a cascade in which each stage consists of a mixture of regress… ▽ More

    Submitted 23 October, 2016; v1 submitted 13 November, 2015; originally announced November 2015.

    Comments: 17 pages, 6 figures

    Journal ref: Proceedings of 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 11-14, 2016, pp 825-841