Skip to main content

Showing 1–50 of 84 results for author: Sigal, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.00637  [pdf, other

    cs.CV cs.AI cs.GR

    Representing Animatable Avatar via Factorized Neural Fields

    Authors: Chun** Song, Zhijie Wu, Bastian Wandt, Leonid Sigal, Helge Rhodin

    Abstract: For reconstructing high-fidelity human 3D models from monocular videos, it is crucial to maintain consistent large-scale body shapes along with finely matched subtle wrinkles. This paper explores the observation that the per-frame rendering results can be factorized into a pose-independent component and a corresponding pose-dependent equivalent to facilitate frame consistency. Pose adaptive textur… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

  2. arXiv:2404.11732  [pdf, other

    cs.CV

    Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach

    Authors: Mir Rayat Imtiaz Hossain, Mennatullah Siam, Leonid Sigal, James J. Little

    Abstract: The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we exa… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024

  3. arXiv:2403.14797  [pdf, other

    cs.CV cs.LG

    Preventing Catastrophic Forgetting through Memory Networks in Continuous Detection

    Authors: Gaurav Bhatt, James Ross, Leonid Sigal

    Abstract: Modern pre-trained architectures struggle to retain previous information while undergoing continuous fine-tuning on new tasks. Despite notable progress in continual classification, systems designed for complex vision tasks such as detection or segmentation still struggle to attain satisfactory performance. In this work, we introduce a memory-based detection transformer architecture to adapt a pre-… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  4. arXiv:2402.11487  [pdf, other

    cs.CV

    Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

    Authors: Tanzila Rahman, Shweta Mahajan, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Leonid Sigal

    Abstract: Text-to-image (TTI) diffusion models have demonstrated impressive results in generating high-resolution images of complex and imaginative scenes. Recent approaches have further extended these methods with personalization techniques that allow them to integrate user-illustrated concepts (e.g., the user him/herself) using a few sample image illustrations. However, the ability to generate images with… ▽ More

    Submitted 18 February, 2024; originally announced February 2024.

    Comments: 9 Figures, 8 Pages

  5. arXiv:2401.12419  [pdf, other

    cs.CV

    Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

    Authors: Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid Sigal, Yalda Mohsenzadeh

    Abstract: While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is wa… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  6. arXiv:2401.01130  [pdf, other

    cs.CV

    Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

    Authors: Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

    Abstract: In this paper, we present a novel generative task: joint scene graph - image generation. While previous works have explored image generation conditioned on scene graphs or layouts, our task is distinctive and important as it involves generating scene graphs themselves unconditionally from noise, enabling efficient and interpretable control for image generation. Our task is challenging, requiring t… ▽ More

    Submitted 2 January, 2024; originally announced January 2024.

  7. arXiv:2312.12416  [pdf, other

    cs.CV cs.LG

    Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

    Authors: Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal

    Abstract: The quality of the prompts provided to text-to-image diffusion models determines how faithful the generated content is to the user's intent, often requiring `prompt engineering'. To harness visual concepts from target images without prompt engineering, current approaches largely rely on embedding inversion by optimizing and then map** them to pseudo-tokens. However, working with such high-dimens… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

  8. arXiv:2312.08514  [pdf, other

    cs.CV

    TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking

    Authors: Raghav Goyal, Wan-Cyuan Fan, Mennatullah Siam, Leonid Sigal

    Abstract: Video Object Segmentation (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings, which involve long videos with global motion (e.g, in egocentric settings), depicting small objects undergoing both rigid and non-rigid (including state) deformations. While a number of recent approaches have been explored for this task, the… ▽ More

    Submitted 9 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  9. arXiv:2312.01261  [pdf, other

    cs.CV cs.CY

    TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models

    Authors: Aditya Chinchure, Pushkar Shukla, Gaurav Bhatt, Kiri Salij, Kartik Hosanagar, Leonid Sigal, Matthew Turk

    Abstract: Text-to-Image (TTI) generative models have shown great progress in the past few years in terms of their ability to generate complex and high-quality imagery. At the same time, these models have been shown to suffer from harmful biases, including exaggerated societal biases (e.g., gender, ethnicity), as well as incidental correlations that limit such model's ability to generate more diverse imagery… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

  10. arXiv:2311.17095  [pdf, other

    cs.CV cs.AI

    Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

    Authors: Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li

    Abstract: From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Voc… ▽ More

    Submitted 15 June, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

    Comments: Accepted to CVPR 2024; Earlier version of this paper contained an unintentional error stemming from a bug in the code. This version corrects this error, which had to do with filtering of class names. In consultation with CVPR Program Chairs it was suggested errata be submitted as the updated (fixed) code reinforced original findings (albeit with slightly different final numbers)

  11. arXiv:2310.00377  [pdf, other

    cs.LG

    Mitigating the Effect of Incidental Correlations on Part-based Learning

    Authors: Gaurav Bhatt, Deepayan Das, Leonid Sigal, Vineeth N Balasubramanian

    Abstract: Intelligent systems possess a crucial characteristic of breaking complicated problems into smaller reusable components or parts and adjusting to new tasks using these part representations. However, current part-learners encounter difficulties in dealing with incidental correlations resulting from the limited observations of objects that may appear only in specific arrangements or with specific bac… ▽ More

    Submitted 30 September, 2023; originally announced October 2023.

    Comments: Accepted in 37th Conference on Neural Information Processing Systems (NeurIPS'2023)

  12. arXiv:2307.14071  [pdf, other

    cs.CV cs.AI

    Uncertainty Guided Adaptive War** for Robust and Efficient Stereo Matching

    Authors: Junpeng **g, Jiankun Li, Pengfei Xiong, Jiangyu Liu, Shuaicheng Liu, Yichen Guo, Xin Deng, Mai Xu, Lai Jiang, Leonid Sigal

    Abstract: Correlation based stereo matching has achieved outstanding performance, which pursues cost volume between two feature maps. Unfortunately, current methods with a fixed model do not work uniformly well across various datasets, greatly limiting their real-world applicability. To tackle this issue, this paper proposes a new perspective to dynamically calculate correlation for robust stereo matching.… ▽ More

    Submitted 26 July, 2023; originally announced July 2023.

    Comments: Accepted by ICCV2023

  13. arXiv:2307.07663  [pdf, other

    cs.CV

    INVE: Interactive Neural Video Editing

    Authors: Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee

    Abstract: We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insuf… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

  14. arXiv:2303.07545  [pdf, other

    cs.CV

    Implicit and Explicit Commonsense for Multi-sentence Video Captioning

    Authors: Shih-Han Chou, James J. Little, Leonid Sigal

    Abstract: Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack the commonsense knowledge of the world required to reason about progression of events, causality, and even the function of certain objects within a scene. To add… ▽ More

    Submitted 8 January, 2024; v1 submitted 13 March, 2023; originally announced March 2023.

    Comments: The paper is under consideration at Computer Vision and Image Understanding Journal

  15. arXiv:2302.08063  [pdf, other

    cs.CV

    MINOTAUR: Multi-task Video Grounding From Multimodal Queries

    Authors: Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

    Abstract: Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences. These tasks differ in the type of inputs (only video, or video-query pair where query is an image region or sentence) and outputs (temporal segments or spatio-temporal tubes). However, at their core they require the same fundamental understanding of the video, i… ▽ More

    Submitted 17 March, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

    Comments: 22 pages, 8 figures and 13 tables

  16. arXiv:2302.07319  [pdf, other

    cs.CV

    Frustratingly Simple but Effective Zero-shot Detection and Segmentation: Analysis and a Strong Baseline

    Authors: Siddhesh Khandelwal, Anirudth Nambirajan, Behjat Siddiquie, Jayan Eledath, Leonid Sigal

    Abstract: Methods for object detection and segmentation often require abundant instance-level annotations for training, which are time-consuming and expensive to collect. To address this, the task of zero-shot object detection (or segmentation) aims at learning effective methods for identifying and localizing object instances for the categories that have no supervision available. Constructing architectures… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

    Comments: 17 Pages, 7 Figures

  17. arXiv:2302.01403  [pdf, other

    cs.CV

    Self-Supervised Relation Alignment for Scene Graph Generation

    Authors: Bicheng Xu, Renjie Liao, Leonid Sigal

    Abstract: The goal of scene graph generation is to predict a graph from an input image, where nodes correspond to identified and localized objects and edges to their corresponding interaction predicates. Existing methods are trained in a fully supervised manner and focus on message passing mechanisms, loss functions, and/or bias mitigation. In this work we introduce a simple-yet-effective self-supervised re… ▽ More

    Submitted 12 December, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

  18. Vocabulary-informed Zero-shot and Open-set Learning

    Authors: Yanwei Fu, Xiaomei Wang, Hanze Dong, Yu-Gang Jiang, Meng Wang, Xiangyang Xue, Leonid Sigal

    Abstract: Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requ… ▽ More

    Submitted 3 January, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

    Comments: 17 pages, 8 figures. TPAMI 2019 extended from CVPR 2016 (arXiv:1604.07093)

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)

  19. Framework-agnostic Semantically-aware Global Reasoning for Segmentation

    Authors: Mir Rayat Imtiaz Hossain, Leonid Sigal, James J. Little

    Abstract: Recent advances in pixel-level tasks (e.g. segmentation) illustrate the benefit of of long-range interactions between aggregated region-based representations that can enhance local features. However, such aggregated representations, often in the form of attention, fail to model the underlying semantics of the scene (e.g. individual objects and, by extension, their interactions). In this work, we a… ▽ More

    Submitted 17 April, 2024; v1 submitted 6 December, 2022; originally announced December 2022.

    Comments: Published in WACV 2024

    Journal ref: 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2024, pp. 988-998

  20. arXiv:2211.15155  [pdf, other

    cs.LG cs.AI

    GraphPNAS: Learning Distribution of Good Neural Architectures via Deep Graph Generative Models

    Authors: Muchen Li, Jeffrey Yunfan Liu, Leonid Sigal, Renjie Liao

    Abstract: Neural architectures can be naturally viewed as computational graphs. Motivated by this perspective, we, in this paper, study neural architecture search (NAS) through the lens of learning random graph models. In contrast to existing NAS methods which largely focus on searching for a single best architecture, i.e, point estimation, we propose GraphPNAS a deep graph generative model that learns a di… ▽ More

    Submitted 28 November, 2022; originally announced November 2022.

  21. arXiv:2211.13319  [pdf, other

    cs.CV

    Make-A-Story: Visual Memory Conditioned Consistent Story Generation

    Authors: Tanzila Rahman, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, Shweta Mahajan, Leonid Sigal

    Abstract: There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-referen… ▽ More

    Submitted 5 May, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: 11 pages

  22. arXiv:2210.13626  [pdf, other

    cs.CV cs.CL

    VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge

    Authors: Sahithya Ravi, Aditya Chinchure, Leonid Sigal, Renjie Liao, Vered Shwartz

    Abstract: There has been a growing interest in solving Visual Question Answering (VQA) tasks that require the model to reason beyond the content present in the image. In this work, we focus on questions that require commonsense reasoning. In contrast to previous methods which inject knowledge from static knowledge bases, we investigate the incorporation of contextualized knowledge using Commonsense Transfor… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted at WACV 2023. For code and supplementary material, see https://github.com/aditya10/VLC-BERT

  23. arXiv:2210.01791  [pdf, other

    cs.CV cs.AI cs.HC

    Real-Time Monitoring of User Stress, Heart Rate and Heart Rate Variability on Mobile Devices

    Authors: Peyman Bateni, Leonid Sigal

    Abstract: Stress is considered to be the epidemic of the 21st-century. Yet, mobile apps cannot directly evaluate the impact of their content and services on user stress. We introduce the Beam AI SDK to address this issue. Using our SDK, apps can monitor user stress through the selfie camera in real-time. Our technology extracts the user's pulse wave by analyzing subtle color variations across the skin regio… ▽ More

    Submitted 4 October, 2022; originally announced October 2022.

  24. arXiv:2207.13440  [pdf, other

    cs.CV

    Iterative Scene Graph Generation

    Authors: Siddhesh Khandelwal, Leonid Sigal

    Abstract: The task of scene graph generation entails identifying object entities and their corresponding interaction predicates in a given image (or video). Due to the combinatorially large solution space, existing approaches to scene graph generation assume certain factorization of the joint distribution to make the estimation feasible (e.g., assuming that objects are conditionally independent of predicate… ▽ More

    Submitted 27 July, 2022; originally announced July 2022.

    Comments: 25 pages, 10 images, 9 tables

  25. arXiv:2207.10662  [pdf, other

    cs.CV

    Generalizable Patch-Based Neural Rendering

    Authors: Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

    Abstract: Neural rendering has received tremendous attention since the advent of Neural Radiance Fields (NeRF), and has pushed the state-of-the-art on novel-view synthesis considerably. The recent focus has been on models that overfit to a single scene, and the few attempts to learn models that can synthesize novel views of unseen scenes mostly consist of combining deep convolutional features with a NeRF-li… ▽ More

    Submitted 28 July, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

    Comments: Project Page with code and results at https://mohammedsuhail.net/gen_patch_neural_rendering/

  26. arXiv:2203.12054  [pdf, other

    cs.CV cs.AI

    Self-supervision through Random Segments with Autoregressive Coding (RandSAC)

    Authors: Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal

    Abstract: Inspired by the success of self-supervised autoregressive representation learning in natural language (GPT and its variants), and advances in recent visual architecture design with Vision Transformers (ViTs), in this paper, we explore the effect various design choices have on the success of applying such training strategies for visual feature learning. Specifically, we introduce a novel strategy t… ▽ More

    Submitted 25 October, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

  27. arXiv:2201.05151  [pdf, other

    cs.CV

    Beyond Simple Meta-Learning: Multi-Purpose Models for Multi-Domain, Active and Continual Few-Shot Learning

    Authors: Peyman Bateni, Jarred Barber, Raghav Goyal, Vaden Masrani, Jan-Willem van de Meent, Leonid Sigal, Frank Wood

    Abstract: Modern deep learning requires large-scale extensively labelled datasets for training. Few-shot learning aims to alleviate this issue by learning effectively from few labelled examples. In previously proposed few-shot visual classifiers, it is assumed that the feature manifold, where classifier decisions are made, has uncorrelated feature dimensions and uniform feature variance. In this work, we fo… ▽ More

    Submitted 12 December, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

  28. arXiv:2112.09687  [pdf, other

    cs.CV

    Light Field Neural Rendering

    Authors: Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

    Abstract: Classical light field rendering for novel view synthesis can accurately reproduce view-dependent effects such as reflection, refraction, and translucency, but requires a dense view sampling of the scene. Methods based on geometric reconstruction need only sparse views, but cannot accurately model non-Lambertian effects. We introduce a model that combines the strengths and mitigates the limitations… ▽ More

    Submitted 28 March, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: Project page with code and videos at https://light-field-neural-rendering.github.io

  29. arXiv:2111.12747  [pdf, other

    cs.CV

    Layered Controllable Video Generation

    Authors: Jiahui Huang, Yuhe **, Kwang Moo Yi, Leonid Sigal

    Abstract: We introduce layered controllable video generation, where we, without any supervision, decompose the initial frame of a video into foreground and background layers, with which the user can control the video generation process by simply manipulating the foreground mask. The key challenges are the unsupervised foreground-background separation, which is ambiguous, and ability to anticipate user manip… ▽ More

    Submitted 30 September, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: This paper has been accepted to ECCV 2022 as an Oral paper

  30. arXiv:2110.13412  [pdf, other

    cs.CV

    TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

    Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal

    Abstract: The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-… ▽ More

    Submitted 26 October, 2021; originally announced October 2021.

    Comments: 10 pages, 5 Figures, Neurips 2021

    Journal ref: https://nips.cc/Conferences/2021

  31. arXiv:2106.03089  [pdf, other

    cs.CV

    Referring Transformer: A One-step Approach to Multi-task Visual Grounding

    Authors: Muchen Li, Leonid Sigal

    Abstract: As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or segmentation (RES) either suffer from limited performance, due to a two-stage setup, or require the designing of complex task-specific one-stage architectures. In this… ▽ More

    Submitted 14 July, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

  32. arXiv:2104.14207  [pdf, other

    cs.CV

    Segmentation-grounded Scene Graph Generation

    Authors: Siddhesh Khandelwal, Mohammed Suhail, Leonid Sigal

    Abstract: Scene graph generation has emerged as an important problem in computer vision. While scene graphs provide a grounded representation of objects, their locations and relations in an image, they do so only at the granularity of proposal bounding boxes. In this work, we propose the first, to our knowledge, framework for pixel-level segmentation-grounded scene graph generation. Our framework is agnosti… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

    Comments: 11 pages, 3 figures, 4 tables

  33. arXiv:2104.02606  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Weakly-supervised Audio-visual Sound Source Detection and Separation

    Authors: Tanzila Rahman, Leonid Sigal

    Abstract: Learning how to localize and separate individual object sounds in the audio channel of the video is a difficult task. Current state-of-the-art methods predict audio masks from artificially mixed spectrograms, known as Mix-and-Separate framework. We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like, from videos labeled with only objec… ▽ More

    Submitted 25 March, 2021; originally announced April 2021.

    Comments: 4 figures, 6 pages

    Journal ref: IEEE International Conference on Multimedia and Expo (ICME) 2021

  34. arXiv:2103.02221  [pdf, other

    cs.CV cs.LG

    Energy-Based Learning for Scene Graph Generation

    Authors: Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris Broaddus, Jayan Eledath, Gerard Medioni, Leonid Sigal

    Abstract: Traditional scene graph generation methods are trained using cross-entropy losses that treat objects and relationships as independent entities. Such a formulation, however, ignores the structure in the output space, in an inherently structured prediction problem. In this work, we introduce a novel energy-based learning framework for generating scene graphs. The proposed formulation allows for effi… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

  35. arXiv:2011.02164  [pdf, other

    cs.CV cs.CL

    An Improved Attention for Visual Question Answering

    Authors: Tanzila Rahman, Shih-Han Chou, Leonid Sigal, Giuseppe Carenini

    Abstract: We consider the problem of Visual Question Answering (VQA). Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image. The task is challenging because it requires simultaneous and intricate understanding of both visual and textual information. Attention, which captures intra-… ▽ More

    Submitted 3 June, 2021; v1 submitted 4 November, 2020; originally announced November 2020.

    Comments: 8 pages

  36. arXiv:2008.12679  [pdf, other

    cs.CV cs.LG

    Person-in-Context Synthesiswith Compositional Structural Space

    Authors: Weidong Yin, Ziwei Liu, Leonid Sigal

    Abstract: Despite significant progress, controlled generation of complex images with interacting people remains difficult. Existing layout generation methods fall short of synthesizing realistic person instances; while pose-guided generation approaches focus on a single person and assume simple or known backgrounds. To tackle these limitations, we propose a new problem, \textbf{Persons in Context Synthesis}… ▽ More

    Submitted 28 August, 2020; originally announced August 2020.

  37. arXiv:2008.11932  [pdf, other

    cs.CV

    Attribute-guided image generation from layout

    Authors: Ke Ma, Bo Zhao, Leonid Sigal

    Abstract: Recent approaches have achieved great success in image generation from structured inputs, e.g., semantic segmentation, scene graph or layout. Although these methods allow specification of objects and their locations at image-level, they lack the fidelity and semantic control to specify visual appearance of these objects at an instance-level. To address this limitation, we propose a new image gener… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Journal ref: BMVC 2020

  38. arXiv:2006.14727  [pdf, other

    cs.CV cs.LG

    Unsupervised Video Decomposition using Spatio-temporal Iterative Inference

    Authors: Polina Zablotskaia, Edoardo A. Dominici, Leonid Sigal, Andreas M. Lehrmann

    Abstract: Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencie… ▽ More

    Submitted 25 June, 2020; originally announced June 2020.

  39. arXiv:2006.12770  [pdf, other

    cs.CV

    Discriminative Feature Alignment: Improving Transferability of Unsupervised Domain Adaptation by Gaussian-guided Latent Alignment

    Authors: **g Wang, Jiahong Chen, Jianzhe Lin, Leonid Sigal, Clarence W. de Silva

    Abstract: In this study, we focus on the unsupervised domain adaptation problem where an approximate inference model is to be learned from a labeled data domain and expected to generalize well to an unlabeled data domain. The success of unsupervised domain adaptation largely relies on the cross-domain feature alignment. Previous work has attempted to directly align latent features by the classifier-induced… ▽ More

    Submitted 9 August, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: 14 pages, 11 figures

  40. arXiv:2006.07502  [pdf, other

    cs.CV

    UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation

    Authors: Siddhesh Khandelwal, Raghav Goyal, Leonid Sigal

    Abstract: Methods for object detection and segmentation rely on large scale instance-level annotations for training, which are difficult and time-consuming to collect. Efforts to alleviate this look at varying degrees and quality of supervision. Weakly-supervised approaches draw on image-level labels to build detectors/segmentors, while zero/few-shot methods assume abundant instance-level data for a set of… ▽ More

    Submitted 3 March, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: 22 Pages, 8 Figures, 13 Tables

  41. arXiv:2004.00760  [pdf, other

    cs.CV

    Consistent Multiple Sequence Decoding

    Authors: Bicheng Xu, Leonid Sigal

    Abstract: Sequence decoding is one of the core components of most visual-lingual models. However, typical neural decoders when faced with decoding multiple, possibly correlated, sequences of tokens resort to simple independent decoding schemes. In this paper, we introduce a consistent multiple sequence decoding architecture, which is while relatively simple, is general and allows for consistent and simultan… ▽ More

    Submitted 15 April, 2020; v1 submitted 1 April, 2020; originally announced April 2020.

  42. arXiv:2002.10501  [pdf, other

    cs.LG stat.ML

    Variational Hyper RNN for Sequence Modeling

    Authors: Ruizhi Deng, Yanshuai Cao, Bo Chang, Leonid Sigal, Greg Mori, Marcus A. Brubaker

    Abstract: In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence. Our method uses temporal latent variables to capture information about the underlying data pattern and dynamically decodes the latent information into modifications of weights of the base decoder and recurrent model. T… ▽ More

    Submitted 24 February, 2020; originally announced February 2020.

  43. arXiv:1912.10589  [pdf, other

    cs.CV cs.GR

    Front2Back: Single View 3D Shape Reconstruction via Front to Back Prediction

    Authors: Yuan Yao, Nico Schertler, Enrique Rosales, Helge Rhodin, Leonid Sigal, Alla Sheffer

    Abstract: Reconstruction of a 3D shape from a single 2D image is a classical computer vision problem, whose difficulty stems from the inherent ambiguity of recovering occluded or only partially observed surfaces. Recent methods address this challenge through the use of largely unstructured neural networks that effectively distill conditional map** and priors over 3D shape. In this work, we induce structur… ▽ More

    Submitted 31 January, 2020; v1 submitted 22 December, 2019; originally announced December 2019.

  44. arXiv:1912.03432  [pdf, other

    cs.CV

    Improved Few-Shot Visual Classification

    Authors: Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, Leonid Sigal

    Abstract: Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data. Most few-shot learning approaches to date have focused on progressively more complex neural feature extractors and classifier adaptation strategies, as well as the refinement of the task definition itself. In this paper, we explore the hypothesis that a simple… ▽ More

    Submitted 11 June, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

  45. arXiv:1912.02401  [pdf, other

    cs.CV cs.LG eess.IV

    Generating Videos of Zero-Shot Compositions of Actions and Objects

    Authors: Megha Nawhal, Mengyao Zhai, Andreas Lehrmann, Leonid Sigal, Greg Mori

    Abstract: Human activity videos involve rich, varied interactions between people and objects. In this paper we develop methods for generating such videos -- making progress toward addressing the important, open problem of video generation in complex scenes. In particular, we introduce the task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for act… ▽ More

    Submitted 17 July, 2020; v1 submitted 5 December, 2019; originally announced December 2019.

    Comments: Accepted at ECCV'20; Project Page: https://www.sfu.ca/~mnawhal/projects/zs_hoi_generation.html

  46. arXiv:1912.00076  [pdf, other

    cs.CV

    OptiBox: Breaking the Limits of Proposals for Visual Grounding

    Authors: Zicong Fan, Si Yi Meng, Leonid Sigal, James J. Little

    Abstract: The problem of language grounding has attracted much attention in recent years due to its pivotal role in more general image-lingual high level reasoning tasks (e.g., image captioning, VQA). Despite the tremendous progress in visual grounding, the performance of most approaches has been hindered by the quality of bounding box proposals obtained in the early stages of all recent pipelines. To addre… ▽ More

    Submitted 29 November, 2019; originally announced December 2019.

  47. arXiv:1910.09139  [pdf, other

    cs.CV cs.LG

    DwNet: Dense warp-based network for pose-guided human video generation

    Authors: Polina Zablotskaia, Aliaksandr Siarohin, Bo Zhao, Leonid Sigal

    Abstract: Generation of realistic high-resolution videos of human subjects is a challenging and important task in computer vision. In this paper, we focus on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video. Our GAN-based architecture, DwNet, leverages dense intermediate pose-g… ▽ More

    Submitted 20 October, 2019; originally announced October 2019.

    Comments: Accepted to BMVC 2019

  48. arXiv:1909.09944  [pdf, other

    cs.CV

    Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning

    Authors: Tanzila Rahman, Bicheng Xu, Leonid Sigal

    Abstract: Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging from language grounding to dense event captioning. However, much of the research has been limited to approaches that either do not take audio corresponding to video into account at all, or those that model the audio-visual correl… ▽ More

    Submitted 25 October, 2019; v1 submitted 22 September, 2019; originally announced September 2019.

    Journal ref: ICCV2019

  49. arXiv:1907.10719  [pdf, other

    cs.CV

    LayoutVAE: Stochastic Scene Layout Generation From a Label Set

    Authors: Akash Abdu Jyothi, Thibaut Durand, Jiawei He, Leonid Sigal, Greg Mori

    Abstract: Recently there is an increasing interest in scene generation within the research community. However, models used for generating scene layouts from textual description largely ignore plausible visual variations within the structure dictated by the text. We propose LayoutVAE, a variational autoencoder based framework for generating stochastic scene layouts. LayoutVAE is a versatile modeling framewor… ▽ More

    Submitted 1 June, 2021; v1 submitted 24 July, 2019; originally announced July 2019.

    Comments: 20 pages, 24 figures, accepted in ICCV 2019

  50. arXiv:1905.09400  [pdf, other

    cs.CV

    AttentionRNN: A Structured Spatial Attention Mechanism

    Authors: Siddhesh Khandelwal, Leonid Sigal

    Abstract: Visual attention mechanisms have proven to be integrally important constituent components of many modern deep neural architectures. They provide an efficient and effective way to utilize visual information selectively, which has shown to be especially valuable in multi-modal learning tasks. However, all prior attention frameworks lack the ability to explicitly model structural dependencies among a… ▽ More

    Submitted 22 May, 2019; originally announced May 2019.