Skip to main content

Showing 1–19 of 19 results for author: Sigurdsson, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.04834  [pdf, other

    cs.CV

    FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

    Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

    Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexibl… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  2. arXiv:2311.16311  [pdf, other

    cs.CV

    Characterizing Video Question Answering with Sparsified Inputs

    Authors: Shiyuan Huang, Robinson Piramuthu, Vicente Ordonez, Shih-Fu Chang, Gunnar A. Sigurdsson

    Abstract: In Video Question Answering, videos are often processed as a full-length sequence of frames to ensure minimal loss of information. Recent works have demonstrated evidence that sparse video inputs are sufficient to maintain high performance. However, they usually discuss the case of single frame selection. In our work, we extend the setting to multiple number of inputs and other modalities. We char… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  3. arXiv:2305.19228  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Unsupervised Melody-to-Lyric Generation

    Authors: Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Yiwen Chen, Tagyoung Chung, **g Huang, Nanyun Peng

    Abstract: Automatic melody-to-lyric generation is a task in which song lyrics are generated to go with a given melody. It is of significant practical interest and more challenging than unconstrained lyric generation as the music imposes additional constraints onto the lyrics. The training data is limited as most songs are copyrighted, resulting in models that underfit the complicated cross-modal relationshi… ▽ More

    Submitted 22 December, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: ACL 2023. arXiv admin note: substantial text overlap with arXiv:2305.07760

  4. arXiv:2305.07760  [pdf, other

    cs.AI cs.CL cs.MM

    Unsupervised Melody-Guided Lyrics Generation

    Authors: Yufei Tian, Anjali Narayan-Chen, Shereen Oraby, Alessandra Cervone, Gunnar Sigurdsson, Chenyang Tao, Wenbo Zhao, Tagyoung Chung, **g Huang, Nanyun Peng

    Abstract: Automatic song writing is a topic of significant practical interest. However, its research is largely hindered by the lack of training data due to copyright concerns and challenged by its creative nature. Most noticeably, prior works often fall short of modeling the cross-modal correlation between melody and lyrics due to limited parallel data, hence generating lyrics that are less singable. Exist… ▽ More

    Submitted 25 May, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: Presented at AAAI23 CreativeAI workshop (Non-Archival). A later version is accepted to ACL23

  5. arXiv:2303.06710  [pdf, other

    cs.RO cs.AI cs.LG

    Decision Making for Human-in-the-loop Robotic Agents via Uncertainty-Aware Reinforcement Learning

    Authors: Siddharth Singi, Zhanpeng He, Alvin Pan, Sandip Patel, Gunnar A. Sigurdsson, Robinson Piramuthu, Shuran Song, Matei Ciocarlie

    Abstract: In a Human-in-the-Loop paradigm, a robotic agent is able to act mostly autonomously in solving a task, but can request help from an external expert when needed. However, knowing when to request such assistance is critical: too few requests can lead to the robot making mistakes, but too many requests can overload the expert. In this paper, we present a Reinforcement Learning based approach to this… ▽ More

    Submitted 14 March, 2023; v1 submitted 12 March, 2023; originally announced March 2023.

  6. arXiv:2301.12614  [pdf, other

    cs.RO cs.AI cs.CV

    RREx-BoT: Remote Referring Expressions with a Bag of Tricks

    Authors: Gunnar A. Sigurdsson, Jesse Thomason, Gaurav S. Sukhatme, Robinson Piramuthu

    Abstract: Household robots operate in the same space for years. Such robots incrementally build dynamic maps that can be used for tasks requiring remote object localization. However, benchmarks in robot learning often test generalization through inference on tasks in unobserved environments. In an observed environment, locating an object is reduced to choosing from among all object proposals in the environm… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

  7. arXiv:2211.16649  [pdf, other

    cs.CV cs.AI cs.CL cs.RO

    CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation

    Authors: Vishnu Sashank Dorbala, Gunnar Sigurdsson, Robinson Piramuthu, Jesse Thomason, Gaurav S. Sukhatme

    Abstract: Household environments are visually diverse. Embodied agents performing Vision-and-Language Navigation (VLN) in the wild must be able to handle this diversity, while also following arbitrary language instructions. Recently, Vision-Language models like CLIP have shown great performance on the task of zero-shot object recognition. In this work, we ask if these models are also capable of zero-shot la… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: 8 pages, Accepted at LangRob Workshop at Conference on Robot Learning (CoRL), 2022

  8. arXiv:2210.08391  [pdf, other

    cs.CV

    Video in 10 Bits: Few-Bit VideoQA for Efficiency and Privacy

    Authors: Shiyuan Huang, Robinson Piramuthu, Shih-Fu Chang, Gunnar A. Sigurdsson

    Abstract: In Video Question Answering (VideoQA), answering general questions about a video requires its visual information. Yet, video often contains redundant information irrelevant to the VideoQA task. For example, if the task is only to answer questions similar to "Is someone laughing in the video?", then all other information can be discarded. This paper investigates how many bits are really needed from… ▽ More

    Submitted 17 October, 2022; v1 submitted 15 October, 2022; originally announced October 2022.

    Comments: ECCV Workshop 2022

  9. arXiv:2206.13396  [pdf, other

    cs.CV cs.AI cs.LG cs.RO

    A Simple Approach for Visual Rearrangement: 3D Map** and Semantic Search

    Authors: Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, Gaurav S. Sukhatme, Ruslan Salakhutdinov

    Abstract: Physically rearranging objects is an important capability for embodied agents. Visual room rearrangement evaluates an agent's ability to rearrange objects in a room to a desired goal based solely on visual input. We propose a simple yet effective method for this problem: (1) search for and map which objects need to be rearranged, and (2) rearrange each object until the task is complete. Our approa… ▽ More

    Submitted 9 August, 2022; v1 submitted 20 June, 2022; originally announced June 2022.

    Comments: Winner of the Rearrangement Challenge at CVPR 2022

  10. Identifying rote learning and the supporting effects of hints in drills

    Authors: Gunnar Stefansson, Anna Helga Jonsdottir, Thorarinn Jonmundsson, Gylfi Snaer Sigurdsson, Ingunn Lilja Bergsdottir

    Abstract: Whenever students use any drilling system the question arises how much of their learning is meaningful learning vs memorisation through repetition or rote learning. Although both types of learning have their place in an educational system it is important to be able to distinguish between these two approaches to learning and identify options which can dislodge students from rote learning and motiva… ▽ More

    Submitted 19 August, 2021; originally announced August 2021.

  11. arXiv:2003.05614  [pdf, other

    cs.CV

    Beyond the Camera: Neural Networks in World Coordinates

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Karteek Alahari

    Abstract: Eye movement and strategic placement of the visual field onto the retina, gives animals increased resolution of the scene and suppresses distracting information. This fundamental system has been missing from video understanding with deep networks, typically limited to 224 by 224 pixel content locked to the camera frame. We propose a simple idea, WorldFeatures, where each feature at every layer has… ▽ More

    Submitted 12 March, 2020; originally announced March 2020.

  12. arXiv:2003.05078  [pdf, other

    cs.CV cs.CL cs.LG

    Visual Grounding in Video for Unsupervised Word Translation

    Authors: Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

    Abstract: There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc… ▽ More

    Submitted 26 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

    Journal ref: CVPR 2020

  13. arXiv:1804.09627  [pdf, other

    cs.CV

    Actor and Observer: Joint Modeling of First and Third-Person Videos

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

    Abstract: Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a st… ▽ More

    Submitted 25 April, 2018; originally announced April 2018.

    Comments: CVPR 2018 spotlight presentation

    Journal ref: CVPR 2018

  14. arXiv:1804.09626  [pdf, other

    cs.CV

    Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos

    Authors: Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, Karteek Alahari

    Abstract: In Actor and Observer we introduced a dataset linking the first and third-person video understanding domains, the Charades-Ego Dataset. In this paper we describe the egocentric aspect of the dataset and present annotations for Charades-Ego with 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Chara… ▽ More

    Submitted 30 April, 2018; v1 submitted 25 April, 2018; originally announced April 2018.

  15. arXiv:1708.02696  [pdf, other

    cs.CV

    What Actions are Needed for Understanding Human Actions in Videos?

    Authors: Gunnar A. Sigurdsson, Olga Russakovsky, Abhinav Gupta

    Abstract: What is the right way to reason about human activities? What directions forward are most promising? In this work, we analyze the current state of human activity understanding in videos. The goal of this paper is to examine datasets, evaluation metrics, algorithms, and potential future directions. We look at the qualitative attributes that define activities such as pose variability, brevity, and de… ▽ More

    Submitted 8 August, 2017; originally announced August 2017.

    Comments: ICCV2017

  16. arXiv:1612.06371  [pdf, other

    cs.CV

    Asynchronous Temporal Fields for Action Recognition

    Authors: Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta

    Abstract: Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for r… ▽ More

    Submitted 24 July, 2017; v1 submitted 19 December, 2016; originally announced December 2016.

  17. arXiv:1607.07429  [pdf, other

    cs.HC cs.CV

    Much Ado About Time: Exhaustive Annotation of Temporal Data

    Authors: Gunnar A. Sigurdsson, Olga Russakovsky, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Abstract: Large-scale annotated datasets allow AI systems to learn from and build upon the knowledge of the crowd. Many crowdsourcing techniques have been developed for collecting image annotations. These techniques often implicitly rely on the fact that a new input image takes a negligible amount of time to perceive. In contrast, we investigate and determine the most cost-effective way of obtaining high-qu… ▽ More

    Submitted 2 October, 2016; v1 submitted 25 July, 2016; originally announced July 2016.

    Comments: HCOMP 2016 Camera Ready

  18. arXiv:1604.04279  [pdf, other

    cs.CV

    Learning Visual Storylines with Skip** Recurrent Neural Networks

    Authors: Gunnar A. Sigurdsson, Xinlei Chen, Abhinav Gupta

    Abstract: What does a typical visit to Paris look like? Do people first take photos of the Louvre and then the Eiffel Tower? Can we visually model a temporal event like "Paris Vacation" using current frameworks? In this paper, we explore how we can automatically learn the temporal aspects, or storylines of visual concepts from web data. Previous attempts focus on consecutive image-to-image transitions and a… ▽ More

    Submitted 26 July, 2016; v1 submitted 14 April, 2016; originally announced April 2016.

    Comments: European Conference on Computer Vision (ECCV) 2016

  19. arXiv:1604.01753  [pdf, other

    cs.CV

    Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

    Authors: Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta

    Abstract: Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So h… ▽ More

    Submitted 26 July, 2016; v1 submitted 6 April, 2016; originally announced April 2016.