Skip to main content

Showing 1–12 of 12 results for author: Suris, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.14398  [pdf, other

    cs.CV cs.LG

    pix2gestalt: Amodal Segmentation by Synthesizing Wholes

    Authors: Ege Ozguroglu, Ruoshi Liu, Dídac Surís, Dian Chen, Achal Dave, Pavel Tokmakov, Carl Vondrick

    Abstract: We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, incl… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Website: https://gestalt.cs.columbia.edu/

  2. arXiv:2303.08128  [pdf, other

    cs.CV

    ViperGPT: Visual Inference via Python Execution for Reasoning

    Authors: Dídac Surís, Sachit Menon, Carl Vondrick

    Abstract: Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules sim… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: Website: https://viper.cs.columbia.edu/

  3. arXiv:2211.11903  [pdf, other

    cs.RO cs.CV

    FLEX: Full-Body Gras** Without Full-Body Grasps

    Authors: Purva Tendulkar, Dídac Surís, Carl Vondrick

    Abstract: Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games and robotics. Towards this goal, we address the task of generating a virtual human -- hands and full body -- gras** everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. How… ▽ More

    Submitted 28 March, 2023; v1 submitted 21 November, 2022; originally announced November 2022.

    Comments: CVPR 2023 Camera-ready

  4. arXiv:2210.01322  [pdf, other

    cs.LG cs.AI cs.CV

    Representing Spatial Trajectories as Distributions

    Authors: Dídac Surís, Carl Vondrick

    Abstract: We introduce a representation learning framework for spatial trajectories. We represent partial observations of trajectories as probability distributions in a learned latent space, which characterize the uncertainty about unobserved parts of the trajectory. Our framework allows us to obtain samples from a trajectory for any continuous point in time, both interpolating and extrapolating. Our flexib… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted to NeurIPS 2022

  5. arXiv:2206.07148  [pdf, other

    cs.MM cs.CV

    It's Time for Artistic Correspondence in Music and Video

    Authors: Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon

    Abstract: We present an approach for recommending a music track for a given video, and vice versa, based on both their temporal alignment and their correspondence at an artistic level. We propose a self-supervised approach that learns this correspondence directly from data, without any need of human annotations. In order to capture the high-level concepts that are required to solve the task, we propose mode… ▽ More

    Submitted 14 June, 2022; originally announced June 2022.

    Comments: CVPR 2022

  6. arXiv:2204.10916  [pdf, other

    cs.CV cs.LG

    Revealing Occlusions with 4D Neural Fields

    Authors: Basile Van Hoorick, Purva Tendulkar, Didac Suris, Dennis Park, Simon Stent, Carl Vondrick

    Abstract: For computer vision systems to operate in dynamic situations, they need to be able to represent and reason about object permanence. We introduce a framework for learning to estimate 4D visual representations from monocular RGB-D, which is able to persist objects, even once they become obstructed by occlusions. Unlike traditional video representations, we encode point clouds into a continuous repre… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: CVPR 2022 (Oral)

  7. arXiv:2101.01600  [pdf, other

    cs.CV cs.LG eess.IV

    Learning the Predictability of the Future

    Authors: Dídac Surís, Ruoshi Liu, Carl Vondrick

    Abstract: We introduce a framework for learning from unlabeled video what is predictable in the future. Instead of committing up front to features to predict, our approach learns from data which features are predictable. Based on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure, we propose a predictive model in hyperbolic space. When the model is most confident… ▽ More

    Submitted 1 January, 2021; originally announced January 2021.

    Comments: Website: https://hyperfuture.cs.columbia.edu

  8. arXiv:2012.04631  [pdf, other

    cs.CL cs.CV cs.LG

    Globetrotter: Connecting Languages by Connecting Images

    Authors: Dídac Surís, Dave Epstein, Carl Vondrick

    Abstract: Machine translation between many languages at once is highly challenging, since training with ground truth requires supervision between all language pairs, which is difficult to obtain. Our key insight is that, while languages may vary drastically, the underlying visual appearance of the world remains consistent. We introduce a method that uses visual observations to bridge the gap between languag… ▽ More

    Submitted 31 March, 2022; v1 submitted 8 December, 2020; originally announced December 2020.

    Comments: CVPR 2022 (Oral)

  9. arXiv:1911.11237  [pdf, other

    cs.CL cs.CV cs.LG

    Learning to Learn Words from Visual Scenes

    Authors: Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick

    Abstract: Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that learns how to learn word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our… ▽ More

    Submitted 12 July, 2020; v1 submitted 25 November, 2019; originally announced November 2019.

    Comments: 26 pages, 12 figures

    Journal ref: European Conference on Computer Vision (ECCV), 2020

  10. arXiv:1804.01452  [pdf, other

    cs.CV cs.CL cs.SD

    Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

    Authors: David Harwath, Adrià Recasens, Dídac Surís, Galen Chuang, Antonio Torralba, James Glass

    Abstract: In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform an image-audio retrieval task. Our models operate directly… ▽ More

    Submitted 4 April, 2018; originally announced April 2018.

  11. arXiv:1801.02200  [pdf, other

    cs.IR cs.CV cs.SD eess.AS

    Cross-modal Embeddings for Video and Audio Retrieval

    Authors: Didac Surís, Amanda Duarte, Amaia Salvador, Jordi Torres, Xavier Giró-i-Nieto

    Abstract: The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural netwo… ▽ More

    Submitted 7 January, 2018; originally announced January 2018.

    Comments: 6 pages, 3 figures

  12. arXiv:1801.01423  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    Overcoming catastrophic forgetting with hard attention to the task

    Authors: Joan Serrà, Dídac Surís, Marius Miron, Alexandros Karatzoglou

    Abstract: Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A h… ▽ More

    Submitted 29 May, 2018; v1 submitted 4 January, 2018; originally announced January 2018.

    Comments: Includes appendix. Accepted for ICML 2018