Skip to main content

Showing 1–11 of 11 results for author: Castrejon, L

.
  1. arXiv:2404.05465  [pdf, other

    cs.CV cs.LG

    HAMMR: HierArchical MultiModal React agents for generic VQA

    Authors: Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, Jasper Uijlings

    Abstract: Combining Large Language Models (LLMs) with external specialized tools (LLMs+tools) is a recent paradigm to solve multimodal tasks such as Visual Question Answering (VQA). While this approach was demonstrated to work well when optimized and evaluated for each individual benchmark, in practice it is crucial for the next generation of real-world AI systems to handle a broad range of multimodal probl… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  2. arXiv:2310.06641  [pdf, other

    cs.CV

    How (not) to ensemble LVLMs for VQA

    Authors: Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink

    Abstract: This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wik… ▽ More

    Submitted 7 December, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

    Comments: 4th I Can't Believe It's Not Better Workshop (co-located with NeurIPS 2023)

  3. arXiv:2306.09224  [pdf, other

    cs.CV

    Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

    Authors: Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

    Abstract: We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evi… ▽ More

    Submitted 24 July, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: ICCV'23

  4. arXiv:2206.00735  [pdf, other

    cs.CV cs.LG

    Cascaded Video Generation for Videos In-the-Wild

    Authors: Lluis Castrejon, Nicolas Ballas, Aaron Courville

    Abstract: Videos can be created by first outlining a global view of the scene and then adding local details. Inspired by this idea we propose a cascaded model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, which is then refined by subsequent cascade levels operating at larger resolutions. We train each… ▽ More

    Submitted 1 June, 2022; originally announced June 2022.

    Comments: Accepted to the 26th International Conference on Pattern Recognition (ICPR 2022). arXiv admin note: substantial text overlap with arXiv:2106.02719

  5. arXiv:2106.02719  [pdf, other

    cs.CV

    Hierarchical Video Generation for Complex Data

    Authors: Lluis Castrejon, Nicolas Ballas, Aaron Courville

    Abstract: Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

  6. arXiv:2006.10803  [pdf, other

    cs.LG cs.CV stat.ML

    Supervision Accelerates Pre-training in Contrastive Semi-Supervised Learning of Visual Representations

    Authors: Mahmoud Assran, Nicolas Ballas, Lluis Castrejon, Michael Rabbat

    Abstract: We investigate a strategy for improving the efficiency of contrastive learning of visual representations by leveraging a small amount of supervised information during pre-training. We propose a semi-supervised loss, SuNCEt, based on noise-contrastive estimation and neighbourhood component analysis, that aims to distinguish examples of different classes in addition to the self-supervised instance-w… ▽ More

    Submitted 1 December, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

  7. arXiv:1904.12165  [pdf, other

    cs.CV cs.LG

    Improved Conditional VRNNs for Video Prediction

    Authors: Lluis Castrejon, Nicolas Ballas, Aaron Courville

    Abstract: Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this i… ▽ More

    Submitted 27 April, 2019; originally announced April 2019.

    Comments: Project page: https://sites.google.com/view/videovrnn

  8. arXiv:1712.06761  [pdf, other

    cs.CV

    MovieGraphs: Towards Understanding Human-Centric Situations from Videos

    Authors: Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

    Abstract: There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of… ▽ More

    Submitted 15 April, 2018; v1 submitted 18 December, 2017; originally announced December 2017.

    Comments: Spotlight at CVPR 2018. Webpage: http://moviegraphs.cs.toronto.edu

  9. arXiv:1704.05548  [pdf, other

    cs.CV

    Annotating Object Instances with a Polygon-RNN

    Authors: Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, Sanja Fidler

    Abstract: We propose an approach for semi-automatic annotation of object instances. While most current methods treat object segmentation as a pixel-labeling problem, we here cast it as a polygon prediction task, mimicking how most current datasets have been annotated. In particular, our approach takes as input an image crop and sequentially produces vertices of the polygon outlining the object. This allows… ▽ More

    Submitted 18 April, 2017; originally announced April 2017.

    Journal ref: CVPR 2017

  10. arXiv:1610.09003  [pdf, other

    cs.CV cs.LG cs.MM

    Cross-Modal Scene Networks

    Authors: Yusuf Aytar, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

    Abstract: People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize scenes well, they also learn an intermediate representation not aligned across modalit… ▽ More

    Submitted 27 October, 2016; originally announced October 2016.

    Comments: See more at http://cmplaces.csail.mit.edu/. arXiv admin note: text overlap with arXiv:1607.07295

  11. arXiv:1607.07295  [pdf, other

    cs.CV

    Learning Aligned Cross-Modal Representations from Weakly Aligned Data

    Authors: Lluis Castrejon, Yusuf Aytar, Carl Vondrick, Hamed Pirsiavash, Antonio Torralba

    Abstract: People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize cross-modal scenes well, they also learn an intermediate representation not aligned ac… ▽ More

    Submitted 25 July, 2016; originally announced July 2016.

    Comments: Conference paper at CVPR 2016