Skip to main content

Showing 1–9 of 9 results for author: Smaira, L

.
  1. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  2. arXiv:2301.09595  [pdf, other

    cs.CV

    Zorro: the masked multimodal transformer

    Authors: Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

    Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in… ▽ More

    Submitted 22 February, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

  3. arXiv:2211.03726  [pdf, other

    cs.CV stat.ML

    TAP-Vid: A Benchmark for Tracking Any Point in a Video

    Authors: Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

    Abstract: Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation e… ▽ More

    Submitted 31 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Published in NeurIPS Datasets and Benchmarks track, 2022

  4. arXiv:2111.12124  [pdf, ps, other

    cs.SD eess.AS

    Towards Learning Universal Audio Representations

    Authors: Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, Joao Carreira, Aaron van den Oord

    Abstract: The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learni… ▽ More

    Submitted 23 June, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

  5. arXiv:2011.14124  [pdf, ps, other

    cs.AI

    Human-Agent Cooperation in Bridge Bidding

    Authors: Edward Lockhart, Neil Burch, Nolan Bard, Sebastian Borgeaud, Tom Eccles, Lucas Smaira, Ray Smith

    Abstract: We introduce a human-compatible reinforcement-learning approach to a cooperative game, making use of a third-party hand-coded human-compatible bot to generate initial training data and to perform initial evaluation. Our learning approach consists of imitation learning, search, and policy iteration. Our trained agents achieve a new state-of-the-art for bridge bidding in three settings: an agent pla… ▽ More

    Submitted 28 November, 2020; originally announced November 2020.

  6. arXiv:2010.10864  [pdf, other

    cs.CV cs.LG

    A Short Note on the Kinetics-700-2020 Human Action Dataset

    Authors: Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman

    Abstract: We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  7. arXiv:2006.16228  [pdf, other

    cs.CV

    Self-Supervised MultiModal Versatile Networks

    Authors: Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

    Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalit… ▽ More

    Submitted 30 October, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: To appear in the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS 2020)

  8. arXiv:2003.05078  [pdf, other

    cs.CV cs.CL cs.LG

    Visual Grounding in Video for Unsupervised Word Translation

    Authors: Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

    Abstract: There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc… ▽ More

    Submitted 26 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

    Journal ref: CVPR 2020

  9. arXiv:1912.06430  [pdf, other

    cs.CV

    End-to-End Learning of Visual Representations from Uncurated Instructional Videos

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narra… ▽ More

    Submitted 23 August, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: CVPR'2020 Oral