-
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Authors:
Viorica Pătrăucean,
Lucas Smaira,
Ankush Gupta,
Adrià Recasens Continente,
Larisa Markeeva,
Dylan Banarse,
Skanda Koppula,
Joseph Heyward,
Mateusz Malinowski,
Yi Yang,
Carl Doersch,
Tatiana Matejovicova,
Yury Sulsky,
Antoine Miech,
Alex Frechette,
Hanna Klimczak,
Raphael Koster,
Junlin Zhang,
Stephanie Winkler,
Yusuf Aytar,
Simon Osindero,
Dima Damen,
Andrew Zisserman,
João Carreira
Abstract:
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning…
▽ More
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding.
Dataset, baseline code, and challenge server are available at https://github.com/deepmind/perception_test
△ Less
Submitted 30 October, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Zorro: the masked multimodal transformer
Authors:
Adrià Recasens,
Jason Lin,
Joāo Carreira,
Drew Jaegle,
Luyu Wang,
Jean-baptiste Alayrac,
Pauline Luc,
Antoine Miech,
Lucas Smaira,
Ross Hemsley,
Andrew Zisserman
Abstract:
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in…
▽ More
Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, kee** some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
△ Less
Submitted 22 February, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
TAP-Vid: A Benchmark for Tracking Any Point in a Video
Authors:
Carl Doersch,
Ankush Gupta,
Larisa Markeeva,
Adrià Recasens,
Lucas Smaira,
Yusuf Aytar,
João Carreira,
Andrew Zisserman,
Yi Yang
Abstract:
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation e…
▽ More
Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.
△ Less
Submitted 31 March, 2023; v1 submitted 7 November, 2022;
originally announced November 2022.
-
Towards Learning Universal Audio Representations
Authors:
Luyu Wang,
Pauline Luc,
Yan Wu,
Adria Recasens,
Lucas Smaira,
Andrew Brock,
Andrew Jaegle,
Jean-Baptiste Alayrac,
Sander Dieleman,
Joao Carreira,
Aaron van den Oord
Abstract:
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learni…
▽ More
The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark. We discover that previous sound event classification or speech models do not generalize outside of their domains. We observe that more robust audio representations can be learned with the SimCLR objective; however, the model's transferability depends heavily on the model architecture. We find the Slowfast architecture is good at learning rich representations required by different domains, but its performance is affected by the normalization scheme. Based on these findings, we propose a novel normalizer-free Slowfast NFNet and achieve state-of-the-art performance across all domains.
△ Less
Submitted 23 June, 2022; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Human-Agent Cooperation in Bridge Bidding
Authors:
Edward Lockhart,
Neil Burch,
Nolan Bard,
Sebastian Borgeaud,
Tom Eccles,
Lucas Smaira,
Ray Smith
Abstract:
We introduce a human-compatible reinforcement-learning approach to a cooperative game, making use of a third-party hand-coded human-compatible bot to generate initial training data and to perform initial evaluation. Our learning approach consists of imitation learning, search, and policy iteration. Our trained agents achieve a new state-of-the-art for bridge bidding in three settings: an agent pla…
▽ More
We introduce a human-compatible reinforcement-learning approach to a cooperative game, making use of a third-party hand-coded human-compatible bot to generate initial training data and to perform initial evaluation. Our learning approach consists of imitation learning, search, and policy iteration. Our trained agents achieve a new state-of-the-art for bridge bidding in three settings: an agent playing in partnership with a copy of itself; an agent partnering a pre-existing bot; and an agent partnering a human player.
△ Less
Submitted 28 November, 2020;
originally announced November 2020.
-
A Short Note on the Kinetics-700-2020 Human Action Dataset
Authors:
Lucas Smaira,
João Carreira,
Eric Noland,
Ellen Clancy,
Amy Wu,
Andrew Zisserman
Abstract:
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results…
▽ More
We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results using the I3D network.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Self-Supervised MultiModal Versatile Networks
Authors:
Jean-Baptiste Alayrac,
Adrià Recasens,
Rosalia Schneider,
Relja Arandjelović,
Jason Ramapuram,
Jeffrey De Fauw,
Lucas Smaira,
Sander Dieleman,
Andrew Zisserman
Abstract:
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalit…
▽ More
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.
△ Less
Submitted 30 October, 2020; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Visual Grounding in Video for Unsupervised Word Translation
Authors:
Gunnar A. Sigurdsson,
Jean-Baptiste Alayrac,
Aida Nematzadeh,
Lucas Smaira,
Mateusz Malinowski,
João Carreira,
Phil Blunsom,
Andrew Zisserman
Abstract:
There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc…
▽ More
There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text map** algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.
△ Less
Submitted 26 March, 2020; v1 submitted 10 March, 2020;
originally announced March 2020.
-
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
Authors:
Antoine Miech,
Jean-Baptiste Alayrac,
Lucas Smaira,
Ivan Laptev,
Josef Sivic,
Andrew Zisserman
Abstract:
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narra…
▽ More
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
△ Less
Submitted 23 August, 2020; v1 submitted 13 December, 2019;
originally announced December 2019.