Skip to main content

Showing 1–44 of 44 results for author: Carreira, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.00847  [pdf, other

    cs.CV stat.ML

    BootsTAP: Bootstrapped Training for Tracking-Any-Point

    Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman

    Abstract: To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulat… ▽ More

    Submitted 23 May, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  2. arXiv:2312.13090  [pdf, other

    cs.CV

    Perception Test 2023: A Summary of the First Challenge And Outcome

    Authors: Joseph Heyward, João Carreira, Dima Damen, Andrew Zisserman, Viorica Pătrăucean

    Abstract: The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023, with the goal of benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark. The challenge had six tracks covering low-level and high-level tasks, with both a language and non-language interface, across video, audio,… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

  3. arXiv:2312.00598  [pdf, other

    cs.CV cs.AI

    Learning from One Continuous Video Stream

    Authors: João Carreira, Michael King, Viorica Pătrăucean, Dilara Gokay, Cătălin Ionescu, Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen, Andrew Zisserman

    Abstract: We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of str… ▽ More

    Submitted 28 March, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: CVPR camera ready version

  4. arXiv:2310.08584  [pdf, other

    cs.CV

    Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

    Authors: Shashanka Venkataramanan, Mamshad Nayeem Rizve, João Carreira, Yuki M. Asano, Yannis Avrithis

    Abstract: Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution,… ▽ More

    Submitted 23 May, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024 (Best paper honorable mention). Project Page: https://shashankvkt.github.io/dora

  5. arXiv:2306.08637  [pdf, other

    cs.CV

    TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

    Authors: Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman

    Abstract: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on loc… ▽ More

    Submitted 30 August, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Published at ICCV 2023

  6. arXiv:2305.13786  [pdf, other

    cs.CV cs.AI cs.LG

    Perception Test: A Diagnostic Benchmark for Multimodal Video Models

    Authors: Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, João Carreira

    Abstract: We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning… ▽ More

    Submitted 30 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  7. arXiv:2301.09595  [pdf, other

    cs.CV

    Zorro: the masked multimodal transformer

    Authors: Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

    Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires in… ▽ More

    Submitted 22 February, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

  8. arXiv:2211.03726  [pdf, other

    cs.CV stat.ML

    TAP-Vid: A Benchmark for Tracking Any Point in a Video

    Authors: Carl Doersch, Ankush Gupta, Larisa Markeeva, Adrià Recasens, Lucas Smaira, Yusuf Aytar, João Carreira, Andrew Zisserman, Yi Yang

    Abstract: Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation e… ▽ More

    Submitted 31 March, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: Published in NeurIPS Datasets and Benchmarks track, 2022

  9. arXiv:2210.06433  [pdf, other

    cs.CV cs.AI cs.LG

    Self-supervised video pretraining yields human-aligned visual representations

    Authors: Nikhil Parthasarathy, S. M. Ali Eslami, João Carreira, Olivier J. Hénaff

    Abstract: Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human pe… ▽ More

    Submitted 25 July, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: Technical report

  10. arXiv:2210.02995  [pdf, other

    cs.CV

    Compressed Vision for Efficient Video Understanding

    Authors: Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

    Abstract: Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long v… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: ACCV

  11. arXiv:2209.15589  [pdf, other

    cs.CV cs.LG

    Where Should I Spend My FLOPS? Efficiency Evaluations of Visual Pre-training Methods

    Authors: Skanda Koppula, Yazhe Li, Evan Shelhamer, Andrew Jaegle, Nikhil Parthasarathy, Relja Arandjelovic, João Carreira, Olivier Hénaff

    Abstract: Self-supervised methods have achieved remarkable success in transfer learning, often achieving the same or better accuracy than supervised pre-training. Most prior work has done so by increasing pre-training computation by adding complex data augmentation, multiple views, or lengthy training schedules. In this work, we investigate a related, but orthogonal question: given a fixed FLOP budget, what… ▽ More

    Submitted 18 October, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: 11 pages. 36th Conference on Neural Information Processing Systems, Workshop on Self-Supervised Learning (2022)

  12. arXiv:2203.09494  [pdf, other

    cs.CV cs.LG

    Transframer: Arbitrary Frame Prediction with Generative Models

    Authors: Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, Peter Battaglia

    Abstract: We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs s… ▽ More

    Submitted 9 May, 2022; v1 submitted 17 March, 2022; originally announced March 2022.

  13. arXiv:2203.08777  [pdf, other

    cs.CV cs.AI cs.LG

    Object discovery and representation networks

    Authors: Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira, Relja Arandjelović

    Abstract: The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks. While there has been excellent progress with simple, image-level learning, recent methods have shown the advantage of including knowledge of image structure. However, by introducing hand-crafted image segmentations to define regions of interest, or specialized augmentation strategie… ▽ More

    Submitted 27 July, 2022; v1 submitted 16 March, 2022; originally announced March 2022.

    Comments: European Conference on Computer Vision (ECCV) 2022

  14. arXiv:2202.10890  [pdf, other

    cs.CV

    HiP: Hierarchical Perceiver

    Authors: Joao Carreira, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, Relja Arandjelovic, Matt Botvinick, Oriol Vinyals, Karen Simonyan, Andrew Zisserman, Andrew Jaegle

    Abstract: General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of l… ▽ More

    Submitted 3 November, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

  15. arXiv:2202.07765  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    General-purpose, long-context autoregressive modeling with Perceiver AR

    Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel

    Abstract: Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic… ▽ More

    Submitted 14 June, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: ICML 2022

  16. arXiv:2112.03243  [pdf, other

    cs.CV

    Input-level Inductive Biases for 3D Reconstruction

    Authors: Wang Yifan, Carl Doersch, Relja Arandjelović, João Carreira, Andrew Zisserman

    Abstract: Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases. In this paper we tackle 3D reconstruction using a domain agnostic architecture and study how instead to inject the same type of inductive biases directly as extra inputs to the model. This approach makes it possible to apply existing general models… ▽ More

    Submitted 19 March, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

    Comments: CVPR 2022, including supplemental material

  17. arXiv:2111.12124  [pdf, ps, other

    cs.SD eess.AS

    Towards Learning Universal Audio Representations

    Authors: Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, Joao Carreira, Aaron van den Oord

    Abstract: The ability to learn universal audio representations that can solve diverse speech, music, and environment tasks can spur many applications that require general sound content understanding. In this work, we introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learni… ▽ More

    Submitted 23 June, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

  18. arXiv:2107.14795  [pdf, other

    cs.LG cs.CL cs.CV cs.SD eess.AS

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Authors: Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, Joāo Carreira

    Abstract: A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data f… ▽ More

    Submitted 15 March, 2022; v1 submitted 30 July, 2021; originally announced July 2021.

    Comments: ICLR 2022 camera ready. Code: https://dpmd.ai/perceiver-code

  19. arXiv:2106.08318  [pdf, other

    cs.CV cs.DC cs.LG eess.IV

    Gradient Forward-Propagation for Large-Scale Temporal Video Modelling

    Authors: Mateusz Malinowski, Dimitrios Vytiniotis, Grzegorz Swirszcz, Viorica Patraucean, Joao Carreira

    Abstract: How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and incr… ▽ More

    Submitted 12 July, 2021; v1 submitted 15 June, 2021; originally announced June 2021.

    Comments: Accepted to CVPR 2021. arXiv admin note: text overlap with arXiv:2001.06232

  20. arXiv:2103.10957  [pdf, other

    cs.CV

    Efficient Visual Pretraining with Contrastive Detection

    Authors: Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira

    Abstract: Self-supervised pretraining has been shown to yield powerful representations for transfer learning. These performance gains come at a large computational cost however, with state-of-the-art methods requiring an order of magnitude more computation than supervised pretraining. We tackle this computational bottleneck by introducing a new self-supervised objective, contrastive detection, which tasks r… ▽ More

    Submitted 5 August, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Technical report

  21. arXiv:2103.03206  [pdf, other

    cs.CV cs.AI cs.LG cs.SD eess.AS

    Perceiver: General Perception with Iterative Attention

    Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

    Abstract: Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. T… ▽ More

    Submitted 22 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

    Comments: ICML 2021

  22. arXiv:2010.10864  [pdf, other

    cs.CV cs.LG

    A Short Note on the Kinetics-700-2020 Human Action Dataset

    Authors: Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman

    Abstract: We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  23. arXiv:2005.00214  [pdf, other

    cs.CV cs.LG eess.IV

    The AVA-Kinetics Localized Human Actions Video Dataset

    Authors: Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman

    Abstract: This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe… ▽ More

    Submitted 20 May, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: 8 pages, 8 figures

  24. arXiv:2003.05078  [pdf, other

    cs.CV cs.CL cs.LG

    Visual Grounding in Video for Unsupervised Word Translation

    Authors: Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

    Abstract: There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc… ▽ More

    Submitted 26 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

    Journal ref: CVPR 2020

  25. arXiv:2001.06232  [pdf, other

    cs.LG cs.CV stat.ML

    Sideways: Depth-Parallel Training of Video Models

    Authors: Mateusz Malinowski, Grzegorz Swirszcz, Joao Carreira, Viorica Patraucean

    Abstract: We propose Sideways, an approximate backpropagation scheme for training video models. In standard backpropagation, the gradients and activations at every computation step through the model are temporally synchronized. The forward activations need to be stored until the backward pass is executed, preventing inter-layer (depth) parallelization. However, can we leverage smooth, redundant input stream… ▽ More

    Submitted 30 March, 2020; v1 submitted 17 January, 2020; originally announced January 2020.

    Comments: Accepted at CVPR'20

  26. arXiv:1910.11306  [pdf, other

    cs.CV cs.NE eess.IV

    Controllable Attention for Structured Layered Video Decomposition

    Authors: Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman

    Abstract: The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion. We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its desig… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

    Comments: In ICCV 2019

  27. arXiv:1907.06987  [pdf, other

    cs.CV

    A Short Note on the Kinetics-700 Human Action Dataset

    Authors: Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman

    Abstract: We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

    Submitted 17 October, 2022; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: added note about dangers of training on k700 and evaluating on k400/k600. arXiv admin note: text overlap with arXiv:1808.01340

  28. arXiv:1902.03383  [pdf, ps, other

    cs.OS

    Cloud Programming Simplified: A Berkeley View on Serverless Computing

    Authors: Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson

    Abstract: Serverless cloud computing handles virtually all the system administration operations needed to make it easier for programmers to use the cloud. It provides an interface that greatly simplifies cloud programming, and represents an evolution that parallels the transition from assembly language to high-level programming languages. This paper gives a quick history of cloud computing, including an acc… ▽ More

    Submitted 9 February, 2019; originally announced February 2019.

  29. arXiv:1812.02707  [pdf, other

    cs.CV

    Video Action Transformer Network

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people… ▽ More

    Submitted 17 May, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: CVPR 2019

  30. arXiv:1812.01461  [pdf, other

    cs.CV

    The Visual Centrifuge: Model-Free Layered Video Representations

    Authors: Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman

    Abstract: True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent as… ▽ More

    Submitted 4 April, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: Appears in: 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019). This arXiv contains the CVPR Camera Ready version of the paper (although we have included larger figures) as well as an appendix detailing the model architecture

  31. arXiv:1808.01340  [pdf, other

    cs.CV

    A Short Note about Kinetics-600

    Authors: Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

    Abstract: We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process so it uses multiple queries per class, with some of them in a language other than english -- portuguese. This paper details the changes between the two… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    Comments: Companion to public release of kinetics-600 test set labels

  32. arXiv:1807.10066  [pdf, other

    cs.CV

    A Better Baseline for AVA

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model use… ▽ More

    Submitted 26 July, 2018; originally announced July 2018.

    Comments: ActivityNet Workshop (AVA Challenge), CVPR 2018

  33. arXiv:1806.03863  [pdf, other

    cs.CV

    Massively Parallel Video Networks

    Authors: Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, Simon Osindero

    Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. Th… ▽ More

    Submitted 5 September, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

    Comments: Fixed typos in densenet model definition in appendix

  34. arXiv:1705.07750  [pdf, other

    cs.CV cs.LG

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    Authors: Joao Carreira, Andrew Zisserman

    Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human… ▽ More

    Submitted 12 February, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

    Comments: Removed references to mini-kinetics dataset that was never made publicly available and repeated all experiments on the full Kinetics dataset

  35. arXiv:1705.06950  [pdf, other

    cs.CV

    The Kinetics Human Action Video Dataset

    Authors: Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman

    Abstract: We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such… ▽ More

    Submitted 19 May, 2017; originally announced May 2017.

  36. arXiv:1511.07845  [pdf, other

    cs.CV

    Shape and Symmetry Induction for 3D Objects

    Authors: Shubham Tulsiani, Abhishek Kar, Qixing Huang, João Carreira, Jitendra Malik

    Abstract: Actions as simple as gras** an object or navigating around it require a rich understanding of that object's 3D shape from a given viewpoint. In this paper we repurpose powerful learning machinery, originally developed for object classification, to discover image cues relevant for recovering the 3D shape of potentially unfamiliar objects. We cast the problem as one of local prediction of surface… ▽ More

    Submitted 24 November, 2015; v1 submitted 24 November, 2015; originally announced November 2015.

  37. arXiv:1509.08147  [pdf, other

    cs.CV

    Amodal Completion and Size Constancy in Natural Scenes

    Authors: Abhishek Kar, Shubham Tulsiani, João Carreira, Jitendra Malik

    Abstract: We consider the problem of enriching current object detection systems with veridical object sizes and relative depth estimates from a single image. There are several technical challenges to this, such as occlusions, lack of calibration data and the scale ambiguity between object size and distance. These have not been addressed in full generality in previous work. Here we propose to tackle these is… ▽ More

    Submitted 1 October, 2015; v1 submitted 27 September, 2015; originally announced September 2015.

    Comments: Accepted to ICCV 2015

  38. arXiv:1507.06550  [pdf, other

    cs.CV cs.LG cs.NE

    Human Pose Estimation with Iterative Error Feedback

    Authors: Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, Jitendra Malik

    Abstract: Hierarchical feature extractors such as Convolutional Networks (ConvNets) have achieved impressive performance on a variety of classification tasks using purely feedforward processing. Feedforward architectures can learn rich representations of the input space but do not explicitly model dependencies in the output spaces, that are quite structured for tasks such as articulated human pose estimatio… ▽ More

    Submitted 12 June, 2016; v1 submitted 23 July, 2015; originally announced July 2015.

  39. arXiv:1505.01596  [pdf, other

    cs.CV cs.NE cs.RO

    Learning to See by Moving

    Authors: Pulkit Agrawal, Joao Carreira, Jitendra Malik

    Abstract: The dominant paradigm for feature learning in computer vision relies on training neural networks for the task of object recognition using millions of hand labelled images. Is it possible to learn useful features for a diverse set of visual tasks using any other form of supervision? In biology, living organisms developed the ability of visual perception for the purpose of moving and acting in the w… ▽ More

    Submitted 14 September, 2015; v1 submitted 7 May, 2015; originally announced May 2015.

    Comments: 12 pages

  40. arXiv:1505.00066  [pdf, other

    cs.CV

    Pose Induction for Novel Object Categories

    Authors: Shubham Tulsiani, João Carreira, Jitendra Malik

    Abstract: We address the task of predicting pose for objects of unannotated object categories from a small seed set of annotated object classes. We present a generalized classifier that can reliably induce pose given a single instance of a novel category. In case of availability of a large collection of novel instances, our approach then jointly reasons over all instances to improve the initial estimates. W… ▽ More

    Submitted 28 September, 2015; v1 submitted 30 April, 2015; originally announced May 2015.

  41. arXiv:1503.06465  [pdf, other

    cs.CV

    Lifting Object Detection Datasets into 3D

    Authors: Joao Carreira, Sara Vicente, Lourdes Agapito, Jorge Batista

    Abstract: While data has certainly taken the center stage in computer vision in recent years, it can still be difficult to obtain in certain scenarios. In particular, acquiring ground truth 3D shapes of objects pictured in 2D images remains a challenging feat and this has hampered progress in recognition-based object reconstruction from a single image. Here we propose to bypass previous solutions such as 3D… ▽ More

    Submitted 31 July, 2016; v1 submitted 22 March, 2015; originally announced March 2015.

  42. arXiv:1411.6091  [pdf, other

    cs.CV

    Virtual View Networks for Object Reconstruction

    Authors: João Carreira, Abhishek Kar, Shubham Tulsiani, Jitendra Malik

    Abstract: All that structure from motion algorithms "see" are sets of 2D points. We show that these impoverished views of the world can be faked for the purpose of reconstructing objects in challenging settings, such as from a single image, or from a few ones far apart, by recognizing the object and getting help from a collection of images of other objects from the same class. We synthesize virtual views by… ▽ More

    Submitted 22 November, 2014; originally announced November 2014.

  43. arXiv:1411.6069  [pdf, other

    cs.CV

    Category-Specific Object Reconstruction from a Single Image

    Authors: Abhishek Kar, Shubham Tulsiani, João Carreira, Jitendra Malik

    Abstract: Object reconstruction from a single image -- in the wild -- is a problem where we can make progress and get meaningful results today. This is the main message of this paper, which introduces an automated pipeline with pixels as inputs and 3D surfaces of various rigid categories as outputs in images of realistic scenes. At the core of our approach are deformable 3D models that can be learned from 2… ▽ More

    Submitted 6 May, 2015; v1 submitted 21 November, 2014; originally announced November 2014.

    Comments: First two authors contributed equally. To appear at CVPR 2015

  44. arXiv:1009.4823  [pdf, ps, other

    cs.CV

    Image Segmentation by Discounted Cumulative Ranking on Maximal Cliques

    Authors: Joao Carreira, Adrian Ion, Cristian Sminchisescu

    Abstract: We propose a mid-level image segmentation framework that combines multiple figure-ground hypothesis (FG) constrained at different locations and scales, into interpretations that tile the entire image. The problem is cast as optimization over sets of maximal cliques sampled from the graph connecting non-overlap**, putative figure-ground segment hypotheses. Potential functions over cliques combine… ▽ More

    Submitted 24 September, 2010; originally announced September 2010.

    Comments: 11 pages, 5 figures

    Report number: TR-06-2010