Skip to main content

Showing 101–150 of 225 results for author: Zisserman, A

.
  1. arXiv:2011.11630  [pdf, other

    cs.CV

    Betrayed by Motion: Camouflaged Object Discovery via Motion Segmentation

    Authors: Hala Lamdouar, Charig Yang, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is to design a computational architecture that discovers camouflaged objects in videos, specifically by exploiting motion information to perform object segmentation. We make the following three contributions: (i) We propose a novel architecture that consists of two essential components for breaking camouflage, namely, a differentiable registration module to align consec… ▽ More

    Submitted 23 November, 2020; originally announced November 2020.

    Comments: ACCV 2020

  2. arXiv:2011.11071  [pdf, other

    cs.CV

    QuerYD: A video dataset with high-quality text and audio narrations

    Authors: Andreea-Maria Oncescu, João F. Henriques, Yang Liu, Andrew Zisserman, Samuel Albanie

    Abstract: We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing… ▽ More

    Submitted 17 February, 2021; v1 submitted 22 November, 2020; originally announced November 2020.

    Comments: 5 pages, 4 figures, accepted at ICASSP 2021

  3. arXiv:2010.15716  [pdf, other

    cs.SD eess.AS

    Playing a Part: Speaker Verification at the Movies

    Authors: Andrew Brown, Jaesung Huh, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this work is to investigate the performance of popular speaker recognition models on speech segments from movies, where often actors intentionally disguise their voice to play a character. We make the following three contributions: (i) We collect a novel, challenging speaker recognition dataset called VoxMovies, with speech for 856 identities from almost 4000 movie clips. VoxMovies con… ▽ More

    Submitted 11 February, 2021; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: The first three authors contributed equally to this work

  4. arXiv:2010.10864  [pdf, other

    cs.CV cs.LG

    A Short Note on the Kinetics-700-2020 Human Action Dataset

    Authors: Lucas Smaira, João Carreira, Eric Noland, Ellen Clancy, Amy Wu, Andrew Zisserman

    Abstract: We describe the 2020 edition of the DeepMind Kinetics human action dataset, which replenishes and extends the Kinetics-700 dataset. In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes. This paper details the changes introduced for this new release of the dataset and includes a comprehensive set of statistics as well as baseline results… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  5. arXiv:2010.09709  [pdf, other

    cs.CV

    Self-supervised Co-training for Video Representation Learning

    Authors: Tengda Han, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-sup… ▽ More

    Submitted 11 January, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: NeurIPS2020

  6. arXiv:2010.04002  [pdf, other

    cs.CV

    Watch, read and lookup: learning to spot signs from multiple supervisors

    Authors: Liliane Momeni, Gül Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman

    Abstract: The focus of this work is sign spotting - given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available trans… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) - Oral presentation. 29 pages

  7. arXiv:2009.07833  [pdf, other

    cs.CV cs.GR

    Layered Neural Rendering for Retiming People in Video

    Authors: Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein

    Abstract: We present a method for retiming people in an ordinary, natural video -- manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationall… ▽ More

    Submitted 30 September, 2021; v1 submitted 16 September, 2020; originally announced September 2020.

    Comments: In SIGGRAPH Asia 2020. Project webpage: https://retiming.github.io/. Added references

  8. arXiv:2009.06610  [pdf, other

    cs.CV

    Adaptive Text Recognition through Visual Matching

    Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman

    Abstract: In this work, our objective is to address the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual representation learning and linguistic modelling stages. By doing this, we turn text recognition into a shape matching problem, and thereby achieve generalization… ▽ More

    Submitted 14 September, 2020; originally announced September 2020.

    Comments: ECCV2020

  9. arXiv:2009.01225  [pdf, other

    cs.CV eess.AS

    Seeing wake words: Audio-visual Keyword Spotting

    Authors: Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman

    Abstract: The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) patt… ▽ More

    Submitted 2 September, 2020; originally announced September 2020.

  10. arXiv:2009.00603  [pdf, other

    cs.CV cs.LG

    Inducing Predictive Uncertainty Estimation for Face Recognition

    Authors: Weidi Xie, Jeffrey Byrne, Andrew Zisserman

    Abstract: Knowing when an output can be trusted is critical for reliably using face recognition systems. While there has been enormous effort in recent research on improving face verification performance, understanding when a model's predictions should or should not be trusted has received far less attention. Our goal is to assign a confidence score for a face image that reflects its quality in terms of rec… ▽ More

    Submitted 1 September, 2020; originally announced September 2020.

    Comments: To Appear at the British Machine Vision Conference (BMVC), 2020

  11. arXiv:2008.04237  [pdf, other

    cs.CV cs.SD eess.AS

    Self-Supervised Learning of Audio-Visual Objects from Video

    Authors: Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

    Abstract: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: ECCV 2020

  12. arXiv:2008.01065  [pdf, other

    cs.CV

    Memory-augmented Dense Predictive Coding for Video Representation Learning

    Authors: Tengda Han, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: ECCV2020, Spotlight

  13. arXiv:2008.01018  [pdf, other

    cs.CV

    RareAct: A video dataset of unusual interactions

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: This paper introduces a manually annotated video dataset of unusual actions, namely RareAct, including actions such as "blend phone", "cut keyboard" and "microwave shoes". RareAct aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by comb… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

  14. arXiv:2008.00744  [pdf, other

    cs.CV

    The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

    Authors: Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shizhe Chen, Yida Zhao, Qin **, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao

    Abstract: We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the re… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Individual reports, dataset information, rules, and released source code can be found at the competition webpage (https://www.robots.ox.ac.uk/~vgg/challenges/video-pentathlon)

  15. arXiv:2007.12163  [pdf, other

    cs.CV

    Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval

    Authors: Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman

    Abstract: Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods. To this end, we introduce an objective that optimises instead a smoothed approximation of AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that allows for end-to-end t… ▽ More

    Submitted 8 September, 2020; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: Accepted at ECCV 2020

  16. arXiv:2007.12131  [pdf, other

    cs.CV

    BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

    Authors: Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman

    Abstract: Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce… ▽ More

    Submitted 13 October, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: Appears in: European Conference on Computer Vision 2020 (ECCV 2020). 28 pages

  17. arXiv:2007.11498  [pdf, other

    cs.CV

    CrossTransformers: spatially-aware few-shot transfer

    Authors: Carl Doersch, Ankush Gupta, Andrew Zisserman

    Abstract: Given new tasks with very little data$-$such as new classes in a classification problem or a domain shift in the input$-$performance of modern vision systems degrades remarkably quickly. In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing t… ▽ More

    Submitted 17 February, 2021; v1 submitted 22 July, 2020; originally announced July 2020.

    Comments: Published at NeurIPS 2020. Code/checkpoints: https://github.com/google-research/meta-dataset

  18. arXiv:2007.08480  [pdf, other

    cs.CV

    Co-Attention for Conditioned Image Matching

    Authors: Olivia Wiles, Sebastien Ehrhardt, Andrew Zisserman

    Abstract: We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them. To achieve this, we introduce (i) a spat… ▽ More

    Submitted 26 March, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

    Comments: Accepted at CVPR 2021. Project page: https://www.robots.ox.ac.uk/~ow/coam.html. Formerly D2D: Learning to find good correspondences for image matching and manipulation

  19. arXiv:2007.02606  [pdf, other

    eess.IV cs.CV

    A Convolutional Approach to Vertebrae Detection and Labelling in Whole Spine MRI

    Authors: Rhydian Windsor, Amir Jamaludin, Timor Kadir, Andrew Zisserman

    Abstract: We propose a novel convolutional method for the detection and identification of vertebrae in whole spine MRIs. This involves using a learnt vector field to group detected vertebrae corners together into individual vertebral bodies and convolutional image-to-image translation followed by beam search to label vertebral levels in a self-consistent manner. The method can be applied without modificatio… ▽ More

    Submitted 13 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

    Comments: Accepted full paper to Medical Image Computing and Computer Assisted Intervention 2020. 11 pages plus appendix

  20. arXiv:2007.01216  [pdf, other

    cs.SD cs.CV eess.AS eess.IV

    Spot the conversation: speaker diarisation in the wild

    Authors: Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman

    Abstract: The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creat… ▽ More

    Submitted 15 August, 2021; v1 submitted 2 July, 2020; originally announced July 2020.

    Comments: The dataset will be available for download from http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The development set will be released in July 2020, and the test set will be released in October 2020

  21. arXiv:2006.16228  [pdf, other

    cs.CV

    Self-Supervised MultiModal Versatile Networks

    Authors: Jean-Baptiste Alayrac, Adrià Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

    Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalit… ▽ More

    Submitted 30 October, 2020; v1 submitted 29 June, 2020; originally announced June 2020.

    Comments: To appear in the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS 2020)

  22. arXiv:2006.15418  [pdf, other

    cs.CV

    Counting Out Time: Class Agnostic Video Repetition Counting in the Wild

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We present an approach for estimating the period with which an action is repeated in a video. The crux of the approach lies in constraining the period prediction module to use temporal self-similarity as an intermediate representation bottleneck that allows generalization to unseen repetitions in videos in the wild. We train this model, called Repnet, with a synthetic dataset that is generated fro… ▽ More

    Submitted 27 June, 2020; originally announced June 2020.

    Comments: Accepted at CVPR 2020. Project webpage: https://sites.google.com/view/repnet

  23. arXiv:2006.10039  [pdf, other

    cs.CV cs.LG

    LSD-C: Linearly Separable Deep Clusters

    Authors: Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman

    Abstract: We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our algorithm first establishes pairwise connections in the feature space between the samples of the minibatch based on a similarity metric. Then it regroups in clusters the connected samples and enforces a linear separation between clusters. This is achieved by using the pairwise connections as targets together with a… ▽ More

    Submitted 17 June, 2020; originally announced June 2020.

    Comments: Code available at https://github.com/srebuffi/lsd-clusters

  24. arXiv:2005.04208  [pdf, other

    cs.CV

    Condensed Movies: Story Based Retrieval with Contextual Embeddings

    Authors: Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman

    Abstract: Our objective in this work is long range understanding of the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the `key scenes' of the movie, providing a condensed look at the full storyline. To this end, we make the following three contributions: (i) We create the Condensed Movies Dataset (CMD) consisting of the key scenes from over 3K movies: each… ▽ More

    Submitted 22 October, 2020; v1 submitted 8 May, 2020; originally announced May 2020.

    Comments: Appears in: Asian Conference on Computer Vision 2020 (ACCV 2020) - Oral presentation

  25. arXiv:2005.00214  [pdf, other

    cs.CV cs.LG eess.IV

    The AVA-Kinetics Localized Human Actions Video Dataset

    Authors: Ang Li, Meghana Thotakuri, David A. Ross, João Carreira, Alexander Vostrikov, Andrew Zisserman

    Abstract: This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe… ▽ More

    Submitted 20 May, 2020; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: 8 pages, 8 figures

  26. arXiv:2004.14368  [pdf, other

    cs.CV cs.SD eess.AS

    VGGSound: A Large-scale Audio-Visual Dataset

    Authors: Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

    Abstract: Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves… ▽ More

    Submitted 24 September, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: ICASSP2020

  27. arXiv:2004.05821  [pdf, other

    cs.CV cs.LG

    Monocular Depth Estimation with Self-supervised Instance Adaptation

    Authors: Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi

    Abstract: Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision. However, in robotics applications,multiple views of a scene may or may not be available, depend-ing on the actions of the robot, switching between monocularand multi-view reconstruction. To address th… ▽ More

    Submitted 13 April, 2020; originally announced April 2020.

    Comments: IROS submission, 7 pages

  28. arXiv:2003.13594  [pdf, other

    cs.CV

    Speech2Action: Cross-modal Supervision for Action Recognition

    Authors: Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

    Abstract: Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to… ▽ More

    Submitted 30 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR 2020

  29. arXiv:2003.11794  [pdf, other

    cs.CV

    Compact Deep Aggregation for Set Retrieval

    Authors: Yujie Zhong, Relja Arandjelović, Andrew Zisserman

    Abstract: The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors. We focus on a specific example of this general problem -- that of retrieving images containing multiple faces from a large scale dataset of images. Here the set consists of the face descriptors in… ▽ More

    Submitted 26 March, 2020; originally announced March 2020.

    Comments: 20 pages

  30. arXiv:2003.05078  [pdf, other

    cs.CV cs.CL cs.LG

    Visual Grounding in Video for Unsupervised Word Translation

    Authors: Gunnar A. Sigurdsson, Jean-Baptiste Alayrac, Aida Nematzadeh, Lucas Smaira, Mateusz Malinowski, João Carreira, Phil Blunsom, Andrew Zisserman

    Abstract: There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word map** between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instruc… ▽ More

    Submitted 26 March, 2020; v1 submitted 10 March, 2020; originally announced March 2020.

    Comments: CVPR 2020

    Journal ref: CVPR 2020

  31. arXiv:2002.08742  [pdf, other

    eess.AS cs.CV cs.SD

    Disentangled Speech Embeddings using Cross-modal Self-supervision

    Authors: Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman

    Abstract: The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart--without annotation--the representations of linguistic content and speaker identity. We co… ▽ More

    Submitted 4 May, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

    Comments: ICASSP 2020. The first three authors contributed equally to this work

  32. arXiv:2002.05714  [pdf, other

    cs.CV

    Automatically Discovering and Learning New Visual Categories with Ranking Statistics

    Authors: Kai Han, Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Andrea Vedaldi, Andrew Zisserman

    Abstract: We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. This setting is similar to semi-supervised learning, but significantly harder because there are no labelled examples for the new classes. The challenge, then, is to leverage the information contained in the labelled images in order to learn a general-purpose clustering model and use… ▽ More

    Submitted 13 February, 2020; originally announced February 2020.

    Comments: ICLR 2020, code: http://www.robots.ox.ac.uk/~vgg/research/auto_novel

  33. arXiv:1912.06430  [pdf, other

    cs.CV

    End-to-End Learning of Visual Representations from Uncurated Instructional Videos

    Authors: Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman

    Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narra… ▽ More

    Submitted 23 August, 2020; v1 submitted 13 December, 2019; originally announced December 2019.

    Comments: CVPR'2020 Oral

  34. Synthetic Humans for Action Recognition from Unseen Viewpoints

    Authors: Gül Varol, Ivan Laptev, Cordelia Schmid, Andrew Zisserman

    Abstract: Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advance… ▽ More

    Submitted 23 May, 2021; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: 21 pages

    Journal ref: International Journal of Computer Vision (2021)

  35. arXiv:1912.02522  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

    Authors: Joon Son Chung, Arsha Nagrani, Ernesto Coto, Weidi Xie, Mitchell McLaren, Douglas A Reynolds, Andrew Zisserman

    Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Inte… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

    Comments: ISCA Archive

  36. arXiv:1911.12747  [pdf, other

    cs.CV cs.SD eess.AS

    ASR is all you need: cross-modal distillation for lip reading

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a large-scale audio-only corpus. We use a cross-modal distillation method that combines Connectionist Temporal Classification (CTC) with a frame-wise cross-entropy l… ▽ More

    Submitted 31 March, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: ICASSP 2020

  37. arXiv:1910.12699  [pdf, other

    cs.CV

    Self-supervised learning of class embeddings from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully p… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    Comments: 4th International Workshop on Compact and Efficient Feature Representation and Learning in Computer Vision 2019

  38. arXiv:1910.11306  [pdf, other

    cs.CV cs.NE eess.IV

    Controllable Attention for Structured Layered Video Decomposition

    Authors: Jean-Baptiste Alayrac, João Carreira, Relja Arandjelović, Andrew Zisserman

    Abstract: The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to. For example, to be able to separate reflections, transparency or object motion. We make the following three contributions: (i) we introduce a new structured neural network architecture that explicitly incorporates layers (as spatial masks) into its desig… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

    Comments: In ICCV 2019

  39. arXiv:1909.08950  [pdf, other

    cs.CV

    Count, Crop and Recognise: Fine-Grained Recognition in the Wild

    Authors: Max Bain, Arsha Nagrani, Daniel Schofield, Andrew Zisserman

    Abstract: The goal of this paper is to label all the animal individuals present in every frame of a video. Unlike previous methods that have principally concentrated on labelling face tracks, we aim to label individuals even when their faces are not visible. We make the following contributions: (i) we introduce a 'Count, Crop and Recognise' (CCR) multistage recognition process for frame level labelling. The… ▽ More

    Submitted 9 October, 2019; v1 submitted 19 September, 2019; originally announced September 2019.

  40. arXiv:1909.04656  [pdf, other

    cs.CV

    Video Representation Learning by Dense Predictive Coding

    Authors: Tengda Han, Weidi Xie, Andrew Zisserman

    Abstract: The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we… ▽ More

    Submitted 26 September, 2019; v1 submitted 10 September, 2019; originally announced September 2019.

  41. arXiv:1909.03140  [pdf, other

    cs.CV

    Geometry-Aware Video Object Detection for Static Cameras

    Authors: Dan Xu, Weidi Xie, Andrew Zisserman

    Abstract: In this paper we propose a geometry-aware model for video object detection. Specifically, we consider the setting that cameras can be well approximated as static, e.g. in video surveillance scenarios, and scene pseudo depth maps can therefore be inferred easily from the object scale on the image plane. We make the following contributions: First, we extend the recent anchor-free detector (CornerNet… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: Accepted at BMVC 2019 as ORAL

  42. arXiv:1908.09884  [pdf, other

    cs.CV

    Learning to Discover Novel Visual Categories via Deep Transfer Clustering

    Authors: Kai Han, Andrea Vedaldi, Andrew Zisserman

    Abstract: We consider the problem of discovering novel object categories in an image collection. While these images are unlabelled, we also assume prior knowledge of related but different image classes. We use such prior knowledge to reduce the ambiguity of clustering, and improve the quality of the newly discovered classes. Our contributions are twofold. The first contribution is to extend Deep Embedded Cl… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

    Comments: ICCV 2019

  43. arXiv:1908.08498  [pdf, other

    cs.CV

    EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

    Authors: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen

    Abstract: We focus on multi-modal fusion for egocentric action recognition, and propose a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets. We train the architecture with three modalities -- RGB, Flow and Audio -- and combine them with mid-level fusion alongside sparse temporal sampling of fused representations. In contrast with previ… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Accepted for presentation at ICCV 2019

  44. arXiv:1908.05263  [pdf, other

    cs.CV

    AutoCorrect: Deep Inductive Alignment of Noisy Geometric Annotations

    Authors: Honglie Chen, Weidi Xie, Andrea Vedaldi, Andrew Zisserman

    Abstract: We propose AutoCorrect, a method to automatically learn object-annotation alignments from a dataset with annotations affected by geometric noise. The method is based on a consistency loss that enables deep neural networks to be trained, given only noisy annotations as input, to correct the annotations. When some noise-free annotations are available, we show that the consistency loss reduces to a s… ▽ More

    Submitted 14 August, 2019; originally announced August 2019.

    Comments: BMVC 2019 (Spotlight)

  45. arXiv:1907.13487  [pdf, other

    cs.CV

    Use What You Have: Video Retrieval Using Representations From Collaborative Experts

    Authors: Yang Liu, Samuel Albanie, Arsha Nagrani, Andrew Zisserman

    Abstract: The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity, with some queries describing specific details such as the names of famous identities, content from speech, or text available on the screen. Our goal is to condens… ▽ More

    Submitted 13 February, 2020; v1 submitted 31 July, 2019; originally announced July 2019.

    Comments: This update contains a correction to previously reported results

  46. arXiv:1907.06987  [pdf, other

    cs.CV

    A Short Note on the Kinetics-700 Human Action Dataset

    Authors: Joao Carreira, Eric Noland, Chloe Hillier, Andrew Zisserman

    Abstract: We describe an extension of the DeepMind Kinetics human action dataset from 600 classes to 700 classes, where for each class there are at least 600 video clips from different YouTube videos. This paper details the changes introduced for this new release of the dataset, and includes a comprehensive set of statistics as well as baseline results using the I3D neural network architecture.

    Submitted 17 October, 2022; v1 submitted 15 July, 2019; originally announced July 2019.

    Comments: added note about dangers of training on k700 and evaluating on k400/k600. arXiv admin note: text overlap with arXiv:1808.01340

  47. arXiv:1907.04975  [pdf, other

    cs.CV cs.SD eess.AS

    My lips are concealed: Audio-visual speech enhancement through obstructions

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

    Abstract: Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's l… ▽ More

    Submitted 10 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019

  48. arXiv:1907.02499  [pdf, other

    cs.CV

    Sim2real transfer learning for 3D human pose estimation: motion to the rescue

    Authors: Carl Doersch, Andrew Zisserman

    Abstract: Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given re… ▽ More

    Submitted 14 November, 2019; v1 submitted 4 July, 2019; originally announced July 2019.

    Comments: Accepted at NeurIPS 2019

  49. arXiv:1906.11883  [pdf, other

    cs.CV cs.LG

    Unsupervised Learning of Object Keypoints for Perception and Control

    Authors: Tejas Kulkarni, Ankush Gupta, Catalin Ionescu, Sebastian Borgeaud, Malcolm Reynolds, Andrew Zisserman, Volodymyr Mnih

    Abstract: The study of object representations in computer vision has primarily focused on develo** representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for d… ▽ More

    Submitted 19 November, 2019; v1 submitted 19 June, 2019; originally announced June 2019.

    Comments: In NeurIPS 2019. Code https://github.com/deepmind/deepmind-research/tree/master/transporter

  50. arXiv:1906.05661  [pdf, other

    cs.LG stat.ML

    Training Neural Networks for and by Interpolation

    Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

    Abstract: In modern supervised learning, many deep neural networks are able to interpolate the data: the empirical loss can be driven to near zero on all samples simultaneously. In this work, we explicitly exploit this interpolation property for the design of a new optimization algorithm for deep learning, which we term Adaptive Learning-rates for Interpolation with Gradients (ALI-G). ALI-G retains the two… ▽ More

    Submitted 1 August, 2020; v1 submitted 13 June, 2019; originally announced June 2019.

    Comments: Published at ICML 2020