Skip to main content

Showing 151–200 of 225 results for author: Zisserman, A

.
  1. arXiv:1906.05261  [pdf, other

    cs.CV

    LAEO-Net: revisiting people Looking At Each Other in videos

    Authors: Manuel J. Marin-Jimenez, Vicky Kalogeiton, Pablo Medina-Suarez, Andrew Zisserman

    Abstract: Capturing the `mutual gaze' of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net takes spatio-temporal tracks as inp… ▽ More

    Submitted 12 June, 2019; originally announced June 2019.

    Comments: CVPR 2019

  2. arXiv:1905.13077  [pdf, other

    cs.CV

    A Hierarchical Probabilistic U-Net for Modeling Multi-Scale Ambiguities

    Authors: Simon A. A. Kohl, Bernardino Romera-Paredes, Klaus H. Maier-Hein, Danilo Jimenez Rezende, S. M. Ali Eslami, Pushmeet Kohli, Andrew Zisserman, Olaf Ronneberger

    Abstract: Medical imaging only indirectly measures the molecular identity of the tissue within each voxel, which often produces only ambiguous image evidence for target measures of interest, like semantic segmentation. This diversity and the variations of plausible interpretations are often specific to given image regions and may thus manifest on various scales, spanning all the way from the pixel to the im… ▽ More

    Submitted 30 May, 2019; originally announced May 2019.

    Comments: 25 pages, 15 figures

  3. arXiv:1905.11369  [pdf, other

    cs.CV cs.AI cs.LG

    Object Discovery with a Copy-Pasting GAN

    Authors: Relja Arandjelović, Andrew Zisserman

    Abstract: We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever. A novel copy-pasting GAN framework is proposed, where the generator learns to discover an object in one image by compositing it into another image such that the discriminator cannot tell that the resulting image is fake. After ca… ▽ More

    Submitted 27 May, 2019; originally announced May 2019.

  4. arXiv:1905.08845  [pdf, other

    cs.CV stat.ML

    Semi-Supervised Learning with Scarce Annotations

    Authors: Sylvestre-Alvise Rebuffi, Sebastien Ehrhardt, Kai Han, Andrea Vedaldi, Andrew Zisserman

    Abstract: While semi-supervised learning (SSL) algorithms provide an efficient way to make use of both labelled and unlabelled data, they generally struggle when the number of annotated samples is very small. In this work, we consider the problem of SSL multi-class classification with very few labelled instances. We introduce two key ideas. The first is a simple but effective one: we leverage the power of t… ▽ More

    Submitted 21 April, 2020; v1 submitted 21 May, 2019; originally announced May 2019.

    Comments: Workshop on Deep Vision, CVPR 2020

  5. arXiv:1905.04266  [pdf, other

    cs.CV

    Exploiting temporal context for 3D human pose estimation in the wild

    Authors: Anurag Arnab, Carl Doersch, Andrew Zisserman

    Abstract: We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change an… ▽ More

    Submitted 10 May, 2019; originally announced May 2019.

    Comments: CVPR 2019

  6. arXiv:1905.02231  [pdf, other

    cs.CV

    A Geometric Approach to Obtain a Bird's Eye View from an Image

    Authors: Ammar Abbas, Andrew Zisserman

    Abstract: The objective of this paper is to rectify any monocular image by computing a homography matrix that transforms it to a bird's eye (overhead) view. We make the following contributions: (i) we show that the homography matrix can be parameterised with only four parameters that specify the horizon line and the vertical vanishing point, or only two if the field of view or focal length is known; (ii)… ▽ More

    Submitted 27 September, 2020; v1 submitted 6 May, 2019; originally announced May 2019.

    Comments: ICCV Workshop "Geometry Meets Deep Learning" 2019

  7. The VIA Annotation Software for Images, Audio and Video

    Authors: Abhishek Dutta, Andrew Zisserman

    Abstract: In this paper, we introduce a simple and standalone manual annotation tool for images, audio and video: the VGG Image Annotator (VIA). This is a light weight, standalone and offline software package that does not require any installation or setup and runs solely in a web browser. The VIA software allows human annotators to define and describe spatial regions in images or video frames, and temporal… ▽ More

    Submitted 9 August, 2019; v1 submitted 24 April, 2019; originally announced April 2019.

    Comments: to appear in Proceedings of the 27th ACM International Conference on Multimedia (MM '19), October 21-25, 2019, Nice, France. ACM, New York, NY, USA, 4 pages

  8. arXiv:1904.07846  [pdf, other

    cs.CV cs.LG

    Temporal Cycle-Consistency Learning

    Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman

    Abstract: We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be used to find correspondences across time in multiple videos. The resulting per-frame embeddings can be used to align videos by simply matching frames using the ne… ▽ More

    Submitted 16 April, 2019; originally announced April 2019.

    Comments: Accepted at CVPR 2019. Project webpage: https://sites.google.com/view/temporal-cycle-consistency

  9. arXiv:1903.01292  [pdf, other

    cs.AI cs.CV cs.RO

    The StreetLearn Environment and Dataset

    Authors: Piotr Mirowski, Andras Banki-Horvath, Keith Anderson, Denis Teplyashin, Karl Moritz Hermann, Mateusz Malinowski, Matthew Koichi Grimes, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

    Abstract: Navigation is a rich and well-grounded problem domain that drives progress in many different areas of research: perception, planning, memory, exploration, and optimisation in particular. Historically these challenges have been separately considered and solutions built that rely on stationary datasets - for example, recorded trajectories through an environment. These datasets cannot be used for dec… ▽ More

    Submitted 4 March, 2019; originally announced March 2019.

    Comments: 13 pages, 6 figures, 4 tables. arXiv admin note: text overlap with arXiv:1804.00168

  10. arXiv:1902.10107  [pdf, other

    eess.AS cs.LG cs.MM cs.SD

    Utterance-level Aggregation For Speaker Recognition In The Wild

    Authors: Weidi Xie, Arsha Nagrani, Joon Son Chung, Andrew Zisserman

    Abstract: The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a "thin-ResNet" trunk architecture, and a dict… ▽ More

    Submitted 17 May, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

    Comments: To appear in: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2019. (Oral Presentation)

  11. arXiv:1812.02707  [pdf, other

    cs.CV

    Video Action Transformer Network

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people… ▽ More

    Submitted 17 May, 2019; v1 submitted 6 December, 2018; originally announced December 2018.

    Comments: CVPR 2019

  12. arXiv:1812.01461  [pdf, other

    cs.CV

    The Visual Centrifuge: Model-Free Layered Video Representations

    Authors: Jean-Baptiste Alayrac, João Carreira, Andrew Zisserman

    Abstract: True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent as… ▽ More

    Submitted 4 April, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

    Comments: Appears in: 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019). This arXiv contains the CVPR Camera Ready version of the paper (although we have included larger figures) as well as an appendix detailing the model architecture

  13. arXiv:1811.07591  [pdf, other

    cs.LG stat.ML

    Deep Frank-Wolfe For Neural Network Optimization

    Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

    Abstract: Learning a deep neural network requires solving a challenging optimization problem: it is a high-dimensional, non-convex and non-smooth minimization problem with a large number of terms. The current practice in neural network optimization is to rely on the stochastic gradient descent (SGD) algorithm or its adaptive variants. However, SGD requires a hand-designed schedule for the learning rate. In… ▽ More

    Submitted 21 February, 2021; v1 submitted 19 November, 2018; originally announced November 2018.

    Comments: Published as a conference paper at ICLR 2019, last version fixing an inaccuracy (details in appendix A.5, Proposition 2)

    Journal ref: International Conference on Learning Representations 2019

  14. arXiv:1811.00472  [pdf, other

    cs.CV cs.LG

    Class-Agnostic Counting

    Authors: Erika Lu, Weidi Xie, Andrew Zisserman

    Abstract: Nearly all existing counting methods are designed for a specific object class. Our work, however, aims to create a counting model able to count any class of object. To achieve this goal, we formulate counting as a matching problem, enabling us to exploit the image self-similarity property that naturally exists in object counting problems. We make the following three contributions: first, a Generic… ▽ More

    Submitted 1 November, 2018; originally announced November 2018.

    Comments: Asian Conference on Computer Vision (ACCV), 2018

  15. arXiv:1810.09951  [pdf, other

    cs.CV

    GhostVLAD for set-based face recognition

    Authors: Yujie Zhong, Relja Arandjelović, Andrew Zisserman

    Abstract: The objective of this paper is to learn a compact representation of image sets for template-based face recognition. We make the following contributions: first, we propose a network architecture which aggregates and embeds the face descriptors produced by deep convolutional neural networks into a compact fixed-length representation. This compact representation requires minimal memory storage and en… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

    Comments: Accepted by ACCV 2018

  16. arXiv:1809.08675  [pdf, other

    cs.CV

    Learning to Read by Spelling: Towards Unsupervised Text Recognition

    Authors: Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

    Abstract: This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-str… ▽ More

    Submitted 9 December, 2018; v1 submitted 23 September, 2018; originally announced September 2018.

  17. arXiv:1809.06200  [pdf, other

    cs.CV

    From Same Photo: Cheating on Visual Kinship Challenges

    Authors: Mitchell Dawson, Andrew Zisserman, Christoffer Nellåker

    Abstract: With the propensity for deep learning models to learn unintended signals from data sets there is always the possibility that the network can `cheat' in order to solve a task. In the instance of data sets for visual kinship verification, one such unintended signal could be that the faces are cropped from the same photograph, since faces from the same photograph are more likely to be from the same f… ▽ More

    Submitted 5 November, 2018; v1 submitted 17 September, 2018; originally announced September 2018.

    Comments: Accepted to ACCV 2018

  18. arXiv:1809.02169  [pdf, other

    cs.CV

    Turning a Blind Eye: Explicit Removal of Biases and Variation from Deep Neural Network Embeddings

    Authors: Mohsan Alvi, Andrew Zisserman, Christoffer Nellaker

    Abstract: Neural networks achieve the state-of-the-art in image classification tasks. However, they can encode spurious variations or biases that may be present in the training data. For example, training an age predictor on a dataset that is not balanced for gender can lead to gender biased predicitons (e.g. wrongly predicting that males are older if only elderly males are in the training set). We present… ▽ More

    Submitted 27 September, 2018; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: Will appear in Workshop on Bias Estimation in Face Analytics, ECCV 2018

  19. Deep Audio-Visual Speech Recognition

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

    Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, on… ▽ More

    Submitted 22 December, 2018; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: Accepted for publication by IEEE Transactions on Pattern Analysis and Machine Intelligence

  20. arXiv:1809.02002  [pdf, other

    cs.CV

    3D Surface Reconstruction by Pointillism

    Authors: Olivia Wiles, Andrew Zisserman

    Abstract: The objective of this work is to infer the 3D shape of an object from a single image. We use sculptures as our training and test bed, as these have great variety in shape and appearance. To achieve this we build on the success of multiple view geometry (MVG) which is able to accurately provide correspondences between images of 3D objects under varying viewpoint and illumination conditions, and m… ▽ More

    Submitted 4 October, 2018; v1 submitted 6 September, 2018; originally announced September 2018.

    Comments: ECCV workshop on Geometry meets Deep Learning

  21. arXiv:1809.00496  [pdf, ps, other

    cs.CV

    LRS3-TED: a large-scale dataset for visual speech recognition

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

    Abstract: This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

    Submitted 28 October, 2018; v1 submitted 3 September, 2018; originally announced September 2018.

    Comments: The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/lip_reading/

  22. arXiv:1808.06882  [pdf, other

    cs.CV

    Self-supervised learning of a facial attribute embedding from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we… ▽ More

    Submitted 21 August, 2018; originally announced August 2018.

    Comments: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.html

  23. arXiv:1808.05561  [pdf, other

    cs.CV

    Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

    Authors: Samuel Albanie, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman

    Abstract: Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with… ▽ More

    Submitted 16 August, 2018; originally announced August 2018.

    Comments: Conference paper at ACM Multimedia 2018

  24. arXiv:1808.01340  [pdf, other

    cs.CV

    A Short Note about Kinetics-600

    Authors: Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

    Abstract: We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process so it uses multiple queries per class, with some of them in a language other than english -- portuguese. This paper details the changes between the two… ▽ More

    Submitted 3 August, 2018; originally announced August 2018.

    Comments: Companion to public release of kinetics-600 test set labels

  25. arXiv:1807.11440  [pdf, other

    cs.CV cs.LG

    Comparator Networks

    Authors: Weidi Xie, Li Shen, Andrew Zisserman

    Abstract: The objective of this work is set-based verification, e.g. to decide if two sets of images of a face are of the same person or not. The traditional approach to this problem is to learn to generate a feature vector per image, aggregate them into one vector to represent the set, and then compute the cosine similarity between sets. Instead, we design a neural network architecture that can directly le… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

    Comments: To appear in ECCV 2018

  26. arXiv:1807.10550  [pdf, other

    cs.CV

    X2Face: A network for controlling face generation by using images, audio, and pose codes

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing. We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

    Comments: To appear in ECCV 2018. Accompanying video: http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html

  27. arXiv:1807.10066  [pdf, other

    cs.CV

    A Better Baseline for AVA

    Authors: Rohit Girdhar, João Carreira, Carl Doersch, Andrew Zisserman

    Abstract: We introduce a simple baseline for action localization on the AVA dataset. The model builds upon the Faster R-CNN bounding box detection framework, adapted to operate on pure spatiotemporal features - in our case produced exclusively by an I3D model pretrained on Kinetics. This model obtains 21.9% average AP on the validation set of AVA v2.1, up from 14.5% for the best RGB spatiotemporal model use… ▽ More

    Submitted 26 July, 2018; originally announced July 2018.

    Comments: ActivityNet Workshop (AVA Challenge), CVPR 2018

  28. arXiv:1807.09192  [pdf, other

    cs.CV

    Multicolumn Networks for Face Recognition

    Authors: Weidi Xie, Andrew Zisserman

    Abstract: The objective of this work is set-based face recognition, i.e. to decide if two sets of images of a face are of the same person or not. Conventionally, the set-wise feature descriptor is computed as an average of the descriptors from individual face images within the set. In this paper, we design a neural network architecture that learns to aggregate based on both "visual" quality (resolution, ill… ▽ More

    Submitted 24 July, 2018; originally announced July 2018.

    Comments: To appear in BMVC2018

  29. arXiv:1807.08179  [pdf, other

    cs.CV

    Inductive Visual Localisation: Factorised Training for Superior Generalisation

    Authors: Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

    Abstract: End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition. However, RNNs often struggle to generalise to sequences longer than the ones encountered during training. In this work, we propose to optimise neural networks explicitly for induction. The ide… ▽ More

    Submitted 21 July, 2018; originally announced July 2018.

    Comments: In BMVC 2018 (spotlight)

  30. arXiv:1806.06053  [pdf, other

    cs.CV

    Deep Lip Reading: a comparison of models and an online application

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this paper is to develop state-of-the-art models for lip reading -- visual speech recognition. We develop three architectures and compare their accuracy and training times: (i) a recurrent model using LSTMs; (ii) a fully convolutional model; and (iii) the recently proposed transformer model. The recurrent and fully convolutional models are trained with a Connectionist Temporal Classifi… ▽ More

    Submitted 15 June, 2018; originally announced June 2018.

    Comments: To appear in Interspeech 2018

  31. VoxCeleb2: Deep Speaker Recognition

    Authors: Joon Son Chung, Arsha Nagrani, Andrew Zisserman

    Abstract: The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any p… ▽ More

    Submitted 26 June, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

    Comments: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 pages

  32. arXiv:1806.03863  [pdf, other

    cs.CV

    Massively Parallel Video Networks

    Authors: Joao Carreira, Viorica Patraucean, Laurent Mazare, Andrew Zisserman, Simon Osindero

    Abstract: We introduce a class of causal video understanding models that aims to improve efficiency of video processing by maximising throughput, minimising latency, and reducing the number of clock cycles. Leveraging operation pipelining and multi-rate clocks, these models perform a minimal amount of computation (e.g. as few as four convolutional layers) for each frame per timestep to produce an output. Th… ▽ More

    Submitted 5 September, 2018; v1 submitted 11 June, 2018; originally announced June 2018.

    Comments: Fixed typos in densenet model definition in appendix

  33. arXiv:1805.00833  [pdf, other

    cs.CV

    Learnable PINs: Cross-Modal Embeddings for Person Identity

    Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman

    Abstract: We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a… ▽ More

    Submitted 26 July, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

    Comments: To appear in ECCV 2018

  34. arXiv:1804.04121  [pdf, other

    cs.CV cs.SD

    The Conversation: Deep Audio-Visual Speech Enhancement

    Authors: Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

    Abstract: Our goal is to isolate individual speakers from multi-talker simultaneous speech in videos. Existing works in this area have focussed on trying to separate utterances from known speakers in controlled environments. In this paper, we propose a deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the… ▽ More

    Submitted 19 June, 2018; v1 submitted 11 April, 2018; originally announced April 2018.

    Comments: To appear in Interspeech 2018. We provide supplementary material with interactive demonstrations on http://www.robots.ox.ac.uk/~vgg/demo/theconversation

  35. arXiv:1804.00326  [pdf, other

    cs.CV

    Seeing Voices and Hearing Faces: Cross-modal biometric matching

    Authors: Arsha Nagrani, Samuel Albanie, Andrew Zisserman

    Abstract: We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available fo… ▽ More

    Submitted 3 April, 2018; v1 submitted 1 April, 2018; originally announced April 2018.

    Comments: To appear in: IEEE Computer Vision and Pattern Recognition (CVPR), 2018

  36. arXiv:1804.00168  [pdf, other

    cs.AI

    Learning to Navigate in Cities Without a Map

    Authors: Piotr Mirowski, Matthew Koichi Grimes, Mateusz Malinowski, Karl Moritz Hermann, Keith Anderson, Denis Teplyashin, Karen Simonyan, Koray Kavukcuoglu, Andrew Zisserman, Raia Hadsell

    Abstract: Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on develo** an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support cont… ▽ More

    Submitted 9 January, 2019; v1 submitted 31 March, 2018; originally announced April 2018.

    Comments: 17 pages, 16 figures, published at NeurIPS 2018

    Journal ref: Neural Information Processing Systems 2018

  37. arXiv:1803.03835  [pdf, other

    cs.LG

    Kickstarting Deep Reinforcement Learning

    Authors: Simon Schmitt, Jonathan J. Hudson, Augustin Zidek, Simon Osindero, Carl Doersch, Wojciech M. Czarnecki, Joel Z. Leibo, Heinrich Kuttler, Andrew Zisserman, Karen Simonyan, S. M. Ali Eslami

    Abstract: We present a method for using previously-trained 'teacher' agents to kickstart the training of a new 'student' agent. To this end, we leverage ideas from policy distillation and population based training. Our method places no constraints on the architecture of the teacher or student agents, and it regulates itself to allow the students to surpass their teachers in performance. We show that, on a c… ▽ More

    Submitted 10 March, 2018; originally announced March 2018.

  38. arXiv:1802.07595  [pdf, other

    cs.LG

    Smooth Loss Functions for Deep Top-k Classification

    Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

    Abstract: The top-k error is a common measure of performance in machine learning and computer vision. In practice, top-k classification is typically performed with deep neural networks trained with the cross-entropy loss. Theoretical results indeed suggest that cross-entropy is an optimal learning objective for such a task in the limit of infinite data. In the context of limited and noisy data however, the… ▽ More

    Submitted 21 February, 2018; originally announced February 2018.

    Comments: ICLR 2018

  39. arXiv:1801.10442  [pdf, other

    cs.CV

    From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script

    Authors: Arsha Nagrani, Andrew Zisserman

    Abstract: The goal of this paper is the automatic identification of characters in TV and feature film material. In contrast to standard approaches to this task, which rely on the weak supervision afforded by transcripts and subtitles, we propose a new method requiring only a cast list. This list is used to obtain images of actors from freely available sources on the web, providing a form of partial supervis… ▽ More

    Submitted 31 January, 2018; originally announced January 2018.

  40. arXiv:1801.01415  [pdf, other

    cs.CV

    What have we learned from deep representations for action recognition?

    Authors: Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, Andrew Zisserman

    Abstract: As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing what two-stream models have learned in order to recognize actions in video. We show that local detectors for appearance and… ▽ More

    Submitted 4 January, 2018; originally announced January 2018.

    Comments: This document is best viewed in Adobe Reader where figures play on click. Supplementary material can be downloaded at http://feichtenhofer.github.io/action_vis.pdf

  41. arXiv:1712.06651  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Objects that Sound

    Authors: Relja Arandjelović, Andrew Zisserman

    Abstract: In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is… ▽ More

    Submitted 25 July, 2018; v1 submitted 18 December, 2017; originally announced December 2017.

    Comments: Appears in: European Conference on Computer Vision (ECCV) 2018

  42. arXiv:1711.07888  [pdf, other

    cs.CV

    SilNet : Single- and Multi-View Reconstruction by Learning from Silhouettes

    Authors: Olivia Wiles, Andrew Zisserman

    Abstract: The objective of this paper is 3D shape understanding from single and multiple images. To this end, we introduce a new deep-learning architecture and loss function, SilNet, that can handle multiple views in an order-agnostic manner. The architecture is fully convolutional, and for training we use a proxy task of silhouette prediction, rather than directly learning a map** from 2D images to 3D sh… ▽ More

    Submitted 21 November, 2017; originally announced November 2017.

    Comments: BMVC 2017; Best Poster

  43. arXiv:1710.08092  [pdf, other

    cs.CV

    VGGFace2: A dataset for recognising faces across pose and age

    Authors: Qiong Cao, Li Shen, Weidi Xie, Omkar M. Parkhi, Andrew Zisserman

    Abstract: In this paper, we introduce a new large-scale face dataset named VGGFace2. The dataset contains 3.31 million images of 9131 subjects, with an average of 362.6 images for each subject. Images are downloaded from Google Image Search and have large variations in pose, age, illumination, ethnicity and profession (e.g. actors, athletes, politicians). The dataset was collected with three goals in mind:… ▽ More

    Submitted 13 May, 2018; v1 submitted 23 October, 2017; originally announced October 2017.

    Comments: This paper has been accepted by IEEE Conference on Automatic Face and Gesture Recognition (F&G), 2018. (Oral)

  44. arXiv:1710.03958  [pdf, other

    cs.CV

    Detect to Track and Track to Detect

    Authors: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

    Abstract: Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous det… ▽ More

    Submitted 7 March, 2018; v1 submitted 11 October, 2017; originally announced October 2017.

    Comments: ICCV 2017. Code and models: https://github.com/feichtenhofer/Detect-Track Results: https://www.robots.ox.ac.uk/~vgg/research/detect-track/

  45. arXiv:1708.07860  [pdf, other

    cs.CV

    Multi-task Self-Supervised Visual Learning

    Authors: Carl Doersch, Andrew Zisserman

    Abstract: We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso… ▽ More

    Submitted 25 August, 2017; originally announced August 2017.

    Comments: Published at ICCV 2017

  46. arXiv:1708.00367  [pdf, other

    cs.CV

    Self-Supervised Learning for Spinal MRIs

    Authors: Amir Jamaludin, Timor Kadir, Andrew Zisserman

    Abstract: A significant proportion of patients scanned in a clinical setting have follow-up scans. We show in this work that such longitudinal scans alone can be used as a form of 'free' self-supervision for training a deep network. We demonstrate this self-supervised learning for the case of T2-weighted sagittal lumbar Magnetic Resonance Images (MRIs). A Siamese convolutional neural network (CNN) is traine… ▽ More

    Submitted 1 August, 2017; originally announced August 2017.

    Comments: 3rd Workshop on Deep Learning in Medical Image Analysis

  47. arXiv:1707.00665  [pdf, other

    cs.CV

    Temporal HeartNet: Towards Human-Level Automatic Analysis of Fetal Cardiac Screening Video

    Authors: Weilin Huang, Christopher P. Bridge, J. Alison Noble, Andrew Zisserman

    Abstract: We present an automatic method to describe clinically useful information about scanning, and to guide image interpretation in ultrasound (US) videos of the fetal heart. Our method is able to jointly predict the visibility, viewing plane, location and orientation of the fetal heart at the frame level. The contributions of the paper are three-fold: (i) a convolutional neural network architecture is… ▽ More

    Submitted 3 July, 2017; originally announced July 2017.

    Comments: To appear in MICCAI, 2017

  48. VoxCeleb: a large-scale speaker identification dataset

    Authors: Arsha Nagrani, Joon Son Chung, Andrew Zisserman

    Abstract: Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'. We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques t… ▽ More

    Submitted 30 May, 2018; v1 submitted 26 June, 2017; originally announced June 2017.

    Comments: The dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb . 1706.08612v2: minor fixes; 6 pages

  49. arXiv:1705.08168  [pdf, other

    cs.CV cs.LG

    Look, Listen and Learn

    Authors: Relja Arandjelović, Andrew Zisserman

    Abstract: We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks f… ▽ More

    Submitted 1 August, 2017; v1 submitted 23 May, 2017; originally announced May 2017.

    Comments: Appears in: IEEE International Conference on Computer Vision (ICCV) 2017

  50. arXiv:1705.07750  [pdf, other

    cs.CV cs.LG

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    Authors: Joao Carreira, Andrew Zisserman

    Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human… ▽ More

    Submitted 12 February, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

    Comments: Removed references to mini-kinetics dataset that was never made publicly available and repeated all experiments on the full Kinetics dataset