Skip to main content

Showing 201–225 of 225 results for author: Zisserman, A

.
  1. arXiv:1705.06950  [pdf, other

    cs.CV

    The Kinetics Human Action Video Dataset

    Authors: Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, Andrew Zisserman

    Abstract: We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such… ▽ More

    Submitted 19 May, 2017; originally announced May 2017.

  2. arXiv:1705.02966  [pdf, other

    cs.CV

    You said that?

    Authors: Joon Son Chung, Amir Jamaludin, Andrew Zisserman

    Abstract: We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint emb… ▽ More

    Submitted 18 July, 2017; v1 submitted 8 May, 2017; originally announced May 2017.

    Comments: https://youtu.be/LeufDSb15Kc British Machine Vision Conference (BMVC), 2017

  3. arXiv:1612.06836  [pdf, other

    cs.CV

    From Images to 3D Shape Attributes

    Authors: David F. Fouhey, Abhinav Gupta, Andrew Zisserman

    Abstract: Our goal in this paper is to investigate properties of 3D shape that can be determined from a single image. We define 3D shape attributes -- generic properties of the shape that capture curvature, contact and occupied space. Our first objective is to infer these 3D shape attributes from a single image. A second objective is to infer a 3D shape embedding -- a low dimensional vector representing the… ▽ More

    Submitted 3 December, 2017; v1 submitted 20 December, 2016; originally announced December 2016.

    Comments: Updated based on TPAMI reviews: title changed, sections reordered, moderate modifications throughout text

  4. arXiv:1611.08194  [pdf, other

    cs.CV

    Interferences in match kernels

    Authors: Naila Murray, Hervé Jégou, Florent Perronnin, Andrew Zisserman

    Abstract: We consider the design of an image representation that embeds and aggregates a set of local descriptors into a single vector. Popular representations of this kind include the bag-of-visual-words, the Fisher vector and the VLAD. When two such image representations are compared with the dot-product, the image-to-image similarity can be interpreted as a match kernel. In match kernels, one has to deal… ▽ More

    Submitted 24 November, 2016; originally announced November 2016.

    Comments: Accepted as regular paper in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

  5. Lip Reading Sentences in the Wild

    Authors: Joon Son Chung, Andrew Senior, Oriol Vinyals, Andrew Zisserman

    Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WL… ▽ More

    Submitted 30 January, 2017; v1 submitted 16 November, 2016; originally announced November 2016.

  6. arXiv:1611.02185  [pdf, other

    cs.LG

    Trusting SVM for Piecewise Linear CNNs

    Authors: Leonard Berrada, Andrew Zisserman, M. Pawan Kumar

    Abstract: We present a novel layerwise optimization algorithm for the learning objective of Piecewise-Linear Convolutional Neural Networks (PL-CNNs), a large class of convolutional neural networks. Specifically, PL-CNNs employ piecewise linear non-linearities such as the commonly used ReLU and max-pool, and an SVM classifier as the final layer. The key observation of our approach is that the problem corresp… ▽ More

    Submitted 6 March, 2017; v1 submitted 7 November, 2016; originally announced November 2016.

  7. arXiv:1608.02059  [pdf, other

    cs.CV

    Signs in time: Encoding human motion as a temporal image

    Authors: Joon Son Chung, Andrew Zisserman

    Abstract: The goal of this work is to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. To this end we propose an image encoding that concisely represents human motion in a video sequence in a form that is suitable for learning with a ConvNet. The encoding reduces the pose information from an image to a single column, dramatically d… ▽ More

    Submitted 5 August, 2016; originally announced August 2016.

  8. arXiv:1605.02914  [pdf, other

    cs.CV cs.NE

    Recurrent Human Pose Estimation

    Authors: Vasileios Belagiannis, Andrew Zisserman

    Abstract: We propose a novel ConvNet model for predicting 2D human body poses in an image. The model regresses a heatmap representation for each body keypoint, and is able to learn and represent both the part appearances and the context of the part configuration. We make the following three contributions: (i) an architecture combining a feed forward module with a recurrent module, where the recurrent module… ▽ More

    Submitted 5 August, 2017; v1 submitted 10 May, 2016; originally announced May 2016.

    Comments: FG 2017, More Info and Demo: http://www.robots.ox.ac.uk/~vgg/software/keypoint_detection/

  9. arXiv:1604.06646  [pdf, other

    cs.CV

    Synthetic Data for Text Localisation in Natural Images

    Authors: Ankush Gupta, Andrea Vedaldi, Andrew Zisserman

    Abstract: In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regr… ▽ More

    Submitted 22 April, 2016; originally announced April 2016.

  10. arXiv:1604.06573  [pdf, other

    cs.CV

    Convolutional Two-Stream Network Fusion for Video Action Recognition

    Authors: Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman

    Abstract: Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fus… ▽ More

    Submitted 26 September, 2016; v1 submitted 22 April, 2016; originally announced April 2016.

    Comments: in Proc. CVPR 2016

  11. arXiv:1603.03958  [pdf, other

    cs.CV

    Template Adaptation for Face Verification and Identification

    Authors: Nate Crosswhite, Jeffrey Byrne, Omkar M. Parkhi, Chris Stauffer, Qiong Cao, Andrew Zisserman

    Abstract: Face recognition performance evaluation has traditionally focused on one-to-one verification, popularized by the Labeled Faces in the Wild dataset for imagery and the YouTubeFaces dataset for videos. In contrast, the newly released IJB-A face recognition dataset unifies evaluation of one-to-many face identification with one-to-one face verification over templates, or sets of imagery and videos for… ▽ More

    Submitted 5 April, 2016; v1 submitted 12 March, 2016; originally announced March 2016.

  12. arXiv:1511.06676  [pdf, other

    cs.CV

    Personalizing Human Video Pose Estimation

    Authors: James Charles, Tomas Pfister, Derek Magee, David Hogg, Andrew Zisserman

    Abstract: We propose a personalized ConvNet pose estimator that automatically adapts itself to the uniqueness of a person's appearance to improve pose estimation in long videos. We make the following contributions: (i) we show that given a few high-precision pose annotations, e.g. from a generic ConvNet pose estimator, additional annotations can be generated throughout the video using a combination of image… ▽ More

    Submitted 15 June, 2016; v1 submitted 20 November, 2015; originally announced November 2015.

    Comments: CVPR 2016

  13. arXiv:1506.02897  [pdf, other

    cs.CV

    Flowing ConvNets for Human Pose Estimation in Videos

    Authors: Tomas Pfister, James Charles, Andrew Zisserman

    Abstract: The objective of this work is human pose estimation in videos, where multiple frames are available. We investigate a ConvNet architecture that is able to benefit from temporal context by combining information across the multiple frames using optical flow. To this end we propose a network architecture with the following novelties: (i) a deeper network than previously investigated for regressing h… ▽ More

    Submitted 8 November, 2015; v1 submitted 9 June, 2015; originally announced June 2015.

    Comments: ICCV'15

  14. arXiv:1506.02025  [pdf, other

    cs.CV

    Spatial Transformer Networks

    Authors: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

    Abstract: Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module… ▽ More

    Submitted 4 February, 2016; v1 submitted 5 June, 2015; originally announced June 2015.

  15. arXiv:1412.6598  [pdf, other

    cs.CV cs.LG

    Automatic Discovery and Optimization of Parts for Image Classification

    Authors: Sobhan Naderi Parizi, Andrea Vedaldi, Andrew Zisserman, Pedro Felzenszwalb

    Abstract: Part-based representations have been shown to be very useful for image classification. Learning part-based models is often viewed as a two-stage problem. First, a collection of informative parts is discovered, using heuristics that promote part distinctiveness and diversity, and then classifiers are trained on the vector of part responses. In this paper we unify the two stages and learn the image… ▽ More

    Submitted 11 April, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

    Comments: 19 pages, template changed to camera ready version, 1 reference added, 1 reference fixed, Fig. 3, 4 updated (larger text)

  16. arXiv:1412.5903  [pdf, other

    cs.CV

    Deep Structured Output Learning for Unconstrained Text Recognition

    Authors: Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

    Abstract: We develop a representation suitable for the unconstrained recognition of words in natural images: the general case of no fixed lexicon and unknown length. To this end we propose a convolutional neural network (CNN) based architecture which incorporates a Conditional Random Field (CRF) graphical model, taking the whole word image as a single input. The unaries of the CRF are provided by a CNN th… ▽ More

    Submitted 10 April, 2015; v1 submitted 18 December, 2014; originally announced December 2014.

    Comments: arXiv admin note: text overlap with arXiv:1406.2227

  17. arXiv:1412.1842  [pdf, other

    cs.CV

    Reading Text in the Wild with Convolutional Neural Networks

    Authors: Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

    Abstract: In this work we present an end-to-end system for text spotting -- localising and recognising text in natural scene images -- and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast s… ▽ More

    Submitted 4 December, 2014; originally announced December 2014.

  18. arXiv:1409.1556  [pdf, ps, other

    cs.CV

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Authors: Karen Simonyan, Andrew Zisserman

    Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19… ▽ More

    Submitted 10 April, 2015; v1 submitted 4 September, 2014; originally announced September 2014.

  19. arXiv:1407.4764  [pdf, other

    cs.CV cs.LG cs.NE

    Efficient On-the-fly Category Retrieval using ConvNets and GPUs

    Authors: Ken Chatfield, Karen Simonyan, Andrew Zisserman

    Abstract: We investigate the gains in precision and speed, that can be obtained by using Convolutional Networks (ConvNets) for on-the-fly retrieval - where classifiers are learnt at run time for a textual query from downloaded images, and used to rank large image or video datasets. We make three contributions: (i) we present an evaluation of state-of-the-art image representations for object category retri… ▽ More

    Submitted 17 November, 2014; v1 submitted 17 July, 2014; originally announced July 2014.

    Comments: Published in proceedings of ACCV 2014

  20. arXiv:1406.2227  [pdf, other

    cs.CV

    Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

    Authors: Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

    Abstract: In this work we present a framework for the recognition of natural scene text. Our framework does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past. The deep neural network models at the centre of this framework are trained solely on data produced by a synthetic text generation engi… ▽ More

    Submitted 9 December, 2014; v1 submitted 9 June, 2014; originally announced June 2014.

  21. arXiv:1406.2199  [pdf, other

    cs.CV

    Two-Stream Convolutional Networks for Action Recognition in Videos

    Authors: Karen Simonyan, Andrew Zisserman

    Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we p… ▽ More

    Submitted 12 November, 2014; v1 submitted 9 June, 2014; originally announced June 2014.

  22. arXiv:1405.3866  [pdf, other

    cs.CV

    Speeding up Convolutional Neural Networks with Low Rank Expansions

    Authors: Max Jaderberg, Andrea Vedaldi, Andrew Zisserman

    Abstract: The focus of this paper is speeding up the evaluation of convolutional neural networks. While delivering impressive results across a range of computer vision and machine learning tasks, these networks are computationally demanding, limiting their deployability. Convolutional layers generally consume the bulk of the processing time, and so in this work we present two simple schemes for drastically… ▽ More

    Submitted 15 May, 2014; originally announced May 2014.

  23. arXiv:1405.3531  [pdf, other

    cs.CV

    Return of the Devil in the Details: Delving Deep into Convolutional Nets

    Authors: Ken Chatfield, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

    Abstract: The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visua… ▽ More

    Submitted 5 November, 2014; v1 submitted 14 May, 2014; originally announced May 2014.

    Comments: Published in proceedings of BMVC 2014

  24. arXiv:1312.6034  [pdf, other

    cs.CV

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Authors: Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

    Abstract: This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNe… ▽ More

    Submitted 19 April, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

  25. arXiv:0809.3083  [pdf, ps, other

    cs.CV

    Supervised Dictionary Learning

    Authors: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, Andrew Zisserman

    Abstract: It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in… ▽ More

    Submitted 18 September, 2008; originally announced September 2008.

    Report number: RR-6652