Search | arXiv e-print repository

What is the Best Feature Learning Procedure in Hierarchical Recognition Architectures?

Authors: Kevin Jarrett, Koray Kvukcuoglu, Karol Gregor, Yann LeCun

Abstract: (This paper was written in November 2011 and never published. It is posted on arXiv.org in its original form in June 2016). Many recent object recognition systems have proposed using a two phase training procedure to learn sparse convolutional feature hierarchies: unsupervised pre-training followed by supervised fine-tuning. Recent results suggest that these methods provide little improvement over… ▽ More (This paper was written in November 2011 and never published. It is posted on arXiv.org in its original form in June 2016). Many recent object recognition systems have proposed using a two phase training procedure to learn sparse convolutional feature hierarchies: unsupervised pre-training followed by supervised fine-tuning. Recent results suggest that these methods provide little improvement over purely supervised systems when the appropriate nonlinearities are included. This paper presents an empirical exploration of the space of learning procedures for sparse convolutional networks to assess which method produces the best performance. In our study, we introduce an augmentation of the Predictive Sparse Decomposition method that includes a discriminative term (DPSD). We also introduce a new single phase supervised learning procedure that places an L1 penalty on the output state of each layer of the network. This forces the network to produce sparse codes without the expensive pre-training phase. Using DPSD with a new, complex predictor that incorporates lateral inhibition, combined with multi-scale feature pooling, and supervised refinement, the system achieves a 70.6\% recognition rate on Caltech-101. With the addition of convolutional training, a 77\% recognition was obtained on the CIfAR-10 dataset. △ Less

Submitted 5 June, 2016; originally announced June 2016.

Comments: 17 pages, 3 figures

arXiv:1605.00983 [pdf]

Phase 3: DCL System Using Deep Learning Approaches for Land-based or Ship-based Real-Time Recognition and Localization of Marine Mammals - Bioacoustic Applicaitons

Authors: Peter J. Dugan, Christopher W. Clark, Yann André LeCun, Sofie M. Van Parijs

Abstract: Goals of this research phase is to investigate advanced detection and classification pardims useful for data-mining passive large passive acoustic archives. Technical objectives are to develop and refine a High Performance Computing, Acoustic Data Accelerator (HPC-ADA) along with MATLAB based software based on time series acoustic signal Detection cLassification using Machine learning Algorithms,… ▽ More Goals of this research phase is to investigate advanced detection and classification pardims useful for data-mining passive large passive acoustic archives. Technical objectives are to develop and refine a High Performance Computing, Acoustic Data Accelerator (HPC-ADA) along with MATLAB based software based on time series acoustic signal Detection cLassification using Machine learning Algorithms, called DeLMA. Data scientists and biologists integrate to use the HPC-ADA and DeLMA technologies to explore data using newly developed techniques aimed at inspection of data extracted at large spatial and temporal scales. △ Less

Submitted 5 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

Comments: National Oceanic Partnership Program (NOPP) sponsored by ONR and NFWF

Report number: N000141210585

arXiv:1605.00982 [pdf]

Phase 4: DCL System Using Deep Learning Approaches for Land-Based or Ship-Based Real-Time Recognition and Localization of Marine Mammals - Distributed Processing and Big Data Applications

Authors: Peter J. Dugan, Christopher W. Clark, Yann André LeCun, Sofie M. Van Parijs

Abstract: While the animal bioacoustics community at large is collecting huge amounts of acoustic data at an unprecedented pace, processing these data is problematic. Currently in bioacoustics, there is no effective way to achieve high performance computing using commericial off the shelf (COTS) or government off the shelf (GOTS) tools. Although several advances have been made in the open source and commerc… ▽ More While the animal bioacoustics community at large is collecting huge amounts of acoustic data at an unprecedented pace, processing these data is problematic. Currently in bioacoustics, there is no effective way to achieve high performance computing using commericial off the shelf (COTS) or government off the shelf (GOTS) tools. Although several advances have been made in the open source and commercial software community, these offerings either support specific applications that do not integrate well with data formats in bioacoustics or they are too general. Furthermore, complex algorithms that use deep learning strategies require special considerations, such as very large libraiers of exemplars (whale sounds) readily available for algorithm training and testing. Detection-classification for passive acoustics is a data-mining strategy and our goals are aligned with best practices that appeal to the general data mining and machine learning communities where the problem of processing large data is common. Therefore, the objective of this work is to advance the state-of-the art for data-mining large passive acoustic datasets as they pertain to bioacoustics. With this basic deficiency recognized at the forefront, portions of the grant were dedicated to fostering deep-learning by way of international competitions (kaggle.com) meant to attract deep-learning solutions. The focus of this early work was targeted to make significant progress in addressing big data systems and advanced algorithms over the duration of the grant from 2012 to 2015. This early work provided simulataneous advances in systems-algorithms research while supporting various collaborations and projects. △ Less

Submitted 5 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

Comments: National Oceanic Partnership Program (NOPP) sponsored by ONR and NFWF

Report number: N000141210585

arXiv:1605.00972 [pdf]

Phase 2: DCL System Using Deep Learning Approaches for Land-based or Ship-based Real-Time Recognition and Localization of Marine Mammals - Machine Learning Detection Algorithms

Authors: Peter J. Dugan, Christopher W. Clark, Yann André LeCun, Sofie M. Van Parijs

Abstract: Overarching goals for this work aim to advance the state of the art for detection, classification and localization (DCL) in the field of bioacoustics. This goal is primarily achieved by building a generic framework for detection-classification (DC) using a fast, efficient and scalable architecture, demonstrating the capabilities of this system using on a variety of low-frequency mid-frequency ceta… ▽ More Overarching goals for this work aim to advance the state of the art for detection, classification and localization (DCL) in the field of bioacoustics. This goal is primarily achieved by building a generic framework for detection-classification (DC) using a fast, efficient and scalable architecture, demonstrating the capabilities of this system using on a variety of low-frequency mid-frequency cetacean sounds. Two primary goals are to develop transferable technologies for detection and classification in, one: the area of advanced algorithms, such as deep learning and other methods; and two: advanced systems, capable of real-time and archival processing. For each key area, we will focus on producing publications from this work and providing tools and software to the community where/when possible. Currently massive amounts of acoustic data are being collected by various institutions, corporations and national defense agencies. The long-term goal is to provide technical capability to analyze the data using automatic algorithms for (DC) based on machine intelligence. The goal of the automation is to provide effective and efficient mechanisms by which to process large acoustic datasets for understanding the bioacoustic behaviors of marine mammals. This capability will provide insights into the potential ecological impacts and influences of anthropogenic ocean sounds. This work focuses on building technologies using a maturity model based on DARPA 6.1 and 6.2 processes, for basic and applied research, respectively. △ Less

Submitted 5 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

Comments: National Oceanic Partnership Program (NOPP) sponsored by ONR and NFWF: N000141210585

Report number: N000141210585

arXiv:1605.00971 [pdf]

Phase 1: DCL System Research Using Advanced Approaches for Land-based or Ship-based Real-Time Recognition and Localization of Marine Mammals - HPC System Implementation

Authors: Peter J. Dugan, Christopher W. Clark, Yann André LeCun, Sofie M. Van Parijs

Abstract: We aim to investigate advancing the state of the art of detection, classification and localization (DCL) in the field of bioacoustics. The two primary goals are to develop transferable technologies for detection and classification in: (1) the area of advanced algorithms, such as deep learning and other methods; and (2) advanced systems, capable of real-time and archival and processing. This projec… ▽ More We aim to investigate advancing the state of the art of detection, classification and localization (DCL) in the field of bioacoustics. The two primary goals are to develop transferable technologies for detection and classification in: (1) the area of advanced algorithms, such as deep learning and other methods; and (2) advanced systems, capable of real-time and archival and processing. This project will focus on long-term, continuous datasets to provide automatic recognition, minimizing human time to annotate the signals. Effort will begin by focusing on several years of multi-channel acoustic data collected in the Stellwagen Bank National Marine Sanctuary (SBNMS) between 2006 and 2010. Our efforts will incorporate existing technologies in the bioacoustics signal processing community, advanced high performance computing (HPC) systems, and new approaches aimed at automatically detecting-classifying and measuring features for species-specific marine mammal sounds within passive acoustic data. △ Less

Submitted 5 May, 2016; v1 submitted 3 May, 2016; originally announced May 2016.

Comments: Year 1 National Oceanic Partnership Program Report, sponsored ONR, NFWF. N000141210585

Report number: N000141210585

arXiv:1602.06662 [pdf, other]

Recurrent Orthogonal Networks and Long-Memory Tasks

Authors: Mikael Henaff, Arthur Szlam, Yann LeCun

Abstract: Although RNNs have been shown to be powerful tools for processing sequential data, finding architectures or optimization strategies that allow them to model very long term dependencies is still an active area of research. In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter and Schmidhuber, 1997) which are used to evaluate the ability of RNNs to store inform… ▽ More Although RNNs have been shown to be powerful tools for processing sequential data, finding architectures or optimization strategies that allow them to model very long term dependencies is still an active area of research. In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter and Schmidhuber, 1997) which are used to evaluate the ability of RNNs to store information over many time steps. We explicitly construct RNN solutions to these problems, and using these constructions, illuminate both the problems themselves and the way in which RNNs store different types of information in their hidden states. These constructions furthermore explain the success of recent methods that specify unitary initializations or constraints on the transition matrices. △ Less

Submitted 15 March, 2017; v1 submitted 22 February, 2016; originally announced February 2016.

arXiv:1511.06444 [pdf, other]

doi 10.1090/qam/1483

Universal halting times in optimization and machine learning

Authors: Levent Sagun, Thomas Trogdon, Yann LeCun

Abstract: The authors present empirical distributions for the halting time (measured by the number of iterations to reach a given accuracy) of optimization algorithms applied to two random systems: spin glasses and deep learning. Given an algorithm, which we take to be both the optimization routine and the form of the random landscape, the fluctuations of the halting time follow a distribution that, after c… ▽ More The authors present empirical distributions for the halting time (measured by the number of iterations to reach a given accuracy) of optimization algorithms applied to two random systems: spin glasses and deep learning. Given an algorithm, which we take to be both the optimization routine and the form of the random landscape, the fluctuations of the halting time follow a distribution that, after centering and scaling, remains unchanged even when the distribution on the landscape is changed. We observe two qualitative classes: A Gumbel-like distribution that appears in Google searches, human decision times, the QR eigenvalue algorithm and spin glasses, and a Gaussian-like distribution that appears in conjugate gradient method, deep network with MNIST input data and deep network with random input data. This empirical evidence suggests presence of a class of distributions for which the halting time is independent of the underlying distribution under some conditions. △ Less

Submitted 20 February, 2017; v1 submitted 19 November, 2015; originally announced November 2015.

MSC Class: 65K10; 82D30; 37E20

Journal ref: Quart. Appl. Math. 76 (2018), 289-301

arXiv:1511.05666 [pdf, other]

Super-Resolution with Deep Convolutional Sufficient Statistics

Authors: Joan Bruna, Pablo Sprechmann, Yann LeCun

Abstract: Inverse problems in image and audio, and super-resolution in particular, can be seen as high-dimensional structured prediction problems, where the goal is to characterize the conditional distribution of a high-resolution output given its low-resolution corrupted observation. When the scaling ratio is small, point estimates achieve impressive performance, but soon they suffer from the regression-to… ▽ More Inverse problems in image and audio, and super-resolution in particular, can be seen as high-dimensional structured prediction problems, where the goal is to characterize the conditional distribution of a high-resolution output given its low-resolution corrupted observation. When the scaling ratio is small, point estimates achieve impressive performance, but soon they suffer from the regression-to-the-mean problem, result of their inability to capture the multi-modality of this conditional distribution. Modeling high-dimensional image and audio distributions is a hard task, requiring both the ability to model complex geometrical structures and textured regions. In this paper, we propose to use as conditional model a Gibbs distribution, where its sufficient statistics are given by deep convolutional neural networks. The features computed by the network are stable to local deformation, and have reduced variance when the input is a stationary texture. These properties imply that the resulting sufficient statistics minimize the uncertainty of the target signals given the degraded observations, while being highly informative. The filters of the CNN are initialized by multiscale complex wavelets, and then we propose an algorithm to fine-tune them by estimating the gradient of the conditional log-likelihood, which bears some similarities with Generative Adversarial Networks. We evaluate experimentally the proposed approach in the image super-resolution task, but the approach is general and could be used in other challenging ill-posed problems such as audio bandwidth extension. △ Less

Submitted 1 March, 2016; v1 submitted 18 November, 2015; originally announced November 2015.

arXiv:1511.05440 [pdf, other]

Deep multi-scale video prediction beyond mean square error

Authors: Michael Mathieu, Camille Couprie, Yann LeCun

Abstract: Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction may be viewed as a promising avenue for unsupervised feature learning. In addition, while optical flow has been a very studied problem in computer… ▽ More Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction may be viewed as a promising avenue for unsupervised feature learning. In addition, while optical flow has been a very studied problem in computer vision for a long time, future frame prediction is rarely approached. Still, many vision applications could benefit from the knowledge of the next frames of videos, that does not require the complexity of tracking every pixel trajectories. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset △ Less

Submitted 26 February, 2016; v1 submitted 17 November, 2015; originally announced November 2015.

arXiv:1511.05212 [pdf, other]

Binary embeddings with structured hashed projections

Authors: Anna Choromanska, Krzysztof Choromanski, Mariusz Bojarski, Tony Jebara, Sanjiv Kumar, Yann LeCun

Abstract: We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) map**s. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadrati… ▽ More We consider the hashing mechanism for constructing binary embeddings, that involves pseudo-random projections followed by nonlinear (sign function) map**s. The pseudo-random projection is described by a matrix, where not all entries are independent random variables but instead a fixed "budget of randomness" is distributed across the matrix. Such matrices can be efficiently stored in sub-quadratic or even linear space, provide reduction in randomness usage (i.e. number of required random values), and very often lead to computational speed ups. We prove several theoretical results showing that projections via various structured matrices followed by nonlinear map**s accurately preserve the angular distance between input high-dimensional vectors. To the best of our knowledge, these results are the first that give theoretical ground for the use of general structured matrices in the nonlinear setting. In particular, they generalize previous extensions of the Johnson-Lindenstrauss lemma and prove the plausibility of the approach that was so far only heuristically confirmed for some special structured matrices. Consequently, we show that many structured matrices can be used as an efficient information compression mechanism. Our findings build a better understanding of certain deep architectures, which contain randomly weighted and untrained layers, and yet achieve high performance on different learning tasks. We empirically verify our theoretical findings and show the dependence of learning via structured hashed projections on the performance of neural network as well as nearest neighbor classifier. △ Less

Submitted 1 July, 2016; v1 submitted 16 November, 2015; originally announced November 2015.

Comments: arXiv admin note: text overlap with arXiv:1505.03190

arXiv:1511.03719 [pdf, other]

Universum Prescription: Regularization using Unlabeled Data

Authors: Xiang Zhang, Yann LeCun

Abstract: This paper shows that simply prescribing "none of the above" labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFA… ▽ More This paper shows that simply prescribing "none of the above" labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter -- probability of sampling from unlabeled data -- is also studied empirically. △ Less

Submitted 17 November, 2016; v1 submitted 11 November, 2015; originally announced November 2015.

Comments: 7 pages for article, 3 pages for supplemental material. To appear in AAAI-17

arXiv:1510.05970 [pdf, other]

Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches

Authors: Jure Žbontar, Yann LeCun

Abstract: We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with e… ▽ More We present a method for extracting depth information from a rectified image pair. Our approach focuses on the first stage of many stereo algorithms: the matching cost computation. We approach the problem by learning a similarity measure on small image patches using a convolutional neural network. Training is carried out in a supervised manner by constructing a binary classification data set with examples of similar and dissimilar pairs of patches. We examine two network architectures for this task: one tuned for speed, the other for accuracy. The output of the convolutional neural network is used to initialize the stereo matching cost. A series of post-processing steps follow: cross-based cost aggregation, semiglobal matching, a left-right consistency check, subpixel enhancement, a median filter, and a bilateral filter. We evaluate our method on the KITTI 2012, KITTI 2015, and Middlebury stereo data sets and show that it outperforms other approaches on all three data sets. △ Less

Submitted 18 May, 2016; v1 submitted 20 October, 2015; originally announced October 2015.

Journal ref: JMLR 17(65):1-32, 2016

arXiv:1509.08967 [pdf, other]

Very Deep Multilingual Convolutional Neural Networks for LVCSR

Authors: Tom Sercu, Christian Puhrsch, Brian Kingsbury, Yann LeCun

Abstract: Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a v… ▽ More Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in LVCSR have not kept pace with recent advances in other domains where deeper neural networks provide superior performance. In this paper we propose a number of architectural advances in CNNs for LVCSR. First, we introduce a very deep convolutional network architecture with up to 14 weight layers. There are multiple convolutional layers before each pooling layer, with small 3x3 kernels, inspired by the VGG Imagenet 2014 architecture. Then, we introduce multilingual CNNs with multiple untied layers. Finally, we introduce multi-scale input features aimed at exploiting more context at negligible computational cost. We evaluate the improvements first on a Babel task for low resource speech recognition, obtaining an absolute 5.77% WER improvement over the baseline PLP DNN by training our CNN on the combined data of six different languages. We then evaluate the very deep CNNs on the Hub5'00 benchmark (using the 262 hours of SWB-1 training data) achieving a word error rate of 11.8% after cross-entropy training, a 1.4% WER improvement (10.6% relative) over the best published CNN result so far. △ Less

Submitted 23 January, 2016; v1 submitted 29 September, 2015; originally announced September 2015.

Comments: Accepted for publication at ICASSP 2016

arXiv:1509.03591 [pdf]

High Performance Computer Acoustic Data Accelerator: A New System for Exploring Marine Mammal Acoustics for Big Data Applications

Authors: Peter Dugan, John Zollweg, Marian Popescu, Denise Risch, Herve Glotin, Yann LeCun, and Christopher Clark

Abstract: This paper presents a new software model designed for distributed sonic signal detection runtime using machine learning algorithms called DeLMA. A new algorithm--Acoustic Data-mining Accelerator (ADA)--is also presented. ADA is a robust yet scalable solution for efficiently processing big sound archives using distributing computing technologies. Together, DeLMA and the ADA algorithm provide a powe… ▽ More This paper presents a new software model designed for distributed sonic signal detection runtime using machine learning algorithms called DeLMA. A new algorithm--Acoustic Data-mining Accelerator (ADA)--is also presented. ADA is a robust yet scalable solution for efficiently processing big sound archives using distributing computing technologies. Together, DeLMA and the ADA algorithm provide a powerful tool currently being used by the Bioacoustics Research Program (BRP) at the Cornell Lab of Ornithology, Cornell University. This paper provides a high level technical overview of the system, and discusses various aspects of the design. Basic runtime performance and project summary are presented. The DeLMA-ADA baseline performance comparing desktop serial configuration to a 64 core distributed HPC system shows as much as a 44 times faster increase in runtime execution. Performance tests using 48 cores on the HPC shows a 9x to 12x efficiency over a 4 core desktop solution. Project summary results for 19 east coast deployments show that the DeLMA-ADA solution has processed over three million channel hours of sound to date. △ Less

Submitted 11 September, 2015; originally announced September 2015.

Comments: Seven pages, submitted at International Conference on Machine Learning 2014, Workshop uLearnBio, unsupervised learning for bioacoustic applications

MSC Class: 68-04

arXiv:1509.01626 [pdf, other]

Character-level Convolutional Networks for Text Classification

Authors: Xiang Zhang, Junbo Zhao, Yann LeCun

Abstract: This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep… ▽ More This article offers an empirical exploration on the use of character-level convolutional networks (ConvNets) for text classification. We constructed several large-scale datasets to show that character-level convolutional networks could achieve state-of-the-art or competitive results. Comparisons are offered against traditional models such as bag of words, n-grams and their TFIDF variants, and deep learning models such as word-based ConvNets and recurrent neural networks. △ Less

Submitted 3 April, 2016; v1 submitted 4 September, 2015; originally announced September 2015.

Comments: An early version of this work entitled "Text Understanding from Scratch" was posted in Feb 2015 as arXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction, Advances in Neural Information Processing Systems 28 (NIPS 2015)

arXiv:1506.05163 [pdf, other]

Deep Convolutional Networks on Graph-Structured Data

Authors: Mikael Henaff, Joan Bruna, Yann LeCun

Abstract: Deep Learning's recent successes have mostly relied on Convolutional Networks, which exploit fundamental statistical properties of images, sounds and video data: the local stationarity and multi-scale compositional structure, that allows expressing long range interactions in terms of shorter, localized interactions. However, there exist other important examples, such as text documents or bioinform… ▽ More Deep Learning's recent successes have mostly relied on Convolutional Networks, which exploit fundamental statistical properties of images, sounds and video data: the local stationarity and multi-scale compositional structure, that allows expressing long range interactions in terms of shorter, localized interactions. However, there exist other important examples, such as text documents or bioinformatic data, that may lack some or all of these strong statistical regularities. In this paper we consider the general question of how to construct deep architectures with small learning complexity on general non-Euclidean domains, which are typically unknown and need to be estimated from the data. In particular, we develop an extension of Spectral Networks which incorporates a Graph Estimation procedure, that we test on large-scale classification problems, matching or improving over Dropout Networks with far less parameters to estimate. △ Less

Submitted 16 June, 2015; originally announced June 2015.

arXiv:1506.03011 [pdf, other]

Learning to Linearize Under Uncertainty

Authors: Ross Goroshin, Michael Mathieu, Yann LeCun

Abstract: Training deep feature hierarchies to solve supervised learning tasks has achieved state of the art performance on many problems in computer vision. However, a principled way in which to train such hierarchies in the unsupervised setting has remained elusive. In this work we suggest a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unla… ▽ More Training deep feature hierarchies to solve supervised learning tasks has achieved state of the art performance on many problems in computer vision. However, a principled way in which to train such hierarchies in the unsupervised setting has remained elusive. In this work we suggest a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabeled natural video sequences. This is done by training a generative model to predict video frames. We also address the problem of inherent uncertainty in prediction by introducing latent variables that are non-deterministic functions of the input into the network architecture. △ Less

Submitted 10 September, 2015; v1 submitted 9 June, 2015; originally announced June 2015.

Comments: To appear at NIPS 2015

arXiv:1506.02351 [pdf, other]

Stacked What-Where Auto-encoders

Authors: Junbo Zhao, Michael Mathieu, Ross Goroshin, Yann LeCun

Abstract: We present a novel architecture, the "stacked what-where auto-encoders" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvoluti… ▽ More We present a novel architecture, the "stacked what-where auto-encoders" (SWWAE), which integrates discriminative and generative pathways and provides a unified approach to supervised, semi-supervised and unsupervised learning without relying on sampling during training. An instantiation of SWWAE uses a convolutional net (Convnet) (LeCun et al. (1998)) to encode the input, and employs a deconvolutional net (Deconvnet) (Zeiler et al. (2010)) to produce the reconstruction. The objective function includes reconstruction terms that induce the hidden states in the Deconvnet to be similar to those of the Convnet. Each pooling layer produces two sets of variables: the "what" which are fed to the next layer, and its complementary variable "where" that are fed to the corresponding layer in the generative decoder. △ Less

Submitted 14 February, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

Comments: Workshop track - ICLR 2016

arXiv:1504.02518 [pdf, other]

Unsupervised Feature Learning from Temporal Data

Authors: Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, Yann LeCun

Abstract: Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pool… ▽ More Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric. △ Less

Submitted 15 April, 2015; v1 submitted 9 April, 2015; originally announced April 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1412.6056

arXiv:1503.03438 [pdf, ps, other]

A mathematical motivation for complex-valued convolutional networks

Authors: Joan Bruna, Soumith Chintala, Yann LeCun, Serkan Piantino, Arthur Szlam, Mark Tygert

Abstract: A complex-valued convolutional network (convnet) implements the repeated application of the following composition of three operations, recursively applying the composition to an input vector of nonnegative real numbers: (1) convolution with complex-valued vectors followed by (2) taking the absolute value of every entry of the resulting vectors followed by (3) local averaging. For processing real-v… ▽ More A complex-valued convolutional network (convnet) implements the repeated application of the following composition of three operations, recursively applying the composition to an input vector of nonnegative real numbers: (1) convolution with complex-valued vectors followed by (2) taking the absolute value of every entry of the resulting vectors followed by (3) local averaging. For processing real-valued random vectors, complex-valued convnets can be viewed as "data-driven multiscale windowed power spectra," "data-driven multiscale windowed absolute spectra," "data-driven multiwavelet absolute values," or (in their most general configuration) "data-driven nonlinear multiwavelet packets." Indeed, complex-valued convnets can calculate multiscale windowed spectra when the convnet filters are windowed complex-valued exponentials. Standard real-valued convnets, using rectified linear units (ReLUs), sigmoidal (for example, logistic or tanh) nonlinearities, max. pooling, etc., do not obviously exhibit the same exact correspondence with data-driven wavelets (whereas for complex-valued convnets, the correspondence is much more than just a vague analogy). Courtesy of the exact correspondence, the remarkably rich and rigorous body of mathematical analysis for wavelets applies directly to (complex-valued) convnets. △ Less

Submitted 12 December, 2015; v1 submitted 11 March, 2015; originally announced March 2015.

Comments: 11 pages, 3 figures; this is the retitled version submitted to the journal, "Neural Computation"

Journal ref: Neural Computation, 28 (5): 815-825, May 2016

arXiv:1502.01710 [pdf, other]

Text Understanding from Scratch

Authors: Xiang Zhang, Yann LeCun

Abstract: This article demontrates that we can apply deep learning to text understanding from character-level inputs all the way up to abstract text concepts, using temporal convolutional networks (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve astonishing performance wit… ▽ More This article demontrates that we can apply deep learning to text understanding from character-level inputs all the way up to abstract text concepts, using temporal convolutional networks (ConvNets). We apply ConvNets to various large-scale datasets, including ontology classification, sentiment analysis, and text categorization. We show that temporal ConvNets can achieve astonishing performance without the knowledge of words, phrases, sentences and any other syntactic or semantic structures with regards to a human language. Evidence shows that our models can work for both English and Chinese. △ Less

Submitted 3 April, 2016; v1 submitted 5 February, 2015; originally announced February 2015.

Comments: This technical report is superseded by a paper entitled "Character-level Convolutional Networks for Text Classification", arXiv:1509.01626. It has considerably more experimental results and a rewritten introduction

arXiv:1412.7580 [pdf, ps, other]

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

Authors: Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, Yann LeCun

Abstract: We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of t… ▽ More We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution implementations: one based on NVIDIA's cuFFT library, and another based on a Facebook authored FFT implementation, fbfft, that provides significant speedups over cuFFT (over 1.5x) for whole CNNs. Both of these convolution implementations are available in open source, and are faster than NVIDIA's cuDNN implementation for many common convolutional layers (up to 23.5x for some synthetic kernel configurations). We discuss different performance regimes of convolutions, comparing areas where straightforward time domain convolutions outperform Fourier frequency domain convolutions. Details on algorithmic applications of NVIDIA GPU hardware specifics in the implementation of fbfft are also provided. △ Less

Submitted 10 April, 2015; v1 submitted 23 December, 2014; originally announced December 2014.

Comments: Camera ready for ICLR2015

arXiv:1412.7022 [pdf, ps, other]

Audio Source Separation with Discriminative Scattering Networks

Authors: Pablo Sprechmann, Joan Bruna, Yann LeCun

Abstract: In this report we describe an ongoing line of research for solving single-channel source separation problems. Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger tha… ▽ More In this report we describe an ongoing line of research for solving single-channel source separation problems. Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. The proposed representation consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms (CQT) with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations (NMF) that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures. △ Less

Submitted 27 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

arXiv:1412.6651 [pdf, other]

Deep learning with Elastic Averaging SGD

Authors: Sixin Zhang, Anna Choromanska, Yann LeCun

Abstract: We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (ma… ▽ More We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient. △ Less

Submitted 25 October, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

Comments: NIPS2015 camera-ready version

arXiv:1412.6615 [pdf, other]

Explorations on high dimensional landscapes

Authors: Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun

Abstract: Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with… ▽ More Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps. △ Less

Submitted 6 April, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

Comments: 11 pages, 8 figures, workshop contribution at ICLR 2015

arXiv:1412.6056 [pdf, other]

Unsupervised Learning of Spatiotemporally Coherent Metrics

Authors: Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, Yann LeCun

Abstract: Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pool… ▽ More Current state-of-the-art classification and detection algorithms rely on supervised training. In this work we study unsupervised feature learning in the context of temporally coherent video data. We focus on feature learning from unlabeled video data, using the assumption that adjacent video frames contain semantically similar information. This assumption is exploited to train a convolutional pooling auto-encoder regularized by slowness and sparsity. We establish a connection between slow feature learning to metric learning and show that the trained encoder can be used to define a more temporally and semantically coherent metric. △ Less

Submitted 8 September, 2015; v1 submitted 18 December, 2014; originally announced December 2014.

Comments: To appear at ICCV2015

arXiv:1412.0233 [pdf, other]

The Loss Surfaces of Multilayer Networks

Authors: Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun

Abstract: We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network t… ▽ More We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization, and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting. △ Less

Submitted 21 January, 2015; v1 submitted 30 November, 2014; originally announced December 2014.

arXiv:1411.4280 [pdf, other]

Efficient Object Localization Using Convolutional Networks

Authors: Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

Abstract: Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architectu… ▽ More Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These benefits of pooling come at the cost of reduced localization accuracy. We introduce a novel architecture which includes an efficient `position refinement' model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model to achieve improved accuracy in human joint location estimation. We show that the variance of our detector approaches the variance of human annotations on the FLIC dataset and outperforms all existing approaches on the MPII-human-pose dataset. △ Less

Submitted 9 June, 2015; v1 submitted 16 November, 2014; originally announced November 2014.

Comments: 8 pages with 1 page of citations

arXiv:1410.6973 [pdf, other]

Differentially- and non-differentially-private random decision trees

Authors: Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Yann LeCun

Abstract: We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three differ… ▽ More We consider supervised learning with random decision trees, where the tree construction is completely random. The method is popularly used and works well in practice despite the simplicity of the setting, but its statistical mechanism is not yet well-understood. In this paper we provide strong theoretical guarantees regarding learning with random decision trees. We analyze and compare three different variants of the algorithm that have minimal memory requirements: majority voting, threshold averaging and probabilistic averaging. The random structure of the tree enables us to adapt these methods to a differentially-private setting thus we also propose differentially-private versions of all three schemes. We give upper-bounds on the generalization error and mathematically explain how the accuracy depends on the number of random decision trees. Furthermore, we prove that only logarithmic (in the size of the dataset) number of independently selected random decision trees suffice to correctly classify most of the data, even when differential-privacy guarantees must be maintained. We empirically show that majority voting and threshold averaging give the best accuracy, also for conservative users requiring high privacy guarantees. Furthermore, we demonstrate that a simple majority voting rule is an especially good candidate for the differentially-private classifier since it is much less sensitive to the choice of forest parameters than other methods. △ Less

Submitted 5 February, 2015; v1 submitted 25 October, 2014; originally announced October 2014.

arXiv:1409.7963 [pdf, other]

MoDeep: A Deep Learning Framework Using Motion Features for Human Pose Estimation

Authors: Arjun Jain, Jonathan Tompson, Yann LeCun, Christoph Bregler

Abstract: In this work, we propose a novel and efficient method for articulated human pose estimation in videos using a convolutional network architecture, which incorporates both color and motion features. We propose a new human body pose dataset, FLIC-motion, that extends the FLIC dataset with additional motion features. We apply our architecture to this dataset and report significantly better performance… ▽ More In this work, we propose a novel and efficient method for articulated human pose estimation in videos using a convolutional network architecture, which incorporates both color and motion features. We propose a new human body pose dataset, FLIC-motion, that extends the FLIC dataset with additional motion features. We apply our architecture to this dataset and report significantly better performance than current state-of-the-art pose detection systems. △ Less

Submitted 28 September, 2014; originally announced September 2014.

arXiv:1409.4326 [pdf, other]

doi 10.1109/CVPR.2015.7298767

Computing the Stereo Matching Cost with a Convolutional Neural Network

Authors: Jure Žbontar, Yann LeCun

Abstract: We present a method for extracting depth information from a rectified image pair. We train a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost. The cost is refined by cross-based cost aggregation and semiglobal matching, followed by a left-right consistency check to eliminate errors in the occluded regions. Our stereo method ach… ▽ More We present a method for extracting depth information from a rectified image pair. We train a convolutional neural network to predict how well two image patches match and use it to compute the stereo matching cost. The cost is refined by cross-based cost aggregation and semiglobal matching, followed by a left-right consistency check to eliminate errors in the occluded regions. Our stereo method achieves an error rate of 2.61 % on the KITTI stereo dataset and is currently (August 2014) the top performing method on this dataset. △ Less

Submitted 20 October, 2015; v1 submitted 15 September, 2014; originally announced September 2014.

Comments: Conference on Computer Vision and Pattern Recognition (CVPR), June 2015

arXiv:1406.2984 [pdf, other]

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation

Authors: Jonathan Tompson, Arjun Jain, Yann LeCun, Christoph Bregler

Abstract: This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint tr… ▽ More This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field. We show how this architecture is successfully applied to the challenging problem of articulated human pose estimation in monocular images. The architecture can exploit structural domain constraints such as geometric relationships between body joint locations. We show that joint training of these two model paradigms improves performance and allows us to significantly outperform existing state-of-the-art techniques. △ Less

Submitted 17 September, 2014; v1 submitted 11 June, 2014; originally announced June 2014.

arXiv:1404.7195 [pdf, other]

Fast Approximation of Rotations and Hessians matrices

Authors: Michael Mathieu, Yann LeCun

Abstract: A new method to represent and approximate rotation matrices is introduced. The method represents approximations of a rotation matrix $Q$ with linearithmic complexity, i.e. with $\frac{1}{2}n\lg(n)$ rotations over pairs of coordinates, arranged in an FFT-like fashion. The approximation is "learned" using gradient descent. It allows to represent symmetric matrices $H$ as $QDQ^T$ where $D$ is a diago… ▽ More A new method to represent and approximate rotation matrices is introduced. The method represents approximations of a rotation matrix $Q$ with linearithmic complexity, i.e. with $\frac{1}{2}n\lg(n)$ rotations over pairs of coordinates, arranged in an FFT-like fashion. The approximation is "learned" using gradient descent. It allows to represent symmetric matrices $H$ as $QDQ^T$ where $D$ is a diagonal matrix. It can be used to approximate covariance matrix of Gaussian models in order to speed up inference, or to estimate and track the inverse Hessian of an objective function by relating changes in parameters to changes in gradient along the trajectory followed by the optimization procedure. Experiments were conducted to approximate synthetic matrices, covariance matrices of real data, and Hessian matrices of objective functions involved in machine learning problems. △ Less

Submitted 28 April, 2014; originally announced April 2014.

arXiv:1404.0736 [pdf, other]

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

Authors: Remi Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus

Abstract: We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lo… ▽ More We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the linear structure present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2x, while kee** the accuracy within 1% of the original model. △ Less

Submitted 9 June, 2014; v1 submitted 2 April, 2014; originally announced April 2014.

arXiv:1312.6229 [pdf, ps, other]

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Authors: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun

Abstract: We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to incr… ▽ More We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat. △ Less

Submitted 23 February, 2014; v1 submitted 21 December, 2013; originally announced December 2013.

arXiv:1312.6203 [pdf, other]

Spectral Networks and Locally Connected Networks on Graphs

Authors: Joan Bruna, Wojciech Zaremba, Arthur Szlam, Yann LeCun

Abstract: Convolutional Neural Networks are extremely efficient architectures in image and audio recognition tasks, thanks to their ability to exploit the local translational invariance of signal classes over their domain. In this paper we consider possible generalizations of CNNs to signals defined on more general domains without the action of a translation group. In particular, we propose two construction… ▽ More Convolutional Neural Networks are extremely efficient architectures in image and audio recognition tasks, thanks to their ability to exploit the local translational invariance of signal classes over their domain. In this paper we consider possible generalizations of CNNs to signals defined on more general domains without the action of a translation group. In particular, we propose two constructions, one based upon a hierarchical clustering of the domain, and another based on the spectrum of the graph Laplacian. We show through experiments that for low-dimensional graphs it is possible to learn convolutional layers with a number of parameters independent of the input size, resulting in efficient deep architectures. △ Less

Submitted 21 May, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

Comments: 14 pages

arXiv:1312.5851 [pdf, other]

Fast Training of Convolutional Networks through FFTs

Authors: Michael Mathieu, Mikael Henaff, Yann LeCun

Abstract: Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of data are required for training. Training a large convolutional network to produce state-of-the-art results can take weeks, even when using modern GPUs. Producing labels using a trained network can also be c… ▽ More Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of data are required for training. Training a large convolutional network to produce state-of-the-art results can take weeks, even when using modern GPUs. Producing labels using a trained network can also be costly when dealing with web-scale datasets. In this work, we present a simple algorithm which accelerates training and inference by a significant factor, and can yield improvements of over an order of magnitude compared to existing state-of-the-art implementations. This is done by computing convolutions as pointwise products in the Fourier domain while reusing the same transformed feature map many times. The algorithm is implemented on a GPU architecture and addresses a number of related challenges. △ Less

Submitted 6 March, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

arXiv:1312.1847 [pdf, other]

Understanding Deep Architectures using a Recursive Convolutional Network

Authors: David Eigen, Jason Rolfe, Rob Fergus, Yann LeCun

Abstract: A key challenge in designing convolutional network models is sizing them appropriately. Many factors are involved in these decisions, including number of layers, feature maps, kernel sizes, etc. Complicating this further is the fact that each of these influence not only the numbers and dimensions of the activation units, but also the total number of parameters. In this paper we focus on assessing… ▽ More A key challenge in designing convolutional network models is sizing them appropriately. Many factors are involved in these decisions, including number of layers, feature maps, kernel sizes, etc. Complicating this further is the fact that each of these influence not only the numbers and dimensions of the activation units, but also the total number of parameters. In this paper we focus on assessing the independent contributions of three of these linked variables: The numbers of layers, feature maps, and parameters. To accomplish this, we employ a recursive convolutional network whose weights are tied between layers; this allows us to vary each of the three factors in a controlled setting. We find that while increasing the numbers of layers and parameters each have clear benefit, the number of feature maps (and hence dimensionality of the representation) appears ancillary, and finds most of its benefit through the introduction of more weights. Our results (i) empirically confirm the notion that adding layers alone increases computational power, within the context of convolutional layers, and (ii) suggest that precise sizing of convolutional feature map dimensions is itself of little concern; more attention should be paid to the number of parameters in these layers instead. △ Less

Submitted 19 February, 2014; v1 submitted 6 December, 2013; originally announced December 2013.

arXiv:1311.4025 [pdf, ps, other]

Signal Recovery from Pooling Representations

Authors: Joan Bruna, Arthur Szlam, Yann LeCun

Abstract: In this work we compute lower Lipschitz bounds of $\ell_p$ pooling operators for $p=1, 2, \infty$ as well as $\ell_p$ pooling operators preceded by half-rectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the r… ▽ More In this work we compute lower Lipschitz bounds of $\ell_p$ pooling operators for $p=1, 2, \infty$ as well as $\ell_p$ pooling operators preceded by half-rectification layers. These give sufficient conditions for the design of invertible neural network layers. Numerical experiments on MNIST and image patches confirm that pooling layers can be inverted with phase recovery algorithms. Moreover, the regularity of the inverse pooling, controlled by the lower Lipschitz constant, is empirically verified with a nearest neighbor regression. △ Less

Submitted 27 February, 2014; v1 submitted 16 November, 2013; originally announced November 2013.

Comments: 17 pages, 3 figures

arXiv:1301.3775 [pdf, other]

Discriminative Recurrent Sparse Auto-Encoders

Authors: Jason Tyler Rolfe, Yann LeCun

Abstract: We present the discriminative recurrent sparse auto-encoder model, comprising a recurrent encoder of rectified linear units, unrolled for a fixed number of iterations, and connected to two linear decoders that reconstruct the input and predict its supervised classification. Training via backpropagation-through-time initially minimizes an unsupervised sparse reconstruction error; the loss function… ▽ More We present the discriminative recurrent sparse auto-encoder model, comprising a recurrent encoder of rectified linear units, unrolled for a fixed number of iterations, and connected to two linear decoders that reconstruct the input and predict its supervised classification. Training via backpropagation-through-time initially minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a discriminative term on the supervised classification. The depth implicit in the temporally-unrolled form allows the system to exhibit all the power of deep networks, while substantially reducing the number of trainable parameters. From an initially unstructured network the hidden units differentiate into categorical-units, each of which represents an input prototype with a well-defined class; and part-units representing deformations of these prototypes. The learned organization of the recurrent encoder is hierarchical: part-units are driven directly by the input, whereas the activity of categorical-units builds up over time through interactions with the part-units. Even using a small number of hidden units per layer, discriminative recurrent sparse auto-encoders achieve excellent performance on MNIST. △ Less

Submitted 19 March, 2013; v1 submitted 16 January, 2013; originally announced January 2013.

Comments: Added clarifications suggested by reviewers. 15 pages, 10 figures

arXiv:1301.3764 [pdf, other]

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Authors: Tom Schaul, Yann LeCun

Abstract: Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper min… ▽ More Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free. △ Less

Submitted 27 March, 2013; v1 submitted 16 January, 2013; originally announced January 2013.

Comments: Published at the First International Conference on Learning Representations (ICLR-2013). Public reviews are available at http://openreview.net/document/c14f2204-fd66-4d91-bed4-153523694041#c14f2204-fd66-4d91-bed4-153523694041

arXiv:1301.3577 [pdf, other]

Saturating Auto-Encoders

Authors: Rostislav Goroshin, Yann LeCun

Abstract: We introduce a simple new regularizer for auto-encoders whose hidden-unit activation functions contain at least one zero-gradient (saturated) region. This regularizer explicitly encourages activations in the saturated region(s) of the corresponding activation function. We call these Saturating Auto-Encoders (SATAE). We show that the saturation regularizer explicitly limits the SATAE's ability to r… ▽ More We introduce a simple new regularizer for auto-encoders whose hidden-unit activation functions contain at least one zero-gradient (saturated) region. This regularizer explicitly encourages activations in the saturated region(s) of the corresponding activation function. We call these Saturating Auto-Encoders (SATAE). We show that the saturation regularizer explicitly limits the SATAE's ability to reconstruct inputs which are not near the data manifold. Furthermore, we show that a wide variety of features can be learned when different activation functions are used. Finally, connections are established with the Contractive and Sparse Auto-Encoders. △ Less

Submitted 20 March, 2013; v1 submitted 15 January, 2013; originally announced January 2013.

arXiv:1301.3572 [pdf, other]

Indoor Semantic Segmentation using depth information

Authors: Camille Couprie, Clément Farabet, Laurent Najman, Yann LeCun

Abstract: This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%… ▽ More This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos sequences that could be processed in real-time using appropriate hardware such as an FPGA. △ Less

Submitted 14 March, 2013; v1 submitted 15 January, 2013; originally announced January 2013.

Comments: 8 pages, 3 figures

arXiv:1301.3537 [pdf, ps, other]

Learning Stable Group Invariant Representations with Convolutional Networks

Authors: Joan Bruna, Arthur Szlam, Yann LeCun

Abstract: Transformation groups, such as translations or rotations, effectively express part of the variability observed in many recognition problems. The group structure enables the construction of invariant signal representations with appealing mathematical properties, where convolutions, together with pooling operators, bring stability to additive and geometric perturbations of the input. Whereas physica… ▽ More Transformation groups, such as translations or rotations, effectively express part of the variability observed in many recognition problems. The group structure enables the construction of invariant signal representations with appealing mathematical properties, where convolutions, together with pooling operators, bring stability to additive and geometric perturbations of the input. Whereas physical transformation groups are ubiquitous in image and audio applications, they do not account for all the variability of complex signal classes. We show that the invariance properties built by deep convolutional networks can be cast as a form of stable group invariance. The network wiring architecture determines the invariance group, while the trainable filter coefficients characterize the group action. We give explanatory examples which illustrate how the network architecture controls the resulting invariance group. We also explore the principle by which additional convolutional layers induce a group factorization enabling more abstract, powerful invariant representations. △ Less

Submitted 15 January, 2013; originally announced January 2013.

Comments: 4 pages

arXiv:1301.3476 [pdf, other]

Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities

Authors: Tommi Vatanen, Tapani Raiko, Harri Valpola, Yann LeCun

Abstract: Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connection… ▽ More Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero. △ Less

Submitted 11 March, 2013; v1 submitted 15 January, 2013; originally announced January 2013.

Comments: 10 pages, 5 figures, ICLR2013

arXiv:1301.1671 [pdf, other]

Causal graph-based video segmentation

Authors: Camille Couprie, Clément Farabet, Yann LeCun

Abstract: Numerous approaches in image processing and computer vision are making use of super-pixels as a pre-processing step. Among the different methods producing such over-segmentation of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time. The algorithm may be t… ▽ More Numerous approaches in image processing and computer vision are making use of super-pixels as a pre-processing step. Among the different methods producing such over-segmentation of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time. The algorithm may be trivially extended to video segmentation by considering a video as a 3D volume, however, this can not be the case for causal segmentation, when subsequent frames are unknown. We propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real time applications. △ Less

Submitted 8 January, 2013; originally announced January 2013.

Comments: 6 pages, 5 figures

arXiv:1212.0142 [pdf, ps, other]

Pedestrian Detection with Unsupervised Multi-Stage Feature Learning

Authors: Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, Yann LeCun

Abstract: Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape informatio… ▽ More Pedestrian detection is a problem of considerable practical interest. Adding to the list of successful applications of deep learning methods to vision, we report state-of-the-art and competitive results on all major pedestrian datasets with a convolutional network model. The model uses a few new twists, such as multi-stage features, connections that skip layers to integrate global shape information with local distinctive motif information, and an unsupervised method based on convolutional sparse coding to pre-train the filters at each stage. △ Less

Submitted 2 April, 2013; v1 submitted 1 December, 2012; originally announced December 2012.

Comments: 12 pages

arXiv:1206.1106 [pdf, other]

No More Pesky Learning Rates

Authors: Tom Schaul, Sixin Zhang, Yann LeCun

Abstract: The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable f… ▽ More The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning. △ Less

Submitted 18 February, 2013; v1 submitted 5 June, 2012; originally announced June 2012.

arXiv:1204.3968 [pdf, ps, other]

Convolutional Neural Networks Applied to House Numbers Digit Classification

Authors: Pierre Sermanet, Soumith Chintala, Yann LeCun

Abstract: We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by lea… ▽ More We classify digits of real-world house numbers using convolutional neural networks (ConvNets). ConvNets are hierarchical feature learning neural networks whose structure is biologically inspired. Unlike many popular vision approaches that are hand-designed, ConvNets can automatically learn a unique set of features optimized for a given task. We augmented the traditional ConvNet architecture by learning multi-stage features and by using Lp pooling and establish a new state-of-the-art of 94.85% accuracy on the SVHN dataset (45.2% error improvement). Furthermore, we analyze the benefits of different pooling methods and multi-stage features in ConvNets. The source code and a tutorial are available at eblearn.sf.net. △ Less

Submitted 17 April, 2012; originally announced April 2012.

Comments: 4 pages, 6 figures, 2 tables

arXiv:1202.6384 [pdf, ps, other]

Fast approximations to structured sparse coding and applications to object classification

Authors: Arthur Szlam, Karol Gregor, Yann LeCun

Abstract: We describe a method for fast approximation of sparse coding. The input space is subdivided by a binary decision tree, and we simultaneously learn a dictionary and assignment of allowed dictionary elements for each leaf of the tree. We store a lookup table with the assignments and the pseudoinverses for each node, allowing for very fast inference. We give an algorithm for learning the tree, the di… ▽ More We describe a method for fast approximation of sparse coding. The input space is subdivided by a binary decision tree, and we simultaneously learn a dictionary and assignment of allowed dictionary elements for each leaf of the tree. We store a lookup table with the assignments and the pseudoinverses for each node, allowing for very fast inference. We give an algorithm for learning the tree, the dictionary and the dictionary element assignment, and In the process of describing this algorithm, we discuss the more general problem of learning the groups in group structured sparse modelling. We show that our method creates good sparse representations by using it in the object recognition framework of \cite{lazebnik06,yang-cvpr-09}. Implementing our own fast version of the SIFT descriptor the whole system runs at 20 frames per second on $321 \times 481$ sized images on a laptop with a quad-core cpu, while sacrificing very little accuracy on the Caltech 101 and 15 scenes benchmarks. △ Less

Submitted 28 February, 2012; originally announced February 2012.

Showing 101–150 of 156 results for author: LeCun, Y