Search | arXiv e-print repository

Investigating the Nature of 3D Generalization in Deep Neural Networks

Authors: Shoaib Ahmed Siddiqui, David Krueger, Thomas Breuel

Abstract: Visual object recognition systems need to generalize from a set of 2D training views to novel views. The question of how the human visual system can generalize to novel views has been studied and modeled in psychology, computer vision, and neuroscience. Modern deep learning architectures for object recognition generalize well to novel views, but the mechanisms are not well understood. In this pape… ▽ More Visual object recognition systems need to generalize from a set of 2D training views to novel views. The question of how the human visual system can generalize to novel views has been studied and modeled in psychology, computer vision, and neuroscience. Modern deep learning architectures for object recognition generalize well to novel views, but the mechanisms are not well understood. In this paper, we characterize the ability of common deep learning architectures to generalize to novel views. We formulate this as a supervised classification task where labels correspond to unique 3D objects and examples correspond to 2D views of the objects at different 3D orientations. We consider three common models of generalization to novel views: (i) full 3D generalization, (ii) pure 2D matching, and (iii) matching based on a linear combination of views. We find that deep models generalize well to novel views, but they do so in a way that differs from all these existing models. Extrapolation to views beyond the range covered by views in the training set is limited, and extrapolation to novel rotation axes is even more limited, implying that the networks do not infer full 3D structure, nor use linear interpolation. Yet, generalization is far superior to pure 2D matching. These findings help with designing datasets with 2D views required to achieve 3D generalization. Code to reproduce our experiments is publicly available: https://github.com/shoaibahmed/investigating_3d_generalization.git △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 15 pages, 15 figures, CVPR format

arXiv:2202.11094 [pdf, other]

GroupViT: Semantic Segmentation Emerges from Text Supervision

Authors: Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang

Abstract: Grou** and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grou** of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grou** mechanism into deep networks, which allows semant… ▽ More Grou** and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grou** of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grou** mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grou** Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at https://github.com/NVlabs/GroupViT . △ Less

Submitted 18 July, 2022; v1 submitted 22 February, 2022; originally announced February 2022.

Comments: CVPR 2022. Project page and code: https://jerryxu.net/GroupViT

arXiv:2107.04827 [pdf, other]

Identifying Layers Susceptible to Adversarial Attacks

Authors: Shoaib Ahmed Siddiqui, Thomas Breuel

Abstract: In this paper, we investigate the use of pretraining with adversarial networks, with the objective of discovering the relationship between network depth and robustness. For this purpose, we selectively retrain different portions of VGG and ResNet architectures on CIFAR-10, Imagenette, and ImageNet using non-adversarial and adversarial data. Experimental results show that susceptibility to adversar… ▽ More In this paper, we investigate the use of pretraining with adversarial networks, with the objective of discovering the relationship between network depth and robustness. For this purpose, we selectively retrain different portions of VGG and ResNet architectures on CIFAR-10, Imagenette, and ImageNet using non-adversarial and adversarial data. Experimental results show that susceptibility to adversarial samples is associated with low-level feature extraction layers. Therefore, retraining of high-level layers is insufficient for achieving robustness. Furthermore, adversarial attacks yield outputs from early layers that differ statistically from features for non-adversarial samples and do not permit consistent classification by subsequent layers. This supports common hypotheses regarding the association of robustness with the feature extractor, insufficiency of deeper layers in providing robustness, and large differences in adversarial and non-adversarial feature vectors. △ Less

Submitted 28 October, 2021; v1 submitted 10 July, 2021; originally announced July 2021.

arXiv:2101.10803 [pdf, other]

ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning

Authors: Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, Yale Song

Abstract: The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such unc… ▽ More The natural association between visual observations and their corresponding sound provides powerful self-supervisory signals for learning video representations, which makes the ever-growing amount of online videos an attractive source of training data. However, large portions of online videos contain irrelevant audio-visual signals because of edited/overdubbed audio, and models trained on such uncurated videos have shown to learn suboptimal representations. Therefore, existing approaches rely almost exclusively on datasets with predetermined taxonomies of semantic concepts, where there is a high chance of audio-visual correspondence. Unfortunately, constructing such datasets require labor intensive manual annotation and/or verification, which severely limits the utility of online videos for large-scale learning. In this work, we present an automatic dataset curation approach based on subset optimization where the objective is to maximize the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data achieve competitive performances compared to models trained on existing manually curated datasets. The most significant benefit of our approach is scalability: We release ACAV100M that contains 100 million videos with high audio-visual correspondence, ideal for self-supervised video representation learning. △ Less

Submitted 16 August, 2021; v1 submitted 26 January, 2021; originally announced January 2021.

Comments: Published to ICCV2021

arXiv:2012.04124 [pdf, other]

Parameter Efficient Multimodal Transformers for Video Representation Learning

Authors: Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

Abstract: The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal info… ▽ More The recent success of Transformers in the language domain has motivated adapting it to a multimodal setting, where a new visual model is trained in tandem with an already pretrained language model. However, due to the excessive memory requirements from Transformers, existing work typically fixes the language model and train only the vision module, which limits its ability to learn cross-modal information in an end-to-end manner. In this work, we focus on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning. We alleviate the high memory requirement by sharing the parameters of Transformers across layers and modalities; we decompose the Transformer into modality-specific and modality-shared parts so that the model learns the dynamics of each modality both individually and together, and propose a novel parameter sharing scheme based on low-rank approximation. We show that our approach reduces parameters of the Transformers up to 97$\%$, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks. △ Less

Submitted 22 September, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

Comments: Accepted to ICLR 2021

arXiv:2012.00899 [pdf, other]

Displacement-Invariant Cost Computation for Efficient Stereo Matching

Authors: Yiran Zhong, Charles Loop, Wonmin Byeon, Stan Birchfield, Yuchao Dai, Kaihao Zhang, Alexey Kamenev, Thomas Breuel, Hongdong Li, Jan Kautz

Abstract: Although deep learning-based methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy, their inference time is typically slow, on the order of seconds for a pair of 540p images. The main reason is that the leading methods employ time-consuming 3D convolutions applied to a 4D feature volume. A common way to speed up the computation is to downsample the featur… ▽ More Although deep learning-based methods have dominated stereo matching leaderboards by yielding unprecedented disparity accuracy, their inference time is typically slow, on the order of seconds for a pair of 540p images. The main reason is that the leading methods employ time-consuming 3D convolutions applied to a 4D feature volume. A common way to speed up the computation is to downsample the feature volume, but this loses high-frequency details. To overcome these challenges, we propose a \emph{displacement-invariant cost computation module} to compute the matching costs without needing a 4D feature volume. Rather, costs are computed by applying the same 2D convolution network on each disparity-shifted feature map pair independently. Unlike previous 2D convolution-based methods that simply perform context map** between inputs and disparity maps, our proposed approach learns to match features between the two images. We also propose an entropy-based refinement strategy to refine the computed disparity map, which further improves speed by avoiding the need to compute a second disparity map on the right image. Extensive experiments on standard datasets (SceneFlow, KITTI, ETH3D, and Middlebury) demonstrate that our method achieves competitive accuracy with much less inference time. On typical image sizes, our method processes over 100 FPS on a desktop GPU, making our method suitable for time-critical applications such as autonomous driving. We also show that our approach generalizes well to unseen datasets, outperforming 4D-volumetric methods. △ Less

Submitted 1 December, 2020; originally announced December 2020.

Comments: 8 pages

arXiv:2001.01885 [pdf, other]

Discovering Nonlinear Relations with Minimum Predictive Information Regularization

Authors: Tailin Wu, Thomas Breuel, Michael Skuhersky, Jan Kautz

Abstract: Identifying the underlying directional relations from observational time series with nonlinear interactions and complex relational structures is key to a wide range of applications, yet remains a hard problem. In this work, we introduce a novel minimum predictive information regularization method to infer directional relations from time series, allowing deep learning models to discover nonlinear r… ▽ More Identifying the underlying directional relations from observational time series with nonlinear interactions and complex relational structures is key to a wide range of applications, yet remains a hard problem. In this work, we introduce a novel minimum predictive information regularization method to infer directional relations from time series, allowing deep learning models to discover nonlinear relations. Our method substantially outperforms other methods for learning nonlinear relations in synthetic datasets, and discovers the directional relations in a video game environment and a heart-rate vs. breath-rate dataset. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: 26 pages, 11 figures; ICML'19 Time Series Workshop

arXiv:2001.01858 [pdf]

High Performance I/O For Large Scale Deep Learning

Authors: Alex Aizman, Gavin Maltby, Thomas Breuel

Abstract: Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs. In this paper, we describe AIStore, a highly scalable, easy-to-deploy stor… ▽ More Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs. In this paper, we describe AIStore, a highly scalable, easy-to-deploy storage system, and WebDataset, a standards-based storage format and library that permits efficient access to very large datasets. We compare system performance experimentally using image classification workloads and storing training data on a variety of backends, including local SSDs, single-node NFS, and two identical bare-metal clusters: HDFS and AIStore. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: 6 pages, 8 figures

arXiv:1804.10123 [pdf, other]

IamNN: Iterative and Adaptive Mobile Neural Network for Efficient Image Classification

Authors: Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, Jan Kautz

Abstract: Deep residual networks (ResNets) made a recent breakthrough in deep learning. The core idea of ResNets is to have shortcut connections between layers that allow the network to be much deeper while still being easy to optimize avoiding vanishing gradients. These shortcut connections have interesting side-effects that make ResNets behave differently from other typical network architectures. In this… ▽ More Deep residual networks (ResNets) made a recent breakthrough in deep learning. The core idea of ResNets is to have shortcut connections between layers that allow the network to be much deeper while still being easy to optimize avoiding vanishing gradients. These shortcut connections have interesting side-effects that make ResNets behave differently from other typical network architectures. In this work we use these properties to design a network based on a ResNet but with parameter sharing and with adaptive computation time. The resulting network is much smaller than the original network and can adapt the computational cost to the complexity of the input image. △ Less

Submitted 26 April, 2018; originally announced April 2018.

Comments: ICLR 2018 Workshop track

arXiv:1804.09534 [pdf, other]

Hand Pose Estimation via Latent 2.5D Heatmap Regression

Authors: Umar Iqbal, Pavlo Molchanov, Thomas Breuel, Juergen Gall, Jan Kautz

Abstract: Estimating the 3D pose of a hand is an essential part of human-computer interaction. Estimating 3D pose using depth or multi-view sensors has become easier with recent advances in computer vision, however, regressing pose from a single RGB image is much less straightforward. The main difficulty arises from the fact that 3D pose requires some form of depth estimates, which are ambiguous given only… ▽ More Estimating the 3D pose of a hand is an essential part of human-computer interaction. Estimating 3D pose using depth or multi-view sensors has become easier with recent advances in computer vision, however, regressing pose from a single RGB image is much less straightforward. The main difficulty arises from the fact that 3D pose requires some form of depth estimates, which are ambiguous given only an RGB image. In this paper we propose a new method for 3D hand pose estimation from a monocular image through a novel 2.5D pose representation. Our new representation estimates pose up to a scaling factor, which can be estimated additionally if a prior of the hand size is given. We implicitly learn depth maps and heatmap distributions with a novel CNN architecture. Our system achieves the state-of-the-art estimation of 2D and 3D hand pose on several challenging datasets in presence of severe occlusions. △ Less

Submitted 25 April, 2018; originally announced April 2018.

arXiv:1703.00848 [pdf, other]

Unsupervised Image-to-Image Translation Networks

Authors: Ming-Yu Liu, Thomas Breuel, Jan Kautz

Abstract: Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumpti… ▽ More Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in https://github.com/mingyuliutw/unit . △ Less

Submitted 22 July, 2018; v1 submitted 2 March, 2017; originally announced March 2017.

Comments: NIPS 2017, 11 pages, 6 figures

arXiv:1610.09565 [pdf, other]

Sequence-to-sequence neural network models for transliteration

Authors: Mihaela Rosca, Thomas Breuel

Abstract: Transliteration is a key component of machine translation systems and software internationalization. This paper demonstrates that neural sequence-to-sequence models obtain state of the art or close to state of the art results on existing datasets. In an effort to make machine transliteration accessible, we open source a new Arabic to English transliteration dataset and our trained models. Transliteration is a key component of machine translation systems and software internationalization. This paper demonstrates that neural sequence-to-sequence models obtain state of the art or close to state of the art results on existing datasets. In an effort to make machine transliteration accessible, we open source a new Arabic to English transliteration dataset and our trained models. △ Less

Submitted 29 October, 2016; originally announced October 2016.

arXiv:1609.03415 [pdf, ps, other]

Active Canny: Edge Detection and Recovery with Open Active Contour Models

Authors: Muhammet Bastan, S. Saqib Bukhari, Thomas M. Breuel

Abstract: We introduce an edge detection and recovery framework based on open active contour models (snakelets). This is motivated by the noisy or broken edges output by standard edge detection algorithms, like Canny. The idea is to utilize the local continuity and smoothness cues provided by strong edges and grow them to recover the missing edges. This way, the strong edges are used to recover weak or miss… ▽ More We introduce an edge detection and recovery framework based on open active contour models (snakelets). This is motivated by the noisy or broken edges output by standard edge detection algorithms, like Canny. The idea is to utilize the local continuity and smoothness cues provided by strong edges and grow them to recover the missing edges. This way, the strong edges are used to recover weak or missing edges by considering the local edge structures, instead of blindly linking them if gradient magnitudes are above some threshold. We initialize short snakelets on the gradient magnitudes or binary edges automatically and then deform and grow them under the influence of gradient vector flow. The output snakelets are able to recover most of the breaks or weak edges, and they provide a smooth edge representation of the image; they can also be used for higher level analysis, like contour segmentation. △ Less

Submitted 12 September, 2016; originally announced September 2016.

arXiv:1606.02617 [pdf, other]

doi 10.13140/RG.2.1.5045.4649

Efficient Estimation of k for the Nearest Neighbors Class of Methods

Authors: Aleksander Lodwich, Faisal Shafait, Thomas Breuel

Abstract: The k Nearest Neighbors (kNN) method has received much attention in the past decades, where some theoretical bounds on its performance were identified and where practical optimizations were proposed for making it work fairly well in high dimensional spaces and on large datasets. From countless experiments of the past it became widely accepted that the value of k has a significant impact on the per… ▽ More The k Nearest Neighbors (kNN) method has received much attention in the past decades, where some theoretical bounds on its performance were identified and where practical optimizations were proposed for making it work fairly well in high dimensional spaces and on large datasets. From countless experiments of the past it became widely accepted that the value of k has a significant impact on the performance of this method. However, the efficient optimization of this parameter has not received so much attention in literature. Today, the most common approach is to cross-validate or bootstrap this value for all values in question. This approach forces distances to be recomputed many times, even if efficient methods are used. Hence, estimating the optimal k can become expensive even on modern systems. Frequently, this circumstance leads to a sparse manual search of k. In this paper we want to point out that a systematic and thorough estimation of the parameter k can be performed efficiently. The discussed approach relies on large matrices, but we want to argue, that in practice a higher space complexity is often much less of a problem than repetitive distance computations. △ Less

Submitted 13 June, 2016; v1 submitted 8 June, 2016; originally announced June 2016.

Comments: Technical Report, 16p, alternative source: http://lodwich.net/Science.html

arXiv:1603.04871 [pdf, other]

Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation

Authors: Zhicheng Yan, Hao Zhang, Yangqing Jia, Thomas Breuel, Yizhou Yu

Abstract: State-of-the-art results of semantic segmentation are established by Fully Convolutional neural Networks (FCNs). FCNs rely on cascaded convolutional and pooling layers to gradually enlarge the receptive fields of neurons, resulting in an indirect way of modeling the distant contextual dependence. In this work, we advocate the use of spatially recurrent layers (i.e. ReNet layers) which directly cap… ▽ More State-of-the-art results of semantic segmentation are established by Fully Convolutional neural Networks (FCNs). FCNs rely on cascaded convolutional and pooling layers to gradually enlarge the receptive fields of neurons, resulting in an indirect way of modeling the distant contextual dependence. In this work, we advocate the use of spatially recurrent layers (i.e. ReNet layers) which directly capture global contexts and lead to improved feature representations. We demonstrate the effectiveness of ReNet layers by building a Naive deep ReNet (N-ReNet), which achieves competitive performance on Stanford Background dataset. Furthermore, we integrate ReNet layers with FCNs, and develop a novel Hybrid deep ReNet (H-ReNet). It enjoys a few remarkable properties, including full-image receptive fields, end-to-end training, and efficient network execution. On the PASCAL VOC 2012 benchmark, the H-ReNet improves the results of state-of-the-art approaches Piecewise, CRFasRNN and DeepParsing by 3.6%, 2.3% and 0.2%, respectively, and achieves the highest IoUs for 13 out of the 20 object classes. △ Less

Submitted 15 March, 2016; originally announced March 2016.

Comments: 14 pages

arXiv:1511.04401 [pdf, other]

Symbol Grounding Association in Multimodal Sequences with Missing Elements

Authors: Federico Raue, Andreas Dengel, Thomas M. Breuel, Marcus Liwicki

Abstract: In this paper, we extend a symbolic association framework for being able to handle missing elements in multimodal sequences. The general scope of the work is the symbolic associations of object-word map**s as it happens in language development in infants. In other words, two different representations of the same abstract concepts can associate in both directions. This scenario has been long inte… ▽ More In this paper, we extend a symbolic association framework for being able to handle missing elements in multimodal sequences. The general scope of the work is the symbolic associations of object-word map**s as it happens in language development in infants. In other words, two different representations of the same abstract concepts can associate in both directions. This scenario has been long interested in Artificial Intelligence, Psychology, and Neuroscience. In this work, we extend a recent approach for multimodal sequences (visual and audio) to also cope with missing elements in one or both modalities. Our method uses two parallel Long Short-Term Memories (LSTMs) with a learning rule based on EM-algorithm. It aligns both LSTM outputs via Dynamic Time War** (DTW). We propose to include an extra step for the combination with the max operation for exploiting the common elements between both sequences. The motivation behind is that the combination acts as a condition selector for choosing the best representation from both LSTMs. We evaluated the proposed extension in the following scenarios: missing elements in one modality (visual or audio) and missing elements in both modalities (visual and sound). The performance of our extension reaches better results than the original model and similar results to individual LSTM trained in each modality. △ Less

Submitted 7 December, 2017; v1 submitted 13 November, 2015; originally announced November 2015.

Comments: Under review on Journal of Artificial Intelligence Research (JAIR) -- Special Track on Deep Learning, Knowledge Representation, and Reasoning

arXiv:1508.02792 [pdf, ps, other]

Possible Mechanisms for Neural Reconfigurability and their Implications

Authors: Thomas M. Breuel

Abstract: The paper introduces a biologically and evolutionarily plausible neural architecture that allows a single group of neurons, or an entire cortical pathway, to be dynamically reconfigured to perform multiple, potentially very different computations. The paper shows that reconfigurability can account for the observed stochastic and distributed coding behavior of neurons and provides a parsimonious ex… ▽ More The paper introduces a biologically and evolutionarily plausible neural architecture that allows a single group of neurons, or an entire cortical pathway, to be dynamically reconfigured to perform multiple, potentially very different computations. The paper shows that reconfigurability can account for the observed stochastic and distributed coding behavior of neurons and provides a parsimonious explanation for timing phenomena in psychophysical experiments. It also shows that reconfigurable pathways correspond to classes of statistical classifiers that include decision lists, decision trees, and hierarchical Bayesian methods. Implications for the interpretation of neurophysiological and psychophysical results are discussed, and future experiments for testing the reconfigurability hypothesis are explored. △ Less

Submitted 11 August, 2015; originally announced August 2015.

ACM Class: K.3.2

arXiv:1508.02790 [pdf, other]

On the Convergence of SGD Training of Neural Networks

Authors: Thomas M. Breuel

Abstract: Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are motivated by ideas about the occurrence of local minima at different scales, valleys, and other phenomena in the objective function. Empirical results presented here… ▽ More Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are motivated by ideas about the occurrence of local minima at different scales, valleys, and other phenomena in the objective function. Empirical results presented here suggest that these phenomena are not significant factors in SGD optimization of MLP-related objective functions, and that the behavior of stochastic gradient descent in these problems is better described as the simultaneous convergence at different rates of many, largely non-interacting subproblems △ Less

Submitted 11 August, 2015; originally announced August 2015.

ACM Class: K.3.2

arXiv:1508.02788 [pdf, other]

The Effects of Hyperparameters on SGD Training of Neural Networks

Authors: Thomas M. Breuel

Abstract: The performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale experime… ▽ More The performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale experiments exploring these different parameters and their interactions. △ Less

Submitted 11 August, 2015; originally announced August 2015.

ACM Class: K.3.2

arXiv:1508.02774 [pdf, other]

Benchmarking of LSTM Networks

Authors: Thomas M. Breuel

Abstract: LSTM (Long Short-Term Memory) recurrent neural networks have been highly successful in a number of application areas. This technical report describes the use of the MNIST and UW3 databases for benchmarking LSTM networks and explores the effect of different architectural and hyperparameter choices on performance. Significant findings include: (1) LSTM performance depends smoothly on learning rates,… ▽ More LSTM (Long Short-Term Memory) recurrent neural networks have been highly successful in a number of application areas. This technical report describes the use of the MNIST and UW3 databases for benchmarking LSTM networks and explores the effect of different architectural and hyperparameter choices on performance. Significant findings include: (1) LSTM performance depends smoothly on learning rates, (2) batching and momentum has no significant effect on performance, (3) softmax training outperforms least square training, (4) peephole units are not useful, (5) the standard non-linearities (tanh and sigmoid) perform best, (6) bidirectional training combined with CTC performs better than other methods. △ Less

Submitted 11 August, 2015; originally announced August 2015.

ACM Class: K.3.2

arXiv:1009.3589 [pdf, other]

Deep Self-Taught Learning for Handwritten Character Recognition

Authors: Frédéric Bastien, Yoshua Bengio, Arnaud Bergeron, Nicolas Boulanger-Lewandowski, Thomas Breuel, Youssouf Chherawala, Moustapha Cisse, Myriam Côté, Dumitru Erhan, Jeremy Eustache, Xavier Glorot, Xavier Muller, Sylvain Pannetier Lebeuf, Razvan Pascanu, Salah Rifai, Francois Savard, Guillaume Sicard

Abstract: Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of… ▽ More Recent theoretical and empirical work in statistical machine learning has demonstrated the importance of learning algorithms for deep architectures, i.e., function classes obtained by composing multiple non-linear transformations. Self-taught learning (exploiting unlabeled examples or examples from other distributions) has already been applied to deep learners, but mostly to show the advantage of unlabeled examples. Here we explore the advantage brought by {\em out-of-distribution examples}. For this purpose we developed a powerful generator of stochastic variations and noise processes for character images, including not only affine transformations but also slant, local elastic deformations, changes in thickness, background images, grey level changes, contrast, occlusion, and various types of noise. The out-of-distribution examples are obtained from these highly distorted images or by including examples of object classes different from those in the target test set. We show that {\em deep learners benefit more from out-of-distribution examples than a corresponding shallow learner}, at least in the area of handwritten character recognition. In fact, we show that they beat previously published results and reach human-level performance on both handwritten digit classification and 62-class handwritten character recognition. △ Less

Submitted 18 September, 2010; originally announced September 2010.

Report number: 1353, Dept. IRO, U. Montreal MSC Class: 68T05 ACM Class: I.2.6

arXiv:0712.0137 [pdf, ps, other]

View Based Methods can achieve Bayes-Optimal 3D Recognition

Authors: Thomas M. Breuel

Abstract: This paper proves that visual object recognition systems using only 2D Euclidean similarity measurements to compare object views against previously seen views can achieve the same recognition performance as observers having access to all coordinate information and able of using arbitrary 3D models internally. Furthermore, it demonstrates that such systems do not require more training views than… ▽ More This paper proves that visual object recognition systems using only 2D Euclidean similarity measurements to compare object views against previously seen views can achieve the same recognition performance as observers having access to all coordinate information and able of using arbitrary 3D models internally. Furthermore, it demonstrates that such systems do not require more training views than Bayes-optimal 3D model-based systems. For building computer vision systems, these results imply that using view-based or appearance-based techniques with carefully constructed combination of evidence mechanisms may not be at a disadvantage relative to 3D model-based systems. For computational approaches to human vision, they show that it is impossible to distinguish view-based and 3D model-based techniques for 3D object recognition solely by comparing the performance achievable by human and 3D model-based systems.} △ Less