Search | arXiv e-print repository

Humans Beat Deep Networks at Recognizing Objects in Unusual Poses, Given Enough Time

Authors: Netta Ollikka, Amro Abbas, Andrea Perin, Markku Kilpeläinen, Stéphane Deny

Abstract: Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically britt… ▽ More Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically brittle in this condition. Remarkably, as we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) take place when humans identify objects in unusual poses. Finally, our analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. We conclude that more work is needed to bring computer vision systems to the level of robustness of the human visual system. Understanding the nature of the mental processes taking place during extra viewing time may be key to attain such robustness. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.02906 [pdf, other]

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

Authors: Bernard Spiegl, Andrea Perin, Stéphane Deny, Alexander Ilin

Abstract: Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with… ▽ More Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with unparalleled flexibility. ViewFusion consists in simultaneously applying a diffusion denoising step to any number of input views of a scene, then combining the noise gradients obtained for each view with an (inferred) pixel-weighting mask, ensuring that for each region of the target scene only the most informative input views are taken into account. Our approach resolves several limitations of previous approaches by (1) being trainable and generalizing across multiple scenes and object classes, (2) adaptively taking in a variable number of pose-free views at both train and test time, (3) generating plausible views even in severely undetermined conditions (thanks to its generative nature) -- all while generating views of quality on par or even better than state-of-the-art methods. Limitations include not generating a 3D embedding of the scene, resulting in a relatively slow inference speed, and our method only being tested on the relatively small dataset NMR. Code is available. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2305.17409 [pdf, other]

On the special role of class-selective neurons in early training

Authors: Omkar Ranadive, Nikhil Thakurdesai, Ari S Morcos, Matthew Leavitt, Stéphane Deny

Abstract: It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experimen… ▽ More It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experiments on ResNet-50s trained on ImageNet. We first show that class-selective neurons emerge during the first few epochs of training, before receding rapidly but not completely; this suggests that class-selective neurons found in trained networks are in fact vestigial remains of early training. With single-neuron ablation experiments, we then show that class-selective neurons are important for network function in this early phase of training. We also observe that the network is close to a linear regime in this early phase; we thus speculate that class-selective neurons appear early in training as quasi-linear shortcut solutions to the classification task. Finally, in causal experiments where we regularize against class selectivity at different points in training, we show that the presence of class-selective neurons early in training is critical to the successful training of the network; in contrast, class-selective neurons can be suppressed later in training with little effect on final accuracy. It remains to be understood by which mechanism the presence of class-selective neurons in the early phase of training contributes to the successful training of networks. △ Less

Submitted 27 May, 2023; originally announced May 2023.

arXiv:2302.01647 [pdf, other]

Blockwise Self-Supervised Learning at Scale

Authors: Shoaib Ahmed Siddiqui, David Krueger, Yann LeCun, Stéphane Deny

Abstract: Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss functi… ▽ More Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience. △ Less

Submitted 3 February, 2023; originally announced February 2023.

arXiv:2207.08034 [pdf, other]

Progress and limitations of deep networks to recognize objects in unusual poses

Authors: Amro Abbas, Stéphane Deny

Abstract: Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks… ▽ More Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformations$\unicode{x2014}$3D-rotations and scaling$\unicode{x2014}$further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at https://github.com/amro-kamal/ObjectPose. △ Less

Submitted 16 July, 2022; originally announced July 2022.

arXiv:2103.03230 [pdf, other]

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Authors: Jure Zbontar, Li **g, Ishan Misra, Yann LeCun, Stéphane Deny

Abstract: Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We… ▽ More Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stop**, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection. △ Less

Submitted 14 June, 2021; v1 submitted 4 March, 2021; originally announced March 2021.

Comments: 13 pages, 6 figures, to appear at ICML 2021

arXiv:2102.05623 [pdf, other]

Addressing the Topological Defects of Disentanglement via Distributed Operators

Authors: Diane Bouchacourt, Mark Ibrahim, Stéphane Deny

Abstract: A core challenge in Machine Learning is to learn to disentangle natural factors of variation in data (e.g. object shape vs. pose). A popular approach to disentanglement consists in learning to map each of these factors to distinct subspaces of a model's latent representation. However, this approach has shown limited empirical success to date. Here, we show that, for a broad family of transformatio… ▽ More A core challenge in Machine Learning is to learn to disentangle natural factors of variation in data (e.g. object shape vs. pose). A popular approach to disentanglement consists in learning to map each of these factors to distinct subspaces of a model's latent representation. However, this approach has shown limited empirical success to date. Here, we show that, for a broad family of transformations acting on images--encompassing simple affine transformations such as rotations and translations--this approach to disentanglement introduces topological defects (i.e. discontinuities in the encoder). Motivated by classical results from group representation theory, we study an alternative, more flexible approach to disentanglement which relies on distributed latent operators, potentially acting on the entire latent space. We theoretically and empirically demonstrate the effectiveness of this approach to disentangle affine transformations. Our work lays a theoretical foundation for the recent success of a new generation of models using distributed operators for disentanglement. △ Less

Submitted 10 February, 2021; originally announced February 2021.

arXiv:1904.04544 [pdf, other]

Predicting synchronous firing of large neural populations from sequential recordings

Authors: Oleksandr Sorochynskyi, Stéphane Deny, Olivier Marre, Ulisse Ferrari

Abstract: A major goal in neuroscience is to understand how populations of neurons code for stimuli or actions. While the number of neurons that can be recorded simultaneously is increasing at a fast pace, in most cases these recordings cannot access a complete population. In particular, it is hard to simultaneously record all the neurons of the same type in a given area. Recent progress have made possible… ▽ More A major goal in neuroscience is to understand how populations of neurons code for stimuli or actions. While the number of neurons that can be recorded simultaneously is increasing at a fast pace, in most cases these recordings cannot access a complete population. In particular, it is hard to simultaneously record all the neurons of the same type in a given area. Recent progress have made possible to profile each recorded neuron in a given area thanks to genetic and physiological tools, and to pool together recordings from neurons of the same type across different experimental sessions. However, it is unclear how to infer the activity of a full population of neurons of the same type from these sequential recordings. Neural networks exhibit collective behaviour, e.g. noise correlations and synchronous activity, that are not directly captured by a conditionally-independent model that would just put together the spike trains from sequential recordings. Here we show that we can infer the activity of a full population of retina ganglion cells from sequential recordings, using a novel method based on copula distributions and maximum entropy modeling. From just the spiking response of each ganglion cell to a repeated stimulus, and a few pairwise recordings, we could predict the noise correlations using copulas, and then the full activity of a large population of ganglion cells of the same type using maximum entropy modeling. Remarkably, we could generalize to predict the population responses to different stimuli and even to different experiments. We could therefore use our method to construct a very large population merging cells' responses from different experiments. We predicted synchronous activity accurately and showed it grew substantially with the number of neurons. This approach is a promising way to infer population activity from sequential recordings in sensory areas. △ Less

Submitted 9 April, 2019; originally announced April 2019.

arXiv:1901.00945 [pdf, other]

A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs

Authors: Jack Lindsey, Samuel A. Ocko, Surya Ganguli, Stephane Deny

Abstract: The visual system is hierarchically organized to process visual information in successive stages. Neural representations vary drastically across the first stages of visual processing: at the output of the retina, ganglion cell receptive fields (RFs) exhibit a clear antagonistic center-surround structure, whereas in the primary visual cortex, typical RFs are sharply tuned to a precise orientation.… ▽ More The visual system is hierarchically organized to process visual information in successive stages. Neural representations vary drastically across the first stages of visual processing: at the output of the retina, ganglion cell receptive fields (RFs) exhibit a clear antagonistic center-surround structure, whereas in the primary visual cortex, typical RFs are sharply tuned to a precise orientation. There is currently no unified theory explaining these differences in representations across layers. Here, using a deep convolutional neural network trained on image recognition as a model of the visual system, we show that such differences in representation can emerge as a direct consequence of different neural resource constraints on the retinal and cortical networks, and we find a single model from which both geometries spontaneously emerge at the appropriate stages of visual processing. The key constraint is a reduced number of neurons at the retinal output, consistent with the anatomy of the optic nerve as a stringent bottleneck. Second, we find that, for simple cortical networks, visual representations at the retinal output emerge as nonlinear and lossy feature detectors, whereas they emerge as linear and faithful encoders of the visual scene for more complex cortices. This result predicts that the retinas of small vertebrates should perform sophisticated nonlinear computations, extracting features directly relevant to behavior, whereas retinas of large animals such as primates should mostly encode the visual scene linearly and respond to a much broader range of stimuli. These predictions could reconcile the two seemingly incompatible views of the retina as either performing feature extraction or efficient coding of natural scenes, by suggesting that all vertebrates lie on a spectrum between these two objectives, depending on the degree of neural resources allocated to their visual system. △ Less

Submitted 3 January, 2019; originally announced January 2019.

Journal ref: International Conference on Learning Representations, 2019 https://openreview.net/forum?id=S1xq3oR5tQ

arXiv:1811.06866 [pdf, other]

Optogenetic vision restoration with high resolution

Authors: Ulisse Ferrari, Stéphane Deny, Abhishek Sengupta, Romain Caplette, José-Alain Sahel, Deniz Dalkara, Serge Picaud, Jens Duebel, Olivier Marre

Abstract: The majority of inherited retinal degenerations are due to photoreceptor cell death. In many cases ganglion cells are spared making it possible to stimulate them to restore visual function. Several studies (Bi et al., 2006; Lin et al., 2008; Sengupta et al., 2016; Caporale et al., 2011; Berry et al., 2017) have shown that it is possible to express an optogenetic protein in ganglion cells and make… ▽ More The majority of inherited retinal degenerations are due to photoreceptor cell death. In many cases ganglion cells are spared making it possible to stimulate them to restore visual function. Several studies (Bi et al., 2006; Lin et al., 2008; Sengupta et al., 2016; Caporale et al., 2011; Berry et al., 2017) have shown that it is possible to express an optogenetic protein in ganglion cells and make them light sensitive. This is a promising strategy to restore vision since optical targeting may be more precise than electrical stimulation with a retinal prothesis. However the spatial resolution of optogenetically-reactivated retinas has not been measured with fine-grained stimulation patterns. Since the optogenetic protein is also expressed in axons, it is unclear if these neurons will only be sensitive to the stimulation of a small region covering their somas and dendrites, or if they will also respond to any stimulation overlap** with their axon, dramatically impairing spatial resolution. Here we recorded responses of mouse and macaque retinas to random checkerboard patterns following an in vivo optogenetic therapy. We show that optogenetically activated ganglion cells are each sensitive to a small region of visual space. A simple model based on this small receptive field predicted accurately their responses to complex stimuli. From this model, we simulated how the entire population of light sensitive ganglion cells would respond to letters of different sizes. We then estimated the maximal acuity expected by a patient, assuming it could make an optimal use of the information delivered by this reactivated retina. The obtained acuity is above the limit of legal blindness. This high spatial resolution is a promising result for future clinical studies. △ Less

Submitted 16 November, 2018; originally announced November 2018.

arXiv:1801.01823 [pdf, other]

doi 10.1103/PhysRevE.98.042410

Separating intrinsic interactions from extrinsic correlations in a network of sensory neurons

Authors: Ulisse Ferrari, Stephane Deny, Matthew Chalk, Gasper Tkacik, Olivier Marre, Thierry Mora

Abstract: Correlations in sensory neural networks have both extrinsic and intrinsic origins. Extrinsic or stimulus correlations arise from shared inputs to the network, and thus depend strongly on the stimulus ensemble. Intrinsic or noise correlations reflect biophysical mechanisms of interactions between neurons, which are expected to be robust to changes of the stimulus ensemble. Despite the importance of… ▽ More Correlations in sensory neural networks have both extrinsic and intrinsic origins. Extrinsic or stimulus correlations arise from shared inputs to the network, and thus depend strongly on the stimulus ensemble. Intrinsic or noise correlations reflect biophysical mechanisms of interactions between neurons, which are expected to be robust to changes of the stimulus ensemble. Despite the importance of this distinction for understanding how sensory networks encode information collectively, no method exists to reliably separate intrinsic interactions from extrinsic correlations in neural activity data, limiting our ability to build predictive models of the network response. In this paper we introduce a general strategy to infer {population models of interacting neurons that collectively encode stimulus information}. The key to disentangling intrinsic from extrinsic correlations is to infer the {couplings between neurons} separately from the encoding model, and to combine the two using corrections calculated in a mean-field approximation. We demonstrate the effectiveness of this approach on retinal recordings. The same coupling network is inferred from responses to radically different stimulus ensembles, showing that these couplings indeed reflect stimulus-independent interactions between neurons. The inferred model predicts accurately the collective response of retinal ganglion cell populations as a function of the stimulus. △ Less

Submitted 22 February, 2018; v1 submitted 5 January, 2018; originally announced January 2018.

Journal ref: Phys. Rev. E 98, 042410 (2018)

arXiv:1801.01362 [pdf, other]

A simple model for low variability in neural spike trains

Authors: Ulisse Ferrari, Stephane Deny, Olivier Marre, Thierry Mora

Abstract: Neural noise sets a limit to information transmission in sensory systems. In several areas, the spiking response (to a repeated stimulus) has shown a higher degree of regularity than predicted by a Poisson process. However, a simple model to explain this low variability is still lacking. Here we introduce a new model, with a correction to Poisson statistics, which can accurately predict the regula… ▽ More Neural noise sets a limit to information transmission in sensory systems. In several areas, the spiking response (to a repeated stimulus) has shown a higher degree of regularity than predicted by a Poisson process. However, a simple model to explain this low variability is still lacking. Here we introduce a new model, with a correction to Poisson statistics, which can accurately predict the regularity of neural spike trains in response to a repeated stimulus. The model has only two parameters, but can reproduce the observed variability in retinal recordings in various conditions. We show analytically why this approximation can work. In a model of the spike emitting process where a refractory period is assumed, we derive that our simple correction can well approximate the spike train statistics over a broad range of firing rates. Our model can be easily plugged to stimulus processing models, like Linear-nonlinear model or its generalizations, to replace the Poisson spike train hypothesis that is commonly assumed. It estimates the amount of information transmitted much more accurately than Poisson models in retinal recordings. Thanks to its simplicity this model has the potential to explain low variability in other areas. △ Less

Submitted 4 January, 2018; originally announced January 2018.

arXiv:1605.03373 [pdf, other]

Nonlinear decoding of a complex movie from the mammalian retina

Authors: Vicente Botella-Soler, Stéphane Deny, Olivier Marre, Gašper Tkačik

Abstract: Retinal circuitry transforms spatiotemporal patterns of light into spiking activity of ganglion cells, which provide the sole visual input to the brain. Recent advances have led to a detailed characterization of retinal activity and stimulus encoding by large neural populations. The inverse problem of decoding, where the stimulus is reconstructed from spikes, has received less attention, in partic… ▽ More Retinal circuitry transforms spatiotemporal patterns of light into spiking activity of ganglion cells, which provide the sole visual input to the brain. Recent advances have led to a detailed characterization of retinal activity and stimulus encoding by large neural populations. The inverse problem of decoding, where the stimulus is reconstructed from spikes, has received less attention, in particular for complex input movies that should be reconstructed "pixel-by-pixel". We recorded around a hundred neurons from a dense patch in a rat retina and decoded movies of multiple small discs executing mutually-avoiding random motions. We constructed nonlinear (kernelized) decoders that improved significantly over linear decoding results, mostly due to their ability to reliably separate between neural responses driven by locally fluctuating light signals, and responses at locally constant light driven by spontaneous or network activity. This improvement crucially depended on the precise, non-Poisson temporal structure of individual spike trains, which originated in the spike-history dependence of neural responses. Our results suggest a general paradigm in which downstream neural circuitry could discriminate between spontaneous and stimulus-driven activity on the basis of higher-order statistical structure intrinsic to the incoming spike trains. △ Less

Submitted 11 May, 2016; originally announced May 2016.

Comments: 24 pages, 21 figures

arXiv:1410.6769 [pdf, other]

doi 10.1103/PhysRevLett.114.078105

Dynamical criticality in the collective activity of a population of retinal neurons

Authors: Thierry Mora, Stéphane Deny, Olivier Marre

Abstract: Recent experimental results based on multi-electrode and imaging techniques have reinvigorated the idea that large neural networks operate near a critical point, between order and disorder. However, evidence for criticality has relied on the definition of arbitrary order parameters, or on models that do not address the dynamical nature of network activity. Here we introduce a novel approach to ass… ▽ More Recent experimental results based on multi-electrode and imaging techniques have reinvigorated the idea that large neural networks operate near a critical point, between order and disorder. However, evidence for criticality has relied on the definition of arbitrary order parameters, or on models that do not address the dynamical nature of network activity. Here we introduce a novel approach to assess criticality that overcomes these limitations, while encompassing and generalizing previous criteria. We find a simple model to describe the global activity of large populations of ganglion cells in the rat retina, and show that their statistics are poised near a critical point. Taking into account the temporal dynamics of the activity greatly enhances the evidence for criticality, revealing it where previous methods would not. The approach is general and could be used in other biological networks. △ Less

Submitted 31 January, 2015; v1 submitted 24 October, 2014; originally announced October 2014.

Journal ref: Phys. Rev. Lett. 114, 078105 (2015)

Showing 1–14 of 14 results for author: Deny, S