-
Humans Beat Deep Networks at Recognizing Objects in Unusual Poses, Given Enough Time
Authors:
Netta Ollikka,
Amro Abbas,
Andrea Perin,
Markku Kilpeläinen,
Stéphane Deny
Abstract:
Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically britt…
▽ More
Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically brittle in this condition. Remarkably, as we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) take place when humans identify objects in unusual poses. Finally, our analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. We conclude that more work is needed to bring computer vision systems to the level of robustness of the human visual system. Understanding the nature of the mental processes taking place during extra viewing time may be key to attain such robustness.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis
Authors:
Bernard Spiegl,
Andrea Perin,
Stéphane Deny,
Alexander Ilin
Abstract:
Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with…
▽ More
Deep learning is providing a wealth of new approaches to the old problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with specific limitations in their applicability. This work introduces ViewFusion, a state-of-the-art end-to-end generative approach to novel view synthesis with unparalleled flexibility. ViewFusion consists in simultaneously applying a diffusion denoising step to any number of input views of a scene, then combining the noise gradients obtained for each view with an (inferred) pixel-weighting mask, ensuring that for each region of the target scene only the most informative input views are taken into account. Our approach resolves several limitations of previous approaches by (1) being trainable and generalizing across multiple scenes and object classes, (2) adaptively taking in a variable number of pose-free views at both train and test time, (3) generating plausible views even in severely undetermined conditions (thanks to its generative nature) -- all while generating views of quality on par or even better than state-of-the-art methods. Limitations include not generating a 3D embedding of the scene, resulting in a relatively slow inference speed, and our method only being tested on the relatively small dataset NMR. Code is available.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
On the special role of class-selective neurons in early training
Authors:
Omkar Ranadive,
Nikhil Thakurdesai,
Ari S Morcos,
Matthew Leavitt,
Stéphane Deny
Abstract:
It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experimen…
▽ More
It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experiments on ResNet-50s trained on ImageNet. We first show that class-selective neurons emerge during the first few epochs of training, before receding rapidly but not completely; this suggests that class-selective neurons found in trained networks are in fact vestigial remains of early training. With single-neuron ablation experiments, we then show that class-selective neurons are important for network function in this early phase of training. We also observe that the network is close to a linear regime in this early phase; we thus speculate that class-selective neurons appear early in training as quasi-linear shortcut solutions to the classification task. Finally, in causal experiments where we regularize against class selectivity at different points in training, we show that the presence of class-selective neurons early in training is critical to the successful training of the network; in contrast, class-selective neurons can be suppressed later in training with little effect on final accuracy. It remains to be understood by which mechanism the presence of class-selective neurons in the early phase of training contributes to the successful training of networks.
△ Less
Submitted 27 May, 2023;
originally announced May 2023.
-
Blockwise Self-Supervised Learning at Scale
Authors:
Shoaib Ahmed Siddiqui,
David Krueger,
Yann LeCun,
Stéphane Deny
Abstract:
Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss functi…
▽ More
Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
△ Less
Submitted 3 February, 2023;
originally announced February 2023.
-
Progress and limitations of deep networks to recognize objects in unusual poses
Authors:
Amro Abbas,
Stéphane Deny
Abstract:
Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks…
▽ More
Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformations$\unicode{x2014}$3D-rotations and scaling$\unicode{x2014}$further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at https://github.com/amro-kamal/ObjectPose.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Authors:
Jure Zbontar,
Li **g,
Ishan Misra,
Yann LeCun,
Stéphane Deny
Abstract:
Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We…
▽ More
Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stop**, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
△ Less
Submitted 14 June, 2021; v1 submitted 4 March, 2021;
originally announced March 2021.
-
Addressing the Topological Defects of Disentanglement via Distributed Operators
Authors:
Diane Bouchacourt,
Mark Ibrahim,
Stéphane Deny
Abstract:
A core challenge in Machine Learning is to learn to disentangle natural factors of variation in data (e.g. object shape vs. pose). A popular approach to disentanglement consists in learning to map each of these factors to distinct subspaces of a model's latent representation. However, this approach has shown limited empirical success to date. Here, we show that, for a broad family of transformatio…
▽ More
A core challenge in Machine Learning is to learn to disentangle natural factors of variation in data (e.g. object shape vs. pose). A popular approach to disentanglement consists in learning to map each of these factors to distinct subspaces of a model's latent representation. However, this approach has shown limited empirical success to date. Here, we show that, for a broad family of transformations acting on images--encompassing simple affine transformations such as rotations and translations--this approach to disentanglement introduces topological defects (i.e. discontinuities in the encoder). Motivated by classical results from group representation theory, we study an alternative, more flexible approach to disentanglement which relies on distributed latent operators, potentially acting on the entire latent space. We theoretically and empirically demonstrate the effectiveness of this approach to disentangle affine transformations. Our work lays a theoretical foundation for the recent success of a new generation of models using distributed operators for disentanglement.
△ Less
Submitted 10 February, 2021;
originally announced February 2021.
-
Predicting synchronous firing of large neural populations from sequential recordings
Authors:
Oleksandr Sorochynskyi,
Stéphane Deny,
Olivier Marre,
Ulisse Ferrari
Abstract:
A major goal in neuroscience is to understand how populations of neurons code for stimuli or actions. While the number of neurons that can be recorded simultaneously is increasing at a fast pace, in most cases these recordings cannot access a complete population. In particular, it is hard to simultaneously record all the neurons of the same type in a given area. Recent progress have made possible…
▽ More
A major goal in neuroscience is to understand how populations of neurons code for stimuli or actions. While the number of neurons that can be recorded simultaneously is increasing at a fast pace, in most cases these recordings cannot access a complete population. In particular, it is hard to simultaneously record all the neurons of the same type in a given area. Recent progress have made possible to profile each recorded neuron in a given area thanks to genetic and physiological tools, and to pool together recordings from neurons of the same type across different experimental sessions. However, it is unclear how to infer the activity of a full population of neurons of the same type from these sequential recordings. Neural networks exhibit collective behaviour, e.g. noise correlations and synchronous activity, that are not directly captured by a conditionally-independent model that would just put together the spike trains from sequential recordings. Here we show that we can infer the activity of a full population of retina ganglion cells from sequential recordings, using a novel method based on copula distributions and maximum entropy modeling. From just the spiking response of each ganglion cell to a repeated stimulus, and a few pairwise recordings, we could predict the noise correlations using copulas, and then the full activity of a large population of ganglion cells of the same type using maximum entropy modeling. Remarkably, we could generalize to predict the population responses to different stimuli and even to different experiments. We could therefore use our method to construct a very large population merging cells' responses from different experiments. We predicted synchronous activity accurately and showed it grew substantially with the number of neurons. This approach is a promising way to infer population activity from sequential recordings in sensory areas.
△ Less
Submitted 9 April, 2019;
originally announced April 2019.
-
A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs
Authors:
Jack Lindsey,
Samuel A. Ocko,
Surya Ganguli,
Stephane Deny
Abstract:
The visual system is hierarchically organized to process visual information in successive stages. Neural representations vary drastically across the first stages of visual processing: at the output of the retina, ganglion cell receptive fields (RFs) exhibit a clear antagonistic center-surround structure, whereas in the primary visual cortex, typical RFs are sharply tuned to a precise orientation.…
▽ More
The visual system is hierarchically organized to process visual information in successive stages. Neural representations vary drastically across the first stages of visual processing: at the output of the retina, ganglion cell receptive fields (RFs) exhibit a clear antagonistic center-surround structure, whereas in the primary visual cortex, typical RFs are sharply tuned to a precise orientation. There is currently no unified theory explaining these differences in representations across layers. Here, using a deep convolutional neural network trained on image recognition as a model of the visual system, we show that such differences in representation can emerge as a direct consequence of different neural resource constraints on the retinal and cortical networks, and we find a single model from which both geometries spontaneously emerge at the appropriate stages of visual processing. The key constraint is a reduced number of neurons at the retinal output, consistent with the anatomy of the optic nerve as a stringent bottleneck. Second, we find that, for simple cortical networks, visual representations at the retinal output emerge as nonlinear and lossy feature detectors, whereas they emerge as linear and faithful encoders of the visual scene for more complex cortices. This result predicts that the retinas of small vertebrates should perform sophisticated nonlinear computations, extracting features directly relevant to behavior, whereas retinas of large animals such as primates should mostly encode the visual scene linearly and respond to a much broader range of stimuli. These predictions could reconcile the two seemingly incompatible views of the retina as either performing feature extraction or efficient coding of natural scenes, by suggesting that all vertebrates lie on a spectrum between these two objectives, depending on the degree of neural resources allocated to their visual system.
△ Less
Submitted 3 January, 2019;
originally announced January 2019.
-
Optogenetic vision restoration with high resolution
Authors:
Ulisse Ferrari,
Stéphane Deny,
Abhishek Sengupta,
Romain Caplette,
José-Alain Sahel,
Deniz Dalkara,
Serge Picaud,
Jens Duebel,
Olivier Marre
Abstract:
The majority of inherited retinal degenerations are due to photoreceptor cell death. In many cases ganglion cells are spared making it possible to stimulate them to restore visual function. Several studies (Bi et al., 2006; Lin et al., 2008; Sengupta et al., 2016; Caporale et al., 2011; Berry et al., 2017) have shown that it is possible to express an optogenetic protein in ganglion cells and make…
▽ More
The majority of inherited retinal degenerations are due to photoreceptor cell death. In many cases ganglion cells are spared making it possible to stimulate them to restore visual function. Several studies (Bi et al., 2006; Lin et al., 2008; Sengupta et al., 2016; Caporale et al., 2011; Berry et al., 2017) have shown that it is possible to express an optogenetic protein in ganglion cells and make them light sensitive. This is a promising strategy to restore vision since optical targeting may be more precise than electrical stimulation with a retinal prothesis. However the spatial resolution of optogenetically-reactivated retinas has not been measured with fine-grained stimulation patterns. Since the optogenetic protein is also expressed in axons, it is unclear if these neurons will only be sensitive to the stimulation of a small region covering their somas and dendrites, or if they will also respond to any stimulation overlap** with their axon, dramatically impairing spatial resolution. Here we recorded responses of mouse and macaque retinas to random checkerboard patterns following an in vivo optogenetic therapy. We show that optogenetically activated ganglion cells are each sensitive to a small region of visual space. A simple model based on this small receptive field predicted accurately their responses to complex stimuli. From this model, we simulated how the entire population of light sensitive ganglion cells would respond to letters of different sizes. We then estimated the maximal acuity expected by a patient, assuming it could make an optimal use of the information delivered by this reactivated retina. The obtained acuity is above the limit of legal blindness. This high spatial resolution is a promising result for future clinical studies.
△ Less
Submitted 16 November, 2018;
originally announced November 2018.
-
Separating intrinsic interactions from extrinsic correlations in a network of sensory neurons
Authors:
Ulisse Ferrari,
Stephane Deny,
Matthew Chalk,
Gasper Tkacik,
Olivier Marre,
Thierry Mora
Abstract:
Correlations in sensory neural networks have both extrinsic and intrinsic origins. Extrinsic or stimulus correlations arise from shared inputs to the network, and thus depend strongly on the stimulus ensemble. Intrinsic or noise correlations reflect biophysical mechanisms of interactions between neurons, which are expected to be robust to changes of the stimulus ensemble. Despite the importance of…
▽ More
Correlations in sensory neural networks have both extrinsic and intrinsic origins. Extrinsic or stimulus correlations arise from shared inputs to the network, and thus depend strongly on the stimulus ensemble. Intrinsic or noise correlations reflect biophysical mechanisms of interactions between neurons, which are expected to be robust to changes of the stimulus ensemble. Despite the importance of this distinction for understanding how sensory networks encode information collectively, no method exists to reliably separate intrinsic interactions from extrinsic correlations in neural activity data, limiting our ability to build predictive models of the network response. In this paper we introduce a general strategy to infer {population models of interacting neurons that collectively encode stimulus information}. The key to disentangling intrinsic from extrinsic correlations is to infer the {couplings between neurons} separately from the encoding model, and to combine the two using corrections calculated in a mean-field approximation. We demonstrate the effectiveness of this approach on retinal recordings. The same coupling network is inferred from responses to radically different stimulus ensembles, showing that these couplings indeed reflect stimulus-independent interactions between neurons. The inferred model predicts accurately the collective response of retinal ganglion cell populations as a function of the stimulus.
△ Less
Submitted 22 February, 2018; v1 submitted 5 January, 2018;
originally announced January 2018.
-
A simple model for low variability in neural spike trains
Authors:
Ulisse Ferrari,
Stephane Deny,
Olivier Marre,
Thierry Mora
Abstract:
Neural noise sets a limit to information transmission in sensory systems. In several areas, the spiking response (to a repeated stimulus) has shown a higher degree of regularity than predicted by a Poisson process. However, a simple model to explain this low variability is still lacking. Here we introduce a new model, with a correction to Poisson statistics, which can accurately predict the regula…
▽ More
Neural noise sets a limit to information transmission in sensory systems. In several areas, the spiking response (to a repeated stimulus) has shown a higher degree of regularity than predicted by a Poisson process. However, a simple model to explain this low variability is still lacking. Here we introduce a new model, with a correction to Poisson statistics, which can accurately predict the regularity of neural spike trains in response to a repeated stimulus. The model has only two parameters, but can reproduce the observed variability in retinal recordings in various conditions. We show analytically why this approximation can work. In a model of the spike emitting process where a refractory period is assumed, we derive that our simple correction can well approximate the spike train statistics over a broad range of firing rates. Our model can be easily plugged to stimulus processing models, like Linear-nonlinear model or its generalizations, to replace the Poisson spike train hypothesis that is commonly assumed. It estimates the amount of information transmitted much more accurately than Poisson models in retinal recordings. Thanks to its simplicity this model has the potential to explain low variability in other areas.
△ Less
Submitted 4 January, 2018;
originally announced January 2018.
-
Nonlinear decoding of a complex movie from the mammalian retina
Authors:
Vicente Botella-Soler,
Stéphane Deny,
Olivier Marre,
Gašper Tkačik
Abstract:
Retinal circuitry transforms spatiotemporal patterns of light into spiking activity of ganglion cells, which provide the sole visual input to the brain. Recent advances have led to a detailed characterization of retinal activity and stimulus encoding by large neural populations. The inverse problem of decoding, where the stimulus is reconstructed from spikes, has received less attention, in partic…
▽ More
Retinal circuitry transforms spatiotemporal patterns of light into spiking activity of ganglion cells, which provide the sole visual input to the brain. Recent advances have led to a detailed characterization of retinal activity and stimulus encoding by large neural populations. The inverse problem of decoding, where the stimulus is reconstructed from spikes, has received less attention, in particular for complex input movies that should be reconstructed "pixel-by-pixel". We recorded around a hundred neurons from a dense patch in a rat retina and decoded movies of multiple small discs executing mutually-avoiding random motions. We constructed nonlinear (kernelized) decoders that improved significantly over linear decoding results, mostly due to their ability to reliably separate between neural responses driven by locally fluctuating light signals, and responses at locally constant light driven by spontaneous or network activity. This improvement crucially depended on the precise, non-Poisson temporal structure of individual spike trains, which originated in the spike-history dependence of neural responses. Our results suggest a general paradigm in which downstream neural circuitry could discriminate between spontaneous and stimulus-driven activity on the basis of higher-order statistical structure intrinsic to the incoming spike trains.
△ Less
Submitted 11 May, 2016;
originally announced May 2016.
-
Dynamical criticality in the collective activity of a population of retinal neurons
Authors:
Thierry Mora,
Stéphane Deny,
Olivier Marre
Abstract:
Recent experimental results based on multi-electrode and imaging techniques have reinvigorated the idea that large neural networks operate near a critical point, between order and disorder. However, evidence for criticality has relied on the definition of arbitrary order parameters, or on models that do not address the dynamical nature of network activity. Here we introduce a novel approach to ass…
▽ More
Recent experimental results based on multi-electrode and imaging techniques have reinvigorated the idea that large neural networks operate near a critical point, between order and disorder. However, evidence for criticality has relied on the definition of arbitrary order parameters, or on models that do not address the dynamical nature of network activity. Here we introduce a novel approach to assess criticality that overcomes these limitations, while encompassing and generalizing previous criteria. We find a simple model to describe the global activity of large populations of ganglion cells in the rat retina, and show that their statistics are poised near a critical point. Taking into account the temporal dynamics of the activity greatly enhances the evidence for criticality, revealing it where previous methods would not. The approach is general and could be used in other biological networks.
△ Less
Submitted 31 January, 2015; v1 submitted 24 October, 2014;
originally announced October 2014.