-
MindSet: Vision. A toolbox for testing DNNs on key psychological experiments
Authors:
Valerio Biscione,
Dong Yin,
Gaurav Malhotra,
Marin Dujmovic,
Milton L. Montero,
Guillermo Puebla,
Federico Adolfi,
Rachel F. Heaton,
John E. Hummel,
Benjamin D. Evans,
Karim Habashy,
Jeffrey S. Bowers
Abstract:
Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbo…
▽ More
Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all cases these benchmarks are observational in the sense they are composed of behavioural and brain responses to naturalistic images that have not been manipulated to test hypotheses regarding how DNNs or humans perceive and identify objects. Here we introduce the toolbox MindSet: Vision, consisting of a collection of image datasets and related scripts designed to test DNNs on 30 psychological findings. In all experimental conditions, the stimuli are systematically manipulated to test specific hypotheses regarding human visual perception and object recognition. In addition to providing pre-generated datasets of images, we provide code to regenerate these datasets, offering many configurable parameters which greatly extend the dataset versatility for different research contexts, and code to facilitate the testing of DNNs on these image datasets using three different methods (similarity judgments, out-of-distribution classification, and decoder method), accessible at https://github.com/MindSetVision/mindset-vision. We test ResNet-152 on each of these methods as an example of how the toolbox can be used.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Convolutional Neural Networks Trained to Identify Words Provide a Surprisingly Good Account of Visual Form Priming Effects
Authors:
Dong Yin,
Valerio Biscione,
Jeffrey Bowers
Abstract:
A wide variety of orthographic coding schemes and models of visual word identification have been developed to account for masked priming data that provide a measure of orthographic similarity between letter strings. These models tend to include hand-coded orthographic representations with single unit coding for specific forms of knowledge (e.g., units coding for a letter in a given position). Here…
▽ More
A wide variety of orthographic coding schemes and models of visual word identification have been developed to account for masked priming data that provide a measure of orthographic similarity between letter strings. These models tend to include hand-coded orthographic representations with single unit coding for specific forms of knowledge (e.g., units coding for a letter in a given position). Here we assess how well a range of these coding schemes and models account for the pattern of form priming effects taken from the Form Priming Project and compare these findings to results observed with 11 standard deep neural network models (DNNs) developed in computer science. We find that deep convolutional networks (CNNs) perform as well or better than the coding schemes and word recognition models, whereas transformer networks did less well. The success of CNNs is remarkable as their architectures were not developed to support word recognition (they were designed to perform well on object recognition), they classify pixel images of words (rather than artificial encodings of letter strings), and their training was highly simplified (not respecting many key aspects of human experience). In addition to these form priming effects, we find that the DNNs can account for visual similarity effects on priming that are beyond all current psychological models of priming. The findings add to the recent work of (Hannagan et al., 2021) and suggest that CNNs should be given more attention in psychology as models of human visual word recognition.
△ Less
Submitted 14 March, 2023; v1 submitted 8 February, 2023;
originally announced February 2023.
-
Mixed Evidence for Gestalt Grou** in Deep Neural Networks
Authors:
Valerio Biscione,
Jeffrey S. Bowers
Abstract:
Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grou** principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform w…
▽ More
Gestalt psychologists have identified a range of conditions in which humans organize elements of a scene into a group or whole, and perceptual grou** principles play an essential role in scene perception and object identification. Recently, Deep Neural Networks (DNNs) trained on natural images (ImageNet) have been proposed as compelling models of human vision based on reports that they perform well on various brain and behavioral benchmarks. Here we test a total of 16 networks covering a variety of architectures and learning paradigms (convolutional, attention-based, supervised and self-supervised, feed-forward and recurrent) on dots (Experiment 1) and more complex shapes (Experiment 2) stimuli that produce strong Gestalts effects in humans. In Experiment 1 we found that convolutional networks were indeed sensitive in a human-like fashion to the principles of proximity, linearity, and orientation, but only at the output layer. In Experiment 2, we found that most networks exhibited Gestalt effects only for a few sets, and again only at the latest stage of processing. Overall, self-supervised and Vision-Transformer appeared to perform worse than convolutional networks in terms of human similarity. Remarkably, no model presented a grou** effect at the early or intermediate stages of processing. This is at odds with the widespread assumption that Gestalts occur prior to object recognition, and indeed, serve to organize the visual scene for the sake of object recognition. Our overall conclusion is that, albeit noteworthy that networks trained on simple 2D images support a form of Gestalt grou** for some stimuli at the output layer, this ability does not seem to transfer to more complex features. Additionally, the fact that this grou** only occurs at the last layer suggests that networks learn fundamentally different perceptual properties than humans.
△ Less
Submitted 20 February, 2023; v1 submitted 14 March, 2022;
originally announced March 2022.
-
Convolutional Neural Networks Are Not Invariant to Translation, but They Can Learn to Be
Authors:
Valerio Biscione,
Jeffrey S. Bowers
Abstract:
When seeing a new object, humans can immediately recognize it across different retinal locations: the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several studies have found that these netwo…
▽ More
When seeing a new object, humans can immediately recognize it across different retinal locations: the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several studies have found that these networks systematically fail to recognise new objects on untrained locations. In this work, we test a wide variety of CNNs architectures showing how, apart from DenseNet-121, none of the models tested was architecturally invariant to translation. Nevertheless, all of them could learn to be invariant to translation. We show how this can be achieved by pretraining on ImageNet, and it is sometimes possible with much simpler data sets when all the items are fully translated across the input canvas. At the same time, this invariance can be disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right `latent' characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Learning Online Visual Invariances for Novel Objects via Supervised and Self-Supervised Training
Authors:
Valerio Biscione,
Jeffrey S. Bowers
Abstract:
Humans can identify objects following various spatial transformations such as scale and viewpoint. This extends to novel objects, after a single presentation at a single pose, sometimes referred to as online invariance. CNNs have been proposed as a compelling model of human vision, but their ability to identify objects across transformations is typically tested on held-out samples of trained categ…
▽ More
Humans can identify objects following various spatial transformations such as scale and viewpoint. This extends to novel objects, after a single presentation at a single pose, sometimes referred to as online invariance. CNNs have been proposed as a compelling model of human vision, but their ability to identify objects across transformations is typically tested on held-out samples of trained categories after extensive data augmentation. This paper assesses whether standard CNNs can support human-like online invariance by training models to recognize images of synthetic 3D objects that undergo several transformations: rotation, scaling, translation, brightness, contrast, and viewpoint. Through the analysis of models' internal representations, we show that standard supervised CNNs trained on transformed objects can acquire strong invariances on novel classes even when trained with as few as 50 objects taken from 10 classes. This extended to a different dataset of photographs of real objects. We also show that these invariances can be acquired in a self-supervised way, through solving the same/different task. We suggest that this latter approach may be similar to how humans acquire invariances.
△ Less
Submitted 14 January, 2022; v1 submitted 4 October, 2021;
originally announced October 2021.
-
A case for robust translation tolerance in humans and CNNs. A commentary on Han et al
Authors:
Ryan Blything,
Valerio Biscione,
Jeffrey Bowers
Abstract:
Han et al. (2020) reported a behavioral experiment that assessed the extent to which the human visual system can identify novel images at unseen retinal locations (what the authors call "intrinsic translation invariance") and developed a novel convolutional neural network model (an Eccentricity Dependent Network or ENN) to capture key aspects of the behavioral results. Here we show that their anal…
▽ More
Han et al. (2020) reported a behavioral experiment that assessed the extent to which the human visual system can identify novel images at unseen retinal locations (what the authors call "intrinsic translation invariance") and developed a novel convolutional neural network model (an Eccentricity Dependent Network or ENN) to capture key aspects of the behavioral results. Here we show that their analysis of behavioral data used inappropriate baseline conditions, leading them to underestimate intrinsic translation invariance. When the data are correctly interpreted they show near complete translation tolerance extending to 14° in some conditions, consistent with earlier work (Bowers et al., 2016) and more recent work Blything et al. (in press). We describe a simpler model that provides a better account of translation invariance.
△ Less
Submitted 14 December, 2020; v1 submitted 10 December, 2020;
originally announced December 2020.
-
Learning Translation Invariance in CNNs
Authors:
Valerio Biscione,
Jeffrey Bowers
Abstract:
When seeing a new object, humans can immediately recognize it across different retinal locations: we say that the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several works have found that t…
▽ More
When seeing a new object, humans can immediately recognize it across different retinal locations: we say that the internal object representation is invariant to translation. It is commonly believed that Convolutional Neural Networks (CNNs) are architecturally invariant to translation thanks to the convolution and/or pooling operations they are endowed with. In fact, several works have found that these networks systematically fail to recognise new objects on untrained locations. In this work we show how, even though CNNs are not 'architecturally invariant' to translation, they can indeed 'learn' to be invariant to translation. We verified that this can be achieved by pretraining on ImageNet, and we found that it is also possible with much simpler datasets in which the items are fully translated across the input canvas. We investigated how this pretraining affected the internal network representations, finding that the invariance was almost always acquired, even though it was some times disrupted by further training due to catastrophic forgetting/interference. These experiments show how pretraining a network on an environment with the right 'latent' characteristics (a more naturalistic environment) can result in the network learning deep perceptual rules which would dramatically improve subsequent generalization.
△ Less
Submitted 6 November, 2020;
originally announced November 2020.
-
The human visual system and CNNs can both support robust online translation tolerance following extreme displacements
Authors:
Ryan Blything,
Valerio Biscione,
Ivan I. Vankov,
Casimir J. H. Ludwig,
Jeffrey S. Bowers
Abstract:
Visual translation tolerance refers to our capacity to recognize objects over a wide range of different retinal locations. Although translation is perhaps the simplest spatial transform that the visual system needs to cope with, the extent to which the human visual system can identify objects at previously unseen locations is unclear, with some studies reporting near complete invariance over 10° a…
▽ More
Visual translation tolerance refers to our capacity to recognize objects over a wide range of different retinal locations. Although translation is perhaps the simplest spatial transform that the visual system needs to cope with, the extent to which the human visual system can identify objects at previously unseen locations is unclear, with some studies reporting near complete invariance over 10° and other reporting zero invariance at 4° of visual angle. Similarly, there is confusion regarding the extent of translation tolerance in computational models of vision, as well as the degree of match between human and model performance. Here we report a series of eye-tracking studies (total N=70) demonstrating that novel objects trained at one retinal location can be recognized at high accuracy rates following translations up to 18°. We also show that standard deep convolutional networks (DCNNs) support our findings when pretrained to classify another set of stimuli across a range of locations, or when a Global Average Pooling (GAP) layer is added to produce larger receptive fields. Our findings provide a strong constraint for theories of human vision and help explain inconsistent findings previously reported with CNNs.
△ Less
Submitted 8 December, 2020; v1 submitted 27 September, 2020;
originally announced September 2020.