Search | arXiv e-print repository

Understanding top-down attention using task-oriented ablation design

Authors: Freddie Bickford Smith, Brett D Roads, Xiaoliang Luo, Bradley C Love

Abstract: Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. This is known to enhance performance in visual perception. But it remains unclear how attention brings about its perceptual boost, especially when it comes to naturalistic settings like recognising an object in an everyday scene. What aspects of a visual task does… ▽ More Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. This is known to enhance performance in visual perception. But it remains unclear how attention brings about its perceptual boost, especially when it comes to naturalistic settings like recognising an object in an everyday scene. What aspects of a visual task does attention help to deal with? We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design. First we define a broad range of visual tasks and identify six factors that underlie task variability. Then on each task we compare the performance of two neural networks, one with top-down attention and one without. These comparisons reveal the task-dependence of attention's perceptual boost, giving a clearer idea of the role attention plays. Whereas many existing cognitive accounts link attention to stimulus-level variables, such as visual clutter and object scale, we find greater explanatory power in system-level variables that capture the interaction between the model, the distribution of training data and the task format. This finding suggests a shift in how attention is studied could be fruitful. We make publicly available our code and results, along with statistics relevant to ImageNet-based experiments beyond this one. Our contribution serves to support the development of more human-like vision models and the design of more informative machine-learning experiments. △ Less

Submitted 8 June, 2021; originally announced June 2021.

arXiv:2102.06406 [pdf, other]

A Too-Good-to-be-True Prior to Reduce Shortcut Reliance

Authors: Nikolay Dagaev, Brett D. Roads, Xiaoliang Luo, Daniel N. Barry, Kaustubh R. Patil, Bradley C. Love

Abstract: Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep networks often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on "shortcuts" - superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-… ▽ More Despite their impressive performance in object recognition and other tasks under standard testing conditions, deep networks often fail to generalize to out-of-distribution (o.o.d.) samples. One cause for this shortcoming is that modern architectures tend to rely on "shortcuts" - superficial features that correlate with categories without capturing deeper invariants that hold across contexts. Real-world concepts often possess a complex structure that can vary superficially across contexts, which can make the most intuitive and promising solutions in one context not generalize to others. One potential way to improve o.o.d. generalization is to assume simple solutions are unlikely to be valid across contexts and avoid them, which we refer to as the too-good-to-be-true prior. A low-capacity network (LCN) with a shallow architecture should only be able to learn surface relationships, including shortcuts. We find that LCNs can serve as shortcut detectors. Furthermore, an LCN's predictions can be used in a two-stage approach to encourage a high-capacity network (HCN) to rely on deeper invariant features that should generalize broadly. In particular, items that the LCN can master are downweighted when training the HCN. Using a modified version of the CIFAR-10 dataset in which we introduced shortcuts, we found that the two-stage LCN-HCN approach reduced reliance on shortcuts and facilitated o.o.d. generalization. △ Less

Submitted 21 October, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: 10 pages, 8 figures

arXiv:2011.11015 [pdf, other]

Enriching ImageNet with Human Similarity Judgments and Psychological Embeddings

Authors: Brett D. Roads, Bradley C. Love

Abstract: Advances in object recognition flourished in part because of the availability of high-quality datasets and associated benchmarks. However, these benchmarks---such as ILSVRC---are relatively task-specific, focusing predominately on predicting class labels. We introduce a publicly-available dataset that embodies the task-general capabilities of human perception and reasoning. The Human Similarity Ju… ▽ More Advances in object recognition flourished in part because of the availability of high-quality datasets and associated benchmarks. However, these benchmarks---such as ILSVRC---are relatively task-specific, focusing predominately on predicting class labels. We introduce a publicly-available dataset that embodies the task-general capabilities of human perception and reasoning. The Human Similarity Judgments extension to ImageNet (ImageNet-HSJ) is composed of human similarity judgments that supplement the ILSVRC validation set. The new dataset supports a range of task and performance metrics, including the evaluation of unsupervised learning algorithms. We demonstrate two methods of assessment: using the similarity judgments directly and using a psychological embedding trained on the similarity judgments. This embedding space contains an order of magnitude more points (i.e., images) than previous efforts based on human judgments. Scaling to the full 50,000 image set was made possible through a selective sampling process that used variational Bayesian inference and model ensembles to sample aspects of the embedding space that were most uncertain. This methodological innovation not only enables scaling, but should also improve the quality of solutions by focusing sampling where it is needed. To demonstrate the utility of ImageNet-HSJ, we used the similarity ratings and the embedding space to evaluate how well several popular models conform to human similarity judgments. One finding is that more complex models that perform better on task-specific benchmarks do not better conform to human semantic judgments. In addition to the human similarity judgments, pre-trained psychological embeddings and code for inferring variational embeddings are made publicly available. Collectively, ImageNet-HSJ assets support the appraisal of internal representations and the development of more human-like models. △ Less

Submitted 22 November, 2020; originally announced November 2020.

arXiv:2010.06512 [pdf, other]

Transforming Neural Network Visual Representations to Predict Human Judgments of Similarity

Authors: Maria Attarian, Brett D. Roads, Michael C. Mozer

Abstract: Deep-learning vision models have shown intriguing similarities and differences with respect to human vision. We investigate how to bring machine visual representations into better alignment with human representations. Human representations are often inferred from behavioral evidence such as the selection of an image most similar to a query image. We find that with appropriate linear transformation… ▽ More Deep-learning vision models have shown intriguing similarities and differences with respect to human vision. We investigate how to bring machine visual representations into better alignment with human representations. Human representations are often inferred from behavioral evidence such as the selection of an image most similar to a query image. We find that with appropriate linear transformations of deep embeddings, we can improve prediction of human binary choice on a data set of bird images from 72% at baseline to 89%. We hypothesized that deep embeddings have redundant, high (4096) dimensional representations; however, reducing the rank of these representations results in a loss of explanatory power. We hypothesized that the dilation transformation of representations explored in past research is too restrictive, and indeed we found that model explanatory power can be significantly improved with a more expressive linear transform. Most surprising and exciting, we found that, consistent with classic psychological literature, human similarity judgments are asymmetric: the similarity of X to Y is not necessarily equal to the similarity of Y to X, and allowing models to express this asymmetry improves explanatory power. △ Less

Submitted 11 January, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

arXiv:2003.00882 [pdf, other]

The perceptual boost of visual attention is task-dependent in naturalistic settings

Authors: Freddie Bickford Smith, Xiaoliang Luo, Brett D. Roads, Bradley C. Love

Abstract: Top-down attention allows people to focus on task-relevant visual information. Is the resulting perceptual boost task-dependent in naturalistic settings? We aim to answer this with a large-scale computational experiment. First, we design a collection of visual tasks, each consisting of classifying images from a chosen task set (subset of ImageNet categories). The nature of a task is determined by… ▽ More Top-down attention allows people to focus on task-relevant visual information. Is the resulting perceptual boost task-dependent in naturalistic settings? We aim to answer this with a large-scale computational experiment. First, we design a collection of visual tasks, each consisting of classifying images from a chosen task set (subset of ImageNet categories). The nature of a task is determined by which categories are included in the task set. Second, on each task we train an attention-augmented neural network and then compare its accuracy to that of a baseline network. We show that the perceptual boost of attention is stronger with increasing task-set difficulty, weaker with increasing task-set size and weaker with increasing perceptual similarity within a task set. △ Less

Submitted 6 April, 2020; v1 submitted 22 February, 2020; originally announced March 2020.

Comments: Published as a workshop paper at "Bridging AI and Cognitive Science" (ICLR 2020)

arXiv:2002.02342 [pdf, other]

The Costs and Benefits of Goal-Directed Attention in Deep Convolutional Neural Networks

Authors: Xiaoliang Luo, Brett D. Roads, Bradley C. Love

Abstract: People deploy top-down, goal-directed attention to accomplish tasks, such as finding lost keys. By tuning the visual system to relevant information sources, object recognition can become more efficient (a benefit) and more biased toward the target (a potential cost). Motivated by selective attention in categorisation models, we developed a goal-directed attention mechanism that can process natural… ▽ More People deploy top-down, goal-directed attention to accomplish tasks, such as finding lost keys. By tuning the visual system to relevant information sources, object recognition can become more efficient (a benefit) and more biased toward the target (a potential cost). Motivated by selective attention in categorisation models, we developed a goal-directed attention mechanism that can process naturalistic (photographic) stimuli. Our attention mechanism can be incorporated into any existing deep convolutional neural network (DCNNs). The processing stages in DCNNs have been related to ventral visual stream. In that light, our attentional mechanism incorporates top-down influences from prefrontal cortex (PFC) to support goal-directed behaviour. Akin to how attention weights in categorisation models warp representational spaces, we introduce a layer of attention weights to the mid-level of a DCNN that amplify or attenuate activity to further a goal. We evaluated the attentional mechanism using photographic stimuli, varying the attentional target. We found that increasing goal-directed attention has benefits (increasing hit rates) and costs (increasing false alarm rates). At a moderate level, attention improves sensitivity (i.e., increases $d^\prime$) at only a moderate increase in bias for tasks involving standard images, blended images, and natural adversarial images chosen to fool DCNNs. These results suggest that goal-directed attention can reconfigure general-purpose DCNNs to better suit the current task goal, much like PFC modulates activity along the ventral stream. In addition to being more parsimonious and brain consistent, the mid-level attention approach performed better than a standard machine learning approach for transfer learning, namely retraining the final network layer to accommodate the new task. △ Less

Submitted 1 October, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1906.09012 [pdf, other]

doi 10.1038/s42256-019-0132-2

Learning as the Unsupervised Alignment of Conceptual Systems

Authors: Brett D. Roads, Bradley C. Love

Abstract: Concept induction requires the extraction and naming of concepts from noisy perceptual experience. For supervised approaches, as the number of concepts grows, so does the number of required training examples. Philosophers, psychologists, and computer scientists, have long recognized that children can learn to label objects without being explicitly taught. In a series of computational experiments,… ▽ More Concept induction requires the extraction and naming of concepts from noisy perceptual experience. For supervised approaches, as the number of concepts grows, so does the number of required training examples. Philosophers, psychologists, and computer scientists, have long recognized that children can learn to label objects without being explicitly taught. In a series of computational experiments, we highlight how information in the environment can be used to build and align conceptual systems. Unlike supervised learning, the learning problem becomes easier the more concepts and systems there are to master. The key insight is that each concept has a unique signature within one conceptual system (e.g., images) that is recapitulated in other systems (e.g., text or audio). As predicted, children's early concepts form readily aligned systems. △ Less

Submitted 17 January, 2020; v1 submitted 21 June, 2019; originally announced June 2019.

Comments: This is a post-peer-review, pre-copyedit version of an article published in Nature Machine Intelligence. The final authenticated version is available online at: https://doi.org/10.1038/s42256-019-0132-2

arXiv:1511.06409 [pdf, other]

Learning to Generate Images with Perceptual Similarity Metrics

Authors: Jake Snell, Karl Ridgeway, Renjie Liao, Brett D. Roads, Michael C. Mozer, Richard S. Zemel

Abstract: Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a… ▽ More Deep networks are increasingly being applied to problems involving image synthesis, e.g., generating images from textual descriptions and reconstructing an input image from a compact representation. Supervised training of image-synthesis networks typically uses a pixel-wise loss (PL) to indicate the mismatch between a generated image and its corresponding target image. We propose instead to use a loss function that is better calibrated to human perceptual judgments of image quality: the multiscale structural-similarity score (MS-SSIM). Because MS-SSIM is differentiable, it is easily incorporated into gradient-descent learning. We compare the consequences of using MS-SSIM versus PL loss on training deterministic and stochastic autoencoders. For three different architectures, we collected human judgments of the quality of image reconstructions. Observers reliably prefer images synthesized by MS-SSIM-optimized models over those synthesized by PL-optimized models, for two distinct PL measures ($\ell_1$ and $\ell_2$ distances). We also explore the effect of training objective on image encoding and analyze conditions under which perceptually-optimized representations yield better performance on image classification. Finally, we demonstrate the superiority of perceptually-optimized networks for super-resolution imaging. Just as computer vision has advanced through the use of convolutional architectures that mimic the structure of the mammalian visual system, we argue that significant additional advances can be made in modeling images through the use of training objectives that are well aligned to characteristics of human perception. △ Less

Submitted 23 January, 2017; v1 submitted 19 November, 2015; originally announced November 2015.

Showing 1–8 of 8 results for author: Roads, B D