-
iWISDM: Assessing instruction following in multimodal models at scale
Authors:
Xiaoxuan Lei,
Lucas Gomez,
Hao Yuan Bai,
Pouya Bashivan
Abstract:
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achie…
▽ More
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.
△ Less
Submitted 25 June, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Towards Out-of-Distribution Adversarial Robustness
Authors:
Adam Ibrahim,
Charles Guille-Escuret,
Ioannis Mitliagkas,
Irina Rish,
David Krueger,
Pouya Bashivan
Abstract:
Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. C…
▽ More
Adversarial robustness continues to be a major challenge for deep learning. A core issue is that robustness to one type of attack often fails to transfer to other attacks. While prior work establishes a theoretical trade-off in robustness against different $L_p$ norms, we show that there is potential for improvement against many commonly used attacks by adopting a domain generalisation approach. Concretely, we treat each type of attack as a domain, and apply the Risk Extrapolation method (REx), which promotes similar levels of robustness against all training attacks. Compared to existing methods, we obtain similar or superior worst-case adversarial robustness on attacks seen during training. Moreover, we achieve superior performance on families or tunings of attacks only encountered at test time. On ensembles of attacks, our approach improves the accuracy from 3.4% with the best existing baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
△ Less
Submitted 26 June, 2023; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Learning Robust Kernel Ensembles with Kernel Average Pooling
Authors:
Pouya Bashivan,
Adam Ibrahim,
Amirozhan Dehghani,
Yifei Ren
Abstract:
Model ensembles have long been used in machine learning to reduce the variance in individual model predictions, making them more robust to input perturbations. Pseudo-ensemble methods like dropout have also been commonly used in deep learning models to improve generalization. However, the application of these techniques to improve neural networks' robustness against input perturbations remains und…
▽ More
Model ensembles have long been used in machine learning to reduce the variance in individual model predictions, making them more robust to input perturbations. Pseudo-ensemble methods like dropout have also been commonly used in deep learning models to improve generalization. However, the application of these techniques to improve neural networks' robustness against input perturbations remains underexplored. We introduce Kernel Average Pooling (KAP), a neural network building block that applies the mean filter along the kernel dimension of the layer activation tensor. We show that ensembles of kernels with similar functionality naturally emerge in convolutional neural networks equipped with KAP and trained with backpropagation. Moreover, we show that when trained on inputs perturbed with additive Gaussian noise, KAP models are remarkably robust against various forms of adversarial attacks. Empirical evaluations on CIFAR10, CIFAR100, TinyImagenet, and Imagenet datasets show substantial improvements in robustness against strong adversarial attacks such as AutoAttack without training on any adversarial examples.
△ Less
Submitted 30 May, 2023; v1 submitted 30 September, 2022;
originally announced October 2022.
-
Adversarial Feature Desensitization
Authors:
Pouya Bashivan,
Reza Bayat,
Adam Ibrahim,
Kartik Ahuja,
Mojtaba Faramarzi,
Touraj Laleh,
Blake Aaron Richards,
Irina Rish
Abstract:
Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen d…
▽ More
Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths. Our code is available at https://github.com/BashivanLab/afd.
△ Less
Submitted 4 January, 2022; v1 submitted 8 June, 2020;
originally announced June 2020.
-
Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs
Authors:
Jonas Kubilius,
Martin Schrimpf,
Kohitij Kar,
Ha Hong,
Najib J. Majaj,
Rishi Rajalingham,
Elias B. Issa,
Pouya Bashivan,
Jonathan Prescott-Roy,
Kailyn Schmidt,
Aran Nayebi,
Daniel Bear,
Daniel L. K. Yamins,
James J. DiCarlo
Abstract:
Deep convolutional artificial neural networks (ANNs) are the leading class of candidate models of the mechanisms of visual processing in the primate ventral stream. While initially inspired by brain anatomy, over the past years, these ANNs have evolved from a simple eight-layer architecture in AlexNet to extremely deep and branching architectures, demonstrating increasingly better object categoriz…
▽ More
Deep convolutional artificial neural networks (ANNs) are the leading class of candidate models of the mechanisms of visual processing in the primate ventral stream. While initially inspired by brain anatomy, over the past years, these ANNs have evolved from a simple eight-layer architecture in AlexNet to extremely deep and branching architectures, demonstrating increasingly better object categorization performance, yet bringing into question how brain-like they still are. In particular, typical deep models from the machine learning community are often hard to map onto the brain's anatomy due to their vast number of layers and missing biologically-important connections, such as recurrence. Here we demonstrate that better anatomical alignment to the brain and high performance on machine learning as well as neuroscience measures do not have to be in contradiction. We developed CORnet-S, a shallow ANN with four anatomically mapped areas and recurrent connectivity, guided by Brain-Score, a new large-scale composite of neural and behavioral benchmarks for quantifying the functional fidelity of models of the primate ventral visual stream. Despite being significantly shallower than most models, CORnet-S is the top model on Brain-Score and outperforms similarly compact models on ImageNet. Moreover, our extensive analyses of CORnet-S circuitry variants reveal that recurrence is the main predictive factor of both Brain-Score and ImageNet top-1 performance. Finally, we report that the temporal evolution of the CORnet-S "IT" neural population resembles the actual monkey IT population dynamics. Taken together, these results establish CORnet-S, a compact, recurrent ANN, as the current best model of the primate ventral visual stream.
△ Less
Submitted 28 October, 2019; v1 submitted 13 September, 2019;
originally announced September 2019.
-
Continual Learning with Self-Organizing Maps
Authors:
Pouya Bashivan,
Martin Schrimpf,
Robert Ajemian,
Irina Rish,
Matthew Riemer,
Yuhai Tu
Abstract:
Despite remarkable successes achieved by modern neural networks in a wide range of applications, these networks perform best in domain-specific stationary environments where they are trained only once on large-scale controlled data repositories. When exposed to non-stationary learning environments, current neural networks tend to forget what they had previously learned, a phenomena known as catast…
▽ More
Despite remarkable successes achieved by modern neural networks in a wide range of applications, these networks perform best in domain-specific stationary environments where they are trained only once on large-scale controlled data repositories. When exposed to non-stationary learning environments, current neural networks tend to forget what they had previously learned, a phenomena known as catastrophic forgetting. Most previous approaches to this problem rely on memory replay buffers which store samples from previously learned tasks, and use them to regularize the learning on new ones. This approach suffers from the important disadvantage of not scaling well to real-life problems in which the memory requirements become enormous. We propose a memoryless method that combines standard supervised neural networks with self-organizing maps to solve the continual learning problem. The role of the self-organizing map is to adaptively cluster the inputs into appropriate task contexts - without explicit labels - and allocate network resources accordingly. Thus, it selectively routes the inputs in accord with previous experience, ensuring that past learning is maintained and does not interfere with current learning. Out method is intuitive, memoryless, and performs on par with current state-of-the-art approaches on standard benchmarks.
△ Less
Submitted 19 April, 2019;
originally announced April 2019.
-
Teacher Guided Architecture Search
Authors:
Pouya Bashivan,
Mark Tensen,
James J DiCarlo
Abstract:
Much of the recent improvement in neural networks for computer vision has resulted from discovery of new networks architectures. Most prior work has used the performance of candidate models following limited training to automatically guide the search in a feasible way. Could further gains in computational efficiency be achieved by guiding the search via measurements of a high performing network wi…
▽ More
Much of the recent improvement in neural networks for computer vision has resulted from discovery of new networks architectures. Most prior work has used the performance of candidate models following limited training to automatically guide the search in a feasible way. Could further gains in computational efficiency be achieved by guiding the search via measurements of a high performing network with unknown detailed architecture (e.g. the primate visual system)? As one step toward this goal, we use representational similarity analysis to evaluate the similarity of internal activations of candidate networks with those of a (fixed, high performing) teacher network. We show that adopting this evaluation metric could produce up to an order of magnitude in search efficiency over performance-guided methods. Our approach finds a convolutional cell structure with similar performance as was previously found using other methods but at a total computational cost that is two orders of magnitude lower than Neural Architecture Search (NAS) and more than four times lower than progressive neural architecture search (PNAS). We further show that measurements from only ~300 neurons from primate visual system provides enough signal to find a network with an Imagenet top-1 error that is significantly lower than that achieved by performance-guided architecture search alone. These results suggest that representational matching can be used to accelerate network architecture search in cases where one has access to some or all of the internal representations of a teacher network of interest, such as the brain's sensory processing networks.
△ Less
Submitted 6 September, 2019; v1 submitted 3 August, 2018;
originally announced August 2018.
-
A Neurobiological Evaluation Metric for Neural Network Model Search
Authors:
Nathaniel Blanchard,
Jeffery Kinnison,
Brandon RichardWebster,
Pouya Bashivan,
Walter J. Scheirer
Abstract:
Neuroscience theory posits that the brain's visual system coarsely identifies broad object categories via neural activation patterns, with similar objects producing similar neural responses. Artificial neural networks also have internal activation behavior in response to stimuli. We hypothesize that networks exhibiting brain-like activation behavior will demonstrate brain-like characteristics, e.g…
▽ More
Neuroscience theory posits that the brain's visual system coarsely identifies broad object categories via neural activation patterns, with similar objects producing similar neural responses. Artificial neural networks also have internal activation behavior in response to stimuli. We hypothesize that networks exhibiting brain-like activation behavior will demonstrate brain-like characteristics, e.g., stronger generalization capabilities. In this paper we introduce a human-model similarity (HMS) metric, which quantifies the similarity of human fMRI and network activation behavior. To calculate HMS, representational dissimilarity matrices (RDMs) are created as abstractions of activation behavior, measured by the correlations of activations to stimulus pairs. HMS is then the correlation between the fMRI RDM and the neural network RDM across all stimulus pairs. We test the metric on unsupervised predictive coding networks, which specifically model visual perception, and assess the metric for statistical significance over a large range of hyperparameters. Our experiments show that networks with increased human-model similarity are correlated with better performance on two computer vision tasks: next frame prediction and object matching accuracy. Further, HMS identifies networks with high performance on both tasks. An unexpected secondary finding is that the metric can be employed during training as an early-stop** mechanism.
△ Less
Submitted 26 November, 2018; v1 submitted 27 May, 2018;
originally announced May 2018.
-
Learning Neural Markers of Schizophrenia Disorder Using Recurrent Neural Networks
Authors:
Jumana Dakka,
Pouya Bashivan,
Mina Gheiratmand,
Irina Rish,
Shantenu Jha,
Russell Greiner
Abstract:
Smart systems that can accurately diagnose patients with mental disorders and identify effective treatments based on brain functional imaging data are of great applicability and are gaining much attention. Most previous machine learning studies use hand-designed features, such as functional connectivity, which does not maintain the potential useful information in the spatial relationship between b…
▽ More
Smart systems that can accurately diagnose patients with mental disorders and identify effective treatments based on brain functional imaging data are of great applicability and are gaining much attention. Most previous machine learning studies use hand-designed features, such as functional connectivity, which does not maintain the potential useful information in the spatial relationship between brain regions and the temporal profile of the signal in each region. Here we propose a new method based on recurrent-convolutional neural networks to automatically learn useful representations from segments of 4-D fMRI recordings. Our goal is to exploit both spatial and temporal information in the functional MRI movie (at the whole-brain voxel level) for identifying patients with schizophrenia.
△ Less
Submitted 1 December, 2017;
originally announced December 2017.
-
Mental State Recognition via Wearable EEG
Authors:
Pouya Bashivan,
Irina Rish,
Steve Heisig
Abstract:
The increasing quality and affordability of consumer electroencephalogram (EEG) headsets make them attractive for situations where medical grade devices are impractical. Predicting and tracking cognitive states is possible for tasks that were previously not conducive to EEG monitoring. For instance, monitoring operators for states inappropriate to the task (e.g. drowsy drivers), tracking mental he…
▽ More
The increasing quality and affordability of consumer electroencephalogram (EEG) headsets make them attractive for situations where medical grade devices are impractical. Predicting and tracking cognitive states is possible for tasks that were previously not conducive to EEG monitoring. For instance, monitoring operators for states inappropriate to the task (e.g. drowsy drivers), tracking mental health (e.g. anxiety) and productivity (e.g. tiredness) are among possible applications for the technology. Consumer grade EEG headsets are affordable and relatively easy to use, but they lack the resolution and quality of signal that can be achieved using medical grade EEG devices. Thus, the key questions remain: to what extent are wearable EEG devices capable of mental state recognition, and what kind of mental states can be accurately recognized with these devices? In this work, we examined responses to two different types of input: instructional (logical) versus recreational (emotional) videos, using a range of machine-learning methods. We tried SVMs, sparse logistic regression, and Deep Belief Networks, to discriminate between the states of mind induced by different types of video input, that can be roughly labeled as logical vs. emotional. Our results demonstrate a significant potential of wearable EEG devices in differentiating cognitive states between situations with large contextual but subtle apparent differences.
△ Less
Submitted 5 June, 2016; v1 submitted 2 February, 2016;
originally announced February 2016.
-
Learning Representations from EEG with Deep Recurrent-Convolutional Neural Networks
Authors:
Pouya Bashivan,
Irina Rish,
Mohammed Yeasin,
Noel Codella
Abstract:
One of the challenges in modeling cognitive events from electroencephalogram (EEG) data is finding representations that are invariant to inter- and intra-subject differences, as well as to inherent noise associated with such data. Herein, we propose a novel approach for learning such representations from multi-channel EEG time-series, and demonstrate its advantages in the context of mental load cl…
▽ More
One of the challenges in modeling cognitive events from electroencephalogram (EEG) data is finding representations that are invariant to inter- and intra-subject differences, as well as to inherent noise associated with such data. Herein, we propose a novel approach for learning such representations from multi-channel EEG time-series, and demonstrate its advantages in the context of mental load classification task. First, we transform EEG activities into a sequence of topology-preserving multi-spectral images, as opposed to standard EEG analysis techniques that ignore such spatial information. Next, we train a deep recurrent-convolutional network inspired by state-of-the-art video classification to learn robust representations from the sequence of images. The proposed approach is designed to preserve the spatial, spectral, and temporal structure of EEG which leads to finding features that are less sensitive to variations and distortions within each dimension. Empirical evaluation on the cognitive load classification task demonstrated significant improvements in classification accuracy over current state-of-the-art approaches in this field.
△ Less
Submitted 29 February, 2016; v1 submitted 19 November, 2015;
originally announced November 2015.