{NoHyper}

Dimensions underlying the representational alignment of deep neural networks with humans

Florian P. Mahner 1,2, , Lukas Muttenthaler 1,3,4,11footnotemark: 1   , Umut Güçlü 2, Martin N. Hebart 1,5,6

1 Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
2 Donders Institute for Brain, Cognition and Behaviour, Nijmegen, the Netherlands
3 Machine Learning Group, Technische Universität Berlin, Germany
4 Berlin Institute for the Foundations of Learning and Data, Berlin, Germany
5 Department of Medicine, Justus Liebig University, Giessen, Germany
6 Center for Mind, Brain and Behavior, Universities of Marburg, Giessen, and Darmstadt, Germany
Equal contributionWork partly done while a Student Researcher at Google DeepMind
Abstract

Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.

Today’s deep neural networks (DNNs) have shown remarkable performance, offering powerful models that often approach or even exceed human performance across diverse perceptual and cognitive benchmarks. Paralleling their success in machine learning, recent work in computational cognitive neuroscience has demonstrated striking similarities between artificial and biological processing systems (Güçlü and van Gerven, 2015; Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Kubilius et al., 2016; Rajalingham et al., 2015). This has sparked considerable interest in linking DNN representations and behaviors to those found in humans. From the machine learning perspective, understanding the limitations of DNNs can support the development of artificial intelligence systems that are better aligned with humans, promising improved and more robust performance (Sucholutsky et al., 2023). From the computational cognitive neuroscience perspective, DNNs with stronger human alignment promise to be better candidate computational models of human cognition and behavior (Cichy and Kaiser, 2019; Lindsay, 2021; Kanwisher et al., 2023; Doerig et al., 2023).

Much previous research on the alignment of human and artificial visual systems has compared behavioral strategies (e.g., classification) in both systems and has revealed important limitations in the generalization performance of DNNs (Rajalingham et al., 2018; Geirhos et al., 2018; Rosenfeld et al., 2018; Beery et al., 2018; Szegedy et al., 2013). Other work has focused on directly comparing cognitive and neural representations in humans to those in DNNs, using methods such as representational similarity analysis (RSA; Kriegeskorte et al., 2008) or linear regression (Attarian et al., 2020; Roads and Love, 2020; Peterson et al., 2018; Muttenthaler et al., 2023a). This quantification of alignment has led to a direct comparison of numerous DNNs across diverse visual tasks (Conwell et al., 2022; Schrimpf et al., 2018; Muttenthaler et al., 2023b; Wang et al., 2023), highlighting the role of factors such as architecture, training data, or learning objective in determining the similarity to humans (Storrs et al., 2021; Conwell et al., 2022; Muttenthaler et al., 2023a; Wang et al., 2023).

Despite the appeal of global metrics for comparing the representational alignment of humans and DNNs, they only provide a quantification of the degree of representational or behavioral alignment. However, without explicit hypotheses about potential causes for misalignment, global metrics are limited in their explanatory scope of what features determine this degree of alignment, that is, what representational factors underlie the similarities and differences between humans and DNNs. While diverse methods for interpreting DNN activations have been developed at various levels of analysis, ranging from single units to entire layers (Bau et al., 2020; Zhou et al., 2018; Morcos et al., 2018), the direct comparability to human representations has remained a key challenge.

Inspired by recent work in the cognitive sciences that has revealed core visual and semantic representational dimensions underlying human similarity judgments of object images (Hebart et al., 2020), here we propose a framework for systematically analyzing and comparing the dimensions underlying the representations in DNNs and humans. In this work, we apply this framework to human visual similarity judgments and representations in a DNN trained to classify natural images. Our approach reveals numerous interpretable DNN dimensions that appear to reflect both visual and semantic image properties and that appear to be well-aligned to humans. In contrast to humans who showed a dominance of semantic over visual dimensions, DNNs exhibited a striking visual bias, which only emulates human semantic behavior. While psychophysical experiments on DNN dimensions underscored their global interpretability, a face-to-face comparison of dimensions with humans revealed that DNN representations in fact only approximate human representations but lack the consistency expected from feature-specific visual and semantic dimensions. Together, our results reveal key factors underlying the representational alignment and misalignment between humans and DNNs, shed light on potentially divergent representational strategies, and highlight the potential of this approach for identifying determinants of the similarities and differences between both domains.

1 Results

Refer to caption
Figure 1: Overview: A computational framework that captures core DNN object representations in analogy to humans by simulating behavioral decisions in an odd-one-out task. a, The triplet odd-one-out task, where a human participant or a DNN is presented a set of three images and is asked to select the image that is most different from the others. b, Sampling approach of odd-one-out decisions from DNN representations. First, a dot-product similarity space is constructed from DNN features. Next, for a given triplet of objects, the most similar pair in this similarity space is identified, making the remaining object the odd-one-out. c, Illustration of the computational modeling approach to learn a lower-dimensional object representation for human participants and the DNN, optimized to predict behavioral choices made in the triplet task. d, Schematic depiction of the interpretability pipeline that allows predicting object embeddings from pretrained DNN features.

To improve the comparability of human and DNN representations, we aimed at identifying the similarities and differences in key dimensions underlying their image representations. To achieve this aim, we treated the neural network analogous to a human participant carrying out a cognitive behavioral experiment and then derived representational embeddings both from human similarity judgments and a DNN on the same behavioral task. This approach ensured direct comparability between human and DNN representations. As a behavioral task, we chose a triplet odd-one-out similarity task, where from a set of three object images i𝑖iitalic_i, j𝑗jitalic_j, k𝑘kitalic_k a participant has to select the most dissimilar object (Fig. 1a). In this task, the perceived similarity between two images i𝑖iitalic_i and j𝑗jitalic_j is defined as the probability of choosing these images to belong together across varying contexts imposed by a third object image k𝑘kitalic_k. By virtue of providing minimal contexts, the odd-one-out task highlights the information sufficient for capturing the similarity between object images i𝑖iitalic_i and j𝑗jitalic_j across diverse contexts. In addition, it approximates human categorization behavior for arbitrary visual and semantic categories, even for quite diverse sets of objects (Zheng et al., 2019; Hebart et al., 2020; Muttenthaler et al., 2022). Thus, by focusing on the building blocks of categorization which underlies diverse behaviors, this task is ideally suited for comparing object representations between humans and DNNs.

Refer to caption
Figure 2: Representational embeddings inferred from human and DNN behavior. a, Visualization of example dimensions from human- and DNN-derived representational embeddings, with a selection of dimensions that had been rated as semantic, mixed visual-semantic, and visual, alongside their dimension labels obtained from human judgments. Note that the displayed images reflect only images with a public domain license and not the full image set (Stoinski et al., 2023) b, Rating procedure for each dimension, which was based on visualizing the top k𝑘kitalic_k images according to their numeric weights. Human participants labeled each of the human and DNN dimensions as predominantly semantic, visual, mixed visual-semantic, or unclear (unclear ratings not shown: 7.35% of all dimensions for humans, 8.57% for VGG-16). c, Relative importance of dimensions labeled as visual and semantic, where VGG-16 exhibited a dominance of visual and mixed dimensions relative to humans that showed a clear dominance of semantic dimensions.

For humans, we used a set of 4.7 million publicly available odd-one-out judgments (Hebart et al., 2023) on 1,854 diverse object images, derived from the THINGS object concept and image database (Hebart et al., 2019). For the DNN, we collected similarity judgments for 24,102 images of the same objects used for humans (1,854 objects, 13 examples per object). We used a larger set of object images since the DNN was less limited by constraints in dataset size than humans. This allowed us to obtain more precise estimates of their representation. As a DNN, we chose a pretrained VGG-16 model (Simonyan and Zisserman, 2014), given its common use in the computational cognitive neurosciences. Specifically, this network has been shown to exhibit good correspondence to both human behavior (Geirhos et al., 2018) and measured neural activity (Güçlü and van Gerven, 2015; Schrimpf et al., 2018; Nonaka et al., 2021) and performs well at predicting human similarity judgments (Jozwik et al., 2017; Peterson et al., 2018; King et al., 2019; Storrs et al., 2021; Kaniuth and Hebart, 2022; Muttenthaler et al., 2023a). However, for completeness, we additionally ran similar analyses for a broader range of neural network architectures (see Supplementary Information A). We focused on the penultimate layer activations as they reflect the most high-level abstraction of the input and are thus representationally closer to the behavioral outputs. For the DNN, we generated a dataset of behavioral odd-one-out choices for the 24,102 object images (Fig. 1b). To this end, we first extracted the DNN layer activations for all images. Next, for a given triplet of activations 𝒛isubscript𝒛𝑖{\bm{z}}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒛jsubscript𝒛𝑗{\bm{z}}_{j}bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒛ksubscript𝒛𝑘{\bm{z}}_{k}bold_italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we computed the dot product between each pair as a measure of similarity, then identified the most similar pair of images in this triplet, and designated the remaining third image as the odd-one-out. Given the excessively large number of possible triplets for all 24,102 images, we approximated the full set of object choices from a random subset of 20202020 million triplets (Jain et al., 2016).

From both sets of available triplet choices, we next generated two representational embeddings, one for humans and one for the DNN. In these embeddings, each object is characterized by a set of representational dimensions. The embeddings were optimized for predicting the odd-one-out choices in humans and DNNs, respectively. For comparability to previous work in humans (Zheng et al., 2019; Hebart et al., 2020; Muttenthaler et al., 2022), we imposed sparsity and non-negativity constraints on the optimization, which support their interpretability and provide cognitive plausible criteria for dimensions(Hebart et al., 2020, 2023; Hoyer, 2002; Murphy et al., 2012; Fyshe et al., 2015). Sparsity limited the number of dimensions used, while non-negativity ensured dimensions only added to the explanation of behavior without canceling each other out. During training, each randomly initialized embedding was optimized using a recent variational embedding technique (Muttenthaler et al., 2022) (see Methods for details). The optimization resulted in two stable, low-dimensional embeddings, with 70 reproducible dimensions for the DNN embedding and 68 for the human embedding. The DNN embedding captured 84.03% of the total variance in image-to-image similarity, while the human embedding captured 90.85% of the explainable variance given the empirical noise ceiling of the dataset.

1.1 DNN dimensions reflect diverse conceptual and perceptual properties

Having identified stable, low-dimensional embeddings that are predictive of triplet odd-one-out judgments, we first assessed the interpretability of each identified DNN dimension by visualizing object images with large numeric weights. In addition to this qualitative assessment, we validated these observations for the DNN by asking 12 (6 female, 6 male) human participants to provide labels for each dimension separately. Similar to the core semantic and visual dimensions underlying odd-one-out judgments in humans described previously (Hebart et al., 2020; Muttenthaler et al., 2022; Hebart et al., 2023), the DNN embedding yielded many interpretable dimensions, which appeared to reflect both semantic and visual properties of objects. The semantic dimensions included taxonomic membership (e.g. food-related, technology-related, home-related) and other knowledge-related features (e.g. softness), while the visual dimensions reflected visual-perceptual attributes (e.g. round, green, stringy), with some dimensions reflecting a composite of semantic and visual features (e.g. green and organic) (Fig. 2a). Of note, the DNN dimensions also revealed a sensitivity to basic shapes, including roundness, boxiness and tube-shape. This suggests that, in line with earlier studies (Hermann et al., 2020; Singer et al., 2022), DNNs indeed learn to represent basic shape features, an aspect that might not be apparent in their overt behavior (Geirhos et al., 2019).

Despite the apparent similarities, there were, however, also striking differences found between humans and the DNN. First, overall, DNN dimensions were less interpretable than human dimensions, as confirmed by the evaluation of all dimensions by two independent raters (see Supplementary Information B). This indicates a global difference in how the DNN assigns images as being conceptually similar to each other. Second, while human dimensions were clearly dominated by semantic properties, many DNN dimensions were more visual-perceptual in nature or reflected a mixture of visual and semantic information. We quantified this observation by asking the same two independent experts to rate human and DNN dimensions according to whether they were primarily visual-perceptual, semantic, reflected a mixture of both, or were unclear (Fig. 2b). To confirm that the results were not an arbitrary byproduct of the chosen DNN architecture, we provided the raters with four additional DNNs for which we had computed additional representational embeddings. The results revealed a clear dominance of semantic dimensions in humans, with only a small number of mixed dimensions. In contrast, for DNNs we found a consistently larger proportion of dimensions that were dominated by visual information or that reflected a mixture of both visual and semantic information (Fig. 2c, Supplemental Fig. S1b) for all DNNs). This demonstrates a clear difference in the relative weight that humans and DNNs assign to visual and semantic information, respectively.

1.2 Linking DNN dimensions to their interpretability

Refer to caption
Figure 3: Relevance of image features for embedding dimension. a, General methodology of the approach. We used Grad-CAM (Selvaraju et al., 2017) to visualize the importance of distinct image parts based on the gradients of the penultimate DNN features that we initially used to sample triplet choices. The gradients were obtained in our fully differentiable interpretability model with respect to a dimension 𝒘𝒘{\bm{w}}bold_italic_w in our embedding. b, We visualize the heatmaps for three different images and dimensions. Each column shows the relevance of parts of an image for that dimension. For this figure, we filtered the embedding by images available in the public domain. Images used under a CC0 license, from Flickr: Cezary Borysiuk, Wojtek Szkutnik.

Despite the overall differences in human and DNN representational dimensions, the DNN also contained many dimensions that appeared to be interpretable and comparable to those found in humans. Next, we aimed at testing to what degree these interpretable dimensions truly reflected specific visual or semantic properties, or whether they only superficially appeared to show this correspondence. To this end, we experimentally and causally manipulated images and observed the impact on dimension scores. Beyond general interpretability, these analyses further establish which visual features in each image drive individual dimensions and thus determine image representations.

Image manipulation requires a direct map** from input images to the embedding dimensions. Since the embedding dimensions were derived using a sampling based approach, this prohibits a direct link to manipulated or novel images. To overcome this challenge, we used 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized linear regression for linking penultimate layer activations of the DNN to each individual dimension of the learned embedding (Fig. 1d). Penultimate layer activations were indeed highly predictive of each embedding dimension, with all dimensions exceeding an R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of 75%, and the majority exceeding 85%. Thus, this allowed us to accurately predict dimension values for novel images.

Having established an end-to-end map** between input image and individual object dimensions, we next used three approaches to both probe the consistency of the interpretation and identify dimension-specific image features. First, to identify image regions relevant for each individual dimension, we used Grad-CAM (Selvaraju et al., 2017), an established technique for providing visual explanations. Grad-CAM generates heatmaps that highlight the image regions that are most influential for model predictions. Unlike traditional usage, which often focuses on creating visual explanations for model categorizations (e.g., dog vs. cat), we employed Grad-CAM to reveal what image regions drive the dimensions in the DNN embedding. The results of this analysis are illustrated with example images in Figure 3. Object dimensions were indeed driven by different image regions that contain relevant information, in line with the dimension’s interpretation derived from human ratings and suggesting that the representations captured by the DNN’s penultimate layer allow distinguishing between different object parts that carry different functional significance.

Refer to caption
Figure 4: Most exciting images for dimensions in our embedding. a, Using StyleGAN XL (Sauer et al., 2022), we optimized a latent code to maximize the predicted response in a specific embedding dimension. b, Visualizations for different dimensions in our embedding. We show the top 10101010 images that score highest in the dimension and the corresponding top 10101010 generated images.

As a second image explanation approach, to specifically highlight what image features drive a dimension, we used a generative image model to create novel images optimized for maximizing values of a given dimension (Montavon et al., 2018; Yosinski et al., 2015; Erhan et al., 2009). Unlike conventional activation maximization that targets a single DNN unit or a cluster of units, our approach aimed to selectively amplify activation in dimensions of the DNN embedding across the entire DNN layer, using a pretrained generative adversarial neural network (StyleGAN XL; Sauer et al., 2022)). The results of this procedure are shown in Figure 4b. The approach successfully generated images with high scores in the dimensions of our DNN embedding. Indeed, the properties highlighted by these generated images appear to align with the common interpretation of each specific dimension, again suggesting that the DNN embedding contained conceptually meaningful and coherent object properties similar to those found in humans.

Third, given that different visual features naturally co-occur across images, in order to tease apart their respective contribution, we causally manipulated individual image features and observed the effect on predicted DNN dimensions. We exemplify this approach with manipulations in color, object shape, and background (see Supplementary Information C). The results largely confirmed our predictions, leading to specific decreases or increases of activation in dimensions that appeared to be representing these features.

1.3 Factors underlying similarities and differences between humans and DNNs

Refer to caption
Figure 5: Factors that determine the similarity between human and VGG-16 embedding dimensions. a, Representational similarity matrices reconstructed from the human and VGG-16 embedding. b, Pairwise correlations between human and VGG-16 embedding dimensions. c, Cumulative RSA analysis that shows the amount of variance explained in the human RSM as a function of the number of DNN dimensions. The black line shows the number of dimensions required to explain 95% of the variance. d-f, Intersection (red and blue regions) and differences (orange and green regions) between three highly correlating human and DNN dimensions. For this figure, we filtered the embedding by images from the public domain.
Refer to caption
Figure 6: Overt behavioral choices in humans and the DNN. a, Overview of the approach. For one triplet, we computed the original predicted softmax probability based on the entire representational embedding for each object image in the triplet. We then iteratively pruned individual dimensions from the representational embedding and stored the difference to the predicted difference to the original softmax probability of the entire embedding as a relevance score for that dimension. b, We calculated the relevance scores for a random sample of 10 million triplets and identified the most relevant dimension for each triplet. We then labeled the 10 million most relevant dimensions according to human-labeled visual properties as semantic, mixed visual-semantic, visual or unclear. Semantic dimensions are most relevant for human behavioral choices, whereas for VGG-16 visual and mixed semantic-visual properties are more relevant. c-f, We rank sorted changes in softmax probability to find triplets where human and the DNN maximally diverge. Each panel shows a triplet with the behavioral choice made by humans and the DNN. We visualized the most relevant dimension for that triplet alongside the distribution of relevance scores. Each dimension is assigned its human annotated label. For this figure, we filtered the embedding by images from the public domain.

The previous results have confirmed the overall consistency and interpretation of the DNN’s visual and semantic dimensions based on common interpretability techniques. However, a direct comparison with human image representations is crucial for identifying which representational dimensions align well and which do not. Traditional representational similarity analysis (RSA) provides a global metric of representational alignment, revealing a moderate correlation (Pearson’s r=0.55𝑟0.55r=0.55italic_r = 0.55) between the representational similarity matrices (RSMs) of humans and the DNN (Fig. 5a). While this indicates some degree of alignment in object image representations, it does not clarify the factors driving this alignment. To address this challenge, we directly compared pairs of dimensions from both embeddings, pinpointing which dimensions contributed the most to the overall alignment and which dimensions were less well aligned.

For each human dimension, we identified the most strongly correlated DNN dimension, once without replacement (unique) and once with replacement, and sorted the dimensions based on their fit (Fig. 5b). This revealed a close alignment, with Pearson’s reaching up to r=0.80𝑟0.80r=0.80italic_r = 0.80 for a select few dimensions which gradually declined across other representational dimensions. To determine whether the global representational similarity was driven by just a few well-aligned dimensions or required a broader spectrum of dimensions, we assessed the number of dimensions needed to explain human similarity judgments. The analysis revealed that 40 dimensions were required to capture 95% of the variance in representational similarity with the human RSM (Fig. 5c). Although this number is much smaller than the original 4096-dimensional VGG-16 layer, these results demonstrate that the global representational similarity is not solely driven by a small number of well-aligned dimensions.

Given the imperfect alignment of DNN and human dimensions, we explored the similarities and differences in the stimuli represented by these dimensions. For each dimension, we identified which images were most representative of both humans and the DNN. Crucially, to highlight the discrepancies between the two domains, we then identified which images exhibited strong dimension values for humans but weak values for the DNN, and vice versa (Fig. 5d-f). While the results indicated similar visual and semantic representations in the most representative images, they also exposed clear divergences in dimension meanings. For instance, in an animal-related dimension, humans consistently represented animals even for images where the DNN exhibited very low dimension values. Conversely, the DNN dimension strongly represented objects that were not animals, such as natural objects, cages, or mesh (Fig. 5d). Similarly, a string-related dimension maintained a string-like representation in humans but included other objects in the DNN that were not string-like, potentially reflecting features related to thin, curvy objects or specific image features (Fig. 5f).

1.4 Relevance of object dimensions for DNN and human categorization behavior

Since internal representations do not necessarily translate into behavior, we next addressed whether this misalignment would translate to downstream behavioral choices. To this end, we employed a jackknife resampling procedure to determine the relevance of individual dimensions for odd-one-out choices. For each triplet, we iteratively pruned dimensions in both the human and DNN embeddings and observed changes in the predicted probabilities of selecting the odd-one-out, yielding an importance score for each dimension for the odd-one-out choice (Fig. 6a). The results of this analysis showed that while humans and DNNs often aligned in both their representations and choices, a sizable fraction of choices exhibited the same behavior despite strong differences in representations (Fig. 6b). For behavioral choices, the semantic bias in humans was enhanced, as evidenced by an even stronger importance of semantic relative to visual or mixed dimensions in humans as compared to DNNs. Individual triplet choices were affected not only by semantic but also by visual dimensions (Fig. 6c-f). Together, these results demonstrate that the differences in how humans and DNNs represent object images not only translate into behavioral choices but are also further amplified in their categorization behavior.

2 Discussion

A key challenge in understanding the similarities and differences in humans and artificial intelligence (AI) lies in establishing ways to make these two domains directly comparable. Overcoming this challenge would allow us to identify strategies for making AI more human-like (Geirhos et al., 2018) and for using AI as effective models of human perception and cognition. In this work, we propose a framework to identify interpretable factors that determine the similarities and differences between human and AI representations. In this framework, these factors can be identified by using the same experiment to probe behavior in humans and AI systems and applying the same computational strategy to the natural and artificial responses for inferring their respective interpretable embeddings. We successfully applied this approach to human similarity judgments and representations in a deep neural network (DNN) trained to classify natural images, thus allowing for a direct, meaningful comparison of the representations between both domains.

Our results revealed that the DNN contained representations that appeared to be similar to those found in humans, ranging from visual (e.g., "white", "circular/round", "transparent") to semantic features (e.g., "food-related", "fire-related"). However, a direct comparison to humans showed largely different strategies for arriving at these representations. While human representations were dominated by semantic dimensions, the DNN exhibited a pronounced bias towards visual or mixed visual-semantic dimensions. In addition, a face-to-face comparison of seemingly aligned dimensions revealed that DNNs only approximated the semantic representations found in humans. These different strategies were also reflected in their behavior, where similar behavioral outcomes were based on different embedding dimensions. Thus, despite seemingly well aligned human and DNN representations at a global level, deriving dimensions underlying the representational similarities provided a more complete and more fine-grained picture of this alignment, revealing the nature of the representational strategies that humans and DNNs use (Sucholutsky et al., 2023; Cichy and Kaiser, 2019; Kanwisher et al., 2023).

While previous approaches like RSA (Kornblith et al., 2019; Kriegeskorte et al., 2008) are particularly useful for comparing one or multiple representational spaces, they typically only provide a global quantitative measure of alignment and require explicit hypotheses and comparisons to other explicit models of representation to determine what it is about the representational space that drives human alignment. In contrast, other approaches have focused specifically on the interpretability of DNN representations (Zhou et al., 2018; Morcos et al., 2018; Erhan et al., 2009; Mahendran and Vedaldi, 2015; Nguyen et al., 2019; Bau et al., 2017, 2020) but either provide very specific local measures about DNN units or have limited direct comparability to human representations, given that the same interpretability methods can typically not be applied to understand human mental representations. Our framework combines the strengths of the comparability gained from RSA and existing interpretability methods to understand image processing in DNNs. We applied common interpretability methods to show that our work allows for detailed experimental testing and causal probing of DNN representations and behavior in response to diverse images. Yet only the direct comparison to human representations revealed the diverging representational strategies of humans and DNNs and thus limitations of the visualization techniques we used (Geirhos et al., 2023).

Our results are consistent with previous work indicating that DNNs make use of "shortcut" strategies that deviate from those used in humans (Geirhos et al., 2020; Hermann et al., 2023). Beyond the existing known biases, here we found a visual bias in DNNs that diverges from a more semantic bias in humans that underlies similarity judgments. This semantic bias is in line with the known ability of humans to abstract away from their immediate percept and their ability to structure the visual world based on their knowledge about the objects and their meaning (Palmeri and Gauthier, 2004). In contrast, even the highest layers in DNNs carry representations that continue to exhibit striking visual biases to solve the tasks they had been trained on, including image classification or relating images to text. While the identification of the visual bias is useful for understanding differences in representations and behavior, future work needs to determine the foundation of this bias and what changes in architectures or training are required to reduce it.

The framework introduced in this work can be expanded in multiple ways. Future work could translate this approach to a comprehensive overview across arbitrary DNN architectures, training objectives, or training datasets (Muttenthaler et al., 2023a, b; Conwell et al., 2022), which could reveal specific strategies for increasing representational alignment (Peterson et al., 2018; Fel et al., 2022). Varying the behavioral task beyond an odd-one-out task could further establish the alignment and misalignment according to behaviors beyond categorization. Finally, the framework could be applied to other domains, including brain recordings or other types of stimuli. Together, this framework promises a more comprehensive understanding of the relationship between human and AI representations, providing the potential to identify better candidate models of human cognition and behavior and more human-aligned artificial cognitive systems.

3 Methods

3.1 Triplet odd-one-out task

In the triplet odd-one-out task, participants are presented with three objects and must select the one that doesn’t fit. We define a dataset 𝒟{({is,js,ks},{as,bs})}s=1n𝒟superscriptsubscriptsubscript𝑖𝑠subscript𝑗𝑠subscript𝑘𝑠subscript𝑎𝑠subscript𝑏𝑠𝑠1𝑛\mathcal{D}\coloneqq\left\{\left(\{i_{s},j_{s},k_{s}\},\{a_{s},b_{s}\}\right)% \right\}_{s=1}^{n}caligraphic_D ≔ { ( { italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } , { italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } ) } start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where n𝑛nitalic_n is the total number of triplets and {is,js,ks}subscript𝑖𝑠subscript𝑗𝑠subscript𝑘𝑠\{i_{s},j_{s},k_{s}\}{ italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } is a set of three unique objects, with {as,bs}subscript𝑎𝑠subscript𝑏𝑠\{a_{s},b_{s}\}{ italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } being the pair among them determined as most similar. We used a dataset of human responses collected by Hebart et al. (2020) to learn an embedding of human object concepts. In addition, we simulated the triplet choices from a DNN. For the DNN, we simulated these choices by computing the dot product of the penultimate layer activation 𝒛i+subscript𝒛𝑖subscript{\bm{z}}_{i}\in\mathbb{R}_{+}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT after applying the ReLU function, where Sij=𝒛i𝒛jsubscript𝑆𝑖𝑗superscriptsubscript𝒛𝑖topsubscript𝒛𝑗S_{ij}={\bm{z}}_{i}^{\top}{\bm{z}}_{j}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The most similar pair {as,bs}subscript𝑎𝑠subscript𝑏𝑠\{a_{s},b_{s}\}{ italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } was then identified by the largest dot-product:

{as,bs}=argmax(xs,ys){(is,js),(is,ks),(js,ks)}{𝒛xs𝒛ys}.subscript𝑎𝑠subscript𝑏𝑠subscriptargmaxsubscript𝑥𝑠subscript𝑦𝑠subscript𝑖𝑠subscript𝑗𝑠subscript𝑖𝑠subscript𝑘𝑠subscript𝑗𝑠subscript𝑘𝑠superscriptsubscript𝒛subscript𝑥𝑠topsubscript𝒛subscript𝑦𝑠\{a_{s},b_{s}\}=\operatorname*{arg\,max}_{(x_{s},y_{s})\in\{(i_{s},j_{s}),(i_{% s},k_{s}),(j_{s},k_{s})\}}\{{\bm{z}}_{x_{s}}^{\top}{\bm{z}}_{y_{s}}\}.{ italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∈ { ( italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ( italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , ( italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT { bold_italic_z start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT } . (1)

Using this approach, we sampled triplet odd-one-out choices for a total of 20 million triplets for the DNN.

3.2 Embedding optimization and pruning

Optimization. Let 𝑾m×p𝑾superscript𝑚𝑝{\bm{W}}\in\mathbb{R}^{m\times p}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT denote a randomly initialized embedding matrix, where p=150𝑝150p=150italic_p = 150 is the initial embedding dimensionality. To learn interpretable concept embeddings, we used VICE, an approximate Bayesian inference approach (Muttenthaler et al., 2022). VICE performs mean-field variational inference to approximate the posterior distribution p(𝑾|𝒟)𝑝conditional𝑾𝒟p({\bm{W}}|\mathcal{D})italic_p ( bold_italic_W | caligraphic_D ) with a variational distribution, qθ(𝑾)subscript𝑞𝜃𝑾q_{\theta}({\bm{W}})italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_W ), where qθ𝒬subscript𝑞𝜃𝒬q_{\theta}\in\mathcal{Q}italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_Q.

VICE imposes sparsity on the embeddings using a spike-and-slab Gaussian mixture prior to update the variational parameters θ𝜃\thetaitalic_θ. This prior encourages shrinkage towards zero, with the spike approximating a Dirac delta function at zero (responsible for sparsity) and the slab modeled as a wide Gaussian distribution (determining non-zero values). Therefore, it is a sparsity-inducing prior and can be interpreted as a Bayesian version of the Elastic Net (Zou and Hastie, 2005). The optimization objective minimizes the KL divergence between the posterior and the approximate distribution:

argminθ𝔼qθ(𝐖)[1nlogqθ(𝑾)logp(𝑾))1ns=1nlogp({as,bs}|{is,js,ks},𝑾)],\operatorname*{arg\,min}_{\theta}~{}\mathbb{E}_{q_{\theta}\left(\mathbf{W}% \right)}\left[\frac{1}{n}\log q_{\theta}\left({\bm{W}}\right)-\log p\left({\bm% {W}})\right)-\frac{1}{n}\sum_{s=1}^{n}\log p\left(\{a_{s},b_{s}\}|\{i_{s},j_{s% },k_{s}\},{\bm{W}}\right)\right],start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_W ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_n end_ARG roman_log italic_q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_W ) - roman_log italic_p ( bold_italic_W ) ) - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log italic_p ( { italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } | { italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } , bold_italic_W ) ] , (2)

where the left term represents the complexity loss and the right term is the data log-likelihood.

Pruning. Since the variational parameters are composed of two matrices, one for the mean and one for the variance, θ={μ,σ}𝜃𝜇𝜎\theta=\{\mu,\sigma\}italic_θ = { italic_μ , italic_σ }, we can use the mean representation 𝝁isubscript𝝁𝑖{\bm{\mu}}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the final embedding for an object i𝑖iitalic_i. Imposing sparsity and positivity constraints improves the interpretability of our embeddings, ensuring that each dimension meaningfully represents distinct object properties. While sparsity is guaranteed via the spike-and-slab prior, we enforced non-negativity by applying a ReLU function to our final embedding matrix, thereby guaranteeing that 𝑾+m×p𝑾superscriptsubscript𝑚𝑝{\bm{W}}\in\mathbb{R}_{+}^{m\times p}bold_italic_W ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m × italic_p end_POSTSUPERSCRIPT. Note that this is done both during optimization and at inference time. We used the same procedure as in (Muttenthaler et al., 2022) for determining the optimal number of dimensions. Specifically, we initialized our model with p=150𝑝150p=150italic_p = 150 dimensions and reduced the dimensionality iteratively by pruning dimensions based on their probability of exceeding a threshold set for sparsity:

Prune if Pr(wij>0)<0.05 for fewer than 5 objects,Prune if Prsubscript𝑤𝑖𝑗00.05 for fewer than 5 objects\text{Prune if }\Pr(w_{ij}>0)<0.05\text{ for fewer than 5 objects},Prune if roman_Pr ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > 0 ) < 0.05 for fewer than 5 objects , (3)

where wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the weight associated with object i𝑖iitalic_i and dimension j𝑗jitalic_j. Training stopped either when the number of dimensions remained unchanged for 500 epochs or when the embedding was optimized for a maximum of 1000 epochs.

3.3 Embedding reproducibility and selection

We assessed reproducibility across 32 model runs with different seeds using a split-half reliability test. We chose the split-half reliability test for its effectiveness in evaluating the consistency of our model’s performance across different subsets of data, ensuring robustness. We partitioned the objects into two disjoint sets using odd and even masks. For each model run and every dimension in an embedding, we identified the dimension that is most highly correlated among all other models by using the odd-mask. Using the even mask, we correlated this highest match with the corresponding dimension. This process generated a sampling distribution of Pearson’s r𝑟ritalic_r coefficients for all model seeds. We subsequently Fisher z𝑧zitalic_z-transformed the Pearson’s r𝑟ritalic_r sampling distribution. The average z𝑧zitalic_z-transformed reliability score for each model run was obtained by taking the mean of these z𝑧zitalic_z-scores. Inverting this average provides an average Pearson’s r𝑟ritalic_r reliability score (see Supplementary Information D). For our final model and all subsequent analyses, we selected the embedding with the highest average reproducibility across all dimensions.

3.4 Labeling dimensions and construction of word clouds

We assigned labels to the human embedding by pairing each dimension with its highest correlating counterpart from Hebart et al. (2020). These dimensions were derived from the same behavioral data, but using a non-Bayesian variant of our method. We then used the human-generated labels that were previously collected for these dimensions, without allowing for repeats.

For the DNN, we labeled dimensions using human judgments. This allowed us to capture a broad and nuanced understanding of each dimension’s characteristics. To collect human judgments, we asked 12 laboratory participants (6 male, 6 female; mean age, 29.08; s.d. 3.09; range 25-35) to label each DNN dimension. Participants were presented with a 5×6565\times 65 × 6 grid of images, with each row representing a decreasing percentile of importance for that specific dimension. The top row contained the most important images, and the following rows included images within the 8th, 16th, 24th, and 32nd percentiles. Participants were asked to provide up to five labels that they thought best described each dimension. Word clouds showing the provided object labels were weighted by the frequencies of occurrence, and the top 6 labels were visualized.

Study participation was voluntary, and participants were not remunerated for their participation. This study was conducted in accordance with the Declaration of Helsinki and was approved by the local ethics committee of the Medical Faculty of the University Medical Center Leipzig (157/20-ek).

3.5 Dimension Ratings

Two independent experts rated the dimensions according to two questions. The first question asked whether the dimensions were primarily visual-perceptual, semantic-conceptual, a mix of both, or whether their nature was unclear. For the second question, they rated the dimensions according to whether they reflected a single concept, several concepts, or were not interpretable. Overall, both raters agreed agreed 81.86% of the time for question 1 and 90.00% of the time for question 2. Response ambiguity was resolved by a third rater (see Supplementary Information A, B). All raters were part of the laboratory but were blind to whether the dimensions were model- or human-generated.

3.6 Dimension value maximization

To visualize the learned object dimensions, we used an activation maximization technique with a pretrained StyleGAN XL generator 𝒢𝒢\mathcal{G}caligraphic_G (Sauer et al., 2022). Our approach combines sampling with gradient-based optimization to generate images that maximize specific dimension values in our embedding space.

Initial sampling. We started by sampling a set of N=100,000𝑁100000N=100,000italic_N = 100 , 000 concatenated noise vectors 𝒗idsubscript𝒗𝑖superscript𝑑{\bm{v}}_{i}\in\mathbb{R}^{d}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the dimensionality of the StyleGAN XL latent space. For each noise vector, we generated an image 𝒙i=𝒢(𝒗i)subscript𝒙𝑖𝒢subscript𝒗𝑖{\bm{x}}_{i}=\mathcal{G}({\bm{v}}_{i})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G ( bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and predicted its embedding 𝒚^ipsubscript^𝒚𝑖superscript𝑝\hat{{\bm{y}}}_{i}\in\mathbb{R}^{p}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT using our pipeline, where p𝑝pitalic_p is the number of dimensions in our embedding space.

For a given dimension j𝑗jitalic_j, we selected the top k𝑘kitalic_k images that yielded the highest values for y^ijsubscript^𝑦𝑖𝑗\hat{y}_{ij}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the j𝑗jitalic_j-th component of 𝒚i^^subscript𝒚𝑖\hat{{\bm{y}}_{i}}over^ start_ARG bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG. These images served as starting points for our optimization process.

Gradient-based optimization. To refine these initial images, we performed gradient-based optimization in the latent space of StyleGAN XL. Our objective function AMAM\mathcal{L}\text{AM}caligraphic_L AM balances two goals: increasing the absolute value of the embedding for dimension j𝑗jitalic_j and concentrating probability mass towards dimension j𝑗jitalic_j. Formally, we define AMsubscriptAM\mathcal{L}_{\text{AM}}caligraphic_L start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT as:

AM(𝒗i)=αy^ijβlogp(y^ij𝒛i),subscriptAMsubscript𝒗𝑖𝛼subscript^𝑦𝑖𝑗𝛽𝑝conditionalsubscript^𝑦𝑖𝑗subscript𝒛𝑖\mathcal{L}_{\text{AM}}({\bm{v}}_{i})=-\alpha\cdot\hat{y}_{ij}-\beta\cdot\log p% \left(\hat{y}_{ij}\mid{\bm{z}}_{i}\right),caligraphic_L start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT ( bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = - italic_α ⋅ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_β ⋅ roman_log italic_p ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∣ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where 𝒛i=f(𝒢(𝒗i))subscript𝒛𝑖𝑓𝒢subscript𝒗𝑖{\bm{z}}_{i}=f(\mathcal{G}({\bm{v}}_{i}))bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f ( caligraphic_G ( bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) denotes the penultimate features extracted from the generated image using the pretrained VGG-16 classifier f𝑓fitalic_f. The term on the left, referred to as the dimension size reward, contributes to increasing the absolute value y^ij^𝑦𝑖𝑗\hat{y}{ij}over^ start_ARG italic_y end_ARG italic_i italic_j for the object dimension j𝑗jitalic_j. The term on the right, referred to as the dimension specificity reward, concentrates probability mass towards a dimension without necessarily increasing its absolute value. The balance between these two objectives is controlled by the scalars α𝛼\alphaitalic_α and β𝛽\betaitalic_β. The objective AMsubscriptAM\mathcal{L}_{\text{AM}}caligraphic_L start_POSTSUBSCRIPT AM end_POSTSUBSCRIPT was minimized using vanilla stochastic gradient descent. Importantly, only the latent code vector 𝒗isubscript𝒗𝑖{\bm{v}}_{i}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT was updated, while kee** the parameters of 𝒢𝒢\mathcal{G}caligraphic_G, the VGG-16 classifier f𝑓fitalic_f, and the embedding model fixed.

This optimization process was performed for each of the top k𝑘kitalic_k images selected in the initial sampling phase. The resulting optimized images provide visual representations that maximally activate specific dimensions in our learned embedding space, offering insights into the semantic content captured by each dimension.

3.7 Highlighting image features

To highlight image regions driving individual DNN dimensions, we used Grad-CAM. For each image, we performed a forward pass to obtain an image embedding and computed gradients using a backward pass. We next aggregated the gradients across all feature maps in that layer to compute an average gradient, yielding a two-dimensional dimension importance map.

3.8 RSA analyses

We used RSA to compare the structure of our learned embeddings with human judgments and DNN features. This analysis was conducted in three stages: human RSA, DNN RSA, and a comparative analysis between human and DNN representations.

Human RSA. We reconstructed a similarity matrix from our learned embedding. Given a set of objects 𝒪=o1,,om𝒪subscript𝑜1subscript𝑜𝑚\mathcal{O}={o_{1},\dots,o_{m}}caligraphic_O = italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we computed the similarity Sijsubscript𝑆𝑖𝑗S_{ij}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT between each pair of objects (oi,oj)subscript𝑜𝑖subscript𝑜𝑗(o_{i},o_{j})( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) using the softmax function:

Sij=1|𝒪{oi,oj}|k𝒪{oi,oj}exp(𝒚i𝒚j)exp(𝒚i𝒚j)+exp(𝒚i𝒚k)+exp(𝒚j𝒚k)subscript𝑆𝑖𝑗1𝒪subscript𝑜𝑖subscript𝑜𝑗subscript𝑘𝒪subscript𝑜𝑖subscript𝑜𝑗superscriptsubscript𝒚𝑖topsubscript𝒚𝑗superscriptsubscript𝒚𝑖topsubscript𝒚𝑗superscriptsubscript𝒚𝑖topsubscript𝒚𝑘superscriptsubscript𝒚𝑗topsubscript𝒚𝑘S_{ij}=\frac{1}{|\mathcal{O}\setminus\{o_{i},o_{j}\}|}\sum_{k\in\mathcal{O}% \setminus\{o_{i},o_{j}\}}\frac{\exp\left({\bm{y}}_{i}^{\top}{\bm{y}}_{j}\right% )}{\exp\left({\bm{y}}_{i}^{\top}{\bm{y}}_{j}\right)+\exp\left({\bm{y}}_{i}^{% \top}{\bm{y}}_{k}\right)+\exp\left({\bm{y}}_{j}^{\top}{\bm{y}}_{k}\right)}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_O ∖ { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } | end_ARG ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_O ∖ { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } end_POSTSUBSCRIPT divide start_ARG roman_exp ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + roman_exp ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_exp ( bold_italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_ARG (5)

where 𝒚isubscript𝒚𝑖{\bm{y}}_{i}bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the embedding of object oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the softmax function returns the probability of oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being more similar to oksubscript𝑜𝑘o_{k}italic_o start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT than ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. To evaluate the explained variance, we used a subset of 48 objects for which a fully sampled similarity matrix and associated noise ceilings were available from previous work (Hebart et al., 2020). We then computed the Pearson correlation between our predicted RSM and the ground truth RSM for these 48 objects.

DNN RSA. We followed a similar procedure, reconstructing the RSM from our learned embedding of the DNN features. We then correlated this reconstructed RSM with the ground-truth RSM derived from the original DNN features used to sample our behavioral judgments.

Comparative Analysis. To compare human and DNN representations, we conducted two analyses. First, we performed a pairwise comparison by matching each human dimension with its most correlated DNN dimension. This was done both with and without replacement, allowing us to assess the degree of alignment between human and DNN representational spaces. Second, we performed a cumulative RSA to determine the number of DNN dimensions needed to accurately reflect the patterns in the human similarity matrix. We took the same ranking of DNN dimensions used for the pairwise RSA, starting with the highest correlating dimension. We then progressively added one DNN dimension at a time to a growing subset. After each addition, we reconstructed the RSM from this subset and correlated both the human RSM and the cumulative DNN RSM. This step-by-step process allowed us to observe how the inclusion of each additional DNN dimension contributed to explaining the variance in the human RSM.

Acknowledgments

FPM, LM and MNH acknowledge support by a Max Planck Research Group grant of the Max Planck Society awarded to MNH. MNH acknowledges support by the ERC Starting Grant COREDIM (ERC-StG-2021-101039712) and the Hessian Ministry of Higher Education, Science, Research and Art (LOEWE Start Professorship and Excellence Program “The Adaptive Mind”). U.G. acknowledges support from the project Dutch Brain Interface Initiative (DBI2) with project number 024.005.022 of the research program Gravitation which is (partly) financed by the Dutch Research Council (NWO). LM acknowledges support by the German Federal Ministry of Education and Research (BMBF) for the Berlin Institute for the Foundations of Learning and Data (BIFOLD) (01IS18037A) and for the grants BIFOLD22B and BIFOLD23B. This study used the high-performance from the Raven and Cobra Linux clusters at the Max Planck Computing & Data Facility (MPCDF), Garching, Germany https://www.mpcdf.mpg.de/services/supercomputing/.

Author Contributions

Conceptualization: FPM, LM, MNH; Funding acquisition: MNH; Software: FPM, LM; Supervision: MNH; Visualization: FPM, LM; Writing – original draft: FPM, LM; Writing – final manuscript: FPM, LM, UG, MNH.

Code availability

A PyTorch implementation of the model alongside all experiments presented in this paper is publicly available at https://github.com/florianmahner/object-dimensions/.

References

  • Attarian et al. [2020] Maria Attarian, Brett D Roads, and Michael C Mozer. Transforming neural network visual representations to predict human judgments of similarity. arXiv preprint arXiv:2010.06512, 2020.
  • Bau et al. [2017] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3319–3327. IEEE Computer Society, 2017.
  • Bau et al. [2020] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020.
  • Beery et al. [2018] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456–473, 2018.
  • Cichy and Kaiser [2019] Radoslaw M Cichy and Daniel Kaiser. Deep neural networks as scientific models. Trends in cognitive sciences, 23(4):305–317, 2019.
  • Conwell et al. [2022] Colin Conwell, Jacob S Prince, Kendrick N Kay, George A Alvarez, and Talia Konkle. What can 1.8 billion regressions tell us about the pressures sha** high-level visual representation in brains and machines? BioRxiv, pages 2022–03, 2022.
  • Doerig et al. [2023] Adrien Doerig, Rowan P Sommers, Katja Seeliger, Blake Richards, Jenann Ismael, Grace W Lindsay, Konrad P Kording, Talia Konkle, Marcel AJ Van Gerven, Nikolaus Kriegeskorte, et al. The neuroconnectionist research programme. Nature Reviews Neuroscience, pages 1–20, 2023.
  • Erhan et al. [2009] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  • Fel et al. [2022] Thomas Fel, Ivan F. Rodriguez Rodriguez, Drew Linsley, and Thomas Serre. Harmonizing the object recognition strategies of deep neural networks with humans. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022.
  • Fyshe et al. [2015] Alona Fyshe, Leila Wehbe, Partha Talukdar, Brian Murphy, and Tom Mitchell. A compositional and interpretable semantic space. In Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 32–41, 2015.
  • Geirhos et al. [2018] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. Advances in neural information processing systems, 31, 2018.
  • Geirhos et al. [2019] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • Geirhos et al. [2020] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  • Geirhos et al. [2023] Robert Geirhos, Roland S Zimmermann, Blair Bilodeau, Wieland Brendel, and Been Kim. Don’t trust your eyes: on the (un) reliability of feature visualizations. arXiv preprint arXiv:2306.04719, 2023.
  • Güçlü and van Gerven [2015] Umut Güçlü and Marcel AJ van Gerven. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35(27):10005–10014, 2015.
  • Hebart et al. [2019] Martin N Hebart, Adam H Dickter, Alexis Kidder, Wan Y Kwok, Anna Corriveau, Caitlin Van Wicklin, and Chris I Baker. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PloS one, 14(10):e0223792, 2019.
  • Hebart et al. [2020] Martin N Hebart, Charles Y Zheng, Francisco Pereira, and Chris I Baker. Revealing the multidimensional mental representations of natural objects underlying human similarity judgements. Nature human behaviour, 4(11):1173–1185, 2020.
  • Hebart et al. [2023] Martin N Hebart, Oliver Contier, Lina Teichmann, Adam H Rockter, Charles Y Zheng, Alexis Kidder, Anna Corriveau, Maryam Vaziri-Pashkam, and Chris I Baker. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife, 12:e82580, 2023.
  • Hermann et al. [2020] Katherine Hermann, Ting Chen, and Simon Kornblith. The origins and prevalence of texture bias in convolutional neural networks. Advances in Neural Information Processing Systems, 33:19000–19015, 2020.
  • Hermann et al. [2023] Katherine L Hermann, Hossein Mobahi, Thomas Fel, and Michael C Mozer. On the foundations of shortcut learning. arXiv preprint arXiv:2310.16228, 2023.
  • Hoyer [2002] Patrik O Hoyer. Non-negative sparse coding. In Proceedings of the 12th IEEE workshop on neural networks for signal processing, pages 557–565. IEEE, 2002.
  • Jain et al. [2016] Lalit Jain, Kevin G. Jamieson, and Robert D. Nowak. Finite sample prediction and recovery bounds for ordinal embedding. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2703–2711, 2016.
  • Jozwik et al. [2017] Kamila M Jozwik, Nikolaus Kriegeskorte, Katherine R Storrs, and Marieke Mur. Deep convolutional neural networks outperform feature-based but not categorical models in explaining object similarity judgments. Frontiers in psychology, 8:1726, 2017.
  • Kaniuth and Hebart [2022] Philipp Kaniuth and Martin N Hebart. Feature-reweighted representational similarity analysis: A method for improving the fit between computational models, brains, and behavior. NeuroImage, 257:119294, 2022.
  • Kanwisher et al. [2023] Nancy Kanwisher, Meenakshi Khosla, and Katharina Dobs. Using artificial neural networks to ask ‘why’questions of minds and brains. Trends in Neurosciences, 46(3):240–254, 2023.
  • Khaligh-Razavi and Kriegeskorte [2014] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models may explain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.
  • King et al. [2019] Marcie L King, Iris IA Groen, Adam Steel, Dwight J Kravitz, and Chris I Baker. Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage, 197:368–382, 2019.
  • Kornblith et al. [2019] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
  • Kriegeskorte et al. [2008] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, page 4, 2008.
  • Kubilius et al. [2016] Jonas Kubilius, Stefania Bracci, and Hans P Op de Beeck. Deep neural networks as a computational model for human shape sensitivity. PLoS computational biology, 12(4):e1004896, 2016.
  • Lindsay [2021] Grace W Lindsay. Convolutional neural networks as a model of the visual system: Past, present, and future. Journal of cognitive neuroscience, 33(10):2017–2031, 2021.
  • Mahendran and Vedaldi [2015] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015.
  • Montavon et al. [2018] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting and understanding deep neural networks. Digital signal processing, 73:1–15, 2018.
  • Morcos et al. [2018] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
  • Murphy et al. [2012] Brian Murphy, Partha Talukdar, and Tom Mitchell. Learning effective and interpretable semantic models using non-negative sparse embedding. In Proceedings of COLING 2012, pages 1933–1950, 2012.
  • Muttenthaler and Hebart [2021] Lukas Muttenthaler and Martin N. Hebart. Thingsvision: A python toolbox for streamlining the extraction of activations from deep neural networks. Frontiers in Neuroinformatics, 15:45, 2021. ISSN 1662-5196.
  • Muttenthaler et al. [2022] Lukas Muttenthaler, Charles Y Zheng, Patrick McClure, Robert A Vandermeulen, Martin N Hebart, and Francisco Pereira. Vice: Variational Interpretable Concept Embeddings. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 33661–33675. Curran Associates, Inc., 2022.
  • Muttenthaler et al. [2023a] Lukas Muttenthaler, Jonas Dippel, Lorenz Linhardt, Robert A. Vandermeulen, and Simon Kornblith. Human alignment of neural network representations. In The Eleventh International Conference on Learning Representations, 2023a.
  • Muttenthaler et al. [2023b] Lukas Muttenthaler, Lorenz Linhardt, Jonas Dippel, Robert A Vandermeulen, Katherine Hermann, Andrew Lampinen, and Simon Kornblith. Improving neural network representations using human similarity judgments. In Advances in Neural Information Processing Systems, volume 36, pages 50978–51007. Curran Associates, Inc., 2023b.
  • Nguyen et al. [2019] Anh Nguyen, Jason Yosinski, and Jeff Clune. Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pages 55–76, 2019.
  • Nonaka et al. [2021] Soma Nonaka, Kei Majima, Shuntaro C Aoki, and Yukiyasu Kamitani. Brain hierarchy score: Which deep neural networks are hierarchically brain-like? IScience, 24(9), 2021.
  • Palmeri and Gauthier [2004] Thomas J Palmeri and Isabel Gauthier. Visual object understanding. Nature Reviews Neuroscience, 5(4):291–303, 2004.
  • Peterson et al. [2018] Joshua C. Peterson, Joshua T. Abbott, and Thomas L. Griffiths. Evaluating (and improving) the correspondence between deep neural networks and human representations. Cogn. Sci., 42(8):2648–2669, 2018.
  • Rajalingham et al. [2015] Rishi Rajalingham, Kailyn Schmidt, and James J DiCarlo. Comparison of object recognition behavior in human and monkey. Journal of Neuroscience, 35(35):12127–12136, 2015.
  • Rajalingham et al. [2018] Rishi Rajalingham, Elias B Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J DiCarlo. Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38(33):7255–7269, 2018.
  • Roads and Love [2020] Brett D Roads and Bradley C Love. Learning as the unsupervised alignment of conceptual systems. Nature Machine Intelligence, 2(1):76–82, 2020.
  • Rosenfeld et al. [2018] Amir Rosenfeld, Richard Zemel, and John K Tsotsos. The elephant in the room. arXiv preprint arXiv:1808.03305, 2018.
  • Sauer et al. [2022] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  • Schrimpf et al. [2018] Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan, Jonathan Prescott-Roy, Franziska Geiger, et al. Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv, page 407007, 2018.
  • Selvaraju et al. [2017] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Singer et al. [2022] Johannes JD Singer, Katja Seeliger, Tim C Kietzmann, and Martin N Hebart. From photos to sketches-how humans and deep neural networks process objects across different levels of visual abstraction. Journal of vision, 22(2):4–4, 2022.
  • Stoinski et al. [2023] Laura M Stoinski, Jonas Perkuhn, and Martin N Hebart. Thingsplus: New norms and metadata for the things database of 1854 object concepts and 26,107 natural object images. Behavior Research Methods, pages 1–21, 2023.
  • Storrs et al. [2021] Katherine R Storrs, Tim C Kietzmann, Alexander Walther, Johannes Mehrer, and Nikolaus Kriegeskorte. Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting. Journal of cognitive neuroscience, 33(10):2044–2064, 2021.
  • Sucholutsky et al. [2023] Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Jascha Achterberg, Joshua B Tenenbaum, et al. Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018, 2023.
  • Szegedy et al. [2013] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Wang et al. [2023] Aria Y Wang, Kendrick Kay, Thomas Naselaris, Michael J Tarr, and Leila Wehbe. Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset. Nature Machine Intelligence, pages 1–12, 2023.
  • Yamins et al. [2014] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014.
  • Yosinski et al. [2015] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
  • Zheng et al. [2019] Charles Y Zheng, Francisco Pereira, Chris I Baker, and Martin N Hebart. Revealing interpretable object representations from human behavior. ICLR, 2019.
  • Zhou et al. [2018] Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Revisiting the importance of individual units in cnns via ablation. arXiv preprint arXiv:1806.02891, 2018.
  • Zou and Hastie [2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320, 2005.

Appendix A Dimension ratings and RSA across models

Refer to caption
Figure S1: Dimension ratings and representational similarity across models. a, VGG-16 does not perform poorly when compared to other models, including Resnet50, DenseNet, CLIP, and BarlowTwins-Resnet50. b, The visual bias identified in VGG-16 is also evident across these other architectures, demonstrating consistent differences between human and DNN dimensions.

We used VGG-16 due to its common use in computational cognitive neuroscience. To validate that VGG-16 is a suitable choice for the comparison to humans, we conducted RSA analyses with various other DNN models that differ in training diets, objective functions, and architecture. Using thingsvision [Muttenthaler and Hebart, 2021], we similarly extracted the penultimate features of these models and learned representational embeddings based on simulated triplet choices. Each embedding was then compared to the human-derived one using RSA. In Figure S1a we can see that VGG-16 does not perform poorly compared to other architectures, which suggests that it is a suitable choice for our analyses. Additionally, we assessed the visual bias in these architectures by having human raters categorize each dimension’s dominant visual property as visual, semantic, a mixture of both, or unclear. This reveals that the visual bias we find for VGG-16 also replicates across different DNNs (Fig. S1b)

Appendix B Human ratings of dimension interpretability

Refer to caption
Figure S2: Dimension ratings for different DNN models. a, Percentage of interpretable dimensions as rated by human observers. Across all DNN models, the human embedding has the smallest percentage of uninterpretable dimensions. b, Variance explained by uninterpretable dimensions. For this, we weighted the uninterpretable dimensions with their importance as given by the numeric value of that dimension. Compared to humans, all uninterpretable DNN dimensions explain more variance in their embedding.

To gain an understanding of how meaningfully interpretable DNN dimensions were, we additionally asked the human experts to rate the interpretability of the dimensions across all five DNN architectures. Across DNNs, the number of interpretable dimensions was consistently lower than that found in humans (see Supplementary Fig. S2a), and uninterpretable dimensions generally had a higher overall importance for odd-one-out choices as indicated by the sum of their weight across images (embedding variance explained by uninterpretable dimensions: 3.83% Humans, 8.02% VGG-16, Supplementary Fig. S2b). Taken together, despite the decent global alignment between human and DNN representations and numerous interpretable visual and semantic dimensions, these results demonstrate largely different strategies used by humans and DNNs for object processing, with DNNs using more visual features and exhibiting a stronger mix between visual and semantic information than humans, who primarily rely on semantic features. This visual bias in DNNs is accompanied by an overall reduced interpretability of dimensions as compared to humans, demonstrating that sparse and non-negative embeddings do not necessarily result in interpretable dimensions and indicating another potential deviation of DNNs to the way humans represent visual stimuli.

Appendix C Causal Image Manipulations

Refer to caption
Figure S3: Causal manipulation of unique image features. We compared the predicted dimension values between the original and causally manipulated images using our interpretability pipeline, revealing how these manipulations specifically affected various dimensions within our embedding space. The arrows indicate whether the activation level of a dimension increases, decreases, or remains relatively unchanged due to the manipulation. a, Altering the color of a toilet from white to black, b, Modifying the shape of a set of bottles to be more curved, c, Changing the background in an image containing a manhole. Note that the displayed images reflect only images with a public domain license and not the full image set [Stoinski et al., 2023].

Appendix D Reproducibility

Refer to caption
Figure S4: Model reproducibility across different random initializations in humans and the DNN. a, Reproducibility across model runs was evaluated using a split-half reliability test (see Sec. 3, Methods). The model with the highest average reproducibility was selected for subsequent experiments. b, For this model, we present a visualization of its dimensional reproducibility compared to other models and dimensions. The red line indicates the number of dimensions retained in the final model as determined by the VICE criteria.