Restyling Unsupervised Concept Based Interpretable Networks with Generative Models

Jayneel Parekh 1122    Quentin Bouniot 22    Pavlo Mozharovskyi 22    Alasdair Newson 11    Florence d’Alché-Buc 22

Develo** inherently interpretable models for prediction has gained prominence in recent years. A subclass of these models, wherein the interpretable network relies on learning high-level concepts, are valued because of closeness of concept representations to human communication. However, the visualization and understanding of the learnt unsupervised dictionary of concepts encounters major limitations, specially for large-scale images. We propose here a novel method that relies on map** the concept features to the latent space of a pretrained generative model. The use of a generative model enables high quality visualization, and naturally lays out an intuitive and interactive procedure for better interpretation of the learnt concepts. Furthermore, leveraging pretrained generative models has the additional advantage of making the training of the system more efficient. We quantitatively ascertain the efficacy of our method in terms of accuracy of the interpretable prediction network, fidelity of reconstruction, as well as faithfulness and consistency of learnt concepts. The experiments are conducted on multiple image recognition benchmarks for large-scale images. Project page available at https://jayneelparekh.github.io/VisCoIN_project_page/

Interpretability Explainability Generative models

1 Introduction

Deep neural networks (DNNs) learn complex patterns from data to make predictions or decisions without being explicitly programmed how to perform the task. Interpreting decisions of DNNs, i.e. being able to obtain human-understandable insights about their decisions, is a difficult task [13, 11, 49]. This lack of transparency impacts their trustworthiness [58] and hinders their democratization for critical applications such as assisting medical diagnosis or autonomous driving.

Two different paths have been explored in order to interpret DNNs outputs. The simplest approach for practitioners is to provide interpretations post-hoc, i.e. by analysing the so-called black-box model after training [12, 55, 47, 61]. However, post-hoc methods have been criticized for their high computational costs and a lack of robustness and faithfulness of interpretations [78, 37, 8]. On the other hand, one preferred way to obtain more meaningful interpretations is to use interpretable by-design approaches [5, 3, 14, 21], that aim to integrate the interpretability constraint into the learning process, while maintaining state-of-the-art performance.

Concept-based Interpretable Networks (CoINs) are a recent subcategory of these inherently interpretable prediction models, that learn a dictionary of high-level concepts for prediction. The concept representation is either learnt in a supervised way using ground-truth concepts annotations [40, 59], or in an unsupervised fashion by enforcing properties through carefully designed loss functions [7, 51, 59]. The output of the model can be interpreted by looking at the activations of each concept and how they are combined to obtain the final prediction. When working in the unsupervised setting, learnt concepts have to additionally be interpreted, usually through visualization [51, 59]. Concept-based interpretations have gained prominence as an alternative to popular feature-wise saliency maps [66, 55, 47, 61] for two main reasons: (1) their ability to provide interpretations closer to human reasoning and communication [76], and (2) in specific case of visual modalities, their ability to more effectively highlight which features are important for a model and not just where in the input image they focus on [16]. However, the underlying concepts in current unsupervised CoINs are understood through a separate visualization pipeline, by finding inputs that highly activate a given concept, either from natural images in the available dataset [7, 59], or from virtual images by solving an optimization problem in the input space that maximally activates the concept [48, 51]. For large-scale images, concepts generally activate for local pattern information (color, texture, shape etc.) and these visualization approaches face major limitations in highlighting this information to a user. Simply visualizing the most activating samples does not highlight the specific feature a concept activates for. Visualizing using an activation maximization procedure leads to the generation of repeated patterns linked to the underlying concept in the image, but are hard for a user to discern any human-interpretable signal. For example, in Fig. 1, it can be hard to identify that the concept activates for “Yellow-colored head” from the activation maximization (“FLINT visualization”). Furthermore, previous CoIN systems fail to include the visualization process in their quantitative evaluation of concepts.

We thus propose a novel set of specifications for the concepts to be learnt: additionally to fidelity to output (predictive capability from the concepts), fidelity to input (encoding input relevant information in concepts) and sparsity (a few concepts activated simultaneously), we also promote the viewability of concepts during training. This viewability is now defined as the ability of the system to reconstruct high-quality images from the learnt concepts, by leveraging a pretrained generative model. In order to obtain this viewability property, we propose to learn a concept translator, i.e., a map** from the concept representation space to the latent space of the generative model. Learning the concept translator along with the other parameters of this novel CoIN system helps to improve the quality of the concepts. Finally, after training, interpretation of concepts is obtained through translation to the generative model, which allows for a more granular and interactive process. Our contributions are the following:

    We propose Visualizable CoIN (VisCoIN), a novel architecture for unsupervised training of CoINs relying on a concept translator module that maps concept vectors to the latent space of a pretrained generative model.

    We introduce a new property for unsupervised CoIN systems, related to viewability. This property is imposed during the training of the system by enforcing perceptual similarity of the reconstruction, in addition to other constraints, and made possible by the use of a generative model.

    We define a novel concept interpretation pipeline based on the concept translator and the associated generative model that allows to both obtain a high-quality and more comprehensive visualization of each concept.

    We introduce new metrics in the context of unsupervised CoINs to evaluate the quality of concepts learnt, from the point of view of visualization. We then quantitatively and qualitatively evaluate our proposed method on three different large-scale image datasets, spanning multiple settings.

Refer to caption
Figure 1: Comparison of the generated images obtain for the same learnt concept (“Yellow-colored head”) using FLINT visualization [51] and our proposed VisCoIN visualization. Using our concept translator, that maps concept representation space to the latent space of a generative model, we can visualize each concept at different activation values, allowing for more granular and interactive interpretation.

2 Related works

Interpretable predictive models . In the context of deep learning architectures, a host of early approaches studying interpretability tackled the post-hoc interpretation problem [63, 66, 55, 47, 67, 15]. However, previous works such as those of [5, 45, 7] have contributed to surge of develo** predictive models that are also interpretable by-design [77, 4, 81, 44, 21]. The earlier systems however trained the complete model from scratch. Recent approaches reflect a growing interest in building interpretable models on top of pretrained models as backbones [40, 9]. Our approach falls in the latter category wherein we learn an interpretable predictive model on top of a pretrained backbone.

GANs for interpretations . One of the earliest application of GANs for interpretability was by [50] to synthesize image for visualizing neurons in a network. More recently, a variety of methods have employed GAN or other generative models (including diffusion models) for generating post-hoc counterfactual interpretations [80, 43, 19, 22]. Their central theme revolves around the idea of embedding any given input to the latent space of a GAN and finding meaningful perturbations in the latent space that affect the given predictor’s output the most. Our aimed use-case of the GAN differs in a major way from these methods, which is that we wish to use the GAN in order to learn and visualize an explicit dictionary of interpretable concept representation simultaneously used in a predictive model.

Concept-based interpretability . Providing interpretations via representations for high-level concepts has gained significant prominence recently. Similar to the overall literature, one set of concept-based methods have focused on post-hoc interpretation [23, 76, 43, 2, 20], with most based on the notion of concept activation vectors [36]. The other type of methods tackle the by-design/ante-hoc interpretation problem by learning concepts [7, 40, 51, 59, 60, 62] abbreviated as CoIN systems in Sec. 1. We cover these methods in more detail in Sec. 3.1 with particular focus on networks based on learning completely unsupervised concepts [7, 51, 59], a key starting point of our approach.

3 Approach

3.1 Background

In this part, we provide a design overview of a concept based interpretable network (CoIN). Our focus in this paper is on CoIN systems that learn an unsupervised dictionary of concepts.

Concept-based interpretable networks. We denote a training set for a supervised image classification task as 𝒮={(xi,yi)}i=1N𝒮superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁\mathcal{S}=\{(x_{i},y_{i})\}_{i=1}^{N}caligraphic_S = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Each input image x𝒳n𝑥𝒳superscript𝑛x\in\mathcal{X}\subset\mathbb{R}^{n}italic_x ∈ caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is associated with a class label y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, a one-hot vector of size number of classes C𝐶Citalic_C. The by-design interpretable network based on learning concept representation is denoted by g:𝒳𝒴:𝑔𝒳𝒴g:\mathcal{X}\rightarrow\mathcal{Y}italic_g : caligraphic_X → caligraphic_Y.

In the standard setup for concept-based prediction models (supervised or unsupervised), given an input x𝑥xitalic_x, the computation of g(x)𝑔𝑥g(x)italic_g ( italic_x ) is broken down into two parts. There is first a concept extraction representation ΦΦ\Phiroman_Φ, and then a subnetwork ΘΘ\Thetaroman_Θ that computes the final prediction using concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), such that g(x)=ΘΦ(x)𝑔𝑥ΘΦ𝑥g(x)=\Theta\circ\Phi(x)italic_g ( italic_x ) = roman_Θ ∘ roman_Φ ( italic_x ). Supervised concept-based networks [40] use ground-truth concept annotations to train ΦΦ\Phiroman_Φ. The core of unsupervised concept-based methods instead lies in learning ΦΦ\Phiroman_Φ by imposing loss functions. These loss functions are typically selected to encourage a certain set of properties that shape ΦΦ\Phiroman_Φ simultaneously for both interpretation and prediction. We list the properties below:

  1. 1.

    Fidelity to output: This requires Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) to model the output space via the ΘΘ\Thetaroman_Θ function, either by predicting ground-truth label y𝑦yitalic_y [7] or the classification output f(x)𝑓𝑥f(x)italic_f ( italic_x ) [51, 59]. It directly shapes Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) for the prediction task and, during the interpretation phase, it helps in identifying important concepts for prediction.

  2. 2.

    Fidelity to input: This requires Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) to reconstruct the input x𝑥xitalic_x via a decoder function. This property is essential to encode features relevant to the input in the concept representation. All the previous methods [7, 51, 59] rely on this loss and employ standard non-generative decoders for pixel-wise reconstruction to learn the concept dictionary ΦΦ\Phiroman_Φ.

  3. 3.

    Sparsity of activations: This requires concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) to be sparse for any x𝑥xitalic_x. It reinforces the high-level nature of ΦΦ\Phiroman_Φ and limits the number of important concepts for prediction, in turn enhancing interpretability as well as reducing the visualization overhead for a user.

Limitations with concept visualization in previous unsupervised CoINs . A central common trait among prior CoINs learning unsupervised concepts [7, 51, 59] is the deployment of a decoder to reconstruct input x𝑥xitalic_x from Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ). Unlike supervised methods, they do not have access to any concept labels and thus need an additional visualization pipeline to understand the information encoded by each concept. However, their visualization pipeline does not utilize the decoder, but instead relies on proxy methods to probe the concept activation. Typically, it consists of finding an input that highly activates a concept, either by selecting from the training data [7, 59] or via input optimization [51]. In the former case, simply visualizing the set of most activating training samples lacks granularity to highlight the features encoded by the concept. Using input optimization, while relatively more insightful, is still difficult for a user to understand as the optimized images are often unnatural. Moreover, these issues exacerbate for large-scale images, as seen in Fig. 1. A natural strategy to overcome these limitations is to enable direct control of a concept’s activation and visualizing its effect on the input. Since the decoder defines the relationship between concept activations and input samples, a generative model is a perfect candidate for a decoder to unlock this ability, in contrast to standard decoders used previously.

In the next part, we describe the architecture behind our by-design interpretable network g𝑔gitalic_g, that additionally includes a concept translator module ΩΩ\Omegaroman_Ω, to map concept features in the latent space of a pretrained generative model G𝐺Gitalic_G. This defines our VisCoIN method.

3.2 Interpretable prediction network design

Refer to caption
Figure 2: Left: Overview of a standard CoIN system g𝑔gitalic_g, that makes prediction g(x)𝑔𝑥g(x)italic_g ( italic_x ) from extracted concepts Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ). Right: Design of our unsupervised concept-based interpretable network VisCoIN leveraging a pretrained generative model G𝐺Gitalic_G for visualization, and a pretrained classifier f𝑓fitalic_f. Purple blocks denote trainable subnetworks.

We assume a fixed pretrained network for classification f𝑓fitalic_f and a fixed pretrained generator G𝐺Gitalic_G, trained for classification and generation on input dataset respectively. We use these two networks to guide our design and learning of g𝑔gitalic_g and its concept extraction function ΦΦ\Phiroman_Φ. We first discuss modelling of g𝑔gitalic_g by describing its constituents, ΦΦ\Phiroman_Φ and ΘΘ\Thetaroman_Θ.

Interpretable network design . The dictionary ΦΦ\Phiroman_Φ consists of K𝐾Kitalic_K concept functions ϕ1,,ϕKsubscriptitalic-ϕ1subscriptitalic-ϕ𝐾\phi_{1},\dots,\phi_{K}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Given an input x𝑥xitalic_x, each concept activation ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) is represented by a small convolutional feature map with non-negative activation. Thus Φ(x)=[ϕ1(x),,ϕK(x)]+K×bΦ𝑥subscriptitalic-ϕ1𝑥subscriptitalic-ϕ𝐾𝑥superscriptsubscript𝐾𝑏\Phi(x)=[\phi_{1}(x),...,\phi_{K}(x)]\in\mathbb{R}_{+}^{K\times b}roman_Φ ( italic_x ) = [ italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) , … , italic_ϕ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_x ) ] ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K × italic_b end_POSTSUPERSCRIPT, where b𝑏bitalic_b is the total number of elements in each feature map. We model computation of concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) using the pretrained classification network f𝑓fitalic_f and learn a relatively lightweight network ΨΨ\Psiroman_Ψ on top of its selected hidden layers denoted as f(x)subscript𝑓𝑥f_{\mathcal{I}}(x)italic_f start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( italic_x ), i.e. Φ(x)=Ψf(x)Φ𝑥Ψsubscript𝑓𝑥\Phi(x)=\Psi\circ f_{\mathcal{I}}(x)roman_Φ ( italic_x ) = roman_Ψ ∘ italic_f start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT ( italic_x ). ΘΘ\Thetaroman_Θ is designed to simply pool the feature maps to obtain a single concept activation of size K𝐾Kitalic_K and make the final prediction by passing it through a linear layer followed by softmax, i.e. g(x)=Θ(Φ(x))=softmax(ΘWTpool(Φ(x)))𝑔𝑥ΘΦ𝑥softmaxsuperscriptsubscriptΘ𝑊𝑇poolΦ𝑥g(x)=\Theta(\Phi(x))=\textrm{softmax}(\Theta_{W}^{T}\textrm{pool}(\Phi(x)))italic_g ( italic_x ) = roman_Θ ( roman_Φ ( italic_x ) ) = softmax ( roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT pool ( roman_Φ ( italic_x ) ) ), where ΘWK×CsubscriptΘ𝑊superscript𝐾𝐶\Theta_{W}\in\mathbb{R}^{K\times C}roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_C end_POSTSUPERSCRIPT are the weights in the linear layer. The simplified design of ΘΘ\Thetaroman_Θ makes estimating importance of each concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for any prediction straightforward.
Viewability property . In order to improve visualization for unsupervised CoINs, and thus interpretation of learnt concepts, we propose to add the requirement for a viewability property. Given an input image x𝑥xitalic_x, this property requires to be able to reconstruct high-quality images from Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), from a pretrained generative model G𝐺Gitalic_G. More specifically, reconstructions should have high enough quality to “view” input samples through generated outputs and thus ground modifications to Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) back to x𝑥xitalic_x. In our method, we propose to achieve this by learning an additional concept translator module ΩΩ\Omegaroman_Ω to map Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) to the latent space of G𝐺Gitalic_G, such that high-quality reconstructed images can be obtained from Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) through ΩΩ\Omegaroman_Ω and G𝐺Gitalic_G. The design of ΩΩ\Omegaroman_Ω depends entirely on the generative model used.

We opt for StyleGAN2-ADA [31] as our generative model. Our rationale behind this choice is due to certain properties exhibited by this model: (a) a low dimensional and highly structured latent space with meaningful transformations in generated output via latent traversals, (b) can generate high-quality large-scale images, and (c) flexible to train in limited data regimes. We discuss about the significance of these properties and suitability of other generative models in much greater detail in Sec. 0.A.1. It’s worth noting that using a pretrained generator also limits the training cost, complexity and improves reusability.

3.3 Use of StyleGAN as decoder

A short primer on style-based GAN . The work by Karras et al. [33] proposed a style-based generator architecture. The generator learns to map a noise vector (z512𝒩(0,1)𝑧superscript512similar-to𝒩01z\in\mathbb{R}^{512}\sim\mathcal{N}(0,1)italic_z ∈ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , 1 )) to a latent code (w𝒲512𝑤𝒲superscript512w\in\mathcal{W}\subset\mathbb{R}^{512}italic_w ∈ caligraphic_W ⊂ blackboard_R start_POSTSUPERSCRIPT 512 end_POSTSUPERSCRIPT) that controls the scaling and biasing of feature maps at different resolutions in the synthesis network, controlling the style of the generated image. Subsequent updates on the original architecture [34, 32] have improved upon various aspects of generator architecture and regularization. One such update relevant for us is the StyleGAN2-ADA [31] which incorporates an adaptive discriminator augmentation in training. This enables training StyleGAN2 with limited amount of data, a regime where GANs are generally prone to collapsing. The StyleGAN architecture has been actively studied for certain aspects relevant to our use case. Primary among these are GAN inversion methods that aim to embed any given input to the latent space of a pretrained GAN so that the GAN can reconstruct the input using the latent code. The two approaches for this include either optimizing a latent code for any given input [1, 29], or training an encoder to predict the latent code [56, 6, 75]. While the optimization approaches can invert a given input better, they scale poorly compared to encoder based approaches when inverting many input samples. StyleGAN has also been studied closely in regard to its structure of latent spaces, including their suitability to embed inputs [1, 72, 73], discovering semantically meaningful directions or trajectories in latent space [25, 69, 65] and hierarchical style control with latent vectors for different resolutions [35].

At this juncture, it is worth asking the question of why don’t we directly use the latent space/stylespace [72] of the pretrained StyleGAN as the concept dictionary for interpretation, as done previously [43]. We discuss about this question in detail in Sec. 0.A.2, including issues in this design for use in a by-design interpretable network. Instead, we adhere to standard setup for unsupervised CoINs to learn a small dictionary of concept functions. The pretrained StyleGAN is used to reconstruct input x𝑥xitalic_x from concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ).

The most natural way to design this reconstruction is to use a similar approach to encoder based GAN-inversion architectures used for StyleGAN. Following previous works in this regard [1, 75], we use the extended latent space 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT for inversion, which corresponds to different latent vectors for different resolutions. We learn the concept translator ΩΩ\Omegaroman_Ω to map the concept activations to 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and bias its computation with average latent vector w¯¯𝑤\bar{w}over¯ start_ARG italic_w end_ARG of G𝐺Gitalic_G.

Since Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) is a much lower dimensional representation compared to elements in 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and constrained by losses unrelated to reconstruction, we found it challenging to achieve reconstruction quality close to GAN inversion methods. This issue in principle cannot be completely eliminated without compromising the interpretable predictive structure. However, we alleviate it to some extent by learning an unconstrained “supporting” representation as a secondary output from ΨΨ\Psiroman_Ψ, termed Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ). The only goal for Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) is to assist ΩΩ\Omegaroman_Ω in embedding the input in 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. We utilize findings from the work in [35] about hierarchical nature of latent vectors and control prediction of small number of latent vectors using Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) which are expected to contain minimal classification related information. To ensure importance of Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) in reconstruction is not diminished, majority of latent vectors are still predicted using Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ). The choice of using ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is analyzed with an ablation study. The computation for reconstructed input x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG is given as:

x~=G(wx+),where wx+=Ω(Φ(x),Φ(x))𝒲+formulae-sequence~𝑥𝐺superscriptsubscript𝑤𝑥where superscriptsubscript𝑤𝑥ΩΦ𝑥superscriptΦ𝑥superscript𝒲\tilde{x}=G(w_{x}^{+}),\text{where }w_{x}^{+}=\Omega(\Phi(x),\Phi^{\prime}(x))% \in\mathcal{W}^{+}over~ start_ARG italic_x end_ARG = italic_G ( italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) , where italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = roman_Ω ( roman_Φ ( italic_x ) , roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ∈ caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT (1)

The concept translator ΩΩ\Omegaroman_Ω consists of a set of single fully-connected (FC) layers, one for predicting each latent vector. Each FC layer either takes Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) or Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) as input depending upon the latent vector it predicts. Precise details about ΩΩ\Omegaroman_Ω and Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) are in Appendix 0.B. Fig. 2 illustrates the complete design of VisCoIN.

3.4 Training losses

Based on the previous discussion about concept based networks and our reconstruction pipeline design inspired from GAN inversion architectures, we define here our training loss trainsubscript𝑡𝑟𝑎𝑖𝑛\mathcal{L}_{train}caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and each of its constitutive terms.

  • To verify the fidelity to output property, we define an output fidelity loss ofsubscript𝑜𝑓\mathcal{L}_{of}caligraphic_L start_POSTSUBSCRIPT italic_o italic_f end_POSTSUBSCRIPT, which grants predictive capabilites to g𝑔gitalic_g. It is defined as the generalized cross entropy (CE) loss between g𝑔gitalic_g and f𝑓fitalic_f:

    of(x;Ψ,Θ)=αCE(g(x),f(x)).subscript𝑜𝑓𝑥ΨΘ𝛼𝐶𝐸𝑔𝑥𝑓𝑥\mathcal{L}_{of}(x;\Psi,\Theta)=\alpha CE(g(x),f(x)).caligraphic_L start_POSTSUBSCRIPT italic_o italic_f end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Θ ) = italic_α italic_C italic_E ( italic_g ( italic_x ) , italic_f ( italic_x ) ) . (2)
  • The most critical part of our training loss is the reconstruction loss recGsuperscriptsubscript𝑟𝑒𝑐𝐺\mathcal{L}_{rec}^{G}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT computed through the pretrained generative model G𝐺Gitalic_G, that gathers all constraints between inputs x𝑥xitalic_x and their reconstruction x~=G(Ω(Φ(x),Φ(x)))~𝑥𝐺ΩΦ𝑥superscriptΦ𝑥\tilde{x}=G(\Omega(\Phi(x),\Phi^{\prime}(x)))over~ start_ARG italic_x end_ARG = italic_G ( roman_Ω ( roman_Φ ( italic_x ) , roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ). It combines 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalties, enforcing pixel-wise reconstruction for fidelity to input, with perceptual similarity LPIPS [82] and a final reconstruction classification term, both linked to viewability. The reconstruction classification term, defined as CE(f(x~),f(x))𝐶𝐸𝑓~𝑥𝑓𝑥CE(f(\tilde{x}),f(x))italic_C italic_E ( italic_f ( over~ start_ARG italic_x end_ARG ) , italic_f ( italic_x ) ), encourages the generative model to reconstruct x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG with more classification specific features pertaining to input x𝑥xitalic_x. Similar losses have been introduced for inversion with StyleGAN and its training for post-hoc interpretation [43]. Our reconstruction loss is thus defined as follows:

    recG(x;Ψ,Ω)=x~x22+x~x1+βLPIPS(x~,x)+γCE(f(x~),f(x)).superscriptsubscript𝑟𝑒𝑐𝐺𝑥ΨΩsuperscriptsubscriptnorm~𝑥𝑥22subscriptnorm~𝑥𝑥1𝛽LPIPS~𝑥𝑥𝛾𝐶𝐸𝑓~𝑥𝑓𝑥\mathcal{L}_{rec}^{G}(x;\Psi,\Omega)=||\tilde{x}-x||_{2}^{2}+||\tilde{x}-x||_{% 1}+\beta\textrm{LPIPS}(\tilde{x},x)+\gamma CE(f(\tilde{x}),f(x)).caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_x ; roman_Ψ , roman_Ω ) = | | over~ start_ARG italic_x end_ARG - italic_x | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | over~ start_ARG italic_x end_ARG - italic_x | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β LPIPS ( over~ start_ARG italic_x end_ARG , italic_x ) + italic_γ italic_C italic_E ( italic_f ( over~ start_ARG italic_x end_ARG ) , italic_f ( italic_x ) ) . (3)
  • We impose the sparsity property along with two other regularizations, combined under the term regsubscript𝑟𝑒𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT. More specifically, ΨΨ\Psiroman_Ψ is regularized to encourage sparsity of activations in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) through an 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty, and diversity while reducing redundancy in learnt dictionary ΦΦ\Phiroman_Φ with a kernel orthogonality loss orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT, applied on weights of final convolution layer of ΨΨ\Psiroman_Ψ [74, 71]. Then, ΩΩ\Omegaroman_Ω is encouraged to predict latent vectors close to average latent vector w¯¯𝑤\bar{w}over¯ start_ARG italic_w end_ARG, a common practice in GAN-inversion systems [68]. The regularization terms are written as follows:

    reg(x;Ψ,Ω)=regΨ(x;Ψ)+regΩ(x;Ω),subscript𝑟𝑒𝑔𝑥ΨΩsubscript𝑟𝑒𝑔Ψ𝑥Ψsubscript𝑟𝑒𝑔Ω𝑥Ω\displaystyle\mathcal{L}_{reg}(x;\Psi,\Omega)=\mathcal{L}_{reg-\Psi}(x;\Psi)+% \mathcal{L}_{reg-\Omega}(x;\Omega),caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Ω ) = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g - roman_Ψ end_POSTSUBSCRIPT ( italic_x ; roman_Ψ ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g - roman_Ω end_POSTSUBSCRIPT ( italic_x ; roman_Ω ) , (4)
    regΩ(x;Ω)=wx+w¯22,regΨ(x;Ψ)=δΦ(x)1+orth(Ψ).formulae-sequencesubscript𝑟𝑒𝑔Ω𝑥Ωsuperscriptsubscriptnormsuperscriptsubscript𝑤𝑥¯𝑤22subscript𝑟𝑒𝑔Ψ𝑥Ψ𝛿subscriptnormΦ𝑥1subscript𝑜𝑟𝑡Ψ\displaystyle\mathcal{L}_{reg-\Omega}(x;\Omega)=||w_{x}^{+}-\bar{w}||_{2}^{2},% \quad\mathcal{L}_{reg-\Psi}(x;\Psi)=\delta||\Phi(x)||_{1}+\mathcal{L}_{orth}(% \Psi).caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g - roman_Ω end_POSTSUBSCRIPT ( italic_x ; roman_Ω ) = | | italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT - over¯ start_ARG italic_w end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g - roman_Ψ end_POSTSUBSCRIPT ( italic_x ; roman_Ψ ) = italic_δ | | roman_Φ ( italic_x ) | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT ( roman_Ψ ) .

Finally, the training loss and the optimization can be summarized as:

train(x;Ψ,Θ,Ω)=of(x;Ψ,Θ)+recG(x;Ψ,Ω)+reg(x;Ψ,Ω),subscript𝑡𝑟𝑎𝑖𝑛𝑥ΨΘΩsubscript𝑜𝑓𝑥ΨΘsubscriptsuperscript𝐺𝑟𝑒𝑐𝑥ΨΩsubscript𝑟𝑒𝑔𝑥ΨΩ\displaystyle\mathcal{L}_{train}(x;\Psi,\Theta,\Omega)=\mathcal{L}_{of}(x;\Psi% ,\Theta)+\mathcal{L}^{G}_{rec}(x;\Psi,\Omega)+\mathcal{L}_{reg}(x;\Psi,\Omega),caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Θ , roman_Ω ) = caligraphic_L start_POSTSUBSCRIPT italic_o italic_f end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Θ ) + caligraphic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Ω ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Ω ) , (5)
Ψ^,Θ^,Ω^=argminΨ,Θ,Ω1Nx𝒮train(x;Ψ,Θ,Ω).^Ψ^Θ^ΩsubscriptΨΘΩ1𝑁subscript𝑥𝒮subscript𝑡𝑟𝑎𝑖𝑛𝑥ΨΘΩ\displaystyle\hat{\Psi},\hat{\Theta},\hat{\Omega}=\arg\min_{\Psi,\Theta,\Omega% }\frac{1}{N}\sum_{x\in\mathcal{S}}\mathcal{L}_{train}(x;\Psi,\Theta,\Omega).over^ start_ARG roman_Ψ end_ARG , over^ start_ARG roman_Θ end_ARG , over^ start_ARG roman_Ω end_ARG = roman_arg roman_min start_POSTSUBSCRIPT roman_Ψ , roman_Θ , roman_Ω end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_S end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ( italic_x ; roman_Ψ , roman_Θ , roman_Ω ) .

In the above equations, the loss hyperparameters are denoted by α,β,γ,δ𝛼𝛽𝛾𝛿\alpha,\beta,\gamma,\deltaitalic_α , italic_β , italic_γ , italic_δ. During training, the overall training loss is simultaneously optimized w.r.t parameters of only Ψ,ΘΨΘ\Psi,\Thetaroman_Ψ , roman_Θ and ΩΩ\Omegaroman_Ω, kee** f𝑓fitalic_f and G𝐺Gitalic_G fixed.

3.5 Interpretation phase

We now describe the interpretation generation process, which can be divided in two parts. (1) Concept relevance estimation, that requires estimating the importance of any given concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in prediction for a particular sample x𝑥xitalic_x (local interpretation) or a class c𝑐citalic_c in general (global interpretation), and (2) Concept visualization, which pertains to visualizing the concept encoded by any given concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. We describe each of them in greater detail below:
(1) Concept relevance: Since our g(x)𝑔𝑥g(x)italic_g ( italic_x ) adheres to structure of CoINs and among them closest to [51], the first step of relevance estimation almost follows as is. The estimation is based on concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), and how the pooled version of Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) is combined by the fully connected layer in ΘΘ\Thetaroman_Θ (with weights ΘWsubscriptΘ𝑊\Theta_{W}roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT) to obtain the output logits. Note that this step does not rely on using the decoder/generator G𝐺Gitalic_G. Specifically, the local relevance rk(x)subscript𝑟𝑘𝑥r_{k}(x)italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) of a concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for a given sample x𝑥xitalic_x is computed as the normalized version (between [1,1]11[-1,1][ - 1 , 1 ]) of its contribution to logit of the predicted class c^=g(x)^𝑐𝑔𝑥\hat{c}=g(x)over^ start_ARG italic_c end_ARG = italic_g ( italic_x ). The global relevance of concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for a given class c𝑐citalic_c, denoted rk,csubscript𝑟𝑘𝑐r_{k,c}italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT, is computed as the average of local relevance rk(x)subscript𝑟𝑘𝑥r_{k}(x)italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) for samples from class c𝑐citalic_c. The above description is summarized in equation below wherein ΘWk,c^superscriptsubscriptΘ𝑊𝑘^𝑐\Theta_{W}^{k,\hat{c}}roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , over^ start_ARG italic_c end_ARG end_POSTSUPERSCRIPT denotes the weight on concept k𝑘kitalic_k for predicted class c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG in weight matrix ΘWsubscriptΘ𝑊\Theta_{W}roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT:

rk(x)=αk(x)maxl|αl(x)|,αk(x)=pool(ϕk(x))ΘWk,c^,rk,c=𝔼(rk(x)|g(x)=c)formulae-sequencesubscript𝑟𝑘𝑥subscript𝛼𝑘𝑥subscript𝑙subscript𝛼𝑙𝑥formulae-sequencesubscript𝛼𝑘𝑥poolsubscriptitalic-ϕ𝑘𝑥superscriptsubscriptΘ𝑊𝑘^𝑐subscript𝑟𝑘𝑐𝔼conditionalsubscript𝑟𝑘𝑥𝑔𝑥𝑐r_{k}(x)=\frac{\alpha_{k}(x)}{\max_{l}|\alpha_{l}(x)|},\quad\alpha_{k}(x)=% \textrm{pool}(\phi_{k}(x))\Theta_{W}^{k,\hat{c}},\quad r_{k,c}=\mathbb{E}(r_{k% }(x)|g(x)=c)italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) | end_ARG , italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = pool ( italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) ) roman_Θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k , over^ start_ARG italic_c end_ARG end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT = blackboard_E ( italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) | italic_g ( italic_x ) = italic_c )

(2) Concept visualization: Once the importance of a concept function is estimated, one can extract the most important concepts for a sample x𝑥xitalic_x or class c𝑐citalic_c by thresholding rk(x)subscript𝑟𝑘𝑥r_{k}(x)italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) or rk,csubscript𝑟𝑘𝑐r_{k,c}italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT respectively. For visualizing any concept ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, following previous CoINs, one can start by selecting most activating training samples for ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) over the whole training set or separately for each class it is highly relevant for. However, the core of our concept visualization process, is to utilize generator G𝐺Gitalic_G to visualize the impact of ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on any input x𝑥xitalic_x it is relevant for. We do so by (1) directly modifying activation of ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) by a factor λ×ϕk(x),λ0𝜆subscriptitalic-ϕ𝑘𝑥𝜆0\lambda\times\phi_{k}(x),\lambda\geq 0italic_λ × italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) , italic_λ ≥ 0 while kee** all other activations in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) intact, and (2) Visualizing generated output for increasing value of λ𝜆\lambdaitalic_λ. The case of λ=1𝜆1\lambda=1italic_λ = 1 corresponds to x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG, the reconstructed version of x𝑥xitalic_x. This process is summarized in Fig. 3. Extremely high values of λ𝜆\lambdaitalic_λ can push the predicted latent vectors far from the average latent vector in which case the generated output is less reliable. For our experiments, qualitatively we found maximum λ𝜆\lambdaitalic_λ upto 3 or 4 reliable with consistent modifications.

Refer to caption
Figure 3: Overview of visualization for a given image x𝑥xitalic_x and concept function ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By imputing a higher activation for ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) (by a factor λ=4𝜆4\lambda=4italic_λ = 4 in the figure), and comparing the obtained visualization to the original reconstruction x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG (obtained with the untouched Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x )), we interpret information encoded by ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT about image x𝑥xitalic_x. For simplicity, we omit Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) in the figure, as it remains the same.

4 Experiments and Results

Datasets . We experiment on image recognition tasks for large-scale images in three different domains with a greater focus on multi-class classification tasks: (1) Binary age classification (young/old) on CelebA-HQ [30], (2) fine-grained bird classification for 200 classes on Caltech-UCSD-Birds-200 (CUB-200) [70], and (3) fine-grained car model classification of 196 classes on Stanford Cars[41].

Implementation details . We use a ResNet50 [26] as our architecture for f𝑓fitalic_f. All images are processed at resolution 256×256256256256\times 256256 × 256. For the pretrained GAN, on CelebA-HQ we use checkpoint released by NVIDIA [31]. On CUB-200 and Stanford Cars, we fine-tune an ImageNet [24] and LSUN Cars [79] checkpoint respectively to obtain the pretrained G𝐺Gitalic_G in a resource efficient manner. The fine-tuning is performed without any class-labels. We use a dictionary size K=64𝐾64K=64italic_K = 64 on CelebA-HQ and K=256𝐾256K=256italic_K = 256 on CUB-200 and Stanford-Cars. All experiments were conducted on a single V100-32GB GPU.We will release our code publicly upon publication. Complete details about architecture, training and evaluation are in appendix 0.B.

4.1 Evaluation strategy

One major goal of by-design interpretable architectures is to obtain high prediction performance. Thus, prediction accuracy of g𝑔gitalic_g is the first metric we evaluate. We next discuss multiple functionally-grounded metrics [17] that evaluate the learnt concept dictionary ΦΦ\Phiroman_Φ from an interpretability perspective and its use in visualization, including two novel metrics in the context of evaluating CoINs (“faithfulness” and “consistency”).

Fidelity of reconstruction . Since reconstructed output plays a crucial role in our visualization pipeline, we evaluate how well does G𝐺Gitalic_G reconstruct the input. We do so by computing averaged per-sample mean squared error (MSE), perceptual distance (LPIPS) [82] and distance of overall distributions (FID) [28] of reconstructed images x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG and original input images x𝑥xitalic_x.

Faithfulness of concept dictionary ΦΦ\Phiroman_Φ . The aspect of faithfulness for a generic interpretation method asks the question “are the features identified in the interpretation truly relevant for the prediction process?” [7, 52]. This is generally computed via simulating “feature removal” from the input and observing the change in predictor’s output [27]. Simulating feature removal from input is relatively straightforward for saliency methods compared to concept-based methods, for example by setting the pixel value to 0. For CoIN systems, this is significantly more tricky, as concept activations ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) don’t represent the input x𝑥xitalic_x exactly. However, through the decoder, we can evaluate if the concepts identified as relevant for an input encode information that is important for prediction. We adopt an approach similar to a previous proposal of faithfulness evaluation for audio interpretation systems [52]. Concretely, for a given sample x𝑥xitalic_x with activation Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), predicted class c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG and a threshold τ𝜏\tauitalic_τ, we first manipulate Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) such that ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) is set to 0 if rk(x)>τ,k{1,,K}formulae-sequencesubscript𝑟𝑘𝑥𝜏for-all𝑘1𝐾r_{k}(x)>\tau,\forall k\in\{1,...,K\}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) > italic_τ , ∀ italic_k ∈ { 1 , … , italic_K }. That is, we “remove” all concepts with relevance greater than some threshold. This modified version of Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) is referred to as Φrem(x)subscriptΦ𝑟𝑒𝑚𝑥\Phi_{rem}(x)roman_Φ start_POSTSUBSCRIPT italic_r italic_e italic_m end_POSTSUBSCRIPT ( italic_x ). To compute faithfulness for a given x𝑥xitalic_x, denoted by FFxsubscriptFF𝑥\text{FF}_{x}FF start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, we compute the change in probability of the predicted class from original reconstructed sample x~=G(Ω(Φ(x),Φ(x)))~𝑥𝐺ΩΦ𝑥superscriptΦ𝑥\tilde{x}=G(\Omega(\Phi(x),\Phi^{\prime}(x)))over~ start_ARG italic_x end_ARG = italic_G ( roman_Ω ( roman_Φ ( italic_x ) , roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ) to new sample xrem=G(Ω(Φrem(x),Φ(x)))subscript𝑥𝑟𝑒𝑚𝐺ΩsubscriptΦ𝑟𝑒𝑚𝑥superscriptΦ𝑥x_{rem}=G(\Omega(\Phi_{rem}(x),\Phi^{\prime}(x)))italic_x start_POSTSUBSCRIPT italic_r italic_e italic_m end_POSTSUBSCRIPT = italic_G ( roman_Ω ( roman_Φ start_POSTSUBSCRIPT italic_r italic_e italic_m end_POSTSUBSCRIPT ( italic_x ) , roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) ), that is, FFx=g(x~)c^g(xrem)c^subscriptFF𝑥𝑔subscript~𝑥^𝑐𝑔subscriptsubscript𝑥𝑟𝑒𝑚^𝑐\text{FF}_{x}=g(\tilde{x})_{\hat{c}}-g(x_{rem})_{\hat{c}}FF start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_g ( over~ start_ARG italic_x end_ARG ) start_POSTSUBSCRIPT over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT - italic_g ( italic_x start_POSTSUBSCRIPT italic_r italic_e italic_m end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT. Ideally, we expect to see a drop in probability (FFx>0subscriptFF𝑥0\text{FF}_{x}>0FF start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT > 0) if the set of relevant concepts “truly” encode information relevant for classification. Following [52] we report the median of FFxsubscriptFF𝑥\text{FF}_{x}FF start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT over the test data for different thresholds 0<τ<10𝜏10<\tau<10 < italic_τ < 1.
Consistency of visualization . We expect during visualization of a given concept ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that a user observes similar semantic modifications across different images. Thus, we hypothesize that if modifying any specific concept activation ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) leads to consistent changes for different samples x𝑥xitalic_x, then generated output for two versions of Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), one with ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) set to a large value and one with ϕk(x)=0subscriptitalic-ϕ𝑘𝑥0\phi_{k}(x)=0italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = 0, should be separable in the embedding space of f𝑓fitalic_f (all other concept activations unchanged). In other words, embeddings for images with high ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) and low ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) should be well separated. To compute this metric, we first create a dataset of generated images with two different sets of activations. For each of training and test set, this is done by first selecting a set of Nccsubscript𝑁𝑐𝑐N_{cc}italic_N start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT samples for which ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is highly activating and relevant for. Then we find its maximum activation ϕkmaxsuperscriptsubscriptitalic-ϕ𝑘𝑚𝑎𝑥\phi_{k}^{max}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT among these samples, and create two generated outputs for each of Nccsubscript𝑁𝑐𝑐N_{cc}italic_N start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT samples, one xk+superscriptsubscript𝑥𝑘x_{k}^{+}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that ϕk(xk+)subscriptitalic-ϕ𝑘superscriptsubscript𝑥𝑘\phi_{k}(x_{k}^{+})italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) is set to λϕkmax𝜆superscriptsubscriptitalic-ϕ𝑘𝑚𝑎𝑥\lambda\phi_{k}^{max}italic_λ italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT with λ1𝜆1\lambda\geq 1italic_λ ≥ 1, and the other xksuperscriptsubscript𝑥𝑘x_{k}^{-}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT such that ϕk(xk)=0subscriptitalic-ϕ𝑘superscriptsubscript𝑥𝑘0\phi_{k}(x_{k}^{-})=0italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) = 0. The two sets of generated images are then gathered into a single dataset 𝒮k={(xk+,1),(xk,0)}subscript𝒮𝑘superscriptsubscript𝑥𝑘1superscriptsubscript𝑥𝑘0\mathcal{S}_{k}=\{(x_{k}^{+},1),(x_{k}^{-},0)\}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , 1 ) , ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT , 0 ) } such that |𝒮k|=2Nccsubscript𝒮𝑘2subscript𝑁𝑐𝑐|\mathcal{S}_{k}|=2N_{cc}| caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | = 2 italic_N start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT, and we learn a binary classifier φk:𝒳{0,1}:subscript𝜑𝑘𝒳01\varphi_{k}:\mathcal{X}\rightarrow\{0,1\}italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : caligraphic_X → { 0 , 1 }, from pooled feature maps of intermediate embedding of f𝑓fitalic_f. We train for binary classification for sets created from training data 𝒮ktrainsuperscriptsubscript𝒮𝑘train\mathcal{S}_{k}^{\text{train}}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT train end_POSTSUPERSCRIPT and test on sets created from test data 𝒮ktestsuperscriptsubscript𝒮𝑘test\mathcal{S}_{k}^{\text{test}}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT. Our concept consistency metric CCk𝐶subscript𝐶𝑘CC_{k}italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for a given concept k𝑘kitalic_k is thus obtained as the accuracy of the binary classifier on 𝒮ktestsuperscriptsubscript𝒮𝑘test\mathcal{S}_{k}^{\text{test}}caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT:

CCk(𝒮ktest;φk)=12Ncc(xk+,xk)𝒮ktestφk(xk+)+(1φk(xk))𝐶subscript𝐶𝑘superscriptsubscript𝒮𝑘testsubscript𝜑𝑘12subscript𝑁𝑐𝑐subscriptsuperscriptsubscript𝑥𝑘superscriptsubscript𝑥𝑘superscriptsubscript𝒮𝑘testsubscript𝜑𝑘superscriptsubscript𝑥𝑘1subscript𝜑𝑘superscriptsubscript𝑥𝑘CC_{k}(\mathcal{S}_{k}^{\text{test}};\varphi_{k})=\frac{1}{2N_{cc}}\sum_{(x_{k% }^{+},x_{k}^{-})\in\mathcal{S}_{k}^{\text{test}}}\varphi_{k}(x_{k}^{+})+(1-% \varphi_{k}(x_{k}^{-}))italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ; italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 italic_N start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ∈ caligraphic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) + ( 1 - italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) (6)

This performance is tabulated for each concept k𝑘kitalic_k for a fixed λ𝜆\lambdaitalic_λ, and mean and standard deviation across all concepts is reported.

Baselines . The primary comparison methods for us are CoINs that learn unsupervised concepts efficiently, even for large-scale images, FLINT [51] and FLAEM [59]. SENN [7] suffers from computational issues for large-scale images as it requires to compute jacobian of concept dictionary w.r.t input pixels for its loss computation. Thus, for all the metrics we compare with FLINT and FLAEM as our primary baselines. Additionally, for accuracy evaluation, we track the performance of our pretrained classifier f𝑓fitalic_f. Note that f𝑓fitalic_f (ResNet50) is not an interpretable model and trained entirely for accuracy. Lastly, for faithfulness we compare with a “random” baseline that randomly selects concepts for whom activation is set to 0. Since there is no notion of threshold in selection, in order to make it comparable for a given threshold, we select the same number of concepts randomly as we would for our method. Our proposed system for all evaluations is abbreviated as VisCoIN (Visualizable CoIN).

4.2 Results and discussion

Quantitative results . Tab. 1(a) reports the test accuracy of all the evaluated systems. Our proposed system, VisCoIN performs competitively with the pretrained f𝑓fitalic_f considered uninterpretable and purely trained for performance. It also performs better than the other recent CoIN systems for more complex classification tasks (CUB-200 and Stanford Cars) with large number of classes and diverse images. Metrics quantifying the fidelity of reconstruction on test data are in Tab. 2(a). The other baselines only optimize for pixel-wise reconstruction and FLINT achieves a lower MSE than VisCoIN. However, crucially, reconstruction from our method approximates the input data considerably better, in terms of perceptual similarity (LPIPS) and overall distribution (FID), which highly contributes to better viewability. Tab. 2(b) tabulates the median faithfulness FFxsubscriptFF𝑥\text{FF}_{x}FF start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT for the evaluated systems on 1000 random test samples for different thresholds. The performance of Random baseline being close to 0 even for small thresholds indicates that a random selection of concepts often contains little information relevant for classification of the predicted class. In contrast, concepts identified relevant as part of g𝑔gitalic_g in VisCoIN tend to encode information about input that noticeably affects classification. In regard to other CoINs, while the faithfulness results are competitive on CelebA-HQ, for more complex datasets, concept dictionary in VisCoIN is significantly more faithful than FLINT or FLAEM, which do not demonstrate more faithfulness than the Random baseline. Finally, the mean and standard deviation for consistency of all concepts is reported in Tab. 1(b) with λ=2𝜆2\lambda=2italic_λ = 2. Concepts learnt with VisCoIN demonstrate a higher mean consistency compared to baselines. The deviation across concepts is also lower for our method. We also evaluate consistency with higher values of λ𝜆\lambdaitalic_λ and observe increased separation with better classification performance (Appendix 0.C).

Table 1: (a) Accuracy (in %) of interpreter g𝑔gitalic_g of CoIN systems, and of the baseline pretrained classifier f𝑓fitalic_f. (b) Mean and standard deviation for consistency CCk𝐶subscript𝐶𝑘CC_{k}italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over all concept functions ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (binary accuracy in %).
Higher is better, the best performance is reported in
bold, second best in underline.
(a) Accuracy of interpreter g𝑔gitalic_g
Dataset Original-f𝑓fitalic_f  FLINT  FLAEM VisCoIN (Ours)
CelebA-HQ 87.71 87.25 88.18 87.71
CUB-200 80.56 77.2 51.76 79.44
Stanford Cars 82.28 75.95 50.02 79.89
(b) Consistency of changes
Dataset  FLINT  FLAEM VisCoIN (Ours)
CelebA-HQ 82.6 ±plus-or-minus\pm± 22.7 57 ±plus-or-minus\pm± 17.3 85.5 ±plus-or-minus\pm± 13.9
CUB-200 72.6 ±plus-or-minus\pm± 18 55.6 ±plus-or-minus\pm± 13.6 85 ±plus-or-minus\pm± 8.4
Stanford Cars 70 ±plus-or-minus\pm± 16.3 54.9 ±plus-or-minus\pm± 13.3 82.7 ±plus-or-minus\pm± 8.3
Table 2: (a) Reconstruction quality (MSE, LPIPS and FID) of CoIN systems. Lower is better. (b) Faithfulness (median FFx𝐹subscript𝐹𝑥FF_{x}italic_F italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT) of CoIN systems and random baseline, for different threshold. Higher is better. Best performance is in bold, second best in underline.
(a) Reconstruction quality
Dataset Metric  FLINT  FLAEM VisCoIN (Ours)
CelebA-HQ MSE 0.051 0.119 0.094
LPIPS 0.533 0.688 0.405
FID 30.45 39.73 8.55
CUB-200 MSE 0.113 0.217 0.161
LPIPS 0.712 0.75 0.545
FID 53.16 51.15 15.85
Stanford Cars MSE 0.121 0.278 0.179
LPIPS 0.697 0.734 0.488
FID 64.16 69.44 6.77
(b) Faithfulness
Dataset Thresh. τ𝜏\tauitalic_τ  Random  FLINT  FLAEM VisCoIN (Ours)
CelebA-HQ 0.1 0.03 0.254 0.091 0.267
0.2 0.018 0.201 0.151 0.171
0.4 0.005 0.07 0.107 0.074
CUB-200 0.1 0.034 0.004 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.251
0.2 0.007 0.002 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.146
0.4 0.001 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.044
Stanford Cars 0.1 0.035 0.001 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.161
0.2 0.016 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.118
0.4 0.002 <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT <103absentsuperscript103<10^{-3}< 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.034

Qualitative results . Fig. 4 shows visualization for different class-concept pairs across the three datasets that are determined to have high global relevance rk,csubscript𝑟𝑘𝑐r_{k,c}italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT through predictive structure of g(x)𝑔𝑥g(x)italic_g ( italic_x ), as described in Sec. 3.5. For each class-concept pair, we show two maximum activating training samples for the concept from the corresponding class, the reconstructed input from Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) (λ=1𝜆1\lambda=1italic_λ = 1) and the generated output with modified concept activation ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) by a factor λ=4𝜆4\lambda=4italic_λ = 4. In all the illustrations, increasing the activation of the concept, i.e. moving from λ=1𝜆1\lambda=1italic_λ = 1 to 4444, strongly emphasizes some specific concept in the generated output that can be clearly grounded to the input. For instance, increasing the activation of concept for “Red-eye” in Fig. 4(a) increases the size of red eye of the bird, a key feature of samples from class 25 (“Bronzed-cowbird”). We can also qualitatively verify that the reconstruction has a high enough quality that allows us to “view” the input sample through the generated output and ground the modifications in generated output to the input. However, we also observed that learnt concept functions can be prone to modifying more than one high-level feature in the image. For, e.g., in Fig. 4(d), increasing the concept activation increases both “eye-squint” and “beard” in the generated output. A longer discussion about this and other limitations is available in Appendix 0.F.

Refer to caption
(a) Concept “Red-eye” in class 25
Refer to caption
(b) Concept “Blue upperparts” in class 15
Refer to caption
(c) Concept “Makeup” in class Young
Refer to caption
(d) Concept “Eye squint” in class Old
Refer to caption
(e) Concept “Silver front” in class 2
Refer to caption
(f) Concept “Radiator grille” in class 60
Figure 4: Qualitative examples obtained for different concepts, classes on (a)-(b) CUB-200, (c)-(d) CelebA-HQ, (e)-(f) Stanford-Cars datasets. On each subfigure, first column corresponds to maximum activated samples x𝑥xitalic_x for class-concept pairs with high relevance (rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5), second column to reconstructed image obtained with original Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), and third column to the image obtained by imputing 4×ϕk(x)4subscriptitalic-ϕ𝑘𝑥4\times\phi_{k}(x)4 × italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ).

Ablation study . We ablate multiple aspects of our system, with detailed results in Appendix 0.D. Most notably, we observed a tradeoff induced by strength of reconstruction-classification loss (weight γ𝛾\gammaitalic_γ). A high γ𝛾\gammaitalic_γ positively impacts the faithfulness but negatively impacts the perceptual similarity of reconstruction.

5 Conclusion

We introduced a novel architecture for Visualizable CoIN (VisCoIN), that addresses major limitations to visualize unsupervised concept dictionaries learnt in CoIN systems for large-scale images. Our architecture integrates the visualization process in the pipeline of the model training, by leveraging a pretrained generative model using a concept translator module. This module maps concept representation to the latent space of the fixed generative model. During training, we additionally enforce a viewability property that promotes reconstruction of high-quality images through the generative model. Finally, we defined new evaluation metrics for this novel interpretation pipeline, to better align evaluation of concept dictionaries and interpretations provided to a user. Future works include adapting the design of this system for other type of generative models, or applying it to different visual domains.


Appendix 0.A Design choices considered with the generative model

0.A.1 Desiderata for generative model

We discuss below the reasoning and requirements governing our choice of generative model G𝐺Gitalic_G, i.e., a pretrained StyleGAN2-ADA.

  • The generative model should have the ability to model the input with a low dimensional latent space (compared to input dimensions), since a concept representation is typically much lower dimensional than input. Even so, it must possess a structured latent space so that modifying a concept activation can enable latent traversal for visualization. These requirements discourage the use of invertible neural networks [10, 18], or diffusion models [64, 57] whose structure of latent spaces is still being explored [42, 53]. Furthermore, since we constantly rely on input reconstruction during training, the slow sampling process in diffusion models for large-scale images also hinders their suitability for our use case.

  • One crucial role of the generative model is to generate high quality natural images. This is essential not only for better reconstruction of large-scale images, but also crucial to ground any visual modifications in the generated images back to the original input. This encouraged our preference for GANs instead of Variational Auto-Encoders (VAEs) [39, 54], specially for large-scale images.

  • The generative model also needs to be able to train in limited data regimes if a pretrained generator is not readily available.

Hence, we opted for a competitive, flexible and widely used generative model, StyleGAN2-ADA [31], shown to have demonstrated high-quality generation for various image domains.

0.A.2 Defining concepts in stylespace of pretrained StyleGAN

Previous works using StyleGAN for interpretation deliberately choose to use the Stylespace as concept dictionary [43], which is considered the most disentangled and interpretable latent space of a StyleGAN [72]. However, in regard to a by-design interpretable architecture, there is a crucial issue in using the Stylespace of a pretrained StyleGAN as the concept feature space. Indeed, the size of Stylespace, even for small-scale images, runs into many thousands, which is much larger than the size of typical dictionaries in CoINs (by a factor of 10 to a 100). This directly compromises the interpretable structure of the prediction model [46] with much lower number of classes for prediction. Moreover, since the model is pretrained for generation, the space doesn’t have sparse activations, further worsening the issue. Note that with the same reasoning, we considered using the extended latent space 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as concept dictionary unsuitable. The final option that remains is using 𝒲𝒲\mathcal{W}caligraphic_W space as our concept dictionary. While at first glance it seems an interesting option, since its size is comparable to concept dictionaries in CoINs, and it is also considered as better structured than 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT for latent traversals [73], inverting images to 𝒲𝒲\mathcal{W}caligraphic_W space has been found to be lot more challenging [1]. Doing so under even more constraints such as those of concept dictionaries in CoIN, poses an even harder challenge. Hence, we adhered to the CoIN structure to learn a small dictionary compared to the size of 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. Given the constraints imposed on ΦΦ\Phiroman_Φ, we designed the decoding part to prioritize better reconstruction.

Appendix 0.B Further system details

0.B.1 Network architectures

We already discuss in the main text the architectures for f𝑓fitalic_f and G𝐺Gitalic_G, both using standard and widely used models, ResNet50 [26] and StyleGAN2-ADA [31] respectively. f𝑓fitalic_f and G𝐺Gitalic_G are pretrained and fixed during training of VisCoIN. As part of our training we train three subnetworks Ψ,Ω,ΘΨΩΘ\Psi,\Omega,\Thetaroman_Ψ , roman_Ω , roman_Θ. We already described ΘΘ\Thetaroman_Θ in the main text, as consisting of a pooling (maxpool), linear and softmax layers in the respective order. We describe the architecture of ΨΨ\Psiroman_Ψ network that operates on hidden layers of f𝑓fitalic_f and computes the dictionary activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) and the supporting representation Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) and the design of concept translator ΩΩ\Omegaroman_Ω that maps (Φ(x),Φ(x))Φ𝑥superscriptΦ𝑥(\Phi(x),\Phi^{\prime}(x))( roman_Φ ( italic_x ) , roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) ) to extended latent space 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT of StyleGAN to reconstruct x𝑥xitalic_x.

0.B.1.1 Architecture of ΨΨ\Psiroman_Ψ

Our architecture of ΨΨ\Psiroman_Ψ mostly follows proposed architecture of ΨΨ\Psiroman_Ψ for FLINT [51] which accesses output of two layers for ResNet18 close to the output layer (output of block 3 and penultimate layer of block 4). The ResNet50 also follows a similar structure with 4 blocks. Each block however contains 3, 4, 3 and 3 sub-blocks termed “bottleneck” respectively. In terms of the set of layers accessed by ΨΨ\Psiroman_Ψ for VisCoIN, in addition to the corresponding two layers in ResNet50 (output of block 3 of shape 1024×16×16102416161024\times 16\times 161024 × 16 × 16, output of penultimate bottleneck layer in block 4 of shape 2048×8×82048882048\times 8\times 82048 × 8 × 8), we also access a third layer for improved reconstruction (output of block 2 of shape 512×32×325123232512\times 32\times 32512 × 32 × 32). Each layer output is passed through a convolutional layer and brought to a common shape of 512×8×851288512\times 8\times 8512 × 8 × 8, the lowest resolution and feature maps. We then concatenate all the feature maps and create two branches, one that outputs Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) and the other outputs Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ). The branch computing Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) is simply two convolutional and a pooling layer yielding an output shape of K×3×3𝐾33K\times 3\times 3italic_K × 3 × 3, where K𝐾Kitalic_K is the number of concepts, and each ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) is a convolutional map of size 3×3333\times 33 × 3. Thus, the total number of elements in each ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) is b=9𝑏9b=9italic_b = 9. The branch computing support representation Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) (unconstrained except by reconstruction) applies two fully connected layers to output same number of elements as in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ). The usefulness of the support representation Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) is analysed in Tab. 7.

0.B.1.2 Architecture of ΩΩ\Omegaroman_Ω

For translating the concept activations Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) and the supporting representation Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) to the extended latent space, we use the concept translator ΩΩ\Omegaroman_Ω. As described in the main text, it consists of single fully-connected layers, one for each latent vector (14 different vectors of size 512 for resolution 256×256256256256\times 256256 × 256). We control prediction of a small subset of latent vectors using ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which are expected to contain minimal classification information. To design which latent vectors to control using ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we relied on findings from the work in [35], which roughly divides the different latent vectors in 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as controlling the coarse, mid and fine level features of the generated image with increasing resolution. The first 4 latent vectors control typically the coarse details, such as view-point, pose or scale. The next 4 roughly control the mid-level details, such as shape or background, while the rest control the fine-level details, such as textures and colours. We expect the relevant features for classification to be controlled mostly by mid-level and fine-level latent vectors, except possibly for the highest resolution where very fine-scaled details are controlled. In particular, we predict the first three and last two style vectors using Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ), which are expected to contain less relevant information for our concept representation. The rest of the style vectors (9 out of 14 for resolution 256) are predicted from Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ). Note that most of the style vectors are still predicted using Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ) to preserve its importance in reconstruction. Also note that employing single FC layers associates each concept activation with a direction in extended latent space.

0.B.2 Training details

The steps to train our system on a given dataset can be divided into three modular parts: (1) Obtaining a pretrained classifier f𝑓fitalic_f with “strong” performance that can provide high-quality source representations to learn from, (2) Obtaining a pretrained generator G𝐺Gitalic_G that can approximate well the distribution of the given dataset, and (3) Training of g𝑔gitalic_g with VisCoIN using the pretrained f𝑓fitalic_f and G𝐺Gitalic_G. When a pretrained f𝑓fitalic_f or G𝐺Gitalic_G is not easily available, we train them on their respective tasks on the given dataset. Among our 3 datasets, CelebA-HQ [30], CUB-200 [70] and Stanford Cars [41], we easily found a pretrained G𝐺Gitalic_G for CelebA-HQ. All other combinations of f𝑓fitalic_f and G𝐺Gitalic_G were pretrained. We describe the training details of f𝑓fitalic_f, G𝐺Gitalic_G and VisCoIN below:

0.B.2.1 Pretraining f𝑓fitalic_f

We pretrain f𝑓fitalic_f for classification on each of our datasets before using it for training VisCoIN. We use Adam optimizer [38] with fixed learning rate 0.0001 on CUB-200 and 0.001 on CelebA-HQ to train f𝑓fitalic_f. On Stanford-Cars, we use SGD optimizer with a starting learning rate of 0.1, decayed by a factor 0.1 after 30 and 60 epochs. The training is initialized with pretrained weights from ImageNet in each case, and fine-tuned for 10, 30 and 90 epochs on CelebA-HQ, CUB-200 and Stanford-Cars respectively. In all cases, during pretraining, the images are resized to size 256×256256256256\times 256256 × 256.The accuracy of f𝑓fitalic_f is already reported in the main paper. All of these experiments have been conducted on a single A100 GPU, with a batch size of 64 for CelebA-HQ and 128 for Stanford-Cars dataset and on V100 GPU with a batch size of 32 for CUB-200.

0.B.2.2 Pretraining G𝐺Gitalic_G

We use a pretrained StyleGAN2-ADA [31] for all our experiments. On CelebA-HQ, we used a pretrained checkpoint available from NVIDIA. For CUB-200 and Stanford-Cars, we pretrain G𝐺Gitalic_G ourselves. Note that since we want G𝐺Gitalic_G to generate images entirely from information provided by Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), we do not use any class labels when training G𝐺Gitalic_G. One can face two separate challenges in these limited data regimes. The first is to train G𝐺Gitalic_G in a stable fashion. Our solution to this, as discussed previously, comes from using the ADA training strategy for StyleGAN2. We also use the official StyleGAN2-ADA Pytorch repository to train our models.https://github.com/NVlabs/stylegan2-ada-pytorch A second challenge can come from limitations to training resources since these models might require to be trained with tens of millions of real/dataset images (“shown” to the discriminator) in order to reach high quality generation. This could potentially require training with multiple GPUs for multiple days. We address this issue to a reasonable extent by fine-tuning pretrained checkpoints. We utilize the insights from [24] and fine-tune a checkpoint from ImageNet for CUB-200, and LSUN Cars [79] for Stanford Cars. The choices of these specific models was specifically based on the idea that these datasets were the closest domains we had access of pretrained checkpoints to.

We train the G𝐺Gitalic_G on a single Tesla V100-32GB GPU with mostly default parameters from the official repository. We only differ in (1) Learning a map** function (that learns to predict latent vectors from gaussian noise vector) with 2 FC layers and (2) For Stanford cars, we observed a collapse in generation of viewpoints with default training after 600k images shown, thus we reduced the strength of horizontal flip augmentation to 0.1 instead of default 1.

We use a batch size of 16 for training. The GANs are trained only on the training data. The final pretrained model for CUB-200 is obtained after training the discriminator with 2 million real images (21 hours). The final model for Stanford-Cars was obtained after training with 1.8 million dataset images (18.5 hours). The pretrained models achieve an FID of around 9.4 and 8.3 on CUB-200 and Stanford Cars, respectively.

0.B.2.3 Details for VisCoIN

We train for 50K iterations on CelebA-HQ and 100K iterations on CUB-200 and Stanford-Cars. We use Adam optimizer with learning rate 0.00010.00010.00010.0001 for all subnetworks and on all datasets. During training, each batch consists of 8 samples from the training data and 8 synthetic samples randomly generated using G𝐺Gitalic_G. This practice of utilizing the synthetic samples from G𝐺Gitalic_G is fairly common for encoder-based GAN inversion systems [75, 73], and an additional advantage for our system to use a pretrained G𝐺Gitalic_G. Note that the use of fidelity loss with a pretrained f𝑓fitalic_f instead of a classification loss on g(x)𝑔𝑥g(x)italic_g ( italic_x ) fits neatly with this, as one cannot obtain any ground-truth annotations for the synthetic samples. The training data samples use a random crop** and random horizontal flip augmentation in all cases. All images are normalized to the range [1,1]11[-1,1][ - 1 , 1 ] and have resolution 256×256256256256\times 256256 × 256 for processing. This is the default range and resolution we use for pretraining for f𝑓fitalic_f and G𝐺Gitalic_G too. We have already described the architectures of all our components, pretrained or trained as part of training VisCoIN. We tabulate below in Tab. 3 the hyperparameter values for all our datasets. To limit the amount of hyperparameters to tune, we used a fixed α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 (weight for output fidelity loss) and β=3𝛽3\beta=3italic_β = 3 (weight for LPIPS loss) for all datasets. The rationale behind choice of all hyperparameters is discussed in Appendix 0.D, wherein we also present the ablation studies w.r.t to multiple components.

Table 3: Hyperparameters values for VisCoIN
Parameter  CelebA-HQ  CUB-200  Stanford Cars
K𝐾Kitalic_K – Size of concept dictionary ΦΦ\Phiroman_Φ 64 256 256
α𝛼\alphaitalic_α – Weight for output fidelity 0.5 0.5 0.5
β𝛽\betaitalic_β – Weight for LPIPS 3.0 3.0 3.0
γ𝛾\gammaitalic_γ – Weight for reconstruction-classification 0.2 0.1 0.05
δ𝛿\deltaitalic_δ – Weight for sparsity 2 0.2 0.2

0.B.3 Evaluation details

Metric computation . The median faithfulness is computed over 1000 random samples from the test data. For consistency, we use Ncc=100,λ=2formulae-sequencesubscript𝑁𝑐𝑐100𝜆2N_{cc}=100,\lambda=2italic_N start_POSTSUBSCRIPT italic_c italic_c end_POSTSUBSCRIPT = 100 , italic_λ = 2, i.e., given any concept ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we extract its 100 most activating samples over samples of classes its most relevant for. The constant high activation is twice (λ=2𝜆2\lambda=2italic_λ = 2) the maximum activation of ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) over the pool of 100 samples. Thus the binary training dataset created via samples from training data consists of 200 samples, 100 “positive” samples with high activation of ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) and 100 “negative” samples with zero activation ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ). The binary testing dataset also contains the same number of samples of each type but is created via samples from test data. The feature maps we extract are the output of the second block of the pretrained f𝑓fitalic_f (ResNet50), the 22nd convolutional layer. The shape for each feature map is 512×32×325123232512\times 32\times 32512 × 32 × 32. We pool them across the spatial axis to obtain an embedding of size 512512512512 for any input sample. The linear classifier we train is a linear SVM. We select its inverse regularization strength C𝐶Citalic_C from the set {0.01,0.1,1.0,5.0}\{0.01,0.1,1.0,5.0\}{ 0.01 , 0.1 , 1.0 , 5.0 } (lower value is stronger regularization). The parameter is selected using 5-fold cross validation on the created training data.

0.B.3.1 Baseline implementations

We utilize the official codebase available for FLINThttps://github.com/jayneelparekh/FLINT and FLAEMhttps://github.com/anirbansarkar-cs/Ante-hoc_Explainability_Concepts for our baseline implementation. For fairness, we use the same number of concepts for both of them. Since our architecture is closer to FLINT, we update and adapt it to implement in similar settings as ours. We use the same f𝑓fitalic_f architecture (ResNet50 instead of ResNet18) for both the systems and keep it pretrained and fixed. The ΨΨ\Psiroman_Ψ architecture is also similar in that it accesses the same set of hidden layers and has the same structure and depth. For other hyperparameters we use their default settings applied earlier for CUB-200. Implementing FLAEM with same network architecture is more complicated as it deviates considerably from the proposed architecture, thus we mostly use their default settings. In their code, they use a base classifier architecture similar to ResNet101 and use the output of final conv layer as the concept representation. In both cases, we do not modify the decoders. FLAEM uses a simpler decoder that learns 3 deconvolution layers, while FLINT learns a deeper decoder consisting of transposed convolution layers.

Appendix 0.C Additional experiments

0.C.1 Consistency with higher λ𝜆\lambdaitalic_λ

Table 4: Mean and standard deviation for consistency CCk𝐶subscript𝐶𝑘CC_{k}italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over all concept functions ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (binary accuracy in %percent\%%), using λ=3𝜆3\lambda=3italic_λ = 3. Higher is better.
Dataset   FLINT   FLAEM  VisCoIN (Ours)
CelebA-HQ 83.4 ±plus-or-minus\pm± 22.9 59.5 ±plus-or-minus\pm± 19.2 90.9 ±plus-or-minus\pm± 13.8
CUB-200 76.4 ±plus-or-minus\pm± 20.8 56.7 ±plus-or-minus\pm± 14.1 93.4 ±plus-or-minus\pm± 6.2
Stanford Cars 77.3 ±plus-or-minus\pm± 19.6 55.2 ±plus-or-minus\pm± 13.5 91.6 ±plus-or-minus\pm± 6.3

We present the results here for evaluating consistency with a higher value of λ=3𝜆3\lambda=3italic_λ = 3, compared to the main text where λ=2𝜆2\lambda=2italic_λ = 2. Thus, the constant high activation of ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) used to generate “positive” samples of the dataset is increased further. We thus expect the “separation” in the embedding space to increase and consequently a higher performance CCk𝐶subscript𝐶𝑘CC_{k}italic_C italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the binary classifier φksubscript𝜑𝑘\varphi_{k}italic_φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for any k𝑘kitalic_k. The results for both are presented in Tab. 4. The results confirm that emphasizing the concept indeed makes the visual modifications more stronger and consistent. Moreover, we also observe that increase in consistency of VisCoIN tends to be larger than increase for other CoIN systems.

0.C.2 Qualitative visualization FID

Table 5: Quantitative evaluation (FID) of the visualization obtained for interpretation. Lower is better.
Dataset FLINT  VisCoIN (Ours)
CelebA-HQ 21.12 9.83
CUB-200 26.55 11.71
Stanford Cars 45.72 8

While qualitatively on can clearly observe the difficulty to understand activation maximization based visualization in FLINT. We further support our claim about the unnaturalness of these visualizations compared to visualization in VisCoIN by computing the distance of distributions of visualizations in FLINT, visualizations in VisCoIN and original data distribution. Note that we can’t use any reconstruction metrics as FLINT visualizations don’t reconstruct a given input. Instead they initialize using a given maximum activating sample and execute the optimization procedure of activation maximization to maximally activate a ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ). We thus compute the FID distance between the visualizations and the data distribution. For VisCoIN visualization we select our most extreme value of λ𝜆\lambdaitalic_λ. For FLINT visualization we follow their implementation and run the input optimization procedure for 1000 iterations. Since the FLINT visualizations are relatively lot more expensive to compute (1000 backward passes vs 1 forward pass for VisCoIN), we compute the visualizations for 3 maximum activating samples for random 400 relevant class-concept pairs (with rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5). Thus the FIDs are computed between 1200 data samples and corresponding visualizations.

Appendix 0.D Ablation studies

We present ablation studies for components and simultaneously discuss our rationale behind the design selection of these components, (a) effect of reconstruction-classification loss with weight γ𝛾\gammaitalic_γ, (b) role of using a supporting representation Φ(x)superscriptΦ𝑥\Phi^{\prime}(x)roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) to assist in reconstruction, (c) selection of number of concepts K𝐾Kitalic_K, (d) effect of orthogonality loss, and (e) effect of fidelity and sparsity loss weights.

We highlight at this point to the reader that there is an overarching theme in our various design choices. Most of them are based on sha** the systems suitability for better optimization of perceptual similarity for reconstruction. We constantly aim to achieve better reconstruction without major negative impacts for any other properties. This is because for complex datasets (CUB-200, Stanford Cars), the key bottleneck in the design is to achieve the high-quality reconstruction for viewability.

Table 6: Effect of the weight γ𝛾\gammaitalic_γ of the Reconstruction-Classification loss in the total training loss, measured by Faithfulness, LPIPS and FID, on CUB-200. Faithfulness computed with a threshold of 0.2. Bold indicates setting selected for our experiments.
γ𝛾\gammaitalic_γ  Faithfulness  MSE  LPIPS  FID
0 0.001 0.142 0.52 13.11
0.1 0.146 0.161 0.545 15.85
0.2 0.236 0.192 0.607 8.84
0.5 0.24 0.209 0.634 11.54
Table 7: Effect of using the support representation ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, measured by MSE, LPIPS and FID, on CUB-200. Bold indicates setting selected for our experiments.
ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT  K  MSE  LPIPS  FID
Yes 256 0.161 0.545 15.85
No 512 0.178 0.568 13.52
No 256 0.187 0.584 9.55

0.D.0.1 Reconstruction-classification loss

Refer to caption
(a) Original batch of inputs
Refer to caption
(b) Corresponding reconstructed images
Figure 5: (a) Final training batch of images shown to the model. (b) Reconstruction obtained using the model trained with γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2. Even though FID is better, reconstruction quality is noticeably worse.

Tab. 6 reports the perceptual similarity and faithfulness with threshold τ=0.2𝜏0.2\tau=0.2italic_τ = 0.2 on test data of CUB-200 for different strength γ𝛾\gammaitalic_γ. Interestingly, it indicates a tradeoff between faithfulness and perceptual similarity. Completely removing this loss heavily impacts the faithfulness. However, among the positive γ𝛾\gammaitalic_γ, our choice was driven strongly by achieving a reconstruction with high-enough quality and perceptual similarity to enable effective visualization. Thus, we chose γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1. The key reason for this is that perceptual similarity is a much better indicator for viewability. For instance, for the γ=0.2𝛾0.2\gamma=0.2italic_γ = 0.2, even though the FID is better, the reconstruction (LPIPS) is noticeably worse. Reconstruction of the final training batch is indicated in 5, to highlight this issue with high γ𝛾\gammaitalic_γ. We thus kept a smaller γ𝛾\gammaitalic_γ for CUB-200 and Stanford Cars where the reconstruction is more challenging and slightly higher value for CelebA-HQ.

0.D.0.2 Use of support representation ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

Tab. 7 presents the reconstruction metrics for different concept dictionary sizes and use of ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in reconstruction. As before, we prioritized optimization of LPIPS and using ΦsuperscriptΦ\Phi^{\prime}roman_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT assists in achieving better reconstruction whilst allowing us to employ a smaller dictionary.

0.D.0.3 Selecting number of concepts K𝐾Kitalic_K

The ablation with different number of concepts is given in Tab. 8 and is an important hyperparameter of the system. While choosing K𝐾Kitalic_K, it is easy to filter out smaller K𝐾Kitalic_K values as they clearly lead to a worse reconstruction. However, hypothetically a higher K𝐾Kitalic_K should improve for all the metrics and thus finding an upper bound for K𝐾Kitalic_K is more subjective. Our choice was mainly influenced by (1) the observation that increasing from 256 to 512 offered relatively minimal advantage in reconstruction, and (2) previous methods that used supervised concepts train with 312 concepts. Hence we intended to use a similar dictionary size. Since the Stanford Cars dataset had a comparable number of samples and classes, we used the same number of concepts. For CelebA-HQ, we experimented with reduced K𝐾Kitalic_K as there are only two classes and the images are less diverse compared to the other two datasets.

Table 8: Impact of the number of concepts K𝐾Kitalic_K used in ΦΦ\Phiroman_Φ, on accuracy of g𝑔gitalic_g (in %percent\%%), LPIPS and FID, for CUB-200. Bold indicates setting selected for our experiments.
K  Accuracy  MSE  LPIPS  FID
512 79.78 0.156 0.537 17.27
256 79.44 0.161 0.545 15.85
128 79.03 0.183 0.578 8.64
64 78.91 0.203 0.624 6.3

0.D.0.4 Orthogonality loss

For the final layer of ΨΨ\Psiroman_Ψ, we choose a 1×1111\times 11 × 1 convolutional layer. Thus, the weights/kernels for this layer can be represented as single matrix of size number of input feature maps times number of concepts K𝐾Kitalic_K. We encourage the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalized columns to be orthogonal which in turn encourages each ϕk(x)subscriptitalic-ϕ𝑘𝑥\phi_{k}(x)italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) to be predicted using different feature maps. We report the quantitative effect of this loss in Tab. 9. Incorporating this loss offers slight advantage in improved perceptual similarity. However, another key reason we incorporated this loss in our experiments is that we qualitatively observed a greater propensity of multiple concepts highly relevant for a class to capture a common concept about that class. This loss thus offered a way to encourage different concepts to rely on different feature maps. Note that it only affects parameters of final layer of ΨΨ\Psiroman_Ψ that outputs Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), and we do not use any additional hyperparameter for it.

Table 9: Impact of orthogonality loss orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT, on accuracy of g𝑔gitalic_g (in %percent\%%), MSE, LPIPS and FID, for CUB-200. Bold indicates setting selected for our experiments.
orthsubscript𝑜𝑟𝑡\mathcal{L}_{orth}caligraphic_L start_POSTSUBSCRIPT italic_o italic_r italic_t italic_h end_POSTSUBSCRIPT  Accuracy  MSE  LPIPS  FID
Yes 79.44 0.161 0.545 15.85
No 79.25 0.171 0.556 9.43

0.D.0.5 Other loss weigths

We report the accuracy and reconstruction metrics for different α𝛼\alphaitalic_α (weight for output fidelity loss) and δ𝛿\deltaitalic_δ (weight for sparsity of activations). A small weight on output fidelity impacts the performance of the system and a high weight affects the reconstruction without benefiting the performance much. We found a balance with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 which we employed for all datasets. A high δ𝛿\deltaitalic_δ impacts both the performance and reconstruction. For CUB-200 and Stanford Cars, due to prioritizing reconstruction we opted for a smaller δ=0.2𝛿0.2\delta=0.2italic_δ = 0.2, while for CelebA-HQ, since obtaining a good reconstruction was relatively easier, we opted for a higher δ=2𝛿2\delta=2italic_δ = 2.

We keep a fixed β=3𝛽3\beta=3italic_β = 3 weight for LPIPS reconstruction loss throughout, for all our datasets and ablations. Even though the system still provides meaningful results for β<3𝛽3\beta<3italic_β < 3, the lower values were ruled out mainly because of the importance of a lower perceptual similarity loss, mentioned earlier. The higher values were ruled out because in our initial experiments we observed some instability with high β>4𝛽4\beta>4italic_β > 4. Thus we fixed β=3𝛽3\beta=3italic_β = 3 for all datasets which provided a good balance.

Table 10: Effect of weight α𝛼\alphaitalic_α on output fidelity loss ofsubscript𝑜𝑓\mathcal{L}_{of}caligraphic_L start_POSTSUBSCRIPT italic_o italic_f end_POSTSUBSCRIPT, measured on accuracy of g𝑔gitalic_g (in %percent\%%), LPIPS and FID, for CUB-200. Bold indicates setting selected for our experiments.
α𝛼\alphaitalic_α  Accuracy  MSE  LPIPS  FID
0.1 76.9 0.162 0.545 9.37
0.5 79.44 0.161 0.545 15.85
2 79.63 0.187 0.586 7.58
Table 11: Effect of weight δ𝛿\deltaitalic_δ on sparsity, measured on accuracy of g𝑔gitalic_g (in %percent\%%), MSE, LPIPS and FID, for CUB-200. Bold indicates setting selected for our experiments.
δ𝛿\deltaitalic_δ  Accuracy  MSE  LPIPS  FID
0.2 79.44 0.161 0.545 15.85
2 79.54 0.174 0.562 11.44
20 76 0.201 0.629 9.83

Appendix 0.E Additional visualization

We show additional visualizations for different highly relevant class-concept pairs (rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5) in Fig. 6. For each class-concept pair, we show the effect of modifying the concept activation on the generated output for three maximum activating training samples. For each sample (on the far-left), we show the corresponding generated outputs for λ=0𝜆0\lambda=0italic_λ = 0 (center-left), λ=1𝜆1\lambda=1italic_λ = 1 (center-right) and λ=4𝜆4\lambda=4italic_λ = 4 (far-right).

Refer to caption
(a) Concept “Red neck” in class 15
Refer to caption
(b) Concept “Yellow front” in class 175
Refer to caption
(c) Concept “Paleness” in class Old
Refer to caption
(d) Concept “Smooth skin” in class Young
Refer to caption
(e) Concept “Big headlights + grille bars” in class 194
Refer to caption
(f) Concept “Logo” in class 20
Figure 6: Additional qualitative examples obtained for different concepts, classes on (a)-(b) CUB-200, (c)-(d) CelebA-HQ, (e)-(f) Stanford-Cars datasets. On each subfigure, first column corresponds to maximum activated samples x𝑥xitalic_x for class-concept pairs with high relevance (rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5), third column to reconstructed image obtained with original Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), while second and fourth columns to the images obtained by imputing respectively ϕk(x)=0subscriptitalic-ϕ𝑘𝑥0\phi_{k}(x)=0italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = 0 and 4×ϕk(x)4subscriptitalic-ϕ𝑘𝑥4\times\phi_{k}(x)4 × italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ).

0.E.0.1 Use of difference images can be useful

Refer to caption
(a) Concept “Orange eye” in class 133
Refer to caption
(b) Concept “Headlight shine” in class 18
Figure 7: More qualitative examples obtained for different concepts, classes on (a) CUB-200, (b) Stanford-Cars datasets. On each subfigure, first column corresponds to maximum activated samples x𝑥xitalic_x for class-concept pairs with high relevance (rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5), second column to reconstructed image obtained with original Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), third column to the image obtained by imputing 4×ϕk(x)4subscriptitalic-ϕ𝑘𝑥4\times\phi_{k}(x)4 × italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) in Φ(x)Φ𝑥\Phi(x)roman_Φ ( italic_x ), and fourth column shows the difference between third and second images.

To better highlight regions in the image impacted by modifying concept activations, one can additionally visualize the differences between two generated outputs. We show visualizations with the difference in the generated outputs in Fig. 7. Again, we selected highly relevant class-concept pairs (rk,c>0.5subscript𝑟𝑘𝑐0.5r_{k,c}>0.5italic_r start_POSTSUBSCRIPT italic_k , italic_c end_POSTSUBSCRIPT > 0.5), and show the effect of modifying the concept activation on the generated output for three maximum activating training samples. For each sample (on the far-left), we show the corresponding generated outputs for λ=1𝜆1\lambda=1italic_λ = 1, x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG (center-left), and λ=4𝜆4\lambda=4italic_λ = 4, x~superscript~𝑥\tilde{x}^{\prime}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (center-right). We then compute and show the difference between the two generated outputs x~x~superscript~𝑥~𝑥\tilde{x}^{\prime}-\tilde{x}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over~ start_ARG italic_x end_ARG (far-right). It is worth noting that this exact strategy might not work as effectively for all types of concepts and can require modifications. For example, if “black feathers” are emphasized by a concept, the increasing “black” color won’t be visible in the difference between x~,x~superscript~𝑥~𝑥\tilde{x}^{\prime},\tilde{x}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_x end_ARG. Instead, one could either visualize the reverse difference between x~,x~~𝑥superscript~𝑥\tilde{x},\tilde{x}^{\prime}over~ start_ARG italic_x end_ARG , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to identify a color being “removed” or visualize the energy of difference for each pixel to identify which regions are modified the most.

Appendix 0.F Limitations of approach

  • While the idea in principle should be adaptable to other choices of pretrained G𝐺Gitalic_G, assuming other choices are inline with desiderata discussed in Sec. 0.A.1, it requires redesigning ΩΩ\Omegaroman_Ω accordingly. The current VisCoIN design in our experiments is only compatible with StyleGAN.

  • As is the case for other CoINs learning unsupervised concepts, the proposed system cannot guarantee that concepts precisely correspond to human concepts and not encode other additional information. However, one interesting aspect is that visualization process in VisCoIN gives a better handle at identifying any deviations as visualizations in other unsupervised CoINs can be much harder to understand with granularity for large-scale images.

  • While VisCoIN significantly upgrades the visualization pipeline over previous CoIN architectures, some of our choices, arising from restrictions to preserve a by-design interpretable predictive structure or limited computational resources, lead to certain limitations. (1) The choice of using a small dictionary as bottleneck under multiple other constraints leads to worse reconstruction compared to StyleGAN inversion architectures. (2) The choice of using a concept dictionary not defined in the Stylespace, results in learnt concepts that are not as well disentangled as Stylespace. Dictionaries for post-hoc interpretation defined in Stylespace provide more disentangled concepts for interpretation (as in StyleEx [43]), but are significantly larger and thus less suitable for by-design architecture. Finally, (3) the choice of using a pretrained G𝐺Gitalic_G improves training time, complexity and reusability, but also implies that the system’s quality is limited by the quality of the pretrained G𝐺Gitalic_G. For instance, for visualization, if G𝐺Gitalic_G can’t generate some specific feature, it can be difficult to visualize a concept ϕksubscriptitalic-ϕ𝑘\phi_{k}italic_ϕ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that encodes that feature.

  • For the case of single FC layers in ΩΩ\Omegaroman_Ω our visualization process follows linear trajectories in 𝒲+superscript𝒲\mathcal{W}^{+}caligraphic_W start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT latent space of StyleGAN when modifying an activation. Recent work has shown that linear trajectories are not necessarily optimal for latent traversals [65].

Appendix 0.G Potential negative impacts

Given that the understanding of neural network decisions is considered as a vital feature for many applications employing these models, specially in critical decision making domains, we expect our method to have an overall positive societal impact. However, in the wrong hands almost any technology can be misused. In the context of VisCoIN, it can be used to provide deceiving interpretations by corrupting its training mechanisms (for example by training on misleading annotated samples, using deliberately altered pretrained models etc.). Thus, we expect a responsible use of the proposed methodology to realize its positive impact.