Incorporating Expert Rules into Neural Networks in the Framework of Concept-Based Learning

Andrei V. Konstantinov and Lev V. Utkin
Higher School of Artificial Intelligence Technologies
Peter the Great St.Petersburg Polytechnic University
St.Petersburg, Russia
e-mail: [email protected], [email protected]

Abstract

A problem of incorporating the expert rules into machine learning models for extending the concept-based learning is formulated in the paper. It is proposed how to combine logical rules and neural networks predicting the concept probabilities. The first idea behind the combination is to form constraints for a joint probability distribution over all combinations of concept values to satisfy the expert rules. The second idea is to represent a feasible set of probability distributions in the form of a convex polytope and to use its vertices or faces. We provide several approaches for solving the stated problem and for training neural networks which guarantee that the output probabilities of concepts would not violate the expert rules. The solution of the problem can be viewed as a way for combining the inductive and deductive learning. Expert rules are used in a broader sense when any logical function that connects concepts and class labels or just concepts with each other can be regarded as a rule. This feature significantly expands the class of the proposed results. Numerical examples illustrate the approaches. The code of proposed algorithms is publicly available.

Keywords: concept-based learning, expert rules, neural networks, classification, logical function, inductive and deductive learning.

1 Introduction

Concept-based learning (CBL) is a well-established approach to express predictions of a machine learning model in terms of high-level concepts derived from raw features instead of in terms of the raw features themselves [1], i.e. unlike using a pixel-based level, CBL provides a higher level of connection between the image and the decision using concepts. The understanding of the decision becomes straightforward once the interpretation of each concept is determined [2]. High-level concepts can be interpreted as additional expert knowledge. Therefore, CBL aims to integrate expert knowledge or human-like reasoning into machine learning models. In the context of machine learning, incorporating high-level concepts into the learning process may significantly improve the efficiency and accuracy of models. Moreover, high-level concepts may improve the explainability of the machine learning model outputs because they are intuitive to users [3, 4]. On the one hand, concepts can be viewed as high-level features of an object, for example, a color of a bird in a picture. On the other hand, the same concepts can be regarded as complex classification labels. The training of the concept model often requires the concept annotations in the form of binary labels, i.e. concept is “present” or “not present”, for each defined concept and image. However, concepts can be represented in other forms, for example, by means of indices assigned to elements of a concept description set. It should be noted that, indices can always be converted into binary concepts.

Many recent CBL approaches consider human-specified concepts as an intermediate step to derive the final predictions. Concept Bottleneck Models (CBMs) [5] and Concept Embedding Models (CEMs) [6] are CBL models that implement these approaches. According to the CBM approach, the CBL model provides the concept prediction in the middle of the decision-making process, i.e. it explicitly predicts the concept labels from images and then predicts the final label based on the concept label predictions. The classifier deriving the final label has access only to the concept representation, and the decision is strongly tied to the concepts [2]. At that, the training procedure can be implemented in an independent way when the concept labels are trained independently on the final label. Another way for learning is to train in an end-to-end manner.

In contrast to CBLs, we propose quite different models which can roughly be called concept-based models because they use concepts for training like CBLs. However, the main goal of the proposed model is to combine the inductive learning tools (neural networks) with the knowledge-based expert rules of the form “IF …, THEN …”, which are elicited from experts and constructed by means of concepts. For example, the rule from the lung cancer diagnostics can look like “IF Finding is the mass, Contour is spicules, Inclusion is the air cavity, THEN a Diagnosis is the squamous cells carcinoma”. Here concepts are shown in Bold, the concept values are shown in Italic. Another illustrative example is taken from the Bird identification dataset (CUB) [7]: “IF the Head is red, the Back color is black, the Breast color is white, the Crown color is red, the Wing color is white, the Bill shape may be dagger OR all-purpose, THEN the Bird is a red-headed woodpecker”. In the context of medicine, a doctor often diagnoses a disease based on certain rules from a medical handbook. Using such rules is the basis of a doctor’s work. Similar examples of using expert rules can be found in various applied fields, not just in medicine. Therefore, it is important to incorporate the expert rules into machine learning models.

It is assumed that we have knowledge how final labels (consequents) of instances depend on values of concepts (antecedents) from a set of knowledge-based expert rules. Moreover, we have a partially labeled training set consisting of images with some concept labels and with some targets which will be called as final concepts. The question is how to construct and to train a neural network which deals simultaneously with the concept-based dataset and the knowledge-based rules to provide accurate predictions and to explain the predictions by using concepts. In order to answer this question, we propose two approaches to taking into account the expert rules.

First, we represent the knowledge-based expert rules in the form of logical functions consisting of the disjunction and conjunction operations of indicator functions corresponding to values of concepts. At that, the target value is also represented as a concept. By having the logical functions, we can write constraints for a joint probability distribution over all combinations of concept values to satisfy the expert rules. This allows us to construct and to train a neural network which guarantees that the output probabilities of concepts would not violate the expert rules. We formulate the corresponding feasible set of probability distributions in the form of a convex polytope and analytically find its vertices. By means of the vertices, a point inside the polytope can be constructed that determines marginal probability distributions of concepts. Additionally, we can define the same polytope in H-representation by setting its faces. It is useful because the number of faces can be significantly smaller than the number of vertices. These two ways to define the convex polytope form a base for develo** four approaches for constructing neural networks incorporating the expert rules.

An important peculiarity of the proposed models is that the expert rules compensate the incomplete concept labeling of instances in datasets whereas existing concept-based models may lead to overfitting when many images have incomplete concept description. Moreover, the expert rules allow us to compensate a partial availability of targets in the training set.

In sum, we try to incorporate the knowledge-based expert rules into a neural network to improve predictions and their interpretation. The knowledge of expert rules changes probabilities of concepts as well as predictions corresponding to new instances which are classified and explained. The proposed models can be viewed as a way for combining the inductive and deductive learning.

It is also important to point out that the term “expert rules” is used in the proposed models not only to represent the standard “IF …,THEN …” rule, but in a broader sense. Any logical function that connects concepts and class labels or just concepts with each other can be regarded as a rule. This feature significantly expands the class of the proposed results.

The code of proposed algorithms is available in https://github.com/andruekonst/ecbl.

2 Related work

Concept-based learning models. Many models taking into account various aspects of CBL have been developed by following the works [3, 4]. In particular, the concept attribution approach to tabular learning by providing an idea on how to define concepts over tabular data was proposed in [8]. An algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model was proposed in [9]. A novel concept-based explanation framework named Prototypical Concept-based Explanation is proposed in [10]. An idea of the framework is that it provides differences and similarities to the expected model behavior via prototypes which are representative predictions summarizing the model behavior in condensed fashion. An analysis of correlations between concepts and methods beyond the test accuracy for evaluating concept-based models, with regard to whether a concept has truly been learned by the model were presented in [11].

Lage et al. [1] claim that many CBL models define concepts which are not inherently interpretable. To overcome this limitation, the authors proposed a CBL model where concepts are fully transparent, thus enabling users to validate whether mental and machine models are aligned. An important peculiarity of the model is that the corresponding learning process incorporates user feedback directly when learning the concept definitions: rather than labeling data, users mark whether specific feature dimensions are relevant to a concept [1]. To relax an assumption that humans are oracles who are always certain and correct in decision making Collins et al. [12] study how existing concept-based models deal with uncertain interventions from humans. An attempt to suppress false positive explanations by providing explanations based on statistically significant concepts was carried out in [13] where the authors guarantee the reliability of the concept-based explanation by controlling an introduced false discovery rate of the selected concepts.

Applications of the concept-based explanation in medicine can be found in [14, 15, 16, 17, 18]. The use of CBL models for time-series data are presented in [19, 20]. Taking into account the anomaly detection problem, a framework for learning a set of concepts that satisfy properties of the out-of-distribution detection and help to explain the out-of-distribution predictions was presented in [21]. In the same work, new metrics for assessing the effectiveness of a particular set of concepts for explaining the out-of-distribution detection detectors were introduced. The concept-based model for anomaly detection was also considered in [22].

Promises and pitfalls of black-box concept learning models [23]. A review of recent approaches for explaining concepts in neural networks was provided in [24].

Concept bottleneck models. Following [5], many extensions of the CBM model have been proposed. A part of the CBL models belongs to post-hoc models. These models analyze the whole model only after it has finished the training process. The post-hoc CBMs were introduced in [25]. These models convert any pre-trained model into a concept bottleneck model. This conversion can be done by using the concept activation vectors (CAVs) [3] in a special way. A Cooperative-CBM (coop-CBM) model was proposed in [26]. The model aims at addressing the performance gap between CBMs and standard black-box models. It uses an auxiliary loss that facilitates the learning of a rich and expressive concept representation. In order to take into account the ambiguity in the concept predictions, a probabilistic CBM was proposed in [27], which exploits probabilistic embeddings in the concept embedding space and reflects uncertainty in the concept predictions. The model maps an image to the concept embeddings with probabilistic distributions which model concept uncertainties.

According to [6], one of the drawbacks of many CBM models is that they unable to find optimal compromises between high task accuracy, robust concept-based explanations, and effective interventions on concepts. In order to overcome this drawback, concept embedding models were introduced in [6]. The models can be viewed as a family of CBMs that represents each concept as a supervised vector, i.e. the models learn two embeddings per concept, one for when it is active, and another when it is inactive. Following the concept embedding model [6], the concept bottleneck generative models were introduced in [28], where a concept bottleneck layer is constrained to encode human-understandable features. Raman et al. [29] studied whether CBMs correctly capture the degree of conditional independence across concepts when the concepts are localized spatially and semantically. Margeloiu et al. [30] demonstrated that concepts may not correspond to anything semantically meaningful in input space. A simple procedure allowing to perform concept-based instance-specific interventions on an already trained black-box neural network is proposed in [31].

A novel image representation learning method, called the Concept-based Explainable Image Representation Learning and adept at harnessing human concepts to bolster the semantic richness of image representations was introduced in [32]. A case of implicit knowledge corresponding to the unsupervised (unlabeled) concepts was studied in [33] where the authors propose to adopt self-explaining neural networks to obtain the unsupervised concepts. These networks are composed of an encoder-decoder architecture and a parametrizer estimating weights of each concept. Energy-based CBMs which use a set of neural networks to define the joint energy of candidate (input, concept, class) tuples are introduced in [34].

CBMs were extended in [35] to interactive prediction settings such that the model can query a human collaborator for the label to some concepts. In order to improve the final prediction, an interaction policy was developed in [35] that chooses which concepts should be requested for labeling. An approach to modify CBMs for images segmentation, objects fine classification and tracking was developed in [36]. Two causes of performance disparity between soft (inputs to a label predictor are the concept probabilities) and hard (the label predictor only accepts binary concepts) CBMs were proposed in [37]. They allow hard CBMs to match the predictive performance of soft CBMs without compromising their resilience to leakage. A similar task was solved in [38]. Marconato et al. [39] provided a definition of interpretability in terms of alignment between the representation of a model and an underlying data generation process, and introduced GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment, thus improving the interpretability of the learned concepts.

Drawing inspiration from the CLIP model [40], a foundation model that establishes a shared embedding space for both text and images, the CLIP-based CBMs are proposed in [41].

3 Background

The concept-based classification is a task to construct a potentially black-box classifier and to explain the constructed classifier’s decision process through human-interpretable concepts [13].

We are given a set of inputs $\mathbf{x}_{i}\in$ $\mathcal{X}\subset\mathbb{R}^{d}$ and the corresponding targets $y_{i}\in\mathcal{Y}=\{1,2,...,K\}$ . Suppose we are also given a set of $m$ pre-specified concepts $\mathbf{c}_{i}=(c_{i}^{(1)},...,c_{i}^{(m)})\in\mathcal{C}$ such that the training set comprises $(\mathbf{x}_{i},y_{i},\mathbf{c}_{i})$ , $i=1,...,n$ . Typically, concepts can be represented as a binary $m$ -length vector $\mathbf{c}_{i}$ where its $j$ -th element $c_{i}^{(j)}$ denotes whether the $j$ -th concept is present or not in the input $\mathbf{x}_{i}$ .

Generally, the CBL model aims to find how targets depend on concepts and inputs, i.e., to find a function $h:(\mathcal{X},\mathcal{C})\rightarrow\mathcal{Y}$ . However, the CBL model can also be used to interpret how predictions depend on concepts corresponding to inputs. In order to solve this task, CBMs have been developed, which learn two map**s, one from the input to the concepts $g:$ $\mathcal{X}\rightarrow\mathcal{C}$ , and another from the concepts to the outputs $f:\mathcal{C}\rightarrow\mathcal{Y}$ . In this case, the CBM prediction for a new input instance $\mathbf{x}$ is defined as $y=f(g(\mathbf{x}))$ .

There are different problem settings in the framework of CBL models [34]:

Prediction: Given the input $\mathbf{x}$ , the goal is to predict the class label $y$ and the associated concepts $\mathbf{c}$ to interpret the predicted class label, that is to find the probability $\mathop{\rm Pr}(\mathbf{c},y\mid\mathbf{x})$ . Note that CBMs decompose $\mathop{\rm Pr}(\mathbf{c},y\mid\mathbf{x})$ to predict $\mathop{\rm Pr}(\mathbf{c\mid x})$ and then $\mathop{\rm Pr}(y\mid\mathbf{x})$ .

Concept Correction/Intervention: Given the input $\mathbf{x}$ and a corrected concept $c^{(k)}$ , predict all the other concepts $c^{(i)}$ , $i=1,...,m,$ $i\neq k$ .

Conditional Interpretations: Given an image with class label $y$ and concept $c^{(j)}$ , what is the probability that the model correctly predicts concept $c^{(k)}$ .

We propose a different problem setting which uses concepts to incorporate the available expert rules of the form: “IF concepts have certain values, THEN the target is equal to a certain class”, into neural networks.

4 Expert rules and concepts

4.1 Problem statement

In contrast to the above definition of concepts as binary variables, which is conventional in many models, we assume that each concept $c^{(i)}$ can take one of $n_{i}$ values denoted as $\mathcal{C}^{(i)}=\{1,...,n_{i}\}$ , $i\in\{0,\dots,m\}$ . We call $\mathcal{C}^{(i)}$ the $i$ -th concept outcome set. The concept vector is $\mathbf{c}=(c^{(0)},c^{(1)},...,c^{(m)})\in\mathcal{C}^{\times}$ , where $\mathcal{C}^{\times}$ is the concept domain produced by the Cartesian product $\mathcal{C}^{\times}=\mathcal{C}^{(0)}\times\dots\times\mathcal{C}^{(m)}$ . We consider the concept $c^{(0)}$ as a special concept corresponding to the target variable $y$ .

Let us introduce the logical literal $h_{i}^{(j)}(\mathbf{c})=\mathbb{I}$ $[c^{(j)}=i]$ which takes the value $1$ , if the concept $c^{(j)}$ has the value $i$ . A set of expert rules is formulated as a logical expression $g(\mathbf{c})$ over literals $h_{i}^{(j)}(\mathbf{c})$ . Formally a set of expert rules may be represented as a map** $g:\mathcal{C}^{\times}\mapsto\{0,1\}$ , where $0$ means FALSE, and $1$ means TRUE.

For example, the rule “IF $c^{(1)}=3$ THEN $c^{(2)}=1$ ” is equivalent to the function:

g=(c^{(1)}=3)\rightarrow(c^{(2)}=1)=h_{3}^{(1)}\rightarrow h_{1}^{(2)}=(\lnot h% _{3}^{(1)})\vee h_{1}^{(2)},

(1)

where $\rightarrow$ is implication and $\lnot$ is negation and the argument $\mathbf{c}$ is omitted for short. A literal negation is equivalent to disjunction of the rest outcomes of the same concept. For example, $\lnot h_{3}^{(1)}\equiv h_{1}^{(1)}\vee h_{2}^{(1)}$ , when $\mathcal{C}^{(1)}=\{1,2,3\}$ .

Let $X$ be a random vector taking values from $\mathcal{X}\subset\mathbb{R}^{a}$ . We introduce $C^{(0)},\dots,C^{(m)}$ as discrete random variables for the concepts, taking values from $\mathcal{C}^{(0)},\dots,\mathcal{C}^{(m)}$ , respectively. The concept random vector is $C=(C^{(0)},C^{(1)},...,C^{(m)})$ .

The task is to estimate marginal concept probabilities $\mathop{\rm Pr}(C^{(i)}=j\mid X=\mathbf{x})$ conditioned on an input object $\mathbf{x}$ (for example, image) for the $i$ -th concept and outcome $j$ under condition of the expert rules.

For brevity, we denote the marginal concept probabilities as vectors:

p^{(i)}=\left(\mathop{\rm Pr}(C^{(i)}=1\mid\mathbf{x}),\dots,\mathop{\rm Pr}(C% ^{(i)}=n_{i}\mid\mathbf{x})\right),

(2)

which is standard for classification. Since the marginal probabilities are not independent because of the expert rules, we cannot estimate $p^{(i)}$ separately. Instead, the goal is to find the concatenated vector of marginal concept probabilities:

\overline{p}=[(p^{(0)})^{T};\dots,(p^{(m)})^{T}]^{T}.

(3)

4.2 Joint distribution

First, we consider the conditional joint probability distribution over concepts: $\mathop{\rm Pr}(C=\mathbf{c}|X=\mathbf{x})$ . Since all concept random variables are discrete with finite outcome sets, all possible vectors $\mathbf{c}$ can be enumerated. The total number of distinct concept vectors is $t=\prod_{i=0}^{m}n_{i}$ . Let us introduce a function $\mathcal{M}:\mathcal{C}^{\times}\mapsto\{1,\dots,t\}$ , that maps the concept vector to its number, and its inverse function $\mathcal{M}^{-1}$ that maps the number to the concept vector. We define the joint probability distribution vector $\pi=(\pi_{1},\dots,\pi_{t})$ as follows:

\forall\mathbf{c}\in\mathcal{C}^{\times},\ \ \pi_{\mathcal{M}(\mathbf{c})}=% \mathop{\rm Pr}(C=\mathbf{c}|X=\mathbf{x}).

(4)

4.3 Reduction to a linear constraint

The joint probability distribution is constrained to satisfy the expert rules formulated as $g$ , therefore:

\mathop{\rm Pr}(g(C)=1)=1.

(5)

Let us consider a binary mask of admissible states, a vector $u=(u_{1},\dots,u_{t})\in\{0,1\}^{t}$ , whose components are equal to $1$ if and only if the rules are satisfied for the corresponding concept vectors:

u_{k}=g(\mathcal{M}^{-1}(k)),\ \ k\in\{1,...,t\}.

(6)

The constraint on the joint probability distribution (5) can be reformulated as a set of the equality constraints on components of $\pi$ , corresponding to invalid states (that violate the rules):

\pi_{k}=0,\ \ k\in\{i\in\{1,...t\}\mid g(\mathcal{M}^{-1}(k))=0\}.

(7)

The vector $\pi$ also obeys the following probability distribution constraints:

\pi\in\Delta_{t},

(8)

so the system (7) can be rewritten in more compact form as one linear equality constraint:

u^{T}\pi=1.

(9)

Here $\Delta_{t}$ is the unit simplex of dimension $t$ .

To construct a feasible solution, a neural network should be able to generate probability distributions matching equality constraint.

For illustrative purposes, we consider a toy example with two classes of birds: a red-headed woodpecker ( $c^{(0)}=1$ ) and an European green woodpecker ( $c^{(0)}=2$ ). The corresponding concepts describing the birds are head ( $c^{(1)}$ ), bill shape ( $c^{(2)}$ ), and wing color ( $c^{(3)}$ ). Concept head can take values: red ( $c^{(1)}=1$ ), green ( $c^{(1)}=2$ ), concept bill shape can take values: chisel ( $c^{(2)}=1$ ), dagger ( $c^{(2)}=2$ ), all-purpose ( $c^{(2)}=3$ ). Here $m=2$ , $n_{0}=2$ , $n_{1}=2$ , $n_{2}=3$ . Suppose there is available the following expert rule:

\begin{array}[c]{llll}\text{IF}&\text{ {head}}&\text{is}&\text{\emph{red}}\\ \text{AND}&\text{ {bill shape}}&\text{is}&\text{\emph{dagger} OR \emph{all-purpose},}\\ \text{THEN}&\text{ {bird}}&\text{is}&\text{\emph{red-headed woodpecker},}\end{array}

(10)

\begin{array}[c]{ll}\text{IF}&c^{(1)}=1\text{ AND }c^{(2)}\in\{2,3\},\\ \text{THEN}&c^{(0)}=1\text{.}\end{array}

(11)

This rule can be represented as follows:

$\displaystyle g(\mathbf{c})$	$\displaystyle=\left(h_{1}^{(1)}\wedge\left(h_{2}^{(2)}\vee h_{3}^{(2)}\right)% \right)\rightarrow h_{1}^{(0)}$
	$\displaystyle=h_{1}^{(0)}\vee\lnot\left(h_{1}^{(1)}\wedge\left(h_{2}^{(2)}\vee h% _{3}^{(2)}\right)\right)$
	$\displaystyle=h_{1}^{(0)}\vee\lnot h_{1}^{(1)}\vee\left(\lnot h_{2}^{(2)}% \wedge\lnot h_{3}^{(2)}\right)$
	$\displaystyle=h_{1}^{(0)}\vee h_{2}^{(1)}\vee h_{1}^{(2)}.$	(12)

Table 1 shows in bold all possible combinations of the concept values satisfying the above expert rule. It can be seen from Table 1 that there holds:

\pi_{1}+\pi_{2}+\pi_{3}+\pi_{4}+\pi_{5}+\pi_{6}+\pi_{7}++\pi_{10}+\pi_{11}+\pi% _{12}=1.

(13)

Table 1: An example of combinations of the concept values and the corresponding probabilities

c^{(0)}

\mathbf{1}

\mathbf{1}

\mathbf{1}

\mathbf{1}

\mathbf{1}

\mathbf{1}

\mathbf{2}

2

2

\mathbf{2}

\mathbf{2}

\mathbf{2}

c^{(1)}

\mathbf{1}

\mathbf{1}

\mathbf{1}

\mathbf{2}

\mathbf{2}

\mathbf{2}

\mathbf{1}

1

1

\mathbf{2}

\mathbf{2}

\mathbf{2}

c^{(2)}

\mathbf{1}

\mathbf{2}

\mathbf{3}

\mathbf{1}

\mathbf{2}

\mathbf{3}

\mathbf{1}

2

3

\mathbf{1}

\mathbf{2}

\mathbf{3}

\mathbf{\pi}

\pi_{1}

\pi_{2}

\pi_{3}

\pi_{4}

\pi_{5}

\pi_{6}

\pi_{7}

\pi_{8}

\pi_{9}

\pi_{10}

\pi_{11}

\pi_{12}

4.4 Probabilistic approach

An alternative approach is to consider the joint probability under condition that expert rules are satisfied:

\mathop{\rm Pr}\left(C=\mathbf{c}\mid g(C)=1\right)=\frac{\mathop{\rm Pr}\left% (C=\mathbf{c},~{}g(C)=1\right)}{\mathop{\rm Pr}\left(g(C)\right)}.

(14)

The probability of conjunction $\mathop{\rm Pr}\left(C=\mathbf{c},~{}g(C)=1\right)$ can be expanded as

\mathop{\rm Pr}(C=\mathbf{c},g(C)=1)=\mathop{\rm Pr}(C=\mathbf{c})\cdot\mathop% {\rm Pr}(g(C)=1|C=\mathbf{c}),

(15)

where $\mathop{\rm Pr}(C=\mathbf{c})$ is a predicted probability distribution that may not satisfy the expert rules.

The posterior probability depends on the deterministic function $g$ , thus

\mathop{\rm Pr}(g(C)=1\mid C=\mathbf{c})=\begin{cases}1,&\text{if}~{}g(\mathbf% {c})=1\\ 0,&\text{else}.\end{cases}

(16)

Let us find the probability $\mathop{\rm Pr}\{g(C)\}$ as follows:

\mathop{\rm Pr}(g(C))=\sum_{\mathbf{k}\in\mathcal{C}^{\times}}\mathbb{I}[g(% \mathbf{k})]\cdot\mathop{\rm Pr}(C=\mathbf{k}).

(17)

The prior joint concept probability can be calculated as an output of a neural network. Let us denote the output $\widehat{\pi}$ :

\mathop{\rm Pr}(C=\mathbf{c})=\widehat{\pi}_{\mathcal{M}(\mathbf{c})}.

(18)

Hence, there holds

\pi_{j}=\frac{\widehat{\pi}_{j}\cdot\mathbb{I}[g(\mathcal{M}^{-1}(j))]}{\sum_{% \mathbf{k}\in\mathcal{C}^{\times}}\widehat{\pi}_{k}\cdot\mathbb{I}[g(\mathcal{% M}^{-1}(k))]}.

(19)

In sum, this approach produces a mask for admissible probabilities $\pi_{1},...,\pi_{N}$ . The mask is the same as $u$ in (9). This approach requires to predict all components of the joint probability distribution by applying a neural network, while only admissible states will be used. So, it is quite redundant, and this motivates us not to use the approach in practice. However, it is flexible and can be useful, for example, in case when multiple conflicting expert rules are applied to different parts of one dataset, or when the choice of expert rules depends on input.

4.5 Solution set as a polytope

Instead of modelling the whole joint distribution, one can estimate probabilities only of the states that lead to satisfying the expert rules. We call this reduced vector as the “admissible probability vector” and denote as $\tilde{\pi}=(\tilde{\pi}_{1},\dots,\tilde{\pi}_{d})$ , where $d$ is the number of admissible states. There is no additional constraints on $\tilde{\pi}$ , therefore we can say that it belongs to the unit simplex of dimension $d$ .

The joint probability vector can be found as

\pi=W\tilde{\pi},

(20)

where $W\in\{0,1\}^{t\times d}$ is a placement matrix which contains strictly one non-zero element in each column and one or zero non-zero elements in each row. It can be interpreted as arrangement of $\tilde{\pi}$ entries to the admissible components of $\pi$ .

The desired marginal concept probabilities can be calculated as a summation over relevant entries of $\pi$ :

\mathop{\rm Pr}(C^{(i)}=j\mid\mathbf{x})=\sum_{c\in\mathcal{C}^{\times}}\pi_{% \mathcal{M}(c)}\cdot\mathbb{I}[c^{(i)}=j].

(21)

So each vector $p^{(i)}$ can be represented as

p^{(i)}=B^{(i)}~{}\pi=B^{(i)}W\tilde{\pi},

(22)

where $B_{jk}^{(i)}=\mathbb{I}\left[\left(\mathcal{M}^{-1}(k)\right)^{(i)}=j\right]$ .

Then every solution satisfying rules can be expressed as

\begin{gathered}\overline{p}=V\tilde{\pi},\\ V=[(B^{(0)}W)^{T};\dots;(B^{(m)}W)^{T}]^{T},\\ \tilde{\pi}\succcurlyeq 0,~{}\mathbf{1}^{T}\tilde{\pi}=1.\end{gathered}

(23)

Therefore the solution set is by definition a polytope whose vertices are columns of the matrix $V$ .

In practice, this approach can be used as follows. First, all possible concept vectors are enumerated and passed through the expert rules, represented as $g$ to obtain the map** $W$ to the admissible states. Then $\tilde{\pi}$ is obtained as an output of a neural network after applying the softmax operation, that is $\tilde{\pi}$ is formally a discrete probability distribution. Then the solution is a point inside the polytope, which is calculated by weighing columns of the vertex matrix $V$ with elements of $\tilde{\pi}$ .

The main disadvantage of this approach is that we have to pre-calculate and store all vertices, while their amount can be enormous for complex logical expressions on multiple concepts.

4.6 Linear inequality system

The solution set is a polytope with a possibly high number of vertices. We discover an alternative definition of this set as an intersection of half-spaces, determined by hyperplanes, the so-called H-representation. Instead of converting from V- to H-representation after calculating vertices, we construct a linear inequality system from scratch based only on $g$ .

The algorithm for constructing the linear inequality system consists of three steps:

1.

Convert $g$ into conjunctive normal form (CNF).
2.

Map each clause to exactly one linear inequality.
3.

Unify (intersect) clause’s linear inequalities into one system, along with the probability distribution constraints on marginals.

The first step is NP-complete in a general case, but can be solved in reasonable time in many practical applications. We stress here that the algorithm is appropriate when the expert rules can be converted to a compact CNF. Let the rule set be represented in CNF as a conjunction of clauses:

\mathcal{R}=\mathcal{K}_{1}\wedge\dots\wedge\mathcal{K}_{b}.

(24)

Each clause $\mathcal{K}_{l}$ is a disjunction of literals:

\mathcal{K}_{l}=\bigvee_{q\in K_{l}}l_{q},

(25)

where $q=(i,j)$ , $l_{q}=h_{j}^{(i)}$ for some set of literals $K_{l}$ of the clause. Note, that if the clause contains some negated literals like $\lnot h_{j}^{(i)}$ , they are replaced with the disjunction of the rest outcome literals for the $i$ -th concept:

\lnot h_{j}^{(i)}\equiv\bigvee_{k\in\mathcal{C}^{(i)}\setminus\{j\}}h_{k}^{(i)}.

(26)

Further, we assume that such the transformation was applied to all clauses, and then each clause does not contain negations.

The next steps can be applied to any boolean expressions in CNF. Let us describe them in detail. First, consider a clause $\mathcal{K}$ , whose literals are like $h_{j}^{(i)}$ . The goal is to find appropriate marginals $\overline{p}$ , for which some corresponding joint probability distribution satisfying the clause exists. Formally, given a clause of the form:

\mathcal{K}=\bigvee_{q\in K}l_{q},

(27)

the sum of probabilities of dependent literals has a lower bound:

\mathop{\rm Pr}(\mathcal{K})=1\implies\sum_{q}\mathop{\rm Pr}(l_{q})\geq% \mathop{\rm Pr}(\mathcal{K})=1.

(28)

This property can be used to formulate a constraint of marginal probabilities. For the clause $\mathcal{K}$ , the constraint is

\sum_{(i,j)\in K}p_{j}^{(i)}\geq 1.

(29)

The lower bound is tight in a sense that there exist feasible marginals that sum up to one. Moreover, no other constraints (except the probability distribution constraints on marginals) restrict the feasible set. It means that, for any solution $\overline{p}$ satisfying the constraint (the sum of different marginals probabilities is not less than $1$ ), there exists at least one joint probability distribution matching rules that have the same marginal probability distributions.

We apply this transformation to each clause to obtain one linear inequality constraint per clause. The last step is to correctly merge the constraints of the clauses into one system. Hopefully, it can be achieved by intersecting the obtained inequalities.

Theorem 1.

Given a rule $\mathcal{R}$ in CNF, consisting of $b$ clauses:

\mathcal{R}=\mathcal{K}_{1}\wedge\dots\wedge\mathcal{K}_{b},

(30)

the constraint on the expert rule probability is equivalent to intersection of clause constraints:

\mathop{\rm Pr}(\mathcal{R})=1\iff\begin{cases}\mathop{\rm Pr}(\mathcal{K}_{1}% )=1,\\ \dots\\ \mathop{\rm Pr}(\mathcal{K}_{b})=1.\end{cases}

(31)

Proof.

Necessity.

\mathop{\rm Pr}(\mathcal{R})=1\implies\forall j\in\overline{1,b}~{}~{}\mathop{% \rm Pr}(\mathcal{K}_{j})\cdot\mathop{\rm Pr}(\bigwedge_{i\neq j}\mathcal{K}_{i% }|\mathcal{K}_{j})=1\implies\mathop{\rm Pr}(\mathcal{K}_{j})=1.

Sufficiency.

	$\displaystyle\forall j\in\overline{1,b}~{}~{}\mathop{\rm Pr}(\mathcal{K}_{j})=% 1\iff\mathop{\rm Pr}(\overline{\mathcal{K}_{j}})=0\implies$		(32)
	$\displaystyle 0=\sum_{i=1}^{b}\mathop{\rm Pr}(\overline{\mathcal{K}_{i}})\geq% \mathop{\rm Pr}(\overline{\mathcal{K}_{1}}\lor\dots\lor\overline{\mathcal{K}_{% b}})=\mathop{\rm Pr}(\overline{\mathcal{K}_{1}\land\dots\land\mathcal{K}_{b}})% \geq 0\implies$		(33)
	$\displaystyle 0=\mathop{\rm Pr}(\overline{\mathcal{K}_{1}\land\dots\land% \mathcal{K}_{b}})=1-\mathop{\rm Pr}(\mathcal{K}_{1}\land\cdots\land\mathcal{K}% _{b})\implies$		(34)
	$\displaystyle\mathop{\rm Pr}(\bigwedge_{i=1}^{b}\mathcal{K}_{i})=1.$		(35)

∎

Finally, according to the proposition and the theorem, when only marginal distributions are of interest, any set of expert rules can be equivalently transformed to a linear inequalities system of the form:

\begin{cases}\sum_{(i,j)\in K_{1}}p_{j}^{(i)}\geq 1\\ \dots\\ \sum_{(i,j)\in K_{b}}p_{j}^{(i)}\geq 1,\end{cases}

(36)

where $K_{r}$ is a set of concept-values pairs for a clause $r$ of the set of rules in CNF. Lets call the matrix of the system as $\hat{A}$ , thus the system is:

\hat{A}~{}\overline{p}\geq\mathbf{1}.

(37)

The entire system of constraints for marginal distributions includes also the probability distribution constraints:

p^{(i)}\in\Delta_{n_{i}},

(38)

and becomes:

	$\displaystyle A~{}\overline{p}$	$\displaystyle\geq b,$		(39)
	$\displaystyle Q~{}\overline{p}$	$\displaystyle=\mathbf{1},$		(40)

where

A=\begin{bmatrix}\hat{A}\\ E\end{bmatrix},~{}~{}b=\begin{bmatrix}\mathbf{1}\\ \mathbf{0}\end{bmatrix},

(41)

Q=\begin{bmatrix}1\dots 1&0\dots 0&0\dots 0\\ 0\dots 0&1\dots 1~{}&0\dots 0\\ &\ddots&\\ 0\dots 0&0\dots 0~{}&1\dots 1\\ &&\end{bmatrix}\in\mathbb{R}^{m+1\times s}.

(42)

Note that dimensionality of the linear inequality system may be reduced from $s$ to $(s-m)$ along with elimination of equality constraints because, for each concept, strictly one entry can be removed by using the condition $p_{1}^{(i)}=1-\sum_{j\neq 1}p_{j}^{(i)}$ .

5 Neural network and expert rules

Consider a partially-labeled multi-label multi-class classification problem. The training dataset $\mathcal{D}$ consists of $N$ tuples $(\mathbf{x}_{j},\zeta_{j}^{(0)},\dots,\zeta_{j}^{(m)})$ , where $\zeta_{j}^{(i)}\in\mathcal{C}^{(i)}\cup\{-1\}$ is a label of the $i$ -th concept of the $j$ -th training observation. The label $\zeta_{j}^{(i)}$ is assigned to $-1$ if it is unknown for the $j$ -th observation. The main target $y_{j}$ is denoted as the $0$ -th concept $\zeta_{j}^{(0)}$ and can also be partially labeled, that is $\zeta_{j}^{(0)}$ can be equal to $-1$ .

We consider neural networks that simultaneously predict the marginal distribution for each concept. For the $i$ -th concept, the prediction map** is denoted as

f^{(i)}:\mathcal{X}\mapsto\Delta_{n_{i}},

(43)

however, a neural network $f_{\theta}$ with parameters $\theta$ computes $f_{\theta}^{(0)},\dots,f_{\theta}^{(m)}$ simultaneously.

The training loss function is a weighted sum of the masked cross entropy losses over each concept

	$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{i=0}^{m}\omega^{(i)}\cdot\mathcal{L}^{(i)},$		(44)
	$\displaystyle\mathcal{L}^{(i)}$	$\displaystyle=-\sum_{j=1}^{N}\mathbb{I}\left[\zeta_{j}^{(i)}\neq-1\right]\cdot% \left(\sum_{k=1}^{n_{i}}\mathbb{I}[\zeta_{j}^{(i)}=k]\cdot\log{(f^{(i)}(% \mathbf{x}))_{k}}\right),$		(45)

where $\omega^{(i)}$ is a weight of the $i$ -th concept loss, which is an inverse number of labeled concept samples by default. The summation in brackets is the log-likelihood for the $i$ -th concept.

The neural network consists of classical layers (fully-connected or convolutional layers, depending on a solved problem) with one special layer at the end of the neural network, which we call as a concept head. This layer maps an embedding produced by the preceding layers to the marginal class probabilities and guarantees that they will satisfy expert rules for any input. The approaches described above can be used to construct different concept heads. Let us consider them in detail.

5.1 Base head

The most simple concept head implementation is to calculate prior joint probability distribution vector $\widehat{\pi}$ using softmax applied to a linear layer that maps embedding to $t$ logits. Then, to satisfy the expert rules, the posterior joint probability conditioned on the rules is calculated by multiplying of admissible states by the mask $u$ , and renormalized. This approach does extra computation when calculating probabilities of invalid states.

The Base Head approach is schematically shown in the top picture in Fig. 1.

5.2 Admissible state head

The idea of the Admissible State head (AS-head) is to estimate the probability distribution on only admissible outcomes $\tilde{\pi}$ instead of the whole joint probability vector $\pi$ . The full joint probability vector is constructed using (20), and the marginals are calculated by using the matrix multiplication (22). In a software implementation, we obtain marginal probabilities $p^{(i)}$ by resha** the flat vector $\pi$ into a multi-dimensional array with dimensions $(n_{0},\dots,n_{m})$ , and then summing up over all dimensions except $i$ . Formally, it is equivalent to (22).

To obtain the placement matrix $W$ and the dimension $d$ of $\tilde{\pi}$ , before constructing such the layer, one needs to enumerate all possible joint outcomes and evaluate the expert rules on them. It is not a problem for a small number of concepts, when enumeration can be carried out in a reasonable time. But it can be a problem when the number of concepts and their outcomes is large enough. Therefore, additional optimizations are required in this case.

Refer to caption — Figure 1: The Base Head (the top picture) and the AS-Head (the bottom picture) approaches

The bottom picture in Fig. 1 illustrates the Admissible State Head approach.

5.3 Vertex-based head

The placement matrix $W$ is of dimension $t\times d$ , where $t$ is the total number of joint outcomes and $d$ is the number of valid states. Even if $d$ remains small, $t$ grows exponentially with the number of concepts $m$ . Since the main goal is to compute only marginal distributions using (23), one can precompute the polytope vertex matrix $V$ of dimension $s\times d$ , where $s=\sum_{i=0}^{m}n_{i}$ is the dimension of the vector $\overline{p}$ of concatenated marginal probabilities.

The computation of $V$ is carried out offline, before training a neural network. Total number of operations at training or inference is reduced in this case when dense or sparse matrices are used for storing $W,V$ . If dense matrices are used, then the number of operations is reduced exponentially w.r.t. $m$ compared to the approaches described above.

The left picture in Fig. 2 illustrates a scheme of the Vertex Head approach. The neural network generates the vector $\widetilde{\pi}$ in the unit simplex, which is multiplied by the simplex vertices.

Another way for implementing the vertex-based approach is to first construct the linear inequality system which defines the polytope of feasible solutions $\overline{p}$ . Then the vertices $V$ can be found via H- to V-representation conversion [42].

5.4 Constraints head

Alternatively, the solution $\overline{p}$ can be generated inside the polytope defined by the linear inequality constraints, in H-representation without estimation of vertices. The vertex-based head is an approach to generate a point in a polytope by multiplying its vertices by weights from softmax. However there are other methods for such a problem, considered in [43]. These methods require one feasible point as an input that is strictly inside the polytope and can map an input embedding into a polytope point. The methods have the computational complexity $O(\nu\cdot\mu)$ , where $\nu$ is the number of inequality constraints, $\mu$ is the output dimension.

The right picture in Fig. 2 illustrates a scheme of the Constraints Head approach.

The main advantage of this approach is that it can be applied even when the heads described above fail: the number of admissible states is enormous, the matrix $V$ cannot be computed in a reasonable time, or uses too much memory, in a couple with the matrix constructed from weights of the preceding linear layer. It is because the number of inequality constraints $b$ (the number of clauses in CNF) can be relatively small comparing to the number of admissible joint outcomes $d$ . The computational complexity of the point construction inside the polytope, defined by the inequality constraints (39), at inference is $O\left((b+s)\cdot s\right)$ under condition that at least one point inside the polytope is found in advance. Therefore, it is reasonable to apply this layer only when $d\gg b+s$ .

5.5 State space reduction

The number of enumeration steps, vertices or half-planes, depending on the chosen approach from the described above, can be strongly reduced by considering only concept values that are mentioned in expert rules. For this case, we construct first separate concepts, that were not mentioned in the rules at all, and use separate classification heads for them.

Other concepts, which are partially mentioned in the expert rules, can be compressed. For this, we consider all values that were not mentioned in the rules and incorporate them into a special $0$ outcome. For example, if rules are based only on literals $h_{1}^{(1)},h_{2}^{(3)},h_{4}^{(3)}$ , and $\mathcal{C}^{(1)}=\{1,2\},\mathcal{C}^{(3)}=\{1,2,3,4\}$ , then the compressed outcome sets will be $\tilde{\mathcal{C}}^{(1)}=\{0,1\},\tilde{\mathcal{C}}^{(3)}=\{0,2,4\}$ . Here for the first concept $c^{(1)}$ , the outcome $2$ is not used in the rules, therefore, we replace it with the artificial $0$ outcome. For the third concept $c^{(3)}$ , outcomes $\{1,3\}$ are not used and replaced with the outcome $0$ .

After such the transformation, the total number of the joint distribution outcomes is much less than the initial one. To infer probabilities of compressed outcomes, we construct the additional classification heads that estimate the probability distribution over outcomes which were replaced with $0$ , for each concept, that was partially mentioned in rules and has at least two values for replacement. The final probability of compressed outcomes is calculated as a multiplication of $0$ outcome probability by estimated probabilities of replaced outcomes. In the above example, we have:

	$\displaystyle\mathop{\rm Pr}(C^{(3)}=1)=\mathop{\rm Pr}_{\text{comp}}(C^{(3)}=% 0)\cdot\mathop{\rm Pr}_{\text{repl}}(C^{(3)}=1),$		(46)
	$\displaystyle\mathop{\rm Pr}(C^{(3)}=2)=\mathop{\rm Pr}_{\text{comp}}(C^{(3)}=% 2),$		(47)
	$\displaystyle\mathop{\rm Pr}(C^{(3)}=3)=\mathop{\rm Pr}_{\text{comp}}(C^{(3)}=% 0)\cdot\mathop{\rm Pr}_{\text{repl}}(C^{(3)}=3),$		(48)
	$\displaystyle\mathop{\rm Pr}(C^{(3)}=4)=\mathop{\rm Pr}_{\text{comp}}(C^{(3)}=% 4),$		(49)

where $\mathop{\rm Pr}_{\text{comp}}$ are the compressed probabilities, $\mathop{\rm Pr}_{\text{repl}}$ are probabilities of replaced values estimated by the separate classification heads.

6 Numerical experiments

6.1 A toy example

The first example is entirely synthetic. Two-dimensional input vectors are randomly generated in the square $[0,1]^{2}$ . The concepts are: $y=c^{(0)}\in\mathcal{C}^{(0)}=\{1,2\}$ , $\mathcal{C}^{(1)}=\{1,2\}$ , $\mathcal{C}^{(2)}=\{1,2,3\}$ , $\mathcal{C}^{(3)}=\{1,2,3\}$ .

Concepts used in the example are illustrated in Fig. 3. In particular, the first concept $c^{(1)}$ is equal to $2$ at the right from the middle, and to $1$ at the left. The second concept $c^{(2)}$ is equal to $1$ at the bottom horizontal stripe of height $0.25$ , to $2$ at the middle horizontal stripe of height $0.5$ and to $3$ at the top stripe. The third concept is like the second, but in the “L”-shape, that is it depends on both features $x^{(1)}$ and $x^{(2)}$ . The main target $y$ is equal to $2$ if and only if $c^{(1)}=c^{(2)}=2$ .

For example, let us consider the rule $g(\mathbf{c})=((c^{(1)}=2)\wedge(c^{(2)}=2))\rightarrow(y=2)$ which is correct for the dataset. It can be expressed with literals as $g(\mathbf{c})=(h_{2}^{(1)}\wedge h_{2}^{(2)})\rightarrow h_{2}^{(0)}$ .

First, we train the model on the dataset completely without $y$ labels, i.e. $\zeta_{k}^{(0)}=-1$ for each sample $k$ . The model is able to reconstruct $y$ by using the rules. The predicted probabilities are shown in Fig. 4. Second, we train the model for the same dataset but with “if and only if” rule $g(\mathbf{c})=h_{2}^{(0)}\leftrightarrow(h_{2}^{(1)}\wedge h_{2}^{(2)})$ . The predicted probabilities are shown in Fig. 5. Note, that the shape of $\mathop{\rm Pr}(y=2)$ is fully determined by the predicted $\mathop{\rm Pr}(c^{(1)}=2)$ and $\mathop{\rm Pr}(c^{(2)}=2)$ , while $\mathop{\rm Pr}(c^{(3)}=2)$ does not affect $y$ .

6.2 Multi-label MNIST

The second example is with an artificial dataset constructed on a part of the real labeled handwritten digits image dataset MNIST consisting of 5000 randomly selected images. We consider digit labels as the concept $c^{(1)}$ (not the main target). An additional synthetic feature is the digit color. Each digit is randomly colored in white or blue corresponding to $c^{(2)}=1$ or $c^{(2)}=2$ , respectively. The main label $y=c^{(0)}$ is defined as the following: odd blue digits or even white digits are assigned to $y=1$ , the rest are assigned to $y=2$ .

We compare three different types of heads: AS-Head, Joint Distribution Head (AS-Head without expert rules) and Independent Classification Heads without rules. The Independent Classification Heads model differs from the first two models. It considers concepts as independent targets (classes) and predicts probabilities for the targets separately. The first head is given the same rule that was used for constructing labels for $y$ . The second head predicts a joint distribution and then calculates marginals from it. The third head, a baseline, is a plain multi-label multi-class classification head, where probabilities for each head are predicted independently. Results are shown in Fig. 6, where the $F_{1}$ measure as a function of the labeled data ratio is provided. The higher the labeled data fraction, the easier is the task, therefore all three curves almost coincide at the fraction of $0.5$ . However when only a small amount of labeled data is available, the proposed AS-Head, that takes expert rules into account, performs systematically better than the baseline.

7 Conclusion

We formulated the problem of incorporating the expert rules into machine learning models for extending the concept-based learning for the first time. We have shown how to combine logical rules and neural networks predicting the concept probabilities. Several approaches have been proposed in order to solve the stated problem and to implement the idea behind the use of expert rules in machine learning. The proposed approaches have provided ways of constructing and training a neural network which guarantees that the output probabilities of concepts satisfy the expert rules. These ways are based on representing sets of possible probability distributions of concepts by means of a convex polytope such that the use of its vertices or its faces allows the neural network to generate a probability distribution of concepts satisfying the expert rules.

It has been illustrated by the numerical examples that the proposed models compensate the incomplete concept labeling of instances in datasets. Moreover, the expert rules allow us to compensate a partial availability of targets in the training set.

The proposed approaches have different computational complexity depending on the number of concepts, the number of concept values, and the number of training examples.

The general problem of incorporating the expert rules into neural networks has been solved. However, there are problems where an application of the proposed models can significantly improve the accuracy and interpretability of models. In particular, it is interesting to adapt the proposed models to CBMs which also deal with concepts and can be combined with the expert rules. This is an important direction for further research.

References

[1] Isaac Lage and Finale Doshi-Velez. Learning interpretable concept-based models with human feedback. arXiv:2012.02898, Dec 2020.
[2] Bowen Wang, Liangzhi Li, Yuta Nakashima, and Hajime Nagahara. Learning bottleneck concepts in image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023.
[3] Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
[4] Chih-Kuan Yeh, Been Kim, Sercan Arik, Chun-Liang Li, Tomas Pfister, and Pradeep Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In Advances in neural information processing systems, volume 33, pages 20554–20565, 2020.
[5] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR, 2020.
[6] Mateo Espinosa Zarlenga, Pietro Barbiero, Gabriele Ciravegna, Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, Zohreh Shams, Frederic Precioso, Stefano Melacci, Adrian Weller, et al. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems, volume 35, pages 21400–21413, 2022.
[7] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, Technical report, California Institute of Technology, 2011.
[8] Varsha Pendyala and Jihye Choi. Concept-based explanations for tabular data. arXiv:2209.05690, Sep 2022.
[9] Yangqing Jia, Joshua T Abbott, Joseph L Austerweil, Tom Griffiths, and Trevor Darrell. Visual concept learning: Combining machine vision and bayesian generalization on concept hierarchies. In Advances in Neural Information Processing Systems, volume 26, pages 1–9, 2013.
[10] Maximilian Dreyer, Reduan Achtibat, Wojciech Samek, and Sebastian Lapuschkin. Understanding the (extra-) ordinary: Validating deep model decisions with prototypical concept-based explanations. arXiv:2311.16681, Nov 2023.
[11] Lena Heidemann, Maureen Monnet, and Karsten Roscher. Concept correlation and its effects on concept-based models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4780–4788, 2023.
[12] Katherine Maeve Collins, Matthew Barker, Mateo Espinosa Zarlenga, Naveen Raman, Umang Bhatt, Mateja Jamnik, Ilia Sucholutsky, Adrian Weller, and Krishnamurthy Dvijotham. Human uncertainty in concept-based ai systems. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 869–889, 2023.
[13] Kaiwen Xu, Kazuto Fukuchi, Youhei Akimoto, and Jun Sakuma. Statistically significant concept-based explanation of image classifiers via model knockoffs. arXiv:2305.18362, May 2023.
[14] Ričards Marcinkevičs, Patricia Reis Wolfertstetter, Ugne Klimiene, Kieran Chin-Cheong, Alyssia Paschke, Julia Zerres, Markus Denzinger, David Niederberger, Sven Wellmann, Ece Ozkan, et al. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Medical Image Analysis, 91:103042, 2024.
[15] A.A. Meldo, L.V. Utkin, M.S. Kovalev, and E.M. Kasimov. The natural language explanation algorithms for the lung cancer computer-aided diagnosis system. Artificial Intelligence in Medicine, 108(Article 101952):1–10, 2020.
[16] Cristiano Patrício, João C Neves, and Luis F Teixeira. Coherent concept-based explanations in medical image and its application to skin lesion diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3798–3807, 2023.
[17] Cristiano Patrício, Luís F Teixeira, and João C Neves. Towards concept-based interpretability of skin lesion diagnosis using vision-language models. arXiv:2311.14339, Nov 2023.
[18] An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, **gbo Shang, et al. Robust and interpretable medical image classifiers via concept bottleneck models. arXiv:2310.03182, Oct 2023.
[19] Christoph Obermair, Alexander Fuchs, Franz Pernkopf, Lukas Felsberger, Andrea Apollonio, and Daniel Wollmann. Example or prototype? learning concept-based explanations in time-series. In Asian Conference on Machine Learning, pages 816–831. PMLR, 2023.
[20] Wensi Tang, Lu Liu, and Guodong Long. Interpretable time-series classification on few-shot samples. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
[21] Jihye Choi, Jayaram Raghuram, Ryan Feng, Jiefeng Chen, Somesh Jha, and Atul Prakash. Concept-based explanations for out-of-distribution detectors. In International Conference on Machine Learning, pages 5817–5837. PMLR, 2023.
[22] Laya Rafiee Sevyeri, Ivaxi Sheth, Farhood Farahnak, and Shirin Abbasinejad Enger. Transparent anomaly detection via concept-based explanations. arXiv:2310.10702, Oct 2023.
[23] Anita Mahinpei, Justin Clark, Isaac Lage, Finale Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models. arXiv:2106.13314, Jun 2021.
[24] Jae Hee Lee, Sergio Lanza, and Stefan Wermter. From neural activations to concepts: A survey on explaining concepts in neural networks. arXiv:2310.11884, Oct 2023.
[25] Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. arXiv:2205.15480, May 2022.
[26] I. Sheth and S.E. Kahou. Auxiliary losses for learning generalizable concept-based models. arXiv:2311.11108, Nov 2023.
[27] Eunji Kim, Dahuin Jung, Sangha Park, Siwon Kim, and Sungroh Yoon. Probabilistic concept bottleneck models. arXiv:2306.01574, Jun 2023.
[28] Aya Abdelsalam Ismail, Julius Adebayo, Hector Corrada Bravo, Stephen Ra, and Kyunghyun Cho. Concept bottleneck generative models. In Proceedings of ICML 2023. Workshop on Deployment Challenges for Generative AI, https://openreview.net/group?id=ICML.cc/2023/Workshop, pages 1–10, 2023.
[29] Naveen Raman, Mateo E. Zarlenga, Juyeon Heo, and Mateja Jamnik. Do concept bottleneck models obey locality? arXiv:2401.01259, Jan 2024.
[30] Andrei Margeloiu, Matthew Ashman, Umang Bhatt, Yanzhi Chen, Mateja Jamnik, and Adrian Weller. Do concept bottleneck models learn as intended? arXiv:2105.04289, May 2021.
[31] R. Marcinkevics, S. Laguna, M. Vandenhirtz, and J.E. Vogt. Beyond concept bottleneck models: How to make black boxes intervenable? arXiv:2401.13544, Jan 2024.
[32] Yan Cui, Shuhong Liu, Liuzhuozheng Li, and Zhiyuan Yuan. Ceir: Concept-based explainable image representation learning. arXiv:2312.10747, Dec 2023.
[33] Yoshihide Sawada and Keigo Nakamura. Concept bottleneck model with additional unsupervised concepts. IEEE Access, 10:41758–41765, 2022.
[34] Xinyue Xu, Yi Qin, Lu Mi, Hao Wang, and Xiaomeng Li. Energy-based concept bottleneck models: unifying prediction, concept intervention, and conditional interpretations. arXiv:2401.14142, Jan 2024.
[35] Kushal Chauhan, Rishabh Tiwari, Jan Freyberg, Pradeep Shenoy, and Krishnamurthy Dvijotham. Interactive concept bottleneck models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5948–5955, 2023.
[36] Federico Pittino, Vesna Dimitrievska, and Rudolf Heer. Hierarchical concept bottleneck models for vision and their application to explainable fine classification and tracking. Engineering Applications of Artificial Intelligence, 118:105674, 2023.
[37] Marton Havasi, Sonali Parbhoo, and Finale Doshi-Velez. Addressing leakage in concept bottleneck models. In Advances in Neural Information Processing Systems, volume 35, pages 23386–23397, 2022.
[38] Ao Sun, Yuanyuan Yuan, **chuan Ma, and Shuai Wang. Eliminating information leakage in hard concept bottleneck models with supervised, hierarchical concept learning. arXiv:2402.05945, Feb 2024.
[39] Emanuele Marconato, Andrea Passerini, and Stefano Teso. Glancenets: Interpretable, leak-proof concept-based models. In Advances in Neural Information Processing Systems, volume 35, pages 21212–21227, 2022.
[40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[41] Rémi Kazmierczak, Eloïse Berthier, Goran Frehse, and Gianni Franchi. CLIP-QDA: An explainable concept bottleneck model. arXiv:2312.00110, Dec 2023.
[42] Komei Fukuda. Exact algorithms and software in optimization and polyhedral computation. In Proceedings of the Twenty-First International Symposium on Symbolic and Algebraic Computation, pages 333–334, 2008.
[43] A.V. Konstantinov and L.V. Utkin. A new computationally simple approach for implementing neural networks with output hard constraints. Doklady Mathematics, 2023.