FI-CBL: A Probabilistic Method for Concept-Based Learning with Expert Rules

Lev V. Utkin, Andrei V. Konstantinov and Stanislav R. Kirpichenko
Higher School of Artificial Intelligence Technologies
Peter the Great St.Petersburg Polytechnic University
St.Petersburg, Russia
e-mail: [email protected], [email protected], [email protected]

Abstract

A method for solving concept-based learning (CBL) problem is proposed. The main idea behind the method is to divide each concept-annotated image into patches, to transform the patches into embeddings by using an autoencoder, and to cluster the embeddings assuming that each cluster will mainly contain embeddings of patches with certain concepts. To find concepts of a new image, the method implements the frequentist inference by computing prior and posterior probabilities of concepts based on rates of patches from images with certain values of the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). FI-CBL allows us to incorporate the expert rules in the form of logic functions into the inference procedure. An idea behind the incorporation is to update prior and conditional probabilities of concepts to satisfy the rules. The method is transparent because it has an explicit sequence of probabilistic calculations and a clear frequency interpretation. Numerical experiments show that FI-CBL outperforms the concept bottleneck model in cases when the number of training data is small. The code of proposed algorithms is publicly available.

Keywords: concept-based learning, expert rules, Bayes rule, classification, logical function, inductive and deductive learning.

1 Introduction

Concept-based machine learning (CBL) is an innovative and promising approach that focuses on utilizing high-level concepts derived from raw features to express predictions in machine learning models, rather than using the raw features themselves [1]. This approach aims to integrate expert knowledge or human-like reasoning into machine learning models, leading to more efficient and accurate predictions. By incorporating high-level concepts, CBL can significantly improve explainability of the machine learning model outputs, making them more accessible to users [2, 3, 4, 5].

One of the key types of CBL is the concept-based bottleneck model (CBM) which can be regarded as a technique used for learning high-level representations of data by forcing the model to learn a compressed and low-dimensional representation of input features [6]. This low-dimensional representation is referred to as the bottleneck. In the context of CBL, the CBM learns high-level concepts by transforming the raw features into a low-dimensional space, which allows the model to capture the essential features while discarding irrelevant information. At that, the classifier deriving final labels has access only to the concept representation, and the decision is strongly tied to the concepts [4]. CBMs have been widely used in various machine learning tasks, such as image recognition, natural language processing, and speech recognition [7]. The CBM effectiveness lies in their ability to learn meaningful high-level concepts, which can be easily interpreted and understood by humans. This capability makes CBMs a powerful tool for develo** explainable and interpretable machine learning models.

It is important to points out that most models implementing CBL are based on applying a deep neural network which transform raw features or images into a specific low-dimensional representation, for example, the bottleneck, which contains information about the concept values. As a result, the neural network may be complex and require a huge amount of labeled instances for training, which are not available in some cases. Therefore, we propose an extremely simple and transparent model for CBL which is motivated by the work [8], devoted to annotating the histopathology images, and by an interesting observation of the relationship between multiple instance models (MIL) [9, 10, 11, 12]. MIL is a type of weakly supervised learning, which deals with two concepts: bags and instances. Each bag is labeled, and it consists of many instances or some its elements. For example, a histology digital image obtained from the glass microscope slides with a label indicating a disease, for example, cancer or non-cancer, can be viewed as a “bag” consisting of patches extracted from the image, which are referred to as “instances”.

It turns out that the MIL can be regarded as a special case of the concept-based learning where labels (concepts) of instances are unknown, but there is a label description of the whole image (the bag). Yamamoto et al. [8] proposed a simple algorithm for annotating instances (patches) by having annotated the whole images. The algorithm is based on clustering of the patch embeddings and computing probabilities that patches in each cluster are malignant or benign in a simplest way by calculating rates of patches from the malignant and benign images corresponding to the patches.

We extend the algorithm proposed by Yamamoto et al. [8] to CBL and develop a simple method for determining concepts of new images. The method is based on the frequentist inference and on computing prior and posterior probabilities of concepts using rates of patches from images with certain values of concepts. In other words, we calculate the relative frequencies of patches from images with the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). In addition, we propose approaches to incorporate the knowledge-based expert rules of the form “IF …, THEN …”, which are elicited from experts and constructed by means of concepts. For example, the rule from the lung cancer diagnostics of a nodule can look like “IF Contour is $<$ spicules $>$ , Inclusion is $<$ necrosis $>$ , THEN a Diagnosis is $<$ malignant $>$ . Here concepts are shown in Bold, their values are in angle brackets. It is important to note that the approach to incorporate the knowledge-based expert rules into neural networks in the framework of CBL has been proposed in [13]. Its idea is to add a special layer to a classification neural network, which computes a probability distribution of concepts in accordance with the available expert rules. According to [13], the probability distributions of concepts are approximately generated by the neural network and are corrected by the incorporated expert rules. We propose another approach for taking into account the expert rules. It is based on the combination of the Bayes rule and the multinomial distribution. The expert rules in the form of logic functions update prior probabilities of concepts as well as conditional probabilities in the Bayes rule. This is a simple way for using the expert rules as a specific type of constraints on probabilities of concepts. We have to note that the term “expert rules” is used in the proposed model in a broader sense as an arbitrary logical function of concepts.

The code of proposed algorithms is available in:

https://github.com/NTAILab/simple_concepts.

2 Related work

Concept-based learning models. Starting from the works [3, 5], many CBL models have been proposed to implement ideas behind the concept-based learning under various conditions. One of the goals to develop the CBL approaches is to interpret and explain predictions of machine learning models. In order to achieve this goal and to overcome some difficulties of the interpretation, a CBL model was proposed in [1]. Concepts in this model are fully transparent, thus enabling users to validate whether mental and machine models are aligned. A concept-based explanation framework was presented in [14]. An algorithm for learning visual concepts from images by applying a Bayesian generalization model was developed in [15]. It should be noted that not only images (visual data) are used in the CBL models, but also tabular data. For example, the concept attribution approach to tabular learning and a definition of concepts over tabular data were proposed in [16]. Applications of CBL goes beyond the explanation and extends to a wide variety of problems. Approaches for analysis time-series data using the CBL models were presented in [17, 18]. The well-known anomaly detection task in the framework of CBL was considered in [19, 20]. One of the important areas of the CBL application is medicine where doctors prefer explanations that are user-friendly and represented via natural language [21]. Several authors have contributed into development of the medicine CBL models [22, 23, 24, 25, 26]. Survey papers [2, 27, 28] comprehensively discuss many aspects of the CBL models and their applications.

Concept bottleneck models. Most CBL models are implemented in the form of the CBMs [6]. Due to the efficient and transparent two-module architecture of CBMs, where the first module (a neural network) implements the dependence of concepts on input instances, and the second module implements the dependence of the target variable on the concepts, many modifications and extensions of these models have been proposed [3, 29, 30].

An extension of CBMs is the concept embedding model [31] which learns two embeddings per concept, one for when it is active, and another when it is inactive. The model aims to overcome the current accuracy-vs-interpretability trade-off. Ideas behind the concept embedding model have been used in the concept bottleneck generative models [32].

Similarly to CBL, extensions and modifications of CBMs were motivated by their applications or different conditions of their applications. For example, to model the ambiguity in the concept predictions, a probabilistic CBM was introduced in [33]. Conditions of independence across concepts were studied in [34]. Different aspects of the concept-based interventions were considered in [35]. An application of CBMs to the images segmentation and tracking was presented in [36]. Two causes of performance disparity between soft (inputs to a label predictor are the concept probabilities) and hard (the label predictor only accepts binary concepts) CBMs were proposed in [37]. The CLIP-based CBMs using the well-known CLIP model [38] was proposed in [39].

The above modifications and extensions of CBMs can be regarded as a small part of all available modifications [40, 41, 42, 43, 44], which are caused by the great interest in CBL.

Incorporating expert rules into machine learning models. The idea of combining the prior expert knowledge with machine learning models has already attracted some interest, and a number of interesting approaches have been proposed for its implementation. A comprehensive and exhaustive review of various available approaches to integrate prior knowledge into the training process was presented in [45]. Authors in [45] also propose a concept of informed machine learning, which can be viewed as a uniting term for different approaches.

Depending on the knowledge representation, there are different approaches for the knowledge integration into the machine learning pipeline. We do not touch upon a large class of methods related to the representation of knowledge in forms of algebraic equations, differential equations, probabilistic relations, etc. An analysis and review of these methods can again be found in [45]. Our goal is to study the knowledge representation in the form of expert rules or logic rules. A common approach to incorporate the logic rules into a machine learning model is to add the rules as constraints to loss functions [46, 47, 48, 49]. However, this approach does not guarantee that the rules will be satisfied for all training and testing instances because violation of the constraints is only penalized, but not eliminated. Another way to integrate the rules into neural networks is to map components of the rules to neurons [50, 51]. In this approach, a neural network implements logic functions corresponding to the expert rules. An interesting approach has been proposed in [52] where authors present an effective safe abductive learning method and show that induction and abduction are mutually beneficial. An extensive review of methods for updating models based on expert feedback is presented in [53].

A quite different approach to incorporate expert rules into machine learning in the framework of CBL was presented in [13]. Following this approach, we present a computationally simple algorithm for predicting the concept probabilities and provide a way to incorporate the expert rules in the form of logic functions.

3 Background

A common statement of the CBL problem is based on considering a classifier which predicts a set of concepts as well as the target variable [54].

Suppose that a training set is represented as a set of triples $(\mathbf{x}_{i},y_{i},\mathbf{c}_{i})$ , $i=1,...,N$ , where $\mathbf{x}_{i}\in$ $\mathcal{X}\subset\mathbb{R}^{d}$ is the input feature vector; $y_{i}\in\mathcal{Y}=\{1,2,...,K\}$ is the corresponding target defining $K$ -class classification task; $\mathbf{c}_{i}=(c_{i}^{(1)},...,c_{i}^{(m)})\in\mathcal{C}$ is a set of $m$ concepts $\mathbf{c}_{i}=(c_{i}^{(1)},...,c_{i}^{(m)})\in\mathcal{C}$ which are given with targets. In most works, concepts are represented as a vector $\mathbf{c}_{i}$ with $m$ binary elements such that $c_{i}^{(j)}=1$ denotes that the $j$ -th concept is present in a description of the input $\mathbf{x}_{i}$ , and $c_{i}^{(j)}=0$ denotes that the $j$ -th concept is not present.

One of the CBL goals is to predict targets and concepts that is to find the dependence $h:\mathcal{X}\rightarrow(\mathcal{C},\mathcal{Y})$ on concepts and inputs. Another goal is to explain what concepts of the input are responsible for the corresponding prediction. In other words, the CBL model aims to interpret how predictions depend on concepts of the corresponding inputs. The above goals can be achieved by applying CBM proposed by Koh et al. [6] as an important type of the CBL models. The function $h$ in the CBM is represented as two functions: the first one $g:$ $\mathcal{X}\rightarrow\mathcal{C}$ maps the input vector to concepts; the second function $f:\mathcal{C}\rightarrow\mathcal{Y}$ maps the concepts to the outputs. The prediction $y$ for a new instance $\mathbf{x}$ can be obtained as $y=f(g(\mathbf{x}))$ . Here concepts act as a bottleneck in the interpretation of predictions.

4 The model and its training

It is assumed that each concept $c^{(i)}$ can take a value from the set $\mathcal{C}^{(i)}=\{1,...,n_{i}\}$ called the $i$ -th concept outcome set, $i\in\{0,\dots,m\}$ . The concept $c^{(0)}$ is a special concept corresponding to the target variable $y$ .

We also suppose that there are $N$ images $\mathbf{x}_{i}$ , $i=1,...,N$ , in the training set such that the $i$ -th image is characterized by a set of concept values $\mathbf{c}_{i}=(c_{i}^{(0)},...,c_{i}^{(m)})$ . The whole dataset consists of pairs $(\mathbf{x}_{i},\mathbf{c}_{i})$ of vectors. For example, the lung cancer nodule description can be based on two concepts: Contour $c^{(1)}$ and Inclusion $c^{(2)}$ . The concept Contour takes values $<$ smooth $>$ $<$ grainy $>$ , $<$ spicules $>$ , the concept Inclusion takes values $<$ homogeneous $>$ and $<$ necrosis $>$ . The target value $y$ or the concept Diagnosis $c^{(0)}$ takes values: $<$ malignant $>$ and $<$ benign $>$ . Then we have the formal concept description $\mathcal{C}^{(0)}=\{1,2\}$ , $\mathcal{C}^{(1)}=\{1,2,3\}$ , $\mathcal{C}^{(2)}=\{1,2\}$ .

Let us divide each image $\mathbf{x}_{i}$ into $s$ patches of the same dimension denoted as $\xi_{1}^{(i)},...,\xi_{s}^{(i)}$ . By using an autoencoder, we can obtain the corresponding embeddings $e_{1}^{(i)},...,e_{s}^{(i)}$ of a smaller dimension.

In fact, we have a weakly supervised learning task where labels of the whole images are known, including values of all concepts or a part of concepts, but labels of patches are unknown. However, we can compute probabilities of labels for patches by separating embeddings corresponding to patches into groups (clusters) with different contents and by counting up how many whole images having a certain concept value contain patches from each group (cluster).

All embeddings are clustered into $R$ clusters $K_{1},...,K_{R}$ , i.e., we obtain subsets of embeddings, which fall into $k$ -th cluster, of the form:

\left\{e_{j}^{(i)},i\in\mathcal{I}_{k},j\in\mathcal{J}_{k}\right\},

(1)

where $\mathcal{I}_{k}$ and $\mathcal{J}_{k}$ are index sets such that the $k$ -th cluster contains embeddings of patches with indices from $\mathcal{J}_{k}$ belonging to indices of images from $\mathcal{I}_{k}$ .

Clusters contain $s_{1},...,s_{R}$ embeddings such that $s_{1}+...+s_{R}=S$ , where $S$ is the total number of patches obtained from all images. If all images are divided into the same number of patches, then $S=s\cdot N$ . Let us write all concepts in the form of one vector consisting of concatenated vectors of indices:

\mathcal{C}=(\underbrace{1,...,n_{0}}_{\mathcal{C}^{(0)}},\underbrace{1,...,n_% {1}}_{\mathcal{C}^{(1)}},...,\underbrace{1,...,n_{m}}_{\mathcal{C}^{(m)}}).

(2)

A general scheme of the above representation of the images in the form embeddings of patches divided into $R$ clusters is depicted in Fig. 1.

Refer to caption — Figure 1: A general scheme of the image transformation to sets of clustered embeddings of patches: each image is divided into $s$ patches $\xi_{j}^{(i)}$ ; every patch is transformed into an embedding $e_{j}^{(i)}$ ; embeddings are clustered into $R$ clusters; probabilities of concepts are computed by having the image concept labels $\mathbf{c}_{i}$ and distributions of embeddings with corresponding concept values in clusters

If we assume that each concept is a random variable $C^{(r)}$ taking values from $\mathcal{C}^{(r)}$ , then we aim to find the conditional probability $p(r,v\mid l)=P\left\{C^{(r)}=v\mid e\in K_{l}\right\}$ that the concept $C^{(r)}$ takes the value $v$ under condition that the embedding $e$ of a patch taken from a considered image falls into the cluster $K_{l}$ . Let us define the following additional probabilities and their short notations:

•

$p(l\mid r,v)=P\left\{e\in K_{l}\mid C^{(r)}=v\right\}$ is the conditional probability that an embedding in the cluster $K_{l}$ is from the image having the value $v$ of the $r$ -th concept;
•

$p(r,v)=P\left\{C^{(r)}=v\right\}$ is the prior probability that an embedding is from the image having the value $v$ of the $r$ -th concept;
•

$p(l)=P\left\{e\in K_{l}\right\}$ is the unconditional probability that an embedding falls into the cluster $K_{l}$ .

By using the Bayes rule, we write

p(r,v\mid l)=\frac{p(l\mid r,v)\cdot p(r,v)}{p(l)}.

(3)

It is important to point out that we do not need probabilities $p(r,v\mid l)$ for the inference when a new instance is classified. We need to know only $p(l\mid r,v)$ and $p(r,v)$ for the inference. On the other hand, the probability $p(r,v\mid l)$ can be regarded as a measure for determining what values of concepts embeddings in the $l$ -th cluster have. The large probability $p(r,v\mid l)$ implies that the embeddings in $K_{l}$ have mainly the concept value $c^{(r)}=v$ . If we assume that clusters are homogeneous to some extent, then embeddings contained in the cluster $K_{l}$ correspond to patches from images with the concept $c^{(r)}=v$ . This is an important information because it allows us to highlight areas in the image with given concepts.

Let us introduce the following additional notations:

•

$s_{v}^{(r)}(l)$ is the number of embeddings in the $l$ -th cluster obtained from images with $c^{(r)}=v$ ;
•

$s_{v}^{(r)}$ is the total number of embeddings in all clusters obtained from images with $c^{(r)}=v$ .

The conditional probability $p(l\mid r,v)$ is determined as the proportion of embeddings from images with $c^{(r)}=v$ that fall into the cluster $K_{l}$ to the entire set of embeddings from the images with $c^{(r)}=v$ , i.e., there holds $p(l\mid r,v)=s_{v}^{(r)}(l)/s_{v}^{(r)}$ .

The prior probability $p(r,v)$ is determined as the proportion of images with $c^{(r)}=v$ to all images in the dataset, i.e., there holds $p(r,v)=s_{v}^{(r)}/S$ . The unconditional probability $p(l)$ can be computed from the condition:

\sum_{v=1}^{n_{r}}p(r,v\mid l)=1.

(4)

It can be also determined as the proportion of embeddings in the cluster $K_{l}$ to all embeddings from all images, i.e., there holds $p(l)=s_{l}/S$ . Hence, the posterior probability is computed as $p(r,v\mid l)=s_{v}^{(r)}(l)/s_{l}$ .

5 The model inference

Suppose we have a new instance $\mathbf{x}$ consisting of $s$ patches $\xi_{1},...,\xi_{s}$ which are fed to the trained autoencoder in order to obtain embeddings $e_{1},...,e_{s}$ . These embeddings are distributed among clusters $K_{1},...,K_{R}$ according to their distances to the cluster centers. Let $s_{1},...,s_{R}$ be numbers of embeddings which fall into clusters $K_{1},...,K_{R}$ , respectively, such that $s_{1}+...+s_{R}=s$ . Let us denote the set of embeddings, produced by a new instance, which fall into $K_{l}$ , as $E_{l}^{\ast}=\{e^{\ast}(1,l),...,e^{\ast}(s_{l},l)\}$ . We do not have any information about concepts and their values which describe the instance. However, we can compute the probability that $C^{(r)}$ is equal to $v$ for all $r=1,...,m$ , $v=1,...,n_{r}$ , under condition that $s_{1},...,s_{R}$ embeddings fall into clusters $K_{1},...,K_{R}$ , respectively, i.e., find the conditional probability

p(r,v\mid E_{1:R}^{\ast})=P\left\{C^{(r)}=v\mid E_{1:R}^{\ast},...,E_{R}^{\ast% }\right\}.

(5)

Here $E_{1:R}^{\ast}$ is a short notation of $E_{1}^{\ast},...,E_{R}^{\ast}$ . This probability depends on the probabilities that embeddings with the concept $c^{(r)}=v$ fall into $K_{i}$ , $l=1,...,R$ . This probability can be estimated by means of the probability $p(l\mid r,v)$ which has been considered above and defined as the proportion of embeddings from images with $c^{(r)}=v$ that fall into the cluster $K_{i}$ to the entire set of embeddings from the images with $c^{(r)}=v$ .

Then we can write the following Bayes rule:

p(r,v\mid E_{1:R}^{\ast})=\frac{p(E_{1:R}^{\ast}\mid r,v)\cdot p(r,v)}{P(E_{1:% R}^{\ast})},

(6)

where $p(E_{1:R}^{\ast}\mid r,v)=P\{E_{1:R}^{\ast}\mid C^{(r)}=v\}$ is the conditional probability that $s_{1},...,s_{R}$ embeddings fall into clusters $K_{1},...,K_{R}$ , respectively, under condition that the image has the value $v$ of the $r$ -th concept; $P\left\{E_{1:R}^{\ast}\right\}$ is unconditional probability that $s_{1},...,s_{R}$ embeddings fall into clusters $K_{1},...,K_{R}$ .

We propose to apply the multinomial distribution with probabilities $p(l\mid r,v)$ of events in order to represent $P\{E_{1:R}^{\ast}\mid C^{(r)}=v\}$ :

p(E_{1:R}^{\ast}\mid r,v)=\frac{s!}{s_{1}!\cdot\cdot\cdot s_{R}!}\prod_{l=1}^{% R}p^{s_{l}}(l\mid r,v).

(7)

The probability $p(l\mid r,v)$ has been also defined in (3). The unconditional probability $P\left\{E_{1:R}^{\ast}\right\}$ can be computed by using the condition:

\sum_{v=1}^{n_{r}}p(r,v\mid E_{1:R}^{\ast})=1.

(8)

The main difficulty of using (7) is that some probabilities $p(l\mid r,v)$ may be $0$ . In this case, the product (7) is also $0$ . In order to overcome this problem, we propose to replace the zero-valued probabilities with some small value $\epsilon>0$ . In this case, the probability $P(E_{1:R}^{\ast})$ can be computed only by means of (8).

In sum, for a new instance, we compute $n_{1}+...+n_{m}$ probabilities $p(r,v\mid E_{1:R}^{\ast})$ considering all $v=1,...,n_{r}$ and $r=1,...,m$ .

Let us introduce the thresholds $\gamma_{r}$ for all values of the $r$ -th concept. If $p(r,v\mid E_{1:R}^{\ast})\geq\gamma_{r}$ , then the value $v$ of the concept $c^{(r)}$ is assigned to the instance. After considering all the threshold conditions, we get the concept-based description of the image.

6 Incorporating expert rules

Let us introduce the logical literal denoted as $[c^{(j)}=i]$ , which takes the value $1$ , if the concept $c^{(j)}$ has the value $i$ . A set of expert rules can be represented as a logical expression $g(\mathbf{c})$ over literals $[c^{(j)}=i]$ such that $g(\mathbf{c})=0$ means that the rule is FALSE, $g(\mathbf{c})=1$ means that it is TRUE. One of the forms of rules provided by experts is represented as “IF $F$ , THEN $G$ ”, where $F$ is the antecedent, $G$ is the consequent. This rule is expressed through the logical function as $F\rightarrow G=\lnot F\vee G$ , where symbols $\rightarrow$ and $\lnot$ denote operations of implication and negation, respectively.

Expert rules change prior probabilities of concepts $p(r,v)$ as well as conditional probabilities of concepts $p(l\mid r,v)$ . Therefore, the next question we need to answer is how to update these probabilities in order to implement the model inference taking into account the rules.

6.1 Expert rules and prior probabilities of concepts

First, we consider how prior probabilities of concepts are changed by using the expert rules. Let us return to the conditional probability $p(r,v\mid E_{1:R}^{\ast})$ in (6). One of the ways to take into account the expert rules is to assume that the expert rules change prior probabilities of concepts $p(r,v)$ , i.e., we have to find the conditional probabilities $P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}$ . Here $\mathbf{C}$ is the random vector taking values from combinations of the concept values. There exist $n_{0}\cdot\cdot\cdot n_{m}$ different combinations of the concept values which are formed from the Cartesian product $\mathcal{C=C}^{(0)}\times....\times\mathcal{C}^{(m)}$ defined above. Let $\mathbf{z}\in\mathcal{C}$ be one of the combinations. Each expert rule takes values TRUE or FALSE ( $1$ or $0$ ) after substituting $\mathbf{z}$ into the logical function $g$ . Let us use again the Bayes rule for determining the posterior probability $P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}$ :

	$\displaystyle P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}$
	$\displaystyle=\frac{P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}\cdot p(r,v)}% {P\left\{g(\mathbf{C})=1\right\}}.$		(9)

In order to determine the probability $P\left\{g(\mathbf{C})=1\right\}$ , we consider all cases when the random vector takes values $\mathbf{z}$ from the set $\mathcal{C}$ . However, when the conditional probability $P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}$ is determined, the the set $\mathcal{C}$ of all vectors $\mathbf{z}$ is restricted by the condition $C^{(r)}=v$ . A new restricted set denoted as $\mathcal{C}_{r,v}$ is formed by replacing $\mathcal{C}^{(r)}$ in $\mathcal{C}$ with the single value $v$ , i.e., $\mathcal{C}^{(r)}=\{v\}$ and

\mathcal{C}_{r,v}=\mathcal{C}^{(0)}\times....\times\{v\}\times...\times% \mathcal{C}^{(m)}.

(10)

By using the rule of total probability, we write:

P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}P\{g(% \mathbf{C})=1\mid\mathbf{C}=\mathbf{z}\}P\{\mathbf{C}=\mathbf{z}\}.

(11)

Since the function $g(\mathbf{C})$ takes values $0$ and $1$ , then the conditional probability is determined as follows:

P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(% \mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.

(12)

The probability $P\{\mathbf{C}=\mathbf{z}\}$ can be found by considering all images which have the combination $\mathbf{z}\in\mathcal{C}_{r,v}$ of concept values, i.e., the probability is equal to the proportion of the number of images with concepts $\mathbf{z}\in\mathcal{C}_{r,v}$ to the number $N$ of all images in the dataset.

In sum, for every combination of concepts $\mathbf{z}$ $\in\mathcal{C}_{r,v}$ , we check whether the concepts satisfy the expert rule $g(\mathbf{z})$ and then compute $P\{C=\mathbf{z}\}$ for this combination. It is important to point out that a combination $\mathbf{z}$ is not considered if there are no images with the corresponding set of concept values in the dataset because $P\{\mathbf{C}=\mathbf{z}\}=0$ in this case.

The unconditional probability $P\left\{g(\mathbf{C})=1\right\}$ can be obtained from the condition

\sum_{v=1}^{n_{r}}P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}=1,

(13)

and is equal to

	$\displaystyle P\left\{g(\mathbf{C})=1\right\}$	$\displaystyle=\sum_{v=1}^{n_{r}}P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}% \cdot p(r,v)$
		$\displaystyle=\sum_{v=1}^{n_{r}}\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(\mathbf% {z}_{v}^{(r)})\cdot P\{\mathbf{C}=\mathbf{z}\}\cdot p(r,v).$		(14)

Hence, we rewrite the expression (9) for computing $P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}$ as follows:

P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}=U_{v}^{(r)}\cdot p(r,v),

(15)

where

U_{v}^{(r)}=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(\mathbf{z})\cdot P\{% \mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.

(16)

The prior probability $p(r,v)$ of the concept is updated in accordance with the rule $g$ by means of its multiplying by the updating coefficient $U_{v}^{(r)}$ .

Finally, we have obtained the updated probabilities $p(r,v)$ which are used in (6) for computing posterior marginal probabilities of concepts $P\left\{C^{(r)}=v\mid E_{1:R}^{\ast}\right\}$ .

6.2 Expert rules and conditional probabilities

In addition to changes of prior probabilities, it is necessary to consider how conditional probabilities $P\left\{e\in K_{l}\mid C^{(r)}=v\right\}$ are updated due to expert rules. Let us denote the conditional event $e\in K_{l}\mid C^{(r)}=v$ as $E_{l}^{(r)}=v$ . Then we have to find the probability

P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}=\frac{P\left\{g(\mathbf{C})=% 1\mid E_{l}^{(r)}=v\right\}\cdot P\left\{E_{l}^{(r)}=v\right\}}{P\left\{g(% \mathbf{C})=1\right\}}.

(17)

Note that the probability $P\left\{E_{l}^{(r)}=v\right\}$ is nothing else, but the conditional probability $p(l\mid r,v)$ which has been defined above in Sec. 4, i.e., $P\left\{E_{l}^{(r)}=v\right\}=s_{v}^{(r)}(l)/s_{v}^{(r)}$ . Let $\mathcal{C}_{r,v}(l)\subseteq$ $\mathcal{C}_{r,v}$ be a subset of $\mathcal{C}_{r,v}$ , which contains the concept combinations of embeddings from the cluster $K_{l}$ . Then the conditional probability $P\left\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\right\}$ is determined as

	$\displaystyle P\{g(\mathbf{C})$	$\displaystyle=1\mid E_{l}^{(r)}=v\}$
		$\displaystyle=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}P\{g(\mathbf{C})=1\mid% \mathbf{C}=\mathbf{z}\}P\{\mathbf{C}=\mathbf{z}\}.$		(18)

It can be seen from (18) that $P\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\}$ differs from $P\{g(\mathbf{C})=1\mid C_{l}^{(r)}=v\}$ in (11) in that the set $\mathcal{C}_{r,v}(l)$ is limited to considering only the cluster $K_{l}$ . Hence, there holds

P\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)% }g(\mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.

(19)

We rewrite the expression (17) for computing $P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$ as follows:

$\displaystyle P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$	$\displaystyle=V_{v}^{(r)}(l)\cdot P\left\{E_{l}^{(r)}=v\right\}$
	$\displaystyle=p(l\mid r,v)$
	$\displaystyle=V_{v}^{(r)}(l)\cdot s_{v}^{(r)}(l)/s_{v}^{(r)},$	(20)

where

V_{v}^{(r)}(l)=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}g(\mathbf{z})% \cdot P\{\mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.

(21)

Note that there holds

\sum_{l=1}^{R}P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}=1,

because the event $E_{l}^{(r)}=v$ means that the embedding $e$ falls into one of the clusters. Hence, we can write

\sum_{l=1}^{S}V_{v}^{(r)}(l)\cdot p(l\mid r,v)=1.

The conditional probability $P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$ is updated in accordance with the rule $g$ by means of its multiplying by the updating coefficient $V_{v}^{(r)}(l)$ .

After updating the probabilities, they are substituted into (6) and (7) in order to obtain probabilities of concepts for a new instance.

6.3 Expert rules with uncertainty

So far, we have considered the hard expert-convinced rules, i.e., the rules taking the value TRUE with the unit probability. Let us study the case when rules are of the form “IF $F$ , THEN $G$ with probability $\pi$ ”. There are different interpretations of the implication operation probabilities [55]. We apply an interpretation which considers the probability of $P(\lnot F\vee G)$ that either $G$ is true or $F$ is false.

A simple way to adapt the uncertain rules to the proposed scheme is to replace the value $g(\mathbf{z})$ in (16) with probabilities $\pi$ and $1-\pi$ . Let us rewrite (12) taking into account the probability $\pi(\mathbf{z})=P\{g(\mathbf{z})=1\}$ that the rule is TRUE for $\mathbf{z}$ , i.e., is satisfied for the combination $\mathbf{z}$ of concepts, as follows:

P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}\pi(% \mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.

(22)

It can be seen from the above that only the updating coefficient $U_{v}^{(r)}$ in (16) is changed when the probability of the rule is added.

In the same way, the conditional probabilities $P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$ in (20) can be updated. In the case, we change only the updating coefficient $V_{v}^{(r)}(l)$ as follows:

V_{v}^{(r)}(l)=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}\pi(\mathbf{z})% \cdot P\{\mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.

(23)

It should be noted that the uncertainty representation of expert rules is a special important question which requires a detailed separate investigation. Therefore, it is not considered in this work and will be study in future.

7 Illustrative example

7.1 Concepts

7.1.1 Training

To illustrate calculation in FI-CBL, we consider an illustrative example with concepts Contour $c^{(1)}$ and Inclusion $c^{(2)}$ of the lung cancer nodule description given in Section 4. The vector of all concept values is

\mathcal{C}=(\underbrace{1,2}_{\mathcal{C}^{(0)}},\underbrace{1,2,3}_{\mathcal% {C}^{(1)}},\underbrace{1,2}_{\mathcal{C}^{(2)}}).

(24)

Table 1: Concept values for the illustrative example with the lung nodules

Diagnosis	Contour	Inclusion	# images
benign	smooth	homogeneous	$2$
benign	grainy	homogeneous	$2$
malignant	spicules	homogeneous	$4$
malignant	grainy	necrosis	$2$

Suppose we have $10$ images whose description is shown in Table 1, where the first three columns contain values of the concepts and the fourth column indicates the number of images having the corresponding concept values. The images are divided into $4$ patches. Numbers of images with values $<$ malignant $>$ and $<$ benign $>$ are $6$ and $4$ , respectively. Suppose that embeddings corresponding to patches are clustered to $R=3$ clusters such that:

•

the first cluster contains embeddings of the background patches;
•

the second cluster contains embeddings of nodules with spicules or with the grainy contour and with necrosis;
•

the third cluster contains embeddings of nodules which are smooth or grainy and homogeneous.

An example of the corresponding images and patches are schematically depicted in Fig. 2. Prior probabilities $p(r,v)$ for all concepts are shown in Table 2

Table 2: Prior probabilities

p(r,v)

for all concepts

$C^{(0)}$		$C^{(1)}$			$C^{(2)}$
$1$	$2$	$1$	$2$	$3$	$1$	$2$
$6/10$	$4/10$	$2/10$	$4/10$	$4/10$	$8/10$	$2/10$

Unconditional probabilities $p(l)$ are

p(1)=28/40,~{}p(2)=7/40,~{}p(3)=5/40.

(25)

Table 3 shows conditional probabilities $p(l\mid r,v)$ . Table 4 contains posterior probabilities $p(r,v\mid l)$ computed by using (3). It follows from Table 4 that the second cluster contains patches which show that the corresponding images are malignant with the probability $1$ . Moreover, the third cluster contains only patches with homogeneous nodules.

Table 3: Conditional probabilities

p(l\mid r,v)

$v$	$1$	$2$	$1$	$2$	$3$	$1$	$2$
	$C^{(0)}$		$C^{(1)}$			$C^{(2)}$
$K_{1}$	$16/24$	$3/4$	$3/4$	$6/8$	$10/16$	$22/32$	$3/4$
$K_{2}$	$7/24$	$0$	$0$	$1/8$	$5/16$	$5/32$	$1/4$
$K_{3}$	$1/24$	$1/4$	$1/4$	$1/8$	$1/16$	$5/32$	$0$

Table 4: Posterior probabilities

p(r,v\mid l)

$v$	$1$	$2$	$1$	$2$	$3$	$1$	$2$
	$C^{(0)}$		$C^{(1)}$			$C^{(2)}$
$K_{1}$	$4/7$	$3/7$	$3/14$	$6/14$	$5/14$	$11/14$	$3/14$
$K_{2}$	$1$	$0$	$0$	$2/7$	$5/7$	$5/7$	$2/7$
$K_{3}$	$1/5$	$4/5$	$2/5$	$2/5$	$1/5$	$1$	$0$

7.1.2 Inference

Suppose that two embeddings of a new instance fall into the first cluster, and two embedding falls into the third cluster. This implies that $\mathbf{(}s_{1},s_{2},s_{3}\mathbf{)}=(2,0,2)$ . Probabilities $p(l\mid r,v)$ are taken from Table 3. Hence, there holds for $C^{(r)}=v$ :

	$\displaystyle p(r,v\mid E_{1:R}^{\ast})$
	$\displaystyle=\frac{p(r,v)}{P\left\{E_{1:R}^{\ast}\right\}}\frac{4!}{2!\cdot 0% !\cdot 2!}p^{2}(1\mid r,v)\cdot p^{0}(2\mid r,v)\cdot p^{2}(2\mid r,v).$		(26)

Probabilities $P\left\{E_{1:R}^{\ast}\right\}$ are computed from (8) for every $r=1,2,3$ . They are $0.087$ , $0.067$ , $0.056$ . Final results are shown in Table 5. It can be seen from Table 5 that the tested image is $<$ benign $>$ ( $c^{(0)}=2$ with the probability $0.968$ ) with homogeneous nodules ( $c^{(2)}=1$ with the probability $0.999$ ). The decision about the concept Contour depends on the threshold $\gamma_{2}$ . If $\gamma_{2}\leq 0.630$ , then we can say that nodules are smooth. If $\gamma_{2}>0.630$ , then this concept is not available or too uncertain to be included into the image description. At the same time, the small probability of spicules implies that the Contour is smooth or grainy with probability $0.945$ .

Table 5: Posterior probabilities

p(r,v\mid E_{1:R}^{\ast})

	$C^{(0)}$		$C^{(1)}$			$C^{(2)}$
$v$	$1$	$2$	$1$	$2$	$3$	$1$	$2$
	$0.032$	$0.968$	$0.630$	$0.315$	$0.055$	$0.999$	$0.001$

7.2 Expert rules

7.2.1 Prior probabilities

Let us consider the illustrative example under condition of using the expert rule “IF Contour is $<$ grainy $>$ , THEN Diagnosis is $<$ malignant $>$ ” which can also be written as “IF $c^{(1)}=2$ , THEN $c^{(0)}=1$ ”. The logical function $g(\mathbf{c})$ corresponding to the rule is of the form:

$\displaystyle g(\mathbf{c})$	$\displaystyle=[c^{(1)}=2]\rightarrow[c^{(0)}=1]$
	$\displaystyle=\lnot\left([c^{(1)}=2]\right)\vee[c^{(0)}=1]$
	$\displaystyle=[c^{(0)}=1]\vee[c^{(1)}=1]\vee[c^{(1)}=3].$	(27)

The above rule is TRUE if at least one of the logical literals is TRUE.

Let us return to the above example. We have the following combinations $\mathbf{z}$ of the concept labels (see Table 1): benign-smooth-homogeneous, benign-grainy-homogeneous, malignant-spicules-homogeneous, malignant-grainy-necrosis. Let us find how the prior probability $p(0,1)=P\left\{C^{(0)}=1\right\}$ will be changed when the expert rule from the above example is available. The rule “IF Contour is $<$ grainy $>$ , THEN Diagnosis is $<$ malignant $>$ ” is FALSE when $[c^{(0)}=2]\wedge[c^{(1)}=2]$ , i.e., when the concept values are $<$ grainy $>$ and $<$ benign $>$ . There are two images simultaneously having the concepts $<$ grainy $>$ and $<$ benign $>$ . This implies that $g(\mathbf{z})=0$ for images with concepts $<$ benign-grainy-homogeneous $>$ (see Fig. 2). Hence, we can write $\mathcal{C}_{0,1}=\{(1,3,1),(1,2,2)\}$ , $\mathcal{C}=\{(1,3,1),(1,2,2),(2,1,1)\}$ . Here combinations with $g(\mathbf{z})=0$ are not provided for short. Probabilities $P\{\mathbf{C}=\mathbf{z}\}$ are shown in Table 6. Finally, we can write

U_{1}^{(0)}=\frac{P\{\mathbf{C}=(1,3,1)\}+P\{\mathbf{C}=(1,2,2)\}}{P\left\{g(% \mathbf{C})=1\right\}},

(28)

where

$\displaystyle P\left\{g(\mathbf{C})=1\right\}$	$\displaystyle=P\{\mathbf{C}=(1,3,1)\}\cdot p(0,1)$
$\displaystyle+P\{\mathbf{C}$	$\displaystyle=(1,2,2)\}\cdot p(0,1)$
$\displaystyle+P\{\mathbf{C}$	$\displaystyle=(2,1,1)\}\cdot p(0,2),$	(29)

and

P\left\{C^{(0)}=1\mid g(\mathbf{C})=1\right\}=0.818.

(30)

Let us find the prior probability $P\left\{C^{(0)}=2\right\}$ . In this case, there holds $\mathcal{C}_{0,1}=\{(2,1,1)\}$ . Hence, we obtain

U_{2}^{(0)}=\frac{P\{\mathbf{C}=(2,1,1)\}}{P\left\{g(\mathbf{C})=1\right\}},

(31)

and

P\left\{C^{(0)}=2\mid g(\mathbf{C})=1\right\}=0.182.

(32)

It is interesting to point out that the expert rule increases the prior probability of the $<$ malignant $>$ and decreases the probability of the $<$ benign $>$ . Indeed, two images with the concept values $<$ benign $>$ and $<$ grainy $>$ do not correspond to the expert rule. Therefore, these images can be regarded as inadmissible for analyzing probabilities of the concept $c^{(0)}$ . However, it does not mean that they cannot be used for finding probabilities of other concepts.

Table 6: Prior probabilities

P\{{\bf{C}}={\bf{z}}\}

$\mathbf{z}$	$(1,3,1)$	$(1,2,2)$	$(2,1,1)$
$P\{\mathbf{C}=\mathbf{z}\}$	$4/10$	$2/10$	$2/10$

7.2.2 Updating posterior probabilities

Let us again return to the above example and find the conditional probabilities $P\left\{E_{l}^{(0)}=v\mid g(\mathbf{C})=1\right\}$ , $v=1,2$ , for every cluster.

Cluster 1: The subset $\mathcal{C}_{0,1}(1)$ consists of combinations $(1,3,1)$ , $(1,2,2)$ because embeddings corresponding to all concepts are included in $K_{1}$ . The subset $\mathcal{C}_{0,2}(1)$ consists of the combination $(2,1,1)$ .

Cluster 2: The subset $\mathcal{C}_{0,1}(2)$ consists of combinations $(1,3,1)$ , $(1,2,2)$ because embeddings with the corresponding concept values are included in $K_{2}$ . The subset $\mathcal{C}_{0,2}(2)$ is empty.

Cluster 3: The subset $\mathcal{C}_{0,1}(3)$ consists of the combination $(1,3,1)$ because one embedding from an image with the concept values $(1,3,1)$ falls into $K_{3}$ . The subset $\mathcal{C}_{0,2}(3)$ consists of the combination $(2,1,1)$ .

Finally, we can write using Tables 3 and 6:

	$\displaystyle V_{1}^{(0)}(1)$	$\displaystyle=\frac{P\{\mathbf{C}=(1,3,1)\}+P\{\mathbf{C}=(1,2,2)\}}{P\left\{g% (\mathbf{C})=1\right\}}$
		$\displaystyle=\frac{0.6}{P\left\{g(\mathbf{C})=1\right\}},$		(33)

V_{1}^{(0)}(2)=V_{1}^{(0)}(1)=\frac{0.6}{P\left\{g(\mathbf{C})=1\right\}},

(34)

V_{1}^{(0)}(3)=\frac{P\{\mathbf{C}=(1,3,1)\}}{P\left\{g(\mathbf{C})=1\right\}}% =\frac{0.4}{P\left\{g(\mathbf{C})=1\right\}},

(35)

where

P\left\{g(\mathbf{C})=1\right\}=0.6\cdot 16/24+0.6\cdot 7/24+0.4\cdot 1/24=0.5% 92.

(36)

Hence, we obtain $V_{1}^{(0)}(1)=V_{1}^{(0)}(2)=\allowbreak 1.\,\allowbreak 014$ , $V_{1}^{(0)}(3)=\allowbreak 0.676$ . Similarly, we find

	$\displaystyle V_{2}^{(0)}(1)$	$\displaystyle=V_{2}^{(0)}(3)=\frac{P\{\mathbf{C}=(1,3,1)\}}{P\left\{g(\mathbf{% C})=1\right\}}$
		$\displaystyle=\frac{0.4}{P\left\{g(\mathbf{C})=1\right\}},\ V_{2}^{(0)}(2)=0,$		(37)

where

P\left\{g(\mathbf{C})=1\right\}=0.4\cdot 3/4+0.0\cdot 0+0.4\cdot 1/4=% \allowbreak 0.4.

(38)

Hence, we obtain $V_{2}^{(0)}(1)=V_{2}^{(0)}(3)=\allowbreak 1$ , $V_{2}^{(0)}(2)=\allowbreak 0$ . Finally, we find the conditional probabilities $P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$ from (20), whose values are given in Table 7.

Table 7: Conditional probabilities

P\left\{E_{l}^{(r)}=v\mid g({\bf{C}})=1\right\}

$v$	$1$	$2$
	$c^{(0)}$
$K_{1}$	$\allowbreak 0.676\,$	$3/4$
$K_{2}$	$\allowbreak 0.296$	$0$
$K_{3}$	$0.028$	$1/4$

If we compare conditional probabilities of the concept $c^{(0)}$ obtained without rules (see Table 3) and with rules (see Table 7), then we can see that the probabilities for $c^{(0)}=1$ have been changed whereas probabilities for $c^{(0)}=2$ have not been changed.

8 Numerical experiments

To study the proposed model, a synthetic dataset is constructed from the well-known MNIST dataset [56] which represents $28\times 28$ pixel handwritten digit images. The original MNIST dataset has a training set of $60,000$ instances and a test set of $10,000$ instances. The dataset is available at http://yann.lecun.com/exdb/mnist/.

Each instance in the synthetic dataset consists of four different digits randomly taken from MNIST such that the instance has two digits in the first row and two digits in the second row as it is shown in Fig. 3. Each instance has the size $56\times 56$ . A similar dataset is used in [33]. Concepts are formed as follows:

•

The concept $c^{(0)}$ (target) is defined as the largest digit among four digits in each instance. This corresponds to the classification task with seven classes (digits from $3$ till $9$ ) due to the difference of digits in the instance.
•

Concepts $c^{(1)},...,c^{(10)}$ are binary and defined by the presence of the corresponding number ( $1,...,9,0$ ) in the instance.

The proposed model is compared with the original CBM [6]. Preliminary numerical experiments have shown that the original CBM provides outperforming results when the number of instances is large. Therefore, we compare the proposed model with CBM by training models on small numbers of instances (from $500$ till $5000$ ). The number of testing images is $20,000$ .

Details of the proposed model are the following:

•

Images are divided into 4 patches such that each patch contains one digit.
•

The autoencoder is constructed by using the convolution network, the embedding size is $16$ .
•

The expectation–maximization (EM) algorithm with $80$ clusters is used for clustering.
•

The model is trained on numbers of epochs from $30$ till $50$ depending on the number of instances taken for training.

CBM is also constructed on the convolution network which transforms the image to an embedding. For each concept, a fully connected two-layer network predicts the concept value. The cross-entropy loss function is used. A modification of CBM with the joint training of the bottleneck and targets is used such that the joint bottleneck minimizes the weighted sum of loss functions with coefficients 1. The F1 measure is used as an accuracy measure in experiments because the training set is imbalanced due to the considered structure of concepts and images.

F1 measures as functions of the training set size for all concepts of the modified MNIST dataset obtained by using the proposed method and CBM are shown in Fig. 4. It can be seen from Fig. 4 that FI-CBL outperforms the original CBM when the training set size is small (smaller than $5000$ instances). This can be explained by the fact that the neural network implementing CBM requires a significantly larger number of instances for training. On the contrary, FI-CBL allows us to obtain acceptable results with a small number of training instances. At the same time, CBM becomes better as the training set increases. In this case, FI-CBL requires finer tuning of the number of clusters and the size of embeddings.

To study how an expert rule impacts on the prediction accuracy, we invert some correct labels that satisfy the expert rule. The rule is “IF $c^{(9)}=1$ , THEN $c^{(0)}=1$ ”. This is an obvious rule which means that if there is the digit $9$ among four digits in an instance, then the target is $9$ . Let $\beta$ be the portion of instances in the training set, whose labels are inverted. Fig. 5 illustrates how F1 measures depend on values of $\beta$ for two cases: before using the rule (the curve with triangle markers) and after using the rule (the curve with circle markers). It can be seen from Fig. 5 that the use of the expert rule allows us to significantly correct the model and to obtain better predictions (the corresponding function slowly decreases with increase). At the same time, if the rule is not used, then the F1 measure quickly decreases with increase of $\beta$ .

We also consider a modification of the Large-scale CelebFaces Attributes (CelebA) dataset [57]. The original dataset contains $200,000$ images, each annotated with $40$ face attributes and $10,000$ classes. The modification is restricted by $20$ classes such that images with labels from $1$ till $500$ have the label $0$ , images with labels from $501$ till $1000$ have the label $1$ , etc. The number of concepts is $40$ . All concepts are binary. To compare FI-CBL with CBM, we find the F1 measure averaged over all $40$ concepts. The number of testing images is approximately $65,000$ , i.e., 40% of the original dataset. The K-means algorithm with $256$ clusters is used for clustering. The embedding size is $32$ . The number of patches is $35$ . To ensure that the whole concepts fall into separate patches, we propose using overlap** patches similar to sliding windows in convolutional neural networks.

F1 measures as functions of the training set size of the modified CelebA dataset obtained by using FI-CBL and CBM are shown in Fig. 6. Similarly to the previous numerical example with the MNIST dataset, FI-CBL outperforms the original CBM when the training set size is small (smaller than $4000$ instances).

Let us consider another numerical example with the original MNIST dataset. All instances are annotated by the following concepts:

0 - target, $\mathcal{C}^{(0)}=\{0,1\}$ , values of the target are given in (39);

1 - the even/odd digit, $\mathcal{C}^{(1)}=\{0,1\}$ ;

2 - the digit is not less than 5 / is less than 5, $\mathcal{C}^{(2)}=\{0,1\}$ ;

3 - the digit is a remainder of division by 3, $\mathcal{C}^{(3)}=\{0,1,2\}$ .

In sum, digits are represented by the following sets of concepts:

$\displaystyle 0$	$\displaystyle\rightarrow[0,0,1,0];\ 1\rightarrow[1,1,1,1];\ 2\rightarrow[0,0,1% ,2];$
$\displaystyle 3$	$\displaystyle\rightarrow[1,1,1,0];\ 4\rightarrow[0,0,1,1];\ 5\rightarrow[0,1,0% ,2];$
$\displaystyle 6$	$\displaystyle\rightarrow[1,0,0,0];\ 7\rightarrow[0,1,0,1];\ 8\rightarrow[1,0,0% ,2];$
$\displaystyle 9$	$\displaystyle\rightarrow[0,1,0,0].$	(39)

The EM algorithm with $128$ clusters is used for clustering. The embedding size is $16$ . The testing set consists of $24,000$ images.

F1 measures as functions of the training set size for four concepts of the original MNIST dataset obtained by using the proposed method and CBM are shown in Fig. 7. One can see from Fig. 7 that CBM outperforms FI-CBL in most cases even when the training set size is small. This numerical example aims to show that CBM can provide better results even when the number of training instances is small. At the same time, the next numerical experiments aim to study how expert rules can correct incorrect concepts and provide better results in comparison with CBM. We aim to show now that CBM provides worse results in comparison with FI-CBL when labels are noisy. We again invert some correct labels that satisfy the expert rule. The rule is “IF $c^{(1)}=1$ AND $c^{(2)}=1$ , THEN $c^{(0)}=1$ ”. Fig. 8 illustrates the difference between F1 measures as functions of $\beta$ for two cases: before using the rule and after using the rule. It can be seen from Fig. 8 that the use of the rule allows us to significantly correct the model and to obtain better predictions. When values of $\beta$ exceed $0.16$ , then both the models show worse results.

In the next experiment, we invert half of the target labels, i.e., the target is completely confused. We aim to improve the model prediction by incorporating the increasingly detailed expert rules:

•

$g_{1}(\mathbf{c})=\left[c^{(1)}=1\wedge c^{(2)}=1\rightarrow c^{(0)}=1\right]$ ;
•

$g_{2}(\mathbf{c})=\left[\left(c^{(1)}=0\wedge c^{(2)}=1\rightarrow c^{(0)}=0% \right)\wedge\left(c^{(1)}=1\wedge c^{(2)}=1\rightarrow c^{(0)}=1\right)\right]$ ;
•

$g_{3}(\mathbf{c})=\left[(c^{(1)}=0\wedge c^{(2)}=0)\vee(c^{(1)}=1\wedge c^{(2)% }=1)\leftrightarrow c^{(0)}=1\right]$ .

Fig. 9 shows histograms of the binary target prediction probabilities $\alpha$ without rules and with different rules. It can be seen from the first histogram that most prediction probabilities are close to $0.5$ . This implies that the inversion of the target labels leads to the total uncertainty, i.e. the model cannot correctly classify instances. The first rule $g_{1}(\mathbf{c})$ partially improves results. One can see from the second histogram that a small part of instances are classified with probabilities close to $1$ . However, so far most prediction probabilities are close to $0.5$ despite the rule. The second rule $g_{2}(\mathbf{c})$ consists of two rules and corrects many inverted labels. One can see from the third histogram that the uncertainty of the target probabilities significantly decreases in comparison with the previous cases. Finally, the third rule $g_{3}(\mathbf{c})$ , which is the most strong one due to the logical operation “if and only if” denoted as “ $\leftrightarrow$ ”, provides the best results shown in the fourth histogram. We can see that the uncertainty of predictions is minimal. This is a very important observation which illustrates how the incorporated expert rules are able to improve the model.

9 Conclusion

Let us point out advantages and disadvantages of FI-CBL.

Advantages:

•

FI-CBL is transparent. In contrast to the neural network implementation of CBL where the network as well as the bottleneck layer are a black box, the proposed model has a clear and explicit sequence of calculations. All probabilities (conditional and unconditional) have a clear frequency interpretation. It allows us to extend the model in order to take into account the possible limitation of instances in datasets and to apply robust statistical methods which can correct the posterior probabilities and improve the classification. Moreover, one can also observe entire processes of training and inference in order to understand results of modeling.
•

FI-CBL is flexible. We can simply change the number of patches, the number of clusters, the autoencoder architecture, threshold of decision making.
•

FI-CBL provides outperforming results when the number of training instances is small.
•

FI-CBL is interpretable.

Disadvantages:

•

The main disadvantage is to implement a perfect clusterization and to guess a proper number of clusters. We could learn the clusterization procedure jointly with the autoencoder, but this approach may significantly complicate the method and reduce its positive property to deal with small datasets.
•

The effectiveness of FI-CBL greatly depends on the size of patches. A large difference in the size of concepts in an image can lead to significant deterioration of the model. One of the ways to overcome this problem is to use sliding windows for producing patches as it has been implemented in the numerical experiment with CelebA dataset. However, this way requires additional studies.
•

We need to store the dataset in order to use it every time when a new instance is analyzed. On the other hand, we can store only a matrix of all conditional probabilities $P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}$ if the set of expert rules is not changed. If many clusters and expert rules are implemented, then the matrix can be very large.
•

Experiments have illustrated that FI-CBL may be inferior to CBMs when the number of training data is large.

The above disadvantages can be regarded as problems whose solutions are direction for further research. In addition, the probabilistic approach used in FI-CBL allows us to simply extend the method to solve several problems of machine learning, including anomaly detection, unlearning, the attention mechanism, etc. These are also directions for further research.

References

[1] I. Lage and F. Doshi-Velez. Learning interpretable concept-based models with human feedback. arXiv:2012.02898, Dec 2020.
[2] A. Gupta and P.J. Narayanan. A survey on concept-based approaches for model improvement. arXiv:2403.14566, Mar 2024.
[3] Been Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
[4] Bowen Wang, Liangzhi Li, Y. Nakashima, and H. Nagahara. Learning bottleneck concepts in image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023.
[5] Chih-Kuan Yeh, Been Kim, S. Arik, Chun-Liang Li, T. Pfister, and P. Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In Advances in neural information processing systems, volume 33, pages 20554–20565, 2020.
[6] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, S. Mussmann, E. Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR, 2020.
[7] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, and E. Baralis. Concept-based explainable artificial intelligence: A survey. arXiv:2312.12936, May 2023.
[8] Y. Yamamoto, T. Tsuzuki, and J. Akatsuka. Automated acquisition of explainable knowledge from unannotated histopathology images. Nature Communications, 10(5642):1–9, 2019.
[9] J. Amores. Multiple instance classification: review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
[10] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
[11] J. Yao, X. Zhu, J. Jonnagaddala, N. Hawkins, and J. Huang. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning network. Medical Image Analysis, 65(101789):1–14, 2020.
[12] Z.-H. Zhou. Multi-instance learning: A survey. Technical report, National Laboratory for Novel Software Technology, Nan**g University, 2004.
[13] A.V. Konstantinov and L.V. Utkin. Incorporating expert rules into neural networks in the framework of concept-based learning. arXiv:2402.14726, Feb 2024.
[14] M. Dreyer, R. Achtibat, W. Samek, and S. Lapuschkin. Understanding the (extra-) ordinary: Validating deep model decisions with prototypical concept-based explanations. arXiv:2311.16681, Nov 2023.
[15] Yangqing Jia, J.T. Abbott, J.L Austerweil, T. Griffiths, and T. Darrell. Visual concept learning: Combining machine vision and bayesian generalization on concept hierarchies. In Advances in Neural Information Processing Systems, volume 26, pages 1–9, 2013.
[16] V. Pendyala and Jihye Choi. Concept-based explanations for tabular data. arXiv:2209.05690, Sep 2022.
[17] C. Obermair, A. Fuchs, F. Pernkopf, L. Felsberger, A. Apollonio, and D. Wollmann. Example or prototype? learning concept-based explanations in time-series. In Asian Conference on Machine Learning, pages 816–831. PMLR, 2023.
[18] Wensi Tang, Lu Liu, and Guodong Long. Interpretable time-series classification on few-shot samples. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
[19] Jihye Choi, J. Raghuram, Ryan Feng, Jiefeng Chen, S. Jha, and A. Prakash. Concept-based explanations for out-of-distribution detectors. In International Conference on Machine Learning, pages 5817–5837. PMLR, 2023.
[20] L.R. Sevyeri, I. Sheth, F. Farahnak, and S.A. Enger. Transparent anomaly detection via concept-based explanations. arXiv:2310.10702, Oct 2023.
[21] L.A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 264–279, 2018.
[22] R. Marcinkevičs, P.R. Wolfertstetter, U. Klimiene, Kieran Chin-Cheong, A. Paschke, J. Zerres, M. Denzinger, D. Niederberger, S. Wellmann, E. Ozkan, et al. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Medical Image Analysis, 91:103042, 2024.
[23] A.A. Meldo, L.V. Utkin, M.S. Kovalev, and E.M. Kasimov. The natural language explanation algorithms for the lung cancer computer-aided diagnosis system. Artificial Intelligence in Medicine, 108(Article 101952):1–10, 2020.
[24] C. Patrício, J.C. Neves, and L.F. Teixeira. Coherent concept-based explanations in medical image and its application to skin lesion diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3798–3807, 2023.
[25] A. Vats, M. Pedersen, and A. Mohammed. Concept-based reasoning in medical imaging. International Journal of Computer Assisted Radiology and Surgery, 18(7):1335–1339, 2023.
[26] An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, **gbo Shang, et al. Robust and interpretable medical image classifiers via concept bottleneck models. arXiv:2310.03182, Oct 2023.
[27] Jae Hee Lee, S. Lanza, and S. Wermter. From neural activations to concepts: A survey on explaining concepts in neural networks. arXiv:2310.11884, Oct 2023.
[28] A. Mahinpei, J. Clark, I. Lage, F. Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models. arXiv:2106.13314, Jun 2021.
[29] I. Sheth and S.E. Kahou. Auxiliary losses for learning generalizable concept-based models. arXiv:2311.11108, Nov 2023.
[30] Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. arXiv:2205.15480, May 2022.
[31] Z.M. Espinosa, P. Barbiero, G. Ciravegna, G. Marra, F. Giannini, M. Diligenti, Z. Shams, F. Precioso, S. Melacci, A. Weller, et al. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems, volume 35, pages 21400–21413, 2022.
[32] A.A. Ismail, J. Adebayo, H.C. Bravo, S. Ra, and Kyunghyun Cho. Concept bottleneck generative models. In Proceedings of ICML 2023. Workshop on Deployment Challenges for Generative AI, https://openreview.net/group?id=ICML.cc/2023/Workshop, pages 1–10, 2023.
[33] Eunji Kim, Dahuin Jung, Sangha Park, Siwon Kim, and Sungroh Yoon. Probabilistic concept bottleneck models. In International Conference on Machine Learning, pages 16521–16540. PMLR, 2023.
[34] Naveen R., Mateo E. Zarlenga, Juyeon Heo, and Mateja Jamnik. Do concept bottleneck models obey locality? arXiv:2401.01259, Jan 2024.
[35] R. Marcinkevics, S. Laguna, M. Vandenhirtz, and J.E. Vogt. Beyond concept bottleneck models: How to make black boxes intervenable? arXiv:2401.13544, Jan 2024.
[36] F. Pittino, V. Dimitrievska, and R. Heer. Hierarchical concept bottleneck models for vision and their application to explainable fine classification and tracking. Engineering Applications of Artificial Intelligence, 118:105674, 2023.
[37] M. Havasi, S. Parbhoo, and F. Doshi-Velez. Addressing leakage in concept bottleneck models. In Advances in Neural Information Processing Systems, volume 35, pages 23386–23397, 2022.
[38] A. Radford, Jong Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[39] R. Kazmierczak, E. Berthier, G. Frehse, and G. Franchi. CLIP-QDA: An explainable concept bottleneck model. arXiv:2312.00110, Dec 2023.
[40] K. Chauhan, R. Tiwari, J. Freyberg, P. Shenoy, and K. Dvijotham. Interactive concept bottleneck models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5948–5955, 2023.
[41] Yan Cui, Shuhong Liu, Liuzhuozheng Li, and Zhiyuan Yuan. Ceir: Concept-based explainable image representation learning. arXiv:2312.10747, Dec 2023.
[42] E. Marconato, A. Passerini, and S. Teso. Glancenets: Interpretable, leak-proof concept-based models. In Advances in Neural Information Processing Systems, volume 35, pages 21212–21227, 2022.
[43] A. Margeloiu, M. Ashman, U. Bhatt, Yanzhi Chen, M. Jamnik, and A. Weller. Do concept bottleneck models learn as intended? arXiv:2105.04289, May 2021.
[44] Ao Sun, Yuanyuan Yuan, **chuan Ma, and Shuai Wang. Eliminating information leakage in hard concept bottleneck models with supervised, hierarchical concept learning. arXiv:2402.05945, Feb 2024.
[45] L. Von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, et al. Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Transactions on Knowledge and Data Engineering, 35(1):614–633, 2021.
[46] M. Diligenti, M. Gori, and C. Sacca. Semantic-based regularization for learning and inference. Artificial Intelligence, 244:143–165, 2017.
[47] M. Diligenti, S. Roychowdhury, and M. Gori. Integrating prior knowledge into deep learning. In 2017 16th IEEE international conference on machine learning and applications (ICMLA), pages 920–923. IEEE, 2017.
[48] Zhiting Hu, Zichao Yang, R. Salakhutdinov, and Eric Xing. Deep neural networks with massive learned knowledge. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1670–1679, 2016.
[49] **gyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A semantic loss function for deep learning with symbolic knowledge. In International conference on machine learning, pages 5502–5511. PMLR, 2018.
[50] M.V.M. França, G. Zaverucha, and d’Avila G.A.S. Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine learning, 94:81–104, 2014.
[51] A. Garcez, M. Gori, L.C. Lamb, L. Serafini, M. Spranger, and S.N. Tran. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. Journal of Applied Logics, 6(4):611–632, 2019.
[52] Xiao-Wen Yang, Jie-**g Shao, Wei-Wei Tu, Yu-Feng Li, Wang-Zhou Dai, and Zhi-Hua Zhou. Safe abductive learning in the presence of inaccurate rules. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16361–16369, 2024.
[53] V. Chen, U. Bhatt, H. Heidari, A. Weller, and A. Talwalkar. Perspectives on incorporating expert feedback into model updates. Patterns, 4(7):1–13, 2023.
[54] Kaiwen Xu, K. Fukuchi, Y. Akimoto, and J. Sakuma. Statistically significant concept-based explanation of image classifiers via model knockoffs. arXiv:2305.18362, May 2023.
[55] Hung T. Nguyen, Masao Mukaidono, and V. Kreinovich. Probability of implication, logical version of bayes theorem, and fuzzy logic operations. In IEEE World Congress on Computational Intelligence. IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02, volume 1, pages 530–535. IEEE, 2002.
[56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[57] Ziwei Liu, ** Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738. IEEE, 2015.