FI-CBL: A Probabilistic Method for Concept-Based Learning with Expert Rules

Lev V. Utkin, Andrei V. Konstantinov and Stanislav R. Kirpichenko
Higher School of Artificial Intelligence Technologies
Peter the Great St.Petersburg Polytechnic University
St.Petersburg, Russia
e-mail: [email protected], [email protected], [email protected]
Abstract

A method for solving concept-based learning (CBL) problem is proposed. The main idea behind the method is to divide each concept-annotated image into patches, to transform the patches into embeddings by using an autoencoder, and to cluster the embeddings assuming that each cluster will mainly contain embeddings of patches with certain concepts. To find concepts of a new image, the method implements the frequentist inference by computing prior and posterior probabilities of concepts based on rates of patches from images with certain values of the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). FI-CBL allows us to incorporate the expert rules in the form of logic functions into the inference procedure. An idea behind the incorporation is to update prior and conditional probabilities of concepts to satisfy the rules. The method is transparent because it has an explicit sequence of probabilistic calculations and a clear frequency interpretation. Numerical experiments show that FI-CBL outperforms the concept bottleneck model in cases when the number of training data is small. The code of proposed algorithms is publicly available.

Keywords: concept-based learning, expert rules, Bayes rule, classification, logical function, inductive and deductive learning.

1 Introduction

Concept-based machine learning (CBL) is an innovative and promising approach that focuses on utilizing high-level concepts derived from raw features to express predictions in machine learning models, rather than using the raw features themselves [1]. This approach aims to integrate expert knowledge or human-like reasoning into machine learning models, leading to more efficient and accurate predictions. By incorporating high-level concepts, CBL can significantly improve explainability of the machine learning model outputs, making them more accessible to users [2, 3, 4, 5].

One of the key types of CBL is the concept-based bottleneck model (CBM) which can be regarded as a technique used for learning high-level representations of data by forcing the model to learn a compressed and low-dimensional representation of input features [6]. This low-dimensional representation is referred to as the bottleneck. In the context of CBL, the CBM learns high-level concepts by transforming the raw features into a low-dimensional space, which allows the model to capture the essential features while discarding irrelevant information. At that, the classifier deriving final labels has access only to the concept representation, and the decision is strongly tied to the concepts [4]. CBMs have been widely used in various machine learning tasks, such as image recognition, natural language processing, and speech recognition [7]. The CBM effectiveness lies in their ability to learn meaningful high-level concepts, which can be easily interpreted and understood by humans. This capability makes CBMs a powerful tool for develo** explainable and interpretable machine learning models.

It is important to points out that most models implementing CBL are based on applying a deep neural network which transform raw features or images into a specific low-dimensional representation, for example, the bottleneck, which contains information about the concept values. As a result, the neural network may be complex and require a huge amount of labeled instances for training, which are not available in some cases. Therefore, we propose an extremely simple and transparent model for CBL which is motivated by the work [8], devoted to annotating the histopathology images, and by an interesting observation of the relationship between multiple instance models (MIL) [9, 10, 11, 12]. MIL is a type of weakly supervised learning, which deals with two concepts: bags and instances. Each bag is labeled, and it consists of many instances or some its elements. For example, a histology digital image obtained from the glass microscope slides with a label indicating a disease, for example, cancer or non-cancer, can be viewed as a “bag” consisting of patches extracted from the image, which are referred to as “instances”.

It turns out that the MIL can be regarded as a special case of the concept-based learning where labels (concepts) of instances are unknown, but there is a label description of the whole image (the bag). Yamamoto et al. [8] proposed a simple algorithm for annotating instances (patches) by having annotated the whole images. The algorithm is based on clustering of the patch embeddings and computing probabilities that patches in each cluster are malignant or benign in a simplest way by calculating rates of patches from the malignant and benign images corresponding to the patches.

We extend the algorithm proposed by Yamamoto et al. [8] to CBL and develop a simple method for determining concepts of new images. The method is based on the frequentist inference and on computing prior and posterior probabilities of concepts using rates of patches from images with certain values of concepts. In other words, we calculate the relative frequencies of patches from images with the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). In addition, we propose approaches to incorporate the knowledge-based expert rules of the form “IF …, THEN …”, which are elicited from experts and constructed by means of concepts. For example, the rule from the lung cancer diagnostics of a nodule can look like “IF Contour is <<<spicules>>>, Inclusion is <<<necrosis>>>, THEN a Diagnosis is <<<malignant>>>. Here concepts are shown in Bold, their values are in angle brackets. It is important to note that the approach to incorporate the knowledge-based expert rules into neural networks in the framework of CBL has been proposed in [13]. Its idea is to add a special layer to a classification neural network, which computes a probability distribution of concepts in accordance with the available expert rules. According to [13], the probability distributions of concepts are approximately generated by the neural network and are corrected by the incorporated expert rules. We propose another approach for taking into account the expert rules. It is based on the combination of the Bayes rule and the multinomial distribution. The expert rules in the form of logic functions update prior probabilities of concepts as well as conditional probabilities in the Bayes rule. This is a simple way for using the expert rules as a specific type of constraints on probabilities of concepts. We have to note that the term “expert rules” is used in the proposed model in a broader sense as an arbitrary logical function of concepts.

The code of proposed algorithms is available in:

https://github.com/NTAILab/simple_concepts.

2 Related work

Concept-based learning models. Starting from the works [3, 5], many CBL models have been proposed to implement ideas behind the concept-based learning under various conditions. One of the goals to develop the CBL approaches is to interpret and explain predictions of machine learning models. In order to achieve this goal and to overcome some difficulties of the interpretation, a CBL model was proposed in [1]. Concepts in this model are fully transparent, thus enabling users to validate whether mental and machine models are aligned. A concept-based explanation framework was presented in [14]. An algorithm for learning visual concepts from images by applying a Bayesian generalization model was developed in [15]. It should be noted that not only images (visual data) are used in the CBL models, but also tabular data. For example, the concept attribution approach to tabular learning and a definition of concepts over tabular data were proposed in [16]. Applications of CBL goes beyond the explanation and extends to a wide variety of problems. Approaches for analysis time-series data using the CBL models were presented in [17, 18]. The well-known anomaly detection task in the framework of CBL was considered in [19, 20]. One of the important areas of the CBL application is medicine where doctors prefer explanations that are user-friendly and represented via natural language [21]. Several authors have contributed into development of the medicine CBL models [22, 23, 24, 25, 26]. Survey papers [2, 27, 28] comprehensively discuss many aspects of the CBL models and their applications.

Concept bottleneck models. Most CBL models are implemented in the form of the CBMs [6]. Due to the efficient and transparent two-module architecture of CBMs, where the first module (a neural network) implements the dependence of concepts on input instances, and the second module implements the dependence of the target variable on the concepts, many modifications and extensions of these models have been proposed [3, 29, 30].

An extension of CBMs is the concept embedding model [31] which learns two embeddings per concept, one for when it is active, and another when it is inactive. The model aims to overcome the current accuracy-vs-interpretability trade-off. Ideas behind the concept embedding model have been used in the concept bottleneck generative models [32].

Similarly to CBL, extensions and modifications of CBMs were motivated by their applications or different conditions of their applications. For example, to model the ambiguity in the concept predictions, a probabilistic CBM was introduced in [33]. Conditions of independence across concepts were studied in [34]. Different aspects of the concept-based interventions were considered in [35]. An application of CBMs to the images segmentation and tracking was presented in [36]. Two causes of performance disparity between soft (inputs to a label predictor are the concept probabilities) and hard (the label predictor only accepts binary concepts) CBMs were proposed in [37]. The CLIP-based CBMs using the well-known CLIP model [38] was proposed in [39].

The above modifications and extensions of CBMs can be regarded as a small part of all available modifications [40, 41, 42, 43, 44], which are caused by the great interest in CBL.

Incorporating expert rules into machine learning models. The idea of combining the prior expert knowledge with machine learning models has already attracted some interest, and a number of interesting approaches have been proposed for its implementation. A comprehensive and exhaustive review of various available approaches to integrate prior knowledge into the training process was presented in [45]. Authors in [45] also propose a concept of informed machine learning, which can be viewed as a uniting term for different approaches.

Depending on the knowledge representation, there are different approaches for the knowledge integration into the machine learning pipeline. We do not touch upon a large class of methods related to the representation of knowledge in forms of algebraic equations, differential equations, probabilistic relations, etc. An analysis and review of these methods can again be found in [45]. Our goal is to study the knowledge representation in the form of expert rules or logic rules. A common approach to incorporate the logic rules into a machine learning model is to add the rules as constraints to loss functions [46, 47, 48, 49]. However, this approach does not guarantee that the rules will be satisfied for all training and testing instances because violation of the constraints is only penalized, but not eliminated. Another way to integrate the rules into neural networks is to map components of the rules to neurons [50, 51]. In this approach, a neural network implements logic functions corresponding to the expert rules. An interesting approach has been proposed in [52] where authors present an effective safe abductive learning method and show that induction and abduction are mutually beneficial. An extensive review of methods for updating models based on expert feedback is presented in [53].

A quite different approach to incorporate expert rules into machine learning in the framework of CBL was presented in [13]. Following this approach, we present a computationally simple algorithm for predicting the concept probabilities and provide a way to incorporate the expert rules in the form of logic functions.

3 Background

A common statement of the CBL problem is based on considering a classifier which predicts a set of concepts as well as the target variable [54].

Suppose that a training set is represented as a set of triples (𝐱i,yi,𝐜i)subscript𝐱𝑖subscript𝑦𝑖subscript𝐜𝑖(\mathbf{x}_{i},y_{i},\mathbf{c}_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), i=1,,N𝑖1𝑁i=1,...,Nitalic_i = 1 , … , italic_N, where 𝐱isubscript𝐱𝑖absent\mathbf{x}_{i}\inbold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ 𝒳d𝒳superscript𝑑\mathcal{X}\subset\mathbb{R}^{d}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the input feature vector; yi𝒴={1,2,,K}subscript𝑦𝑖𝒴12𝐾y_{i}\in\mathcal{Y}=\{1,2,...,K\}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y = { 1 , 2 , … , italic_K } is the corresponding target defining K𝐾Kitalic_K-class classification task; 𝐜i=(ci(1),,ci(m))𝒞subscript𝐜𝑖superscriptsubscript𝑐𝑖1superscriptsubscript𝑐𝑖𝑚𝒞\mathbf{c}_{i}=(c_{i}^{(1)},...,c_{i}^{(m)})\in\mathcal{C}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ∈ caligraphic_C is a set of m𝑚mitalic_m concepts 𝐜i=(ci(1),,ci(m))𝒞subscript𝐜𝑖superscriptsubscript𝑐𝑖1superscriptsubscript𝑐𝑖𝑚𝒞\mathbf{c}_{i}=(c_{i}^{(1)},...,c_{i}^{(m)})\in\mathcal{C}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ∈ caligraphic_C which are given with targets. In most works, concepts are represented as a vector 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with m𝑚mitalic_m binary elements such that ci(j)=1superscriptsubscript𝑐𝑖𝑗1c_{i}^{(j)}=1italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = 1 denotes that the j𝑗jitalic_j-th concept is present in a description of the input 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ci(j)=0superscriptsubscript𝑐𝑖𝑗0c_{i}^{(j)}=0italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = 0 denotes that the j𝑗jitalic_j-th concept is not present.

One of the CBL goals is to predict targets and concepts that is to find the dependence h:𝒳(𝒞,𝒴):𝒳𝒞𝒴h:\mathcal{X}\rightarrow(\mathcal{C},\mathcal{Y})italic_h : caligraphic_X → ( caligraphic_C , caligraphic_Y ) on concepts and inputs. Another goal is to explain what concepts of the input are responsible for the corresponding prediction. In other words, the CBL model aims to interpret how predictions depend on concepts of the corresponding inputs. The above goals can be achieved by applying CBM proposed by Koh et al. [6] as an important type of the CBL models. The function hhitalic_h in the CBM is represented as two functions: the first one g::𝑔absentg:italic_g : 𝒳𝒞𝒳𝒞\mathcal{X}\rightarrow\mathcal{C}caligraphic_X → caligraphic_C maps the input vector to concepts; the second function f:𝒞𝒴:𝑓𝒞𝒴f:\mathcal{C}\rightarrow\mathcal{Y}italic_f : caligraphic_C → caligraphic_Y maps the concepts to the outputs. The prediction y𝑦yitalic_y for a new instance 𝐱𝐱\mathbf{x}bold_x can be obtained as y=f(g(𝐱))𝑦𝑓𝑔𝐱y=f(g(\mathbf{x}))italic_y = italic_f ( italic_g ( bold_x ) ). Here concepts act as a bottleneck in the interpretation of predictions.

4 The model and its training

It is assumed that each concept c(i)superscript𝑐𝑖c^{(i)}italic_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT can take a value from the set 𝒞(i)={1,,ni}superscript𝒞𝑖1subscript𝑛𝑖\mathcal{C}^{(i)}=\{1,...,n_{i}\}caligraphic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = { 1 , … , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } called the i𝑖iitalic_i-th concept outcome set, i{0,,m}𝑖0𝑚i\in\{0,\dots,m\}italic_i ∈ { 0 , … , italic_m }. The concept c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT is a special concept corresponding to the target variable y𝑦yitalic_y.

We also suppose that there are N𝑁Nitalic_N images 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,N𝑖1𝑁i=1,...,Nitalic_i = 1 , … , italic_N, in the training set such that the i𝑖iitalic_i-th image is characterized by a set of concept values 𝐜i=(ci(0),,ci(m))subscript𝐜𝑖superscriptsubscript𝑐𝑖0superscriptsubscript𝑐𝑖𝑚\mathbf{c}_{i}=(c_{i}^{(0)},...,c_{i}^{(m)})bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ). The whole dataset consists of pairs (𝐱i,𝐜i)subscript𝐱𝑖subscript𝐜𝑖(\mathbf{x}_{i},\mathbf{c}_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of vectors. For example, the lung cancer nodule description can be based on two concepts: Contour c(1)superscript𝑐1c^{(1)}italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and Inclusion c(2)superscript𝑐2c^{(2)}italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT. The concept Contour takes values <<<smooth>>> <<<grainy>>>, <<<spicules>>>, the concept Inclusion takes values <<<homogeneous>>> and <<<necrosis>>>. The target value y𝑦yitalic_y or the concept Diagnosis c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT takes values: <<<malignant>>> and <<<benign>>>. Then we have the formal concept description 𝒞(0)={1,2}superscript𝒞012\mathcal{C}^{(0)}=\{1,2\}caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { 1 , 2 }, 𝒞(1)={1,2,3}superscript𝒞1123\mathcal{C}^{(1)}=\{1,2,3\}caligraphic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = { 1 , 2 , 3 }, 𝒞(2)={1,2}superscript𝒞212\mathcal{C}^{(2)}=\{1,2\}caligraphic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = { 1 , 2 }.

Let us divide each image 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into s𝑠sitalic_s patches of the same dimension denoted as ξ1(i),,ξs(i)superscriptsubscript𝜉1𝑖superscriptsubscript𝜉𝑠𝑖\xi_{1}^{(i)},...,\xi_{s}^{(i)}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. By using an autoencoder, we can obtain the corresponding embeddings e1(i),,es(i)superscriptsubscript𝑒1𝑖superscriptsubscript𝑒𝑠𝑖e_{1}^{(i)},...,e_{s}^{(i)}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of a smaller dimension.

In fact, we have a weakly supervised learning task where labels of the whole images are known, including values of all concepts or a part of concepts, but labels of patches are unknown. However, we can compute probabilities of labels for patches by separating embeddings corresponding to patches into groups (clusters) with different contents and by counting up how many whole images having a certain concept value contain patches from each group (cluster).

All embeddings are clustered into R𝑅Ritalic_R clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, i.e., we obtain subsets of embeddings, which fall into k𝑘kitalic_k-th cluster, of the form:

{ej(i),ik,j𝒥k},formulae-sequencesuperscriptsubscript𝑒𝑗𝑖𝑖subscript𝑘𝑗subscript𝒥𝑘\left\{e_{j}^{(i)},i\in\mathcal{I}_{k},j\in\mathcal{J}_{k}\right\},{ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_i ∈ caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_j ∈ caligraphic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } , (1)

where ksubscript𝑘\mathcal{I}_{k}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝒥ksubscript𝒥𝑘\mathcal{J}_{k}caligraphic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are index sets such that the k𝑘kitalic_k-th cluster contains embeddings of patches with indices from 𝒥ksubscript𝒥𝑘\mathcal{J}_{k}caligraphic_J start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT belonging to indices of images from ksubscript𝑘\mathcal{I}_{k}caligraphic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Clusters contain s1,,sRsubscript𝑠1subscript𝑠𝑅s_{1},...,s_{R}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT embeddings such that s1++sR=Ssubscript𝑠1subscript𝑠𝑅𝑆s_{1}+...+s_{R}=Sitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_S, where S𝑆Sitalic_S is the total number of patches obtained from all images. If all images are divided into the same number of patches, then S=sN𝑆𝑠𝑁S=s\cdot Nitalic_S = italic_s ⋅ italic_N. Let us write all concepts in the form of one vector consisting of concatenated vectors of indices:

𝒞=(1,,n0𝒞(0),1,,n1𝒞(1),,1,,nm𝒞(m)).𝒞subscript1subscript𝑛0superscript𝒞0subscript1subscript𝑛1superscript𝒞1subscript1subscript𝑛𝑚superscript𝒞𝑚\mathcal{C}=(\underbrace{1,...,n_{0}}_{\mathcal{C}^{(0)}},\underbrace{1,...,n_% {1}}_{\mathcal{C}^{(1)}},...,\underbrace{1,...,n_{m}}_{\mathcal{C}^{(m)}}).caligraphic_C = ( under⏟ start_ARG 1 , … , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , under⏟ start_ARG 1 , … , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , … , under⏟ start_ARG 1 , … , italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . (2)

A general scheme of the above representation of the images in the form embeddings of patches divided into R𝑅Ritalic_R clusters is depicted in Fig. 1.

Refer to caption
Figure 1: A general scheme of the image transformation to sets of clustered embeddings of patches: each image is divided into s𝑠sitalic_s patches ξj(i)superscriptsubscript𝜉𝑗𝑖\xi_{j}^{(i)}italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT; every patch is transformed into an embedding ej(i)superscriptsubscript𝑒𝑗𝑖e_{j}^{(i)}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT; embeddings are clustered into R𝑅Ritalic_R clusters; probabilities of concepts are computed by having the image concept labels 𝐜isubscript𝐜𝑖\mathbf{c}_{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and distributions of embeddings with corresponding concept values in clusters

If we assume that each concept is a random variable C(r)superscript𝐶𝑟C^{(r)}italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT taking values from 𝒞(r)superscript𝒞𝑟\mathcal{C}^{(r)}caligraphic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT, then we aim to find the conditional probability p(r,vl)=P{C(r)=veKl}𝑝𝑟conditional𝑣𝑙𝑃conditional-setsuperscript𝐶𝑟𝑣𝑒subscript𝐾𝑙p(r,v\mid l)=P\left\{C^{(r)}=v\mid e\in K_{l}\right\}italic_p ( italic_r , italic_v ∣ italic_l ) = italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_e ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } that the concept C(r)superscript𝐶𝑟C^{(r)}italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT takes the value v𝑣vitalic_v under condition that the embedding e𝑒eitalic_e of a patch taken from a considered image falls into the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Let us define the following additional probabilities and their short notations:

  • p(lr,v)=P{eKlC(r)=v}𝑝conditional𝑙𝑟𝑣𝑃conditional-set𝑒subscript𝐾𝑙superscript𝐶𝑟𝑣p(l\mid r,v)=P\left\{e\in K_{l}\mid C^{(r)}=v\right\}italic_p ( italic_l ∣ italic_r , italic_v ) = italic_P { italic_e ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is the conditional probability that an embedding in the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is from the image having the value v𝑣vitalic_v of the r𝑟ritalic_r-th concept;

  • p(r,v)=P{C(r)=v}𝑝𝑟𝑣𝑃superscript𝐶𝑟𝑣p(r,v)=P\left\{C^{(r)}=v\right\}italic_p ( italic_r , italic_v ) = italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is the prior probability that an embedding is from the image having the value v𝑣vitalic_v of the r𝑟ritalic_r-th concept;

  • p(l)=P{eKl}𝑝𝑙𝑃𝑒subscript𝐾𝑙p(l)=P\left\{e\in K_{l}\right\}italic_p ( italic_l ) = italic_P { italic_e ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } is the unconditional probability that an embedding falls into the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

By using the Bayes rule, we write

p(r,vl)=p(lr,v)p(r,v)p(l).𝑝𝑟conditional𝑣𝑙𝑝conditional𝑙𝑟𝑣𝑝𝑟𝑣𝑝𝑙p(r,v\mid l)=\frac{p(l\mid r,v)\cdot p(r,v)}{p(l)}.italic_p ( italic_r , italic_v ∣ italic_l ) = divide start_ARG italic_p ( italic_l ∣ italic_r , italic_v ) ⋅ italic_p ( italic_r , italic_v ) end_ARG start_ARG italic_p ( italic_l ) end_ARG . (3)

It is important to point out that we do not need probabilities p(r,vl)𝑝𝑟conditional𝑣𝑙p(r,v\mid l)italic_p ( italic_r , italic_v ∣ italic_l ) for the inference when a new instance is classified. We need to know only p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) and p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) for the inference. On the other hand, the probability p(r,vl)𝑝𝑟conditional𝑣𝑙p(r,v\mid l)italic_p ( italic_r , italic_v ∣ italic_l ) can be regarded as a measure for determining what values of concepts embeddings in the l𝑙litalic_l-th cluster have. The large probability p(r,vl)𝑝𝑟conditional𝑣𝑙p(r,v\mid l)italic_p ( italic_r , italic_v ∣ italic_l ) implies that the embeddings in Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT have mainly the concept value c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v. If we assume that clusters are homogeneous to some extent, then embeddings contained in the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT correspond to patches from images with the concept c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v. This is an important information because it allows us to highlight areas in the image with given concepts.

Let us introduce the following additional notations:

  • sv(r)(l)superscriptsubscript𝑠𝑣𝑟𝑙s_{v}^{(r)}(l)italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) is the number of embeddings in the l𝑙litalic_l-th cluster obtained from images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v;

  • sv(r)superscriptsubscript𝑠𝑣𝑟s_{v}^{(r)}italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is the total number of embeddings in all clusters obtained from images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v.

The conditional probability p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) is determined as the proportion of embeddings from images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v that fall into the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to the entire set of embeddings from the images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v, i.e., there holds p(lr,v)=sv(r)(l)/sv(r)𝑝conditional𝑙𝑟𝑣superscriptsubscript𝑠𝑣𝑟𝑙superscriptsubscript𝑠𝑣𝑟p(l\mid r,v)=s_{v}^{(r)}(l)/s_{v}^{(r)}italic_p ( italic_l ∣ italic_r , italic_v ) = italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) / italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT.

The prior probability p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) is determined as the proportion of images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v to all images in the dataset, i.e., there holds p(r,v)=sv(r)/S𝑝𝑟𝑣superscriptsubscript𝑠𝑣𝑟𝑆p(r,v)=s_{v}^{(r)}/Sitalic_p ( italic_r , italic_v ) = italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT / italic_S. The unconditional probability p(l)𝑝𝑙p(l)italic_p ( italic_l ) can be computed from the condition:

v=1nrp(r,vl)=1.superscriptsubscript𝑣1subscript𝑛𝑟𝑝𝑟conditional𝑣𝑙1\sum_{v=1}^{n_{r}}p(r,v\mid l)=1.∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_r , italic_v ∣ italic_l ) = 1 . (4)

It can be also determined as the proportion of embeddings in the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to all embeddings from all images, i.e., there holds p(l)=sl/S𝑝𝑙subscript𝑠𝑙𝑆p(l)=s_{l}/Sitalic_p ( italic_l ) = italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT / italic_S. Hence, the posterior probability is computed as p(r,vl)=sv(r)(l)/sl𝑝𝑟conditional𝑣𝑙superscriptsubscript𝑠𝑣𝑟𝑙subscript𝑠𝑙p(r,v\mid l)=s_{v}^{(r)}(l)/s_{l}italic_p ( italic_r , italic_v ∣ italic_l ) = italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) / italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

5 The model inference

Suppose we have a new instance 𝐱𝐱\mathbf{x}bold_x consisting of s𝑠sitalic_s patches ξ1,,ξssubscript𝜉1subscript𝜉𝑠\xi_{1},...,\xi_{s}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT which are fed to the trained autoencoder in order to obtain embeddings e1,,essubscript𝑒1subscript𝑒𝑠e_{1},...,e_{s}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. These embeddings are distributed among clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT according to their distances to the cluster centers. Let s1,,sRsubscript𝑠1subscript𝑠𝑅s_{1},...,s_{R}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT be numbers of embeddings which fall into clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, respectively, such that s1++sR=ssubscript𝑠1subscript𝑠𝑅𝑠s_{1}+...+s_{R}=sitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_s. Let us denote the set of embeddings, produced by a new instance, which fall into Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, as El={e(1,l),,e(sl,l)}superscriptsubscript𝐸𝑙superscript𝑒1𝑙superscript𝑒subscript𝑠𝑙𝑙E_{l}^{\ast}=\{e^{\ast}(1,l),...,e^{\ast}(s_{l},l)\}italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( 1 , italic_l ) , … , italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_l ) }. We do not have any information about concepts and their values which describe the instance. However, we can compute the probability that C(r)superscript𝐶𝑟C^{(r)}italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is equal to v𝑣vitalic_v for all r=1,,m𝑟1𝑚r=1,...,mitalic_r = 1 , … , italic_m, v=1,,nr𝑣1subscript𝑛𝑟v=1,...,n_{r}italic_v = 1 , … , italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, under condition that s1,,sRsubscript𝑠1subscript𝑠𝑅s_{1},...,s_{R}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT embeddings fall into clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, respectively, i.e., find the conditional probability

p(r,vE1:R)=P{C(r)=vE1:R,,ER}.𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅𝑃conditional-setsuperscript𝐶𝑟𝑣superscriptsubscript𝐸:1𝑅superscriptsubscript𝐸𝑅p(r,v\mid E_{1:R}^{\ast})=P\left\{C^{(r)}=v\mid E_{1:R}^{\ast},...,E_{R}^{\ast% }\right\}.italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } . (5)

Here E1:Rsuperscriptsubscript𝐸:1𝑅E_{1:R}^{\ast}italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a short notation of E1,,ERsuperscriptsubscript𝐸1superscriptsubscript𝐸𝑅E_{1}^{\ast},...,E_{R}^{\ast}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_E start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This probability depends on the probabilities that embeddings with the concept c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v fall into Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, l=1,,R𝑙1𝑅l=1,...,Ritalic_l = 1 , … , italic_R. This probability can be estimated by means of the probability p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) which has been considered above and defined as the proportion of embeddings from images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v that fall into the cluster Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the entire set of embeddings from the images with c(r)=vsuperscript𝑐𝑟𝑣c^{(r)}=vitalic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v.

Then we can write the following Bayes rule:

p(r,vE1:R)=p(E1:Rr,v)p(r,v)P(E1:R),𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅𝑝conditionalsuperscriptsubscript𝐸:1𝑅𝑟𝑣𝑝𝑟𝑣𝑃superscriptsubscript𝐸:1𝑅p(r,v\mid E_{1:R}^{\ast})=\frac{p(E_{1:R}^{\ast}\mid r,v)\cdot p(r,v)}{P(E_{1:% R}^{\ast})},italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG italic_p ( italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_r , italic_v ) ⋅ italic_p ( italic_r , italic_v ) end_ARG start_ARG italic_P ( italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG , (6)

where p(E1:Rr,v)=P{E1:RC(r)=v}𝑝conditionalsuperscriptsubscript𝐸:1𝑅𝑟𝑣𝑃conditional-setsuperscriptsubscript𝐸:1𝑅superscript𝐶𝑟𝑣p(E_{1:R}^{\ast}\mid r,v)=P\{E_{1:R}^{\ast}\mid C^{(r)}=v\}italic_p ( italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_r , italic_v ) = italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is the conditional probability that s1,,sRsubscript𝑠1subscript𝑠𝑅s_{1},...,s_{R}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT embeddings fall into clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, respectively, under condition that the image has the value v𝑣vitalic_v of the r𝑟ritalic_r-th concept; P{E1:R}𝑃superscriptsubscript𝐸:1𝑅P\left\{E_{1:R}^{\ast}\right\}italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } is unconditional probability that s1,,sRsubscript𝑠1subscript𝑠𝑅s_{1},...,s_{R}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT embeddings fall into clusters K1,,KRsubscript𝐾1subscript𝐾𝑅K_{1},...,K_{R}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT.

We propose to apply the multinomial distribution with probabilities p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) of events in order to represent P{E1:RC(r)=v}𝑃conditional-setsuperscriptsubscript𝐸:1𝑅superscript𝐶𝑟𝑣P\{E_{1:R}^{\ast}\mid C^{(r)}=v\}italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v }:

p(E1:Rr,v)=s!s1!sR!l=1Rpsl(lr,v).𝑝conditionalsuperscriptsubscript𝐸:1𝑅𝑟𝑣𝑠subscript𝑠1subscript𝑠𝑅superscriptsubscriptproduct𝑙1𝑅superscript𝑝subscript𝑠𝑙conditional𝑙𝑟𝑣p(E_{1:R}^{\ast}\mid r,v)=\frac{s!}{s_{1}!\cdot\cdot\cdot s_{R}!}\prod_{l=1}^{% R}p^{s_{l}}(l\mid r,v).italic_p ( italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_r , italic_v ) = divide start_ARG italic_s ! end_ARG start_ARG italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ! ⋯ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ! end_ARG ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_l ∣ italic_r , italic_v ) . (7)

The probability p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) has been also defined in (3). The unconditional probability P{E1:R}𝑃superscriptsubscript𝐸:1𝑅P\left\{E_{1:R}^{\ast}\right\}italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } can be computed by using the condition:

v=1nrp(r,vE1:R)=1.superscriptsubscript𝑣1subscript𝑛𝑟𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅1\sum_{v=1}^{n_{r}}p(r,v\mid E_{1:R}^{\ast})=1.∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 1 . (8)

The main difficulty of using (7) is that some probabilities p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) may be 00. In this case, the product (7) is also 00. In order to overcome this problem, we propose to replace the zero-valued probabilities with some small value ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. In this case, the probability P(E1:R)𝑃superscriptsubscript𝐸:1𝑅P(E_{1:R}^{\ast})italic_P ( italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) can be computed only by means of (8).

In sum, for a new instance, we compute n1++nmsubscript𝑛1subscript𝑛𝑚n_{1}+...+n_{m}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT probabilities p(r,vE1:R)𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅p(r,v\mid E_{1:R}^{\ast})italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) considering all v=1,,nr𝑣1subscript𝑛𝑟v=1,...,n_{r}italic_v = 1 , … , italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and r=1,,m𝑟1𝑚r=1,...,mitalic_r = 1 , … , italic_m.

Let us introduce the thresholds γrsubscript𝛾𝑟\gamma_{r}italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for all values of the r𝑟ritalic_r-th concept. If p(r,vE1:R)γr𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅subscript𝛾𝑟p(r,v\mid E_{1:R}^{\ast})\geq\gamma_{r}italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_γ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, then the value v𝑣vitalic_v of the concept c(r)superscript𝑐𝑟c^{(r)}italic_c start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT is assigned to the instance. After considering all the threshold conditions, we get the concept-based description of the image.

6 Incorporating expert rules

Let us introduce the logical literal denoted as [c(j)=i]delimited-[]superscript𝑐𝑗𝑖[c^{(j)}=i][ italic_c start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_i ], which takes the value 1111, if the concept c(j)superscript𝑐𝑗c^{(j)}italic_c start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT has the value i𝑖iitalic_i. A set of expert rules can be represented as a logical expression g(𝐜)𝑔𝐜g(\mathbf{c})italic_g ( bold_c ) over literals [c(j)=i]delimited-[]superscript𝑐𝑗𝑖[c^{(j)}=i][ italic_c start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_i ] such that g(𝐜)=0𝑔𝐜0g(\mathbf{c})=0italic_g ( bold_c ) = 0 means that the rule is FALSE, g(𝐜)=1𝑔𝐜1g(\mathbf{c})=1italic_g ( bold_c ) = 1 means that it is TRUE. One of the forms of rules provided by experts is represented as “IF F𝐹Fitalic_F, THEN G𝐺Gitalic_G”, where F𝐹Fitalic_F is the antecedent, G𝐺Gitalic_G is the consequent. This rule is expressed through the logical function as FG=¬FG𝐹𝐺𝐹𝐺F\rightarrow G=\lnot F\vee Gitalic_F → italic_G = ¬ italic_F ∨ italic_G, where symbols \rightarrow and ¬\lnot¬ denote operations of implication and negation, respectively.

Expert rules change prior probabilities of concepts p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) as well as conditional probabilities of concepts p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ). Therefore, the next question we need to answer is how to update these probabilities in order to implement the model inference taking into account the rules.

6.1 Expert rules and prior probabilities of concepts

First, we consider how prior probabilities of concepts are changed by using the expert rules. Let us return to the conditional probability p(r,vE1:R)𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅p(r,v\mid E_{1:R}^{\ast})italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) in (6). One of the ways to take into account the expert rules is to assume that the expert rules change prior probabilities of concepts p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ), i.e., we have to find the conditional probabilities P{C(r)=vg(𝐂)=1}𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂1P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 }. Here 𝐂𝐂\mathbf{C}bold_C is the random vector taking values from combinations of the concept values. There exist n0nmsubscript𝑛0subscript𝑛𝑚n_{0}\cdot\cdot\cdot n_{m}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋯ italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT different combinations of the concept values which are formed from the Cartesian product 𝒞=𝒞(0)×.×𝒞(m)\mathcal{C=C}^{(0)}\times....\times\mathcal{C}^{(m)}caligraphic_C = caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT × … . × caligraphic_C start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT defined above. Let 𝐳𝒞𝐳𝒞\mathbf{z}\in\mathcal{C}bold_z ∈ caligraphic_C be one of the combinations. Each expert rule takes values TRUE or FALSE (1111 or 00) after substituting 𝐳𝐳\mathbf{z}bold_z into the logical function g𝑔gitalic_g. Let us use again the Bayes rule for determining the posterior probability P{C(r)=vg(𝐂)=1}𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂1P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 }:

P{C(r)=vg(𝐂)=1}𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂1\displaystyle P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 }
=P{g(𝐂)=1C(r)=v}p(r,v)P{g(𝐂)=1}.absent𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣𝑝𝑟𝑣𝑃𝑔𝐂1\displaystyle=\frac{P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}\cdot p(r,v)}% {P\left\{g(\mathbf{C})=1\right\}}.= divide start_ARG italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } ⋅ italic_p ( italic_r , italic_v ) end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG . (9)

In order to determine the probability P{g(𝐂)=1}𝑃𝑔𝐂1P\left\{g(\mathbf{C})=1\right\}italic_P { italic_g ( bold_C ) = 1 }, we consider all cases when the random vector takes values 𝐳𝐳\mathbf{z}bold_z from the set 𝒞𝒞\mathcal{C}caligraphic_C. However, when the conditional probability P{g(𝐂)=1C(r)=v}𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is determined, the the set 𝒞𝒞\mathcal{C}caligraphic_C of all vectors 𝐳𝐳\mathbf{z}bold_z is restricted by the condition C(r)=vsuperscript𝐶𝑟𝑣C^{(r)}=vitalic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v. A new restricted set denoted as 𝒞r,vsubscript𝒞𝑟𝑣\mathcal{C}_{r,v}caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT is formed by replacing 𝒞(r)superscript𝒞𝑟\mathcal{C}^{(r)}caligraphic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT in 𝒞𝒞\mathcal{C}caligraphic_C with the single value v𝑣vitalic_v, i.e., 𝒞(r)={v}superscript𝒞𝑟𝑣\mathcal{C}^{(r)}=\{v\}caligraphic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = { italic_v } and

𝒞r,v=𝒞(0)×.×{v}××𝒞(m).\mathcal{C}_{r,v}=\mathcal{C}^{(0)}\times....\times\{v\}\times...\times% \mathcal{C}^{(m)}.caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT = caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT × … . × { italic_v } × … × caligraphic_C start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT . (10)

By using the rule of total probability, we write:

P{g(𝐂)=1C(r)=v}=𝐳𝒞r,vP{g(𝐂)=1𝐂=𝐳}P{𝐂=𝐳}.𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣subscript𝐳subscript𝒞𝑟𝑣𝑃conditional-set𝑔𝐂1𝐂𝐳𝑃𝐂𝐳P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}P\{g(% \mathbf{C})=1\mid\mathbf{C}=\mathbf{z}\}P\{\mathbf{C}=\mathbf{z}\}.italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } = ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P { italic_g ( bold_C ) = 1 ∣ bold_C = bold_z } italic_P { bold_C = bold_z } . (11)

Since the function g(𝐂)𝑔𝐂g(\mathbf{C})italic_g ( bold_C ) takes values 00 and 1111, then the conditional probability is determined as follows:

P{g(𝐂)=1C(r)=v}=𝐳𝒞r,vg(𝐳)P{𝐂=𝐳}.𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣subscript𝐳subscript𝒞𝑟𝑣𝑔𝐳𝑃𝐂𝐳P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(% \mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } = ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_z ) ⋅ italic_P { bold_C = bold_z } . (12)

The probability P{𝐂=𝐳}𝑃𝐂𝐳P\{\mathbf{C}=\mathbf{z}\}italic_P { bold_C = bold_z } can be found by considering all images which have the combination 𝐳𝒞r,v𝐳subscript𝒞𝑟𝑣\mathbf{z}\in\mathcal{C}_{r,v}bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT of concept values, i.e., the probability is equal to the proportion of the number of images with concepts 𝐳𝒞r,v𝐳subscript𝒞𝑟𝑣\mathbf{z}\in\mathcal{C}_{r,v}bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT to the number N𝑁Nitalic_N of all images in the dataset.

In sum, for every combination of concepts 𝐳𝐳\mathbf{z}bold_z 𝒞r,vabsentsubscript𝒞𝑟𝑣\in\mathcal{C}_{r,v}∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT, we check whether the concepts satisfy the expert rule g(𝐳)𝑔𝐳g(\mathbf{z})italic_g ( bold_z ) and then compute P{C=𝐳}𝑃𝐶𝐳P\{C=\mathbf{z}\}italic_P { italic_C = bold_z } for this combination. It is important to point out that a combination 𝐳𝐳\mathbf{z}bold_z is not considered if there are no images with the corresponding set of concept values in the dataset because P{𝐂=𝐳}=0𝑃𝐂𝐳0P\{\mathbf{C}=\mathbf{z}\}=0italic_P { bold_C = bold_z } = 0 in this case.

The unconditional probability P{g(𝐂)=1}𝑃𝑔𝐂1P\left\{g(\mathbf{C})=1\right\}italic_P { italic_g ( bold_C ) = 1 } can be obtained from the condition

v=1nrP{C(r)=vg(𝐂)=1}=1,superscriptsubscript𝑣1subscript𝑛𝑟𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂11\sum_{v=1}^{n_{r}}P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}=1,∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } = 1 , (13)

and is equal to

P{g(𝐂)=1}𝑃𝑔𝐂1\displaystyle P\left\{g(\mathbf{C})=1\right\}italic_P { italic_g ( bold_C ) = 1 } =v=1nrP{g(𝐂)=1C(r)=v}p(r,v)absentsuperscriptsubscript𝑣1subscript𝑛𝑟𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣𝑝𝑟𝑣\displaystyle=\sum_{v=1}^{n_{r}}P\left\{g(\mathbf{C})=1\mid C^{(r)}=v\right\}% \cdot p(r,v)= ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } ⋅ italic_p ( italic_r , italic_v )
=v=1nr𝐳𝒞r,vg(𝐳v(r))P{𝐂=𝐳}p(r,v).absentsuperscriptsubscript𝑣1subscript𝑛𝑟subscript𝐳subscript𝒞𝑟𝑣𝑔superscriptsubscript𝐳𝑣𝑟𝑃𝐂𝐳𝑝𝑟𝑣\displaystyle=\sum_{v=1}^{n_{r}}\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(\mathbf% {z}_{v}^{(r)})\cdot P\{\mathbf{C}=\mathbf{z}\}\cdot p(r,v).= ∑ start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ) ⋅ italic_P { bold_C = bold_z } ⋅ italic_p ( italic_r , italic_v ) . (14)

Hence, we rewrite the expression (9) for computing P{C(r)=vg(𝐂)=1}𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂1P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } as follows:

P{C(r)=vg(𝐂)=1}=Uv(r)p(r,v),𝑃conditional-setsuperscript𝐶𝑟𝑣𝑔𝐂1superscriptsubscript𝑈𝑣𝑟𝑝𝑟𝑣P\left\{C^{(r)}=v\mid g(\mathbf{C})=1\right\}=U_{v}^{(r)}\cdot p(r,v),italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } = italic_U start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ⋅ italic_p ( italic_r , italic_v ) , (15)

where

Uv(r)=𝐳𝒞r,vg(𝐳)P{𝐂=𝐳}P{g(𝐂)=1}.superscriptsubscript𝑈𝑣𝑟subscript𝐳subscript𝒞𝑟𝑣𝑔𝐳𝑃𝐂𝐳𝑃𝑔𝐂1U_{v}^{(r)}=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}g(\mathbf{z})\cdot P\{% \mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.italic_U start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_z ) ⋅ italic_P { bold_C = bold_z } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG . (16)

The prior probability p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) of the concept is updated in accordance with the rule g𝑔gitalic_g by means of its multiplying by the updating coefficient Uv(r)superscriptsubscript𝑈𝑣𝑟U_{v}^{(r)}italic_U start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT.

Finally, we have obtained the updated probabilities p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) which are used in (6) for computing posterior marginal probabilities of concepts P{C(r)=vE1:R}𝑃conditional-setsuperscript𝐶𝑟𝑣superscriptsubscript𝐸:1𝑅P\left\{C^{(r)}=v\mid E_{1:R}^{\ast}\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }.

6.2 Expert rules and conditional probabilities

In addition to changes of prior probabilities, it is necessary to consider how conditional probabilities P{eKlC(r)=v}𝑃conditional-set𝑒subscript𝐾𝑙superscript𝐶𝑟𝑣P\left\{e\in K_{l}\mid C^{(r)}=v\right\}italic_P { italic_e ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } are updated due to expert rules. Let us denote the conditional event eKlC(r)=v𝑒conditionalsubscript𝐾𝑙superscript𝐶𝑟𝑣e\in K_{l}\mid C^{(r)}=vitalic_e ∈ italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v as El(r)=vsuperscriptsubscript𝐸𝑙𝑟𝑣E_{l}^{(r)}=vitalic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v. Then we have to find the probability

P{El(r)=vg(𝐂)=1}=P{g(𝐂)=1El(r)=v}P{El(r)=v}P{g(𝐂)=1}.𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1𝑃conditional-set𝑔𝐂1superscriptsubscript𝐸𝑙𝑟𝑣𝑃superscriptsubscript𝐸𝑙𝑟𝑣𝑃𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}=\frac{P\left\{g(\mathbf{C})=% 1\mid E_{l}^{(r)}=v\right\}\cdot P\left\{E_{l}^{(r)}=v\right\}}{P\left\{g(% \mathbf{C})=1\right\}}.italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } = divide start_ARG italic_P { italic_g ( bold_C ) = 1 ∣ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } ⋅ italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG . (17)

Note that the probability P{El(r)=v}𝑃superscriptsubscript𝐸𝑙𝑟𝑣P\left\{E_{l}^{(r)}=v\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is nothing else, but the conditional probability p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) which has been defined above in Sec. 4, i.e., P{El(r)=v}=sv(r)(l)/sv(r)𝑃superscriptsubscript𝐸𝑙𝑟𝑣superscriptsubscript𝑠𝑣𝑟𝑙superscriptsubscript𝑠𝑣𝑟P\left\{E_{l}^{(r)}=v\right\}=s_{v}^{(r)}(l)/s_{v}^{(r)}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } = italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) / italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT. Let 𝒞r,v(l)subscript𝒞𝑟𝑣𝑙absent\mathcal{C}_{r,v}(l)\subseteqcaligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) ⊆ 𝒞r,vsubscript𝒞𝑟𝑣\mathcal{C}_{r,v}caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT be a subset of 𝒞r,vsubscript𝒞𝑟𝑣\mathcal{C}_{r,v}caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT, which contains the concept combinations of embeddings from the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Then the conditional probability P{g(𝐂)=1El(r)=v}𝑃conditional-set𝑔𝐂1superscriptsubscript𝐸𝑙𝑟𝑣P\left\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\right\}italic_P { italic_g ( bold_C ) = 1 ∣ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } is determined as

P{g(𝐂)\displaystyle P\{g(\mathbf{C})italic_P { italic_g ( bold_C ) =1El(r)=v}\displaystyle=1\mid E_{l}^{(r)}=v\}= 1 ∣ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v }
=𝐳𝒞r,v(l)P{g(𝐂)=1𝐂=𝐳}P{𝐂=𝐳}.absentsubscript𝐳subscript𝒞𝑟𝑣𝑙𝑃conditional-set𝑔𝐂1𝐂𝐳𝑃𝐂𝐳\displaystyle=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}P\{g(\mathbf{C})=1\mid% \mathbf{C}=\mathbf{z}\}P\{\mathbf{C}=\mathbf{z}\}.= ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT italic_P { italic_g ( bold_C ) = 1 ∣ bold_C = bold_z } italic_P { bold_C = bold_z } . (18)

It can be seen from (18) that P{g(𝐂)=1El(r)=v}𝑃conditional-set𝑔𝐂1superscriptsubscript𝐸𝑙𝑟𝑣P\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\}italic_P { italic_g ( bold_C ) = 1 ∣ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } differs from P{g(𝐂)=1Cl(r)=v}𝑃conditional-set𝑔𝐂1superscriptsubscript𝐶𝑙𝑟𝑣P\{g(\mathbf{C})=1\mid C_{l}^{(r)}=v\}italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } in (11) in that the set 𝒞r,v(l)subscript𝒞𝑟𝑣𝑙\mathcal{C}_{r,v}(l)caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) is limited to considering only the cluster Klsubscript𝐾𝑙K_{l}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Hence, there holds

P{g(𝐂)=1El(r)=v}=𝐳𝒞r,v(l)g(𝐳)P{𝐂=𝐳}.𝑃conditional-set𝑔𝐂1superscriptsubscript𝐸𝑙𝑟𝑣subscript𝐳subscript𝒞𝑟𝑣𝑙𝑔𝐳𝑃𝐂𝐳P\{g(\mathbf{C})=1\mid E_{l}^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)% }g(\mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.italic_P { italic_g ( bold_C ) = 1 ∣ italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } = ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT italic_g ( bold_z ) ⋅ italic_P { bold_C = bold_z } . (19)

We rewrite the expression (17) for computing P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } as follows:

P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1\displaystyle P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } =Vv(r)(l)P{El(r)=v}absentsuperscriptsubscript𝑉𝑣𝑟𝑙𝑃superscriptsubscript𝐸𝑙𝑟𝑣\displaystyle=V_{v}^{(r)}(l)\cdot P\left\{E_{l}^{(r)}=v\right\}= italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) ⋅ italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v }
=p(lr,v)absent𝑝conditional𝑙𝑟𝑣\displaystyle=p(l\mid r,v)= italic_p ( italic_l ∣ italic_r , italic_v )
=Vv(r)(l)sv(r)(l)/sv(r),absentsuperscriptsubscript𝑉𝑣𝑟𝑙superscriptsubscript𝑠𝑣𝑟𝑙superscriptsubscript𝑠𝑣𝑟\displaystyle=V_{v}^{(r)}(l)\cdot s_{v}^{(r)}(l)/s_{v}^{(r)},= italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) ⋅ italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) / italic_s start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT , (20)

where

Vv(r)(l)=𝐳𝒞r,v(l)g(𝐳)P{𝐂=𝐳}P{g(𝐂)=1}.superscriptsubscript𝑉𝑣𝑟𝑙subscript𝐳subscript𝒞𝑟𝑣𝑙𝑔𝐳𝑃𝐂𝐳𝑃𝑔𝐂1V_{v}^{(r)}(l)=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}g(\mathbf{z})% \cdot P\{\mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) = divide start_ARG ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT italic_g ( bold_z ) ⋅ italic_P { bold_C = bold_z } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG . (21)

Note that there holds

l=1RP{El(r)=vg(𝐂)=1}=1,superscriptsubscript𝑙1𝑅𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂11\sum_{l=1}^{R}P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}=1,∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } = 1 ,

because the event El(r)=vsuperscriptsubscript𝐸𝑙𝑟𝑣E_{l}^{(r)}=vitalic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v means that the embedding e𝑒eitalic_e falls into one of the clusters. Hence, we can write

l=1SVv(r)(l)p(lr,v)=1.superscriptsubscript𝑙1𝑆superscriptsubscript𝑉𝑣𝑟𝑙𝑝conditional𝑙𝑟𝑣1\sum_{l=1}^{S}V_{v}^{(r)}(l)\cdot p(l\mid r,v)=1.∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) ⋅ italic_p ( italic_l ∣ italic_r , italic_v ) = 1 .

The conditional probability P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } is updated in accordance with the rule g𝑔gitalic_g by means of its multiplying by the updating coefficient Vv(r)(l)superscriptsubscript𝑉𝑣𝑟𝑙V_{v}^{(r)}(l)italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ).

After updating the probabilities, they are substituted into (6) and (7) in order to obtain probabilities of concepts for a new instance.

6.3 Expert rules with uncertainty

So far, we have considered the hard expert-convinced rules, i.e., the rules taking the value TRUE with the unit probability. Let us study the case when rules are of the form “IF F𝐹Fitalic_F, THEN G𝐺Gitalic_G with probability π𝜋\piitalic_π”. There are different interpretations of the implication operation probabilities [55]. We apply an interpretation which considers the probability of P(¬FG)𝑃𝐹𝐺P(\lnot F\vee G)italic_P ( ¬ italic_F ∨ italic_G ) that either G𝐺Gitalic_G is true or F𝐹Fitalic_F is false.

A simple way to adapt the uncertain rules to the proposed scheme is to replace the value g(𝐳)𝑔𝐳g(\mathbf{z})italic_g ( bold_z ) in (16) with probabilities π𝜋\piitalic_π and 1π1𝜋1-\pi1 - italic_π. Let us rewrite (12) taking into account the probability π(𝐳)=P{g(𝐳)=1}𝜋𝐳𝑃𝑔𝐳1\pi(\mathbf{z})=P\{g(\mathbf{z})=1\}italic_π ( bold_z ) = italic_P { italic_g ( bold_z ) = 1 } that the rule is TRUE for 𝐳𝐳\mathbf{z}bold_z, i.e., is satisfied for the combination 𝐳𝐳\mathbf{z}bold_z of concepts, as follows:

P{g(𝐂)=1C(r)=v}=𝐳𝒞r,vπ(𝐳)P{𝐂=𝐳}.𝑃conditional-set𝑔𝐂1superscript𝐶𝑟𝑣subscript𝐳subscript𝒞𝑟𝑣𝜋𝐳𝑃𝐂𝐳P\{g(\mathbf{C})=1\mid C^{(r)}=v\}=\sum_{\mathbf{z}\in\mathcal{C}_{r,v}}\pi(% \mathbf{z})\cdot P\{\mathbf{C}=\mathbf{z}\}.italic_P { italic_g ( bold_C ) = 1 ∣ italic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v } = ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π ( bold_z ) ⋅ italic_P { bold_C = bold_z } . (22)

It can be seen from the above that only the updating coefficient Uv(r)superscriptsubscript𝑈𝑣𝑟U_{v}^{(r)}italic_U start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT in (16) is changed when the probability of the rule is added.

In the same way, the conditional probabilities P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } in (20) can be updated. In the case, we change only the updating coefficient Vv(r)(l)superscriptsubscript𝑉𝑣𝑟𝑙V_{v}^{(r)}(l)italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) as follows:

Vv(r)(l)=𝐳𝒞r,v(l)π(𝐳)P{𝐂=𝐳}P{g(𝐂)=1}.superscriptsubscript𝑉𝑣𝑟𝑙subscript𝐳subscript𝒞𝑟𝑣𝑙𝜋𝐳𝑃𝐂𝐳𝑃𝑔𝐂1V_{v}^{(r)}(l)=\frac{\sum_{\mathbf{z}\in\mathcal{C}_{r,v}(l)}\pi(\mathbf{z})% \cdot P\{\mathbf{C}=\mathbf{z}\}}{P\left\{g(\mathbf{C})=1\right\}}.italic_V start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_l ) = divide start_ARG ∑ start_POSTSUBSCRIPT bold_z ∈ caligraphic_C start_POSTSUBSCRIPT italic_r , italic_v end_POSTSUBSCRIPT ( italic_l ) end_POSTSUBSCRIPT italic_π ( bold_z ) ⋅ italic_P { bold_C = bold_z } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG . (23)

It should be noted that the uncertainty representation of expert rules is a special important question which requires a detailed separate investigation. Therefore, it is not considered in this work and will be study in future.

7 Illustrative example

7.1 Concepts

7.1.1 Training

To illustrate calculation in FI-CBL, we consider an illustrative example with concepts Contour c(1)superscript𝑐1c^{(1)}italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and Inclusion c(2)superscript𝑐2c^{(2)}italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT of the lung cancer nodule description given in Section 4. The vector of all concept values is

𝒞=(1,2𝒞(0),1,2,3𝒞(1),1,2𝒞(2)).𝒞subscript12superscript𝒞0subscript123superscript𝒞1subscript12superscript𝒞2\mathcal{C}=(\underbrace{1,2}_{\mathcal{C}^{(0)}},\underbrace{1,2,3}_{\mathcal% {C}^{(1)}},\underbrace{1,2}_{\mathcal{C}^{(2)}}).caligraphic_C = ( under⏟ start_ARG 1 , 2 end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , under⏟ start_ARG 1 , 2 , 3 end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , under⏟ start_ARG 1 , 2 end_ARG start_POSTSUBSCRIPT caligraphic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . (24)
Table 1: Concept values for the illustrative example with the lung nodules
Diagnosis Contour Inclusion # images
benign smooth homogeneous 2222
benign grainy homogeneous 2222
malignant spicules homogeneous 4444
malignant grainy necrosis 2222

Suppose we have 10101010 images whose description is shown in Table 1, where the first three columns contain values of the concepts and the fourth column indicates the number of images having the corresponding concept values. The images are divided into 4444 patches. Numbers of images with values <<<malignant>>> and <<<benign>>> are 6666 and 4444, respectively. Suppose that embeddings corresponding to patches are clustered to R=3𝑅3R=3italic_R = 3 clusters such that:

  • the first cluster contains embeddings of the background patches;

  • the second cluster contains embeddings of nodules with spicules or with the grainy contour and with necrosis;

  • the third cluster contains embeddings of nodules which are smooth or grainy and homogeneous.

An example of the corresponding images and patches are schematically depicted in Fig. 2. Prior probabilities p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) for all concepts are shown in Table 2

Table 2: Prior probabilities p(r,v)𝑝𝑟𝑣p(r,v)italic_p ( italic_r , italic_v ) for all concepts
C(0)superscript𝐶0C^{(0)}italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT C(1)superscript𝐶1C^{(1)}italic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
1111 2222 1111 2222 3333 1111 2222
6/106106/106 / 10 4/104104/104 / 10 2/102102/102 / 10 4/104104/104 / 10 4/104104/104 / 10 8/108108/108 / 10 2/102102/102 / 10

Unconditional probabilities p(l)𝑝𝑙p(l)italic_p ( italic_l ) are

p(1)=28/40,p(2)=7/40,p(3)=5/40.formulae-sequence𝑝12840formulae-sequence𝑝2740𝑝3540p(1)=28/40,~{}p(2)=7/40,~{}p(3)=5/40.italic_p ( 1 ) = 28 / 40 , italic_p ( 2 ) = 7 / 40 , italic_p ( 3 ) = 5 / 40 . (25)

Table 3 shows conditional probabilities p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ). Table 4 contains posterior probabilities p(r,vl)𝑝𝑟conditional𝑣𝑙p(r,v\mid l)italic_p ( italic_r , italic_v ∣ italic_l ) computed by using (3). It follows from Table 4 that the second cluster contains patches which show that the corresponding images are malignant with the probability 1111. Moreover, the third cluster contains only patches with homogeneous nodules.

Table 3: Conditional probabilities p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v )
C(0)superscript𝐶0C^{(0)}italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT C(1)superscript𝐶1C^{(1)}italic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
v𝑣vitalic_v 1111 2222 1111 2222 3333 1111 2222
K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 16/24162416/2416 / 24 3/4343/43 / 4 3/4343/43 / 4 6/8686/86 / 8 10/16101610/1610 / 16 22/32223222/3222 / 32 3/4343/43 / 4
K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 7/247247/247 / 24 00 00 1/8181/81 / 8 5/165165/165 / 16 5/325325/325 / 32 1/4141/41 / 4
K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 1/241241/241 / 24 1/4141/41 / 4 1/4141/41 / 4 1/8181/81 / 8 1/161161/161 / 16 5/325325/325 / 32 00
Table 4: Posterior probabilities p(r,vl)𝑝𝑟conditional𝑣𝑙p(r,v\mid l)italic_p ( italic_r , italic_v ∣ italic_l )
C(0)superscript𝐶0C^{(0)}italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT C(1)superscript𝐶1C^{(1)}italic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
v𝑣vitalic_v 1111 2222 1111 2222 3333 1111 2222
K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 4/7474/74 / 7 3/7373/73 / 7 3/143143/143 / 14 6/146146/146 / 14 5/145145/145 / 14 11/14111411/1411 / 14 3/143143/143 / 14
K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1111 00 00 2/7272/72 / 7 5/7575/75 / 7 5/7575/75 / 7 2/7272/72 / 7
K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 1/5151/51 / 5 4/5454/54 / 5 2/5252/52 / 5 2/5252/52 / 5 1/5151/51 / 5 1111 00
Refer to caption
Figure 2: An illustration of the concept-based description of images consisting of four patches and three clusters containing different patches

7.1.2 Inference

Suppose that two embeddings of a new instance fall into the first cluster, and two embedding falls into the third cluster. This implies that (s1,s2,s3)=(2,0,2)subscript𝑠1subscript𝑠2subscript𝑠3202\mathbf{(}s_{1},s_{2},s_{3}\mathbf{)}=(2,0,2)( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ( 2 , 0 , 2 ). Probabilities p(lr,v)𝑝conditional𝑙𝑟𝑣p(l\mid r,v)italic_p ( italic_l ∣ italic_r , italic_v ) are taken from Table 3. Hence, there holds for C(r)=vsuperscript𝐶𝑟𝑣C^{(r)}=vitalic_C start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v:

p(r,vE1:R)𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅\displaystyle p(r,v\mid E_{1:R}^{\ast})italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
=p(r,v)P{E1:R}4!2!0!2!p2(1r,v)p0(2r,v)p2(2r,v).absent𝑝𝑟𝑣𝑃superscriptsubscript𝐸:1𝑅4202superscript𝑝2conditional1𝑟𝑣superscript𝑝0conditional2𝑟𝑣superscript𝑝2conditional2𝑟𝑣\displaystyle=\frac{p(r,v)}{P\left\{E_{1:R}^{\ast}\right\}}\frac{4!}{2!\cdot 0% !\cdot 2!}p^{2}(1\mid r,v)\cdot p^{0}(2\mid r,v)\cdot p^{2}(2\mid r,v).= divide start_ARG italic_p ( italic_r , italic_v ) end_ARG start_ARG italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } end_ARG divide start_ARG 4 ! end_ARG start_ARG 2 ! ⋅ 0 ! ⋅ 2 ! end_ARG italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ∣ italic_r , italic_v ) ⋅ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( 2 ∣ italic_r , italic_v ) ⋅ italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 ∣ italic_r , italic_v ) . (26)

Probabilities P{E1:R}𝑃superscriptsubscript𝐸:1𝑅P\left\{E_{1:R}^{\ast}\right\}italic_P { italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } are computed from (8) for every r=1,2,3𝑟123r=1,2,3italic_r = 1 , 2 , 3. They are 0.0870.0870.0870.087, 0.0670.0670.0670.067, 0.0560.0560.0560.056. Final results are shown in Table 5. It can be seen from Table 5 that the tested image is <<<benign>>> (c(0)=2superscript𝑐02c^{(0)}=2italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 2 with the probability 0.9680.9680.9680.968) with homogeneous nodules (c(2)=1superscript𝑐21c^{(2)}=1italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 with the probability 0.9990.9990.9990.999). The decision about the concept Contour depends on the threshold γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. If γ20.630subscript𝛾20.630\gamma_{2}\leq 0.630italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0.630, then we can say that nodules are smooth. If γ2>0.630subscript𝛾20.630\gamma_{2}>0.630italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0.630, then this concept is not available or too uncertain to be included into the image description. At the same time, the small probability of spicules implies that the Contour is smooth or grainy with probability 0.9450.9450.9450.945.

Table 5: Posterior probabilities p(r,vE1:R)𝑝𝑟conditional𝑣superscriptsubscript𝐸:1𝑅p(r,v\mid E_{1:R}^{\ast})italic_p ( italic_r , italic_v ∣ italic_E start_POSTSUBSCRIPT 1 : italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
C(0)superscript𝐶0C^{(0)}italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT C(1)superscript𝐶1C^{(1)}italic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT C(2)superscript𝐶2C^{(2)}italic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT
v𝑣vitalic_v 1111 2222 1111 2222 3333 1111 2222
0.0320.0320.0320.032 0.9680.9680.9680.968 0.6300.6300.6300.630 0.3150.3150.3150.315 0.0550.0550.0550.055 0.9990.9990.9990.999 0.0010.0010.0010.001

7.2 Expert rules

7.2.1 Prior probabilities

Let us consider the illustrative example under condition of using the expert rule “IF Contour is <<<grainy>>>, THEN Diagnosis is <<<malignant>>>” which can also be written as “IF c(1)=2superscript𝑐12c^{(1)}=2italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 2, THEN c(0)=1superscript𝑐01c^{(0)}=1italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1”. The logical function g(𝐜)𝑔𝐜g(\mathbf{c})italic_g ( bold_c ) corresponding to the rule is of the form:

g(𝐜)𝑔𝐜\displaystyle g(\mathbf{c})italic_g ( bold_c ) =[c(1)=2][c(0)=1]absentdelimited-[]superscript𝑐12delimited-[]superscript𝑐01\displaystyle=[c^{(1)}=2]\rightarrow[c^{(0)}=1]= [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 2 ] → [ italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ]
=¬([c(1)=2])[c(0)=1]absentdelimited-[]superscript𝑐12delimited-[]superscript𝑐01\displaystyle=\lnot\left([c^{(1)}=2]\right)\vee[c^{(0)}=1]= ¬ ( [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 2 ] ) ∨ [ italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ]
=[c(0)=1][c(1)=1][c(1)=3].absentdelimited-[]superscript𝑐01delimited-[]superscript𝑐11delimited-[]superscript𝑐13\displaystyle=[c^{(0)}=1]\vee[c^{(1)}=1]\vee[c^{(1)}=3].= [ italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ] ∨ [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 ] ∨ [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 3 ] . (27)

The above rule is TRUE if at least one of the logical literals is TRUE.

Let us return to the above example. We have the following combinations 𝐳𝐳\mathbf{z}bold_z of the concept labels (see Table 1): benign-smooth-homogeneous, benign-grainy-homogeneous, malignant-spicules-homogeneous, malignant-grainy-necrosis. Let us find how the prior probability p(0,1)=P{C(0)=1}𝑝01𝑃superscript𝐶01p(0,1)=P\left\{C^{(0)}=1\right\}italic_p ( 0 , 1 ) = italic_P { italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 } will be changed when the expert rule from the above example is available. The rule “IF Contour is <<<grainy>>>, THEN Diagnosis is <<<malignant>>>” is FALSE when [c(0)=2][c(1)=2]delimited-[]superscript𝑐02delimited-[]superscript𝑐12[c^{(0)}=2]\wedge[c^{(1)}=2][ italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 2 ] ∧ [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 2 ], i.e., when the concept values are <<<grainy>>> and <<<benign>>>. There are two images simultaneously having the concepts <<<grainy>>> and <<<benign>>>. This implies that g(𝐳)=0𝑔𝐳0g(\mathbf{z})=0italic_g ( bold_z ) = 0 for images with concepts <<<benign-grainy-homogeneous>>> (see Fig. 2). Hence, we can write 𝒞0,1={(1,3,1),(1,2,2)}subscript𝒞01131122\mathcal{C}_{0,1}=\{(1,3,1),(1,2,2)\}caligraphic_C start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT = { ( 1 , 3 , 1 ) , ( 1 , 2 , 2 ) }, 𝒞={(1,3,1),(1,2,2),(2,1,1)}𝒞131122211\mathcal{C}=\{(1,3,1),(1,2,2),(2,1,1)\}caligraphic_C = { ( 1 , 3 , 1 ) , ( 1 , 2 , 2 ) , ( 2 , 1 , 1 ) }. Here combinations with g(𝐳)=0𝑔𝐳0g(\mathbf{z})=0italic_g ( bold_z ) = 0 are not provided for short. Probabilities P{𝐂=𝐳}𝑃𝐂𝐳P\{\mathbf{C}=\mathbf{z}\}italic_P { bold_C = bold_z } are shown in Table 6. Finally, we can write

U1(0)=P{𝐂=(1,3,1)}+P{𝐂=(1,2,2)}P{g(𝐂)=1},superscriptsubscript𝑈10𝑃𝐂131𝑃𝐂122𝑃𝑔𝐂1U_{1}^{(0)}=\frac{P\{\mathbf{C}=(1,3,1)\}+P\{\mathbf{C}=(1,2,2)\}}{P\left\{g(% \mathbf{C})=1\right\}},italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG italic_P { bold_C = ( 1 , 3 , 1 ) } + italic_P { bold_C = ( 1 , 2 , 2 ) } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , (28)

where

P{g(𝐂)=1}𝑃𝑔𝐂1\displaystyle P\left\{g(\mathbf{C})=1\right\}italic_P { italic_g ( bold_C ) = 1 } =P{𝐂=(1,3,1)}p(0,1)absent𝑃𝐂131𝑝01\displaystyle=P\{\mathbf{C}=(1,3,1)\}\cdot p(0,1)= italic_P { bold_C = ( 1 , 3 , 1 ) } ⋅ italic_p ( 0 , 1 )
+P{𝐂\displaystyle+P\{\mathbf{C}+ italic_P { bold_C =(1,2,2)}p(0,1)\displaystyle=(1,2,2)\}\cdot p(0,1)= ( 1 , 2 , 2 ) } ⋅ italic_p ( 0 , 1 )
+P{𝐂\displaystyle+P\{\mathbf{C}+ italic_P { bold_C =(2,1,1)}p(0,2),\displaystyle=(2,1,1)\}\cdot p(0,2),= ( 2 , 1 , 1 ) } ⋅ italic_p ( 0 , 2 ) , (29)

and

P{C(0)=1g(𝐂)=1}=0.818.𝑃conditional-setsuperscript𝐶01𝑔𝐂10.818P\left\{C^{(0)}=1\mid g(\mathbf{C})=1\right\}=0.818.italic_P { italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ∣ italic_g ( bold_C ) = 1 } = 0.818 . (30)

Let us find the prior probability P{C(0)=2}𝑃superscript𝐶02P\left\{C^{(0)}=2\right\}italic_P { italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 2 }. In this case, there holds 𝒞0,1={(2,1,1)}subscript𝒞01211\mathcal{C}_{0,1}=\{(2,1,1)\}caligraphic_C start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT = { ( 2 , 1 , 1 ) }. Hence, we obtain

U2(0)=P{𝐂=(2,1,1)}P{g(𝐂)=1},superscriptsubscript𝑈20𝑃𝐂211𝑃𝑔𝐂1U_{2}^{(0)}=\frac{P\{\mathbf{C}=(2,1,1)\}}{P\left\{g(\mathbf{C})=1\right\}},italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = divide start_ARG italic_P { bold_C = ( 2 , 1 , 1 ) } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , (31)

and

P{C(0)=2g(𝐂)=1}=0.182.𝑃conditional-setsuperscript𝐶02𝑔𝐂10.182P\left\{C^{(0)}=2\mid g(\mathbf{C})=1\right\}=0.182.italic_P { italic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 2 ∣ italic_g ( bold_C ) = 1 } = 0.182 . (32)

It is interesting to point out that the expert rule increases the prior probability of the <<<malignant>>> and decreases the probability of the <<<benign>>>. Indeed, two images with the concept values <<<benign>>> and <<<grainy>>> do not correspond to the expert rule. Therefore, these images can be regarded as inadmissible for analyzing probabilities of the concept c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT. However, it does not mean that they cannot be used for finding probabilities of other concepts.

Table 6: Prior probabilities P{𝐂=𝐳}𝑃𝐂𝐳P\{{\bf{C}}={\bf{z}}\}italic_P { bold_C = bold_z }
𝐳𝐳\mathbf{z}bold_z (1,3,1)131(1,3,1)( 1 , 3 , 1 ) (1,2,2)122(1,2,2)( 1 , 2 , 2 ) (2,1,1)211(2,1,1)( 2 , 1 , 1 )
P{𝐂=𝐳}𝑃𝐂𝐳P\{\mathbf{C}=\mathbf{z}\}italic_P { bold_C = bold_z } 4/104104/104 / 10 2/102102/102 / 10 2/102102/102 / 10

7.2.2 Updating posterior probabilities

Let us again return to the above example and find the conditional probabilities P{El(0)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙0𝑣𝑔𝐂1P\left\{E_{l}^{(0)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 }, v=1,2𝑣12v=1,2italic_v = 1 , 2, for every cluster.

Cluster 1: The subset 𝒞0,1(1)subscript𝒞011\mathcal{C}_{0,1}(1)caligraphic_C start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ( 1 ) consists of combinations (1,3,1)131(1,3,1)( 1 , 3 , 1 ), (1,2,2)122(1,2,2)( 1 , 2 , 2 ) because embeddings corresponding to all concepts are included in K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The subset 𝒞0,2(1)subscript𝒞021\mathcal{C}_{0,2}(1)caligraphic_C start_POSTSUBSCRIPT 0 , 2 end_POSTSUBSCRIPT ( 1 ) consists of the combination (2,1,1)211(2,1,1)( 2 , 1 , 1 ).

Cluster 2: The subset 𝒞0,1(2)subscript𝒞012\mathcal{C}_{0,1}(2)caligraphic_C start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ( 2 ) consists of combinations (1,3,1)131(1,3,1)( 1 , 3 , 1 ), (1,2,2)122(1,2,2)( 1 , 2 , 2 ) because embeddings with the corresponding concept values are included in K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The subset 𝒞0,2(2)subscript𝒞022\mathcal{C}_{0,2}(2)caligraphic_C start_POSTSUBSCRIPT 0 , 2 end_POSTSUBSCRIPT ( 2 ) is empty.

Cluster 3: The subset 𝒞0,1(3)subscript𝒞013\mathcal{C}_{0,1}(3)caligraphic_C start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ( 3 ) consists of the combination (1,3,1)131(1,3,1)( 1 , 3 , 1 ) because one embedding from an image with the concept values (1,3,1)131(1,3,1)( 1 , 3 , 1 ) falls into K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. The subset 𝒞0,2(3)subscript𝒞023\mathcal{C}_{0,2}(3)caligraphic_C start_POSTSUBSCRIPT 0 , 2 end_POSTSUBSCRIPT ( 3 ) consists of the combination (2,1,1)211(2,1,1)( 2 , 1 , 1 ).

Finally, we can write using Tables 3 and 6:

V1(0)(1)superscriptsubscript𝑉101\displaystyle V_{1}^{(0)}(1)italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 1 ) =P{𝐂=(1,3,1)}+P{𝐂=(1,2,2)}P{g(𝐂)=1}absent𝑃𝐂131𝑃𝐂122𝑃𝑔𝐂1\displaystyle=\frac{P\{\mathbf{C}=(1,3,1)\}+P\{\mathbf{C}=(1,2,2)\}}{P\left\{g% (\mathbf{C})=1\right\}}= divide start_ARG italic_P { bold_C = ( 1 , 3 , 1 ) } + italic_P { bold_C = ( 1 , 2 , 2 ) } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG
=0.6P{g(𝐂)=1},absent0.6𝑃𝑔𝐂1\displaystyle=\frac{0.6}{P\left\{g(\mathbf{C})=1\right\}},= divide start_ARG 0.6 end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , (33)
V1(0)(2)=V1(0)(1)=0.6P{g(𝐂)=1},superscriptsubscript𝑉102superscriptsubscript𝑉1010.6𝑃𝑔𝐂1V_{1}^{(0)}(2)=V_{1}^{(0)}(1)=\frac{0.6}{P\left\{g(\mathbf{C})=1\right\}},italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 2 ) = italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 1 ) = divide start_ARG 0.6 end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , (34)
V1(0)(3)=P{𝐂=(1,3,1)}P{g(𝐂)=1}=0.4P{g(𝐂)=1},superscriptsubscript𝑉103𝑃𝐂131𝑃𝑔𝐂10.4𝑃𝑔𝐂1V_{1}^{(0)}(3)=\frac{P\{\mathbf{C}=(1,3,1)\}}{P\left\{g(\mathbf{C})=1\right\}}% =\frac{0.4}{P\left\{g(\mathbf{C})=1\right\}},italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 3 ) = divide start_ARG italic_P { bold_C = ( 1 , 3 , 1 ) } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG = divide start_ARG 0.4 end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , (35)

where

P{g(𝐂)=1}=0.616/24+0.67/24+0.41/24=0.592.𝑃𝑔𝐂10.616240.67240.41240.592P\left\{g(\mathbf{C})=1\right\}=0.6\cdot 16/24+0.6\cdot 7/24+0.4\cdot 1/24=0.5% 92.italic_P { italic_g ( bold_C ) = 1 } = 0.6 ⋅ 16 / 24 + 0.6 ⋅ 7 / 24 + 0.4 ⋅ 1 / 24 = 0.592 . (36)

Hence, we obtain V1(0)(1)=V1(0)(2)=1. 014superscriptsubscript𝑉101superscriptsubscript𝑉1021.014V_{1}^{(0)}(1)=V_{1}^{(0)}(2)=\allowbreak 1.\,\allowbreak 014italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 1 ) = italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 2 ) = 1. 014, V1(0)(3)=0.676superscriptsubscript𝑉1030.676V_{1}^{(0)}(3)=\allowbreak 0.676italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 3 ) = 0.676. Similarly, we find

V2(0)(1)superscriptsubscript𝑉201\displaystyle V_{2}^{(0)}(1)italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 1 ) =V2(0)(3)=P{𝐂=(1,3,1)}P{g(𝐂)=1}absentsuperscriptsubscript𝑉203𝑃𝐂131𝑃𝑔𝐂1\displaystyle=V_{2}^{(0)}(3)=\frac{P\{\mathbf{C}=(1,3,1)\}}{P\left\{g(\mathbf{% C})=1\right\}}= italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 3 ) = divide start_ARG italic_P { bold_C = ( 1 , 3 , 1 ) } end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG
=0.4P{g(𝐂)=1},V2(0)(2)=0,formulae-sequenceabsent0.4𝑃𝑔𝐂1superscriptsubscript𝑉2020\displaystyle=\frac{0.4}{P\left\{g(\mathbf{C})=1\right\}},\ V_{2}^{(0)}(2)=0,= divide start_ARG 0.4 end_ARG start_ARG italic_P { italic_g ( bold_C ) = 1 } end_ARG , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 2 ) = 0 , (37)

where

P{g(𝐂)=1}=0.43/4+0.00+0.41/4=0.4.𝑃𝑔𝐂10.4340.000.4140.4P\left\{g(\mathbf{C})=1\right\}=0.4\cdot 3/4+0.0\cdot 0+0.4\cdot 1/4=% \allowbreak 0.4.italic_P { italic_g ( bold_C ) = 1 } = 0.4 ⋅ 3 / 4 + 0.0 ⋅ 0 + 0.4 ⋅ 1 / 4 = 0.4 . (38)

Hence, we obtain V2(0)(1)=V2(0)(3)=1superscriptsubscript𝑉201superscriptsubscript𝑉2031V_{2}^{(0)}(1)=V_{2}^{(0)}(3)=\allowbreak 1italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 1 ) = italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 3 ) = 1, V2(0)(2)=0superscriptsubscript𝑉2020V_{2}^{(0)}(2)=\allowbreak 0italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ( 2 ) = 0. Finally, we find the conditional probabilities P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } from (20), whose values are given in Table 7.

Table 7: Conditional probabilities P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g({\bf{C}})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 }
c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT
v𝑣vitalic_v 1111 2222
K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.6760.676\allowbreak 0.676\,0.676 3/4343/43 / 4
K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.2960.296\allowbreak 0.2960.296 00
K3subscript𝐾3K_{3}italic_K start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.0280.0280.0280.028 1/4141/41 / 4

If we compare conditional probabilities of the concept c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT obtained without rules (see Table 3) and with rules (see Table 7), then we can see that the probabilities for c(0)=1superscript𝑐01c^{(0)}=1italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 have been changed whereas probabilities for c(0)=2superscript𝑐02c^{(0)}=2italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 2 have not been changed.

8 Numerical experiments

To study the proposed model, a synthetic dataset is constructed from the well-known MNIST dataset [56] which represents 28×28282828\times 2828 × 28 pixel handwritten digit images. The original MNIST dataset has a training set of 60,0006000060,00060 , 000 instances and a test set of 10,0001000010,00010 , 000 instances. The dataset is available at http://yann.lecun.com/exdb/mnist/.

Refer to caption
Figure 3: Examples of the modified MNIST dataset

Each instance in the synthetic dataset consists of four different digits randomly taken from MNIST such that the instance has two digits in the first row and two digits in the second row as it is shown in Fig. 3. Each instance has the size 56×56565656\times 5656 × 56. A similar dataset is used in [33]. Concepts are formed as follows:

  • The concept c(0)superscript𝑐0c^{(0)}italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT (target) is defined as the largest digit among four digits in each instance. This corresponds to the classification task with seven classes (digits from 3333 till 9999) due to the difference of digits in the instance.

  • Concepts c(1),,c(10)superscript𝑐1superscript𝑐10c^{(1)},...,c^{(10)}italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_c start_POSTSUPERSCRIPT ( 10 ) end_POSTSUPERSCRIPT are binary and defined by the presence of the corresponding number (1,,9,01901,...,9,01 , … , 9 , 0) in the instance.

The proposed model is compared with the original CBM [6]. Preliminary numerical experiments have shown that the original CBM provides outperforming results when the number of instances is large. Therefore, we compare the proposed model with CBM by training models on small numbers of instances (from 500500500500 till 5000500050005000). The number of testing images is 20,0002000020,00020 , 000.

Details of the proposed model are the following:

  • Images are divided into 4 patches such that each patch contains one digit.

  • The autoencoder is constructed by using the convolution network, the embedding size is 16161616.

  • The expectation–maximization (EM) algorithm with 80808080 clusters is used for clustering.

  • The model is trained on numbers of epochs from 30303030 till 50505050 depending on the number of instances taken for training.

CBM is also constructed on the convolution network which transforms the image to an embedding. For each concept, a fully connected two-layer network predicts the concept value. The cross-entropy loss function is used. A modification of CBM with the joint training of the bottleneck and targets is used such that the joint bottleneck minimizes the weighted sum of loss functions with coefficients 1. The F1 measure is used as an accuracy measure in experiments because the training set is imbalanced due to the considered structure of concepts and images.

F1 measures as functions of the training set size for all concepts of the modified MNIST dataset obtained by using the proposed method and CBM are shown in Fig. 4. It can be seen from Fig. 4 that FI-CBL outperforms the original CBM when the training set size is small (smaller than 5000500050005000 instances). This can be explained by the fact that the neural network implementing CBM requires a significantly larger number of instances for training. On the contrary, FI-CBL allows us to obtain acceptable results with a small number of training instances. At the same time, CBM becomes better as the training set increases. In this case, FI-CBL requires finer tuning of the number of clusters and the size of embeddings.

Refer to caption
Figure 4: The F1 measures of the proposed method and CBM as functions of the training set size for all concepts of the modified MNIST dataset

To study how an expert rule impacts on the prediction accuracy, we invert some correct labels that satisfy the expert rule. The rule is “IF c(9)=1superscript𝑐91c^{(9)}=1italic_c start_POSTSUPERSCRIPT ( 9 ) end_POSTSUPERSCRIPT = 1, THEN c(0)=1superscript𝑐01c^{(0)}=1italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1”. This is an obvious rule which means that if there is the digit 9999 among four digits in an instance, then the target is 9999. Let β𝛽\betaitalic_β be the portion of instances in the training set, whose labels are inverted. Fig. 5 illustrates how F1 measures depend on values of β𝛽\betaitalic_β for two cases: before using the rule (the curve with triangle markers) and after using the rule (the curve with circle markers). It can be seen from Fig. 5 that the use of the expert rule allows us to significantly correct the model and to obtain better predictions (the corresponding function slowly decreases with increase). At the same time, if the rule is not used, then the F1 measure quickly decreases with increase of β𝛽\betaitalic_β.

Refer to caption
Figure 5: F1-measures as functions of β𝛽\betaitalic_β for cases when any rule is not used (the curve with triangle markers), and when the rule “IF c(9)=1superscript𝑐91c^{(9)}=1italic_c start_POSTSUPERSCRIPT ( 9 ) end_POSTSUPERSCRIPT = 1, THEN c(0)=1superscript𝑐01c^{(0)}=1italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1” is used (the curve with circle markers) for the modified MNIST

We also consider a modification of the Large-scale CelebFaces Attributes (CelebA) dataset [57]. The original dataset contains 200,000200000200,000200 , 000 images, each annotated with 40404040 face attributes and 10,0001000010,00010 , 000 classes. The modification is restricted by 20202020 classes such that images with labels from 1111 till 500500500500 have the label 00, images with labels from 501501501501 till 1000100010001000 have the label 1111, etc. The number of concepts is 40404040. All concepts are binary. To compare FI-CBL with CBM, we find the F1 measure averaged over all 40404040 concepts. The number of testing images is approximately 65,0006500065,00065 , 000, i.e., 40% of the original dataset. The K-means algorithm with 256256256256 clusters is used for clustering. The embedding size is 32323232. The number of patches is 35353535. To ensure that the whole concepts fall into separate patches, we propose using overlap** patches similar to sliding windows in convolutional neural networks.

F1 measures as functions of the training set size of the modified CelebA dataset obtained by using FI-CBL and CBM are shown in Fig. 6. Similarly to the previous numerical example with the MNIST dataset, FI-CBL outperforms the original CBM when the training set size is small (smaller than 4000400040004000 instances).

Refer to caption
Figure 6: The averaged F1 measures as functions of the training set size for the modified CelebA dataset obtained by using FI-CBL and CBM

Let us consider another numerical example with the original MNIST dataset. All instances are annotated by the following concepts:

0 - target, 𝒞(0)={0,1}superscript𝒞001\mathcal{C}^{(0)}=\{0,1\}caligraphic_C start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { 0 , 1 }, values of the target are given in (39);

1 - the even/odd digit, 𝒞(1)={0,1}superscript𝒞101\mathcal{C}^{(1)}=\{0,1\}caligraphic_C start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = { 0 , 1 };

2 - the digit is not less than 5 / is less than 5, 𝒞(2)={0,1}superscript𝒞201\mathcal{C}^{(2)}=\{0,1\}caligraphic_C start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = { 0 , 1 };

3 - the digit is a remainder of division by 3, 𝒞(3)={0,1,2}superscript𝒞3012\mathcal{C}^{(3)}=\{0,1,2\}caligraphic_C start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = { 0 , 1 , 2 }.

In sum, digits are represented by the following sets of concepts:

00\displaystyle 0 [0,0,1,0]; 1[1,1,1,1]; 2[0,0,1,2];formulae-sequenceabsent0010formulae-sequence1111120012\displaystyle\rightarrow[0,0,1,0];\ 1\rightarrow[1,1,1,1];\ 2\rightarrow[0,0,1% ,2];→ [ 0 , 0 , 1 , 0 ] ; 1 → [ 1 , 1 , 1 , 1 ] ; 2 → [ 0 , 0 , 1 , 2 ] ;
33\displaystyle 33 [1,1,1,0]; 4[0,0,1,1]; 5[0,1,0,2];formulae-sequenceabsent1110formulae-sequence4001150102\displaystyle\rightarrow[1,1,1,0];\ 4\rightarrow[0,0,1,1];\ 5\rightarrow[0,1,0% ,2];→ [ 1 , 1 , 1 , 0 ] ; 4 → [ 0 , 0 , 1 , 1 ] ; 5 → [ 0 , 1 , 0 , 2 ] ;
66\displaystyle 66 [1,0,0,0]; 7[0,1,0,1]; 8[1,0,0,2];formulae-sequenceabsent1000formulae-sequence7010181002\displaystyle\rightarrow[1,0,0,0];\ 7\rightarrow[0,1,0,1];\ 8\rightarrow[1,0,0% ,2];→ [ 1 , 0 , 0 , 0 ] ; 7 → [ 0 , 1 , 0 , 1 ] ; 8 → [ 1 , 0 , 0 , 2 ] ;
99\displaystyle 99 [0,1,0,0].absent0100\displaystyle\rightarrow[0,1,0,0].→ [ 0 , 1 , 0 , 0 ] . (39)

The EM algorithm with 128128128128 clusters is used for clustering. The embedding size is 16161616. The testing set consists of 24,0002400024,00024 , 000 images.

F1 measures as functions of the training set size for four concepts of the original MNIST dataset obtained by using the proposed method and CBM are shown in Fig. 7. One can see from Fig. 7 that CBM outperforms FI-CBL in most cases even when the training set size is small. This numerical example aims to show that CBM can provide better results even when the number of training instances is small. At the same time, the next numerical experiments aim to study how expert rules can correct incorrect concepts and provide better results in comparison with CBM. We aim to show now that CBM provides worse results in comparison with FI-CBL when labels are noisy. We again invert some correct labels that satisfy the expert rule. The rule is “IF c(1)=1superscript𝑐11c^{(1)}=1italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 AND c(2)=1superscript𝑐21c^{(2)}=1italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1, THEN c(0)=1superscript𝑐01c^{(0)}=1italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1”. Fig. 8 illustrates the difference between F1 measures as functions of β𝛽\betaitalic_β for two cases: before using the rule and after using the rule. It can be seen from Fig. 8 that the use of the rule allows us to significantly correct the model and to obtain better predictions. When values of β𝛽\betaitalic_β exceed 0.160.160.160.16, then both the models show worse results.

Refer to caption
Figure 7: The F1 measures of the proposed method and CBM as functions of the training set size for all concepts of the original MNIST dataset
Refer to caption
Figure 8: F1 measures as functions of β𝛽\betaitalic_β for two cases: before using the rule and after using the rule for the MNIST dataset

In the next experiment, we invert half of the target labels, i.e., the target is completely confused. We aim to improve the model prediction by incorporating the increasingly detailed expert rules:

  • g1(𝐜)=[c(1)=1c(2)=1c(0)=1]subscript𝑔1𝐜delimited-[]superscript𝑐11superscript𝑐21superscript𝑐01g_{1}(\mathbf{c})=\left[c^{(1)}=1\wedge c^{(2)}=1\rightarrow c^{(0)}=1\right]italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_c ) = [ italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 ∧ italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 → italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ];

  • g2(𝐜)=[(c(1)=0c(2)=1c(0)=0)(c(1)=1c(2)=1c(0)=1)]subscript𝑔2𝐜delimited-[]superscript𝑐10superscript𝑐21superscript𝑐00superscript𝑐11superscript𝑐21superscript𝑐01g_{2}(\mathbf{c})=\left[\left(c^{(1)}=0\wedge c^{(2)}=1\rightarrow c^{(0)}=0% \right)\wedge\left(c^{(1)}=1\wedge c^{(2)}=1\rightarrow c^{(0)}=1\right)\right]italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_c ) = [ ( italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0 ∧ italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 → italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0 ) ∧ ( italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 ∧ italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 → italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ) ];

  • g3(𝐜)=[(c(1)=0c(2)=0)(c(1)=1c(2)=1)c(0)=1]g_{3}(\mathbf{c})=\left[(c^{(1)}=0\wedge c^{(2)}=0)\vee(c^{(1)}=1\wedge c^{(2)% }=1)\leftrightarrow c^{(0)}=1\right]italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_c ) = [ ( italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0 ∧ italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 0 ) ∨ ( italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 1 ∧ italic_c start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 1 ) ↔ italic_c start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1 ].

Fig. 9 shows histograms of the binary target prediction probabilities α𝛼\alphaitalic_α without rules and with different rules. It can be seen from the first histogram that most prediction probabilities are close to 0.50.50.50.5. This implies that the inversion of the target labels leads to the total uncertainty, i.e. the model cannot correctly classify instances. The first rule g1(𝐜)subscript𝑔1𝐜g_{1}(\mathbf{c})italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_c ) partially improves results. One can see from the second histogram that a small part of instances are classified with probabilities close to 1111. However, so far most prediction probabilities are close to 0.50.50.50.5 despite the rule. The second rule g2(𝐜)subscript𝑔2𝐜g_{2}(\mathbf{c})italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_c ) consists of two rules and corrects many inverted labels. One can see from the third histogram that the uncertainty of the target probabilities significantly decreases in comparison with the previous cases. Finally, the third rule g3(𝐜)subscript𝑔3𝐜g_{3}(\mathbf{c})italic_g start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_c ), which is the most strong one due to the logical operation “if and only if” denoted as “\leftrightarrow”, provides the best results shown in the fourth histogram. We can see that the uncertainty of predictions is minimal. This is a very important observation which illustrates how the incorporated expert rules are able to improve the model.

Refer to caption
Figure 9: Four histograms of the target prediction probabilities obtained under conditions of confused concept labels and different expert rules for the original MNIST dataset

9 Conclusion

Let us point out advantages and disadvantages of FI-CBL.

Advantages:

  • FI-CBL is transparent. In contrast to the neural network implementation of CBL where the network as well as the bottleneck layer are a black box, the proposed model has a clear and explicit sequence of calculations. All probabilities (conditional and unconditional) have a clear frequency interpretation. It allows us to extend the model in order to take into account the possible limitation of instances in datasets and to apply robust statistical methods which can correct the posterior probabilities and improve the classification. Moreover, one can also observe entire processes of training and inference in order to understand results of modeling.

  • FI-CBL is flexible. We can simply change the number of patches, the number of clusters, the autoencoder architecture, threshold of decision making.

  • FI-CBL provides outperforming results when the number of training instances is small.

  • FI-CBL is interpretable.

Disadvantages:

  • The main disadvantage is to implement a perfect clusterization and to guess a proper number of clusters. We could learn the clusterization procedure jointly with the autoencoder, but this approach may significantly complicate the method and reduce its positive property to deal with small datasets.

  • The effectiveness of FI-CBL greatly depends on the size of patches. A large difference in the size of concepts in an image can lead to significant deterioration of the model. One of the ways to overcome this problem is to use sliding windows for producing patches as it has been implemented in the numerical experiment with CelebA dataset. However, this way requires additional studies.

  • We need to store the dataset in order to use it every time when a new instance is analyzed. On the other hand, we can store only a matrix of all conditional probabilities P{El(r)=vg(𝐂)=1}𝑃conditional-setsuperscriptsubscript𝐸𝑙𝑟𝑣𝑔𝐂1P\left\{E_{l}^{(r)}=v\mid g(\mathbf{C})=1\right\}italic_P { italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT = italic_v ∣ italic_g ( bold_C ) = 1 } if the set of expert rules is not changed. If many clusters and expert rules are implemented, then the matrix can be very large.

  • Experiments have illustrated that FI-CBL may be inferior to CBMs when the number of training data is large.

The above disadvantages can be regarded as problems whose solutions are direction for further research. In addition, the probabilistic approach used in FI-CBL allows us to simply extend the method to solve several problems of machine learning, including anomaly detection, unlearning, the attention mechanism, etc. These are also directions for further research.

References

  • [1] I. Lage and F. Doshi-Velez. Learning interpretable concept-based models with human feedback. arXiv:2012.02898, Dec 2020.
  • [2] A. Gupta and P.J. Narayanan. A survey on concept-based approaches for model improvement. arXiv:2403.14566, Mar 2024.
  • [3] Been Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
  • [4] Bowen Wang, Liangzhi Li, Y. Nakashima, and H. Nagahara. Learning bottleneck concepts in image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10962–10971, 2023.
  • [5] Chih-Kuan Yeh, Been Kim, S. Arik, Chun-Liang Li, T. Pfister, and P. Ravikumar. On completeness-aware concept-based explanations in deep neural networks. In Advances in neural information processing systems, volume 33, pages 20554–20565, 2020.
  • [6] Pang Wei Koh, Thao Nguyen, Yew Siang Tang, S. Mussmann, E. Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International conference on machine learning, pages 5338–5348. PMLR, 2020.
  • [7] E. Poeta, G. Ciravegna, E. Pastor, T. Cerquitelli, and E. Baralis. Concept-based explainable artificial intelligence: A survey. arXiv:2312.12936, May 2023.
  • [8] Y. Yamamoto, T. Tsuzuki, and J. Akatsuka. Automated acquisition of explainable knowledge from unannotated histopathology images. Nature Communications, 10(5642):1–9, 2019.
  • [9] J. Amores. Multiple instance classification: review, taxonomy and comparative study. Artificial Intelligence, 201:81–105, 2013.
  • [10] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognition, 77:329–353, 2018.
  • [11] J. Yao, X. Zhu, J. Jonnagaddala, N. Hawkins, and J. Huang. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning network. Medical Image Analysis, 65(101789):1–14, 2020.
  • [12] Z.-H. Zhou. Multi-instance learning: A survey. Technical report, National Laboratory for Novel Software Technology, Nan**g University, 2004.
  • [13] A.V. Konstantinov and L.V. Utkin. Incorporating expert rules into neural networks in the framework of concept-based learning. arXiv:2402.14726, Feb 2024.
  • [14] M. Dreyer, R. Achtibat, W. Samek, and S. Lapuschkin. Understanding the (extra-) ordinary: Validating deep model decisions with prototypical concept-based explanations. arXiv:2311.16681, Nov 2023.
  • [15] Yangqing Jia, J.T. Abbott, J.L Austerweil, T. Griffiths, and T. Darrell. Visual concept learning: Combining machine vision and bayesian generalization on concept hierarchies. In Advances in Neural Information Processing Systems, volume 26, pages 1–9, 2013.
  • [16] V. Pendyala and Jihye Choi. Concept-based explanations for tabular data. arXiv:2209.05690, Sep 2022.
  • [17] C. Obermair, A. Fuchs, F. Pernkopf, L. Felsberger, A. Apollonio, and D. Wollmann. Example or prototype? learning concept-based explanations in time-series. In Asian Conference on Machine Learning, pages 816–831. PMLR, 2023.
  • [18] Wensi Tang, Lu Liu, and Guodong Long. Interpretable time-series classification on few-shot samples. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
  • [19] Jihye Choi, J. Raghuram, Ryan Feng, Jiefeng Chen, S. Jha, and A. Prakash. Concept-based explanations for out-of-distribution detectors. In International Conference on Machine Learning, pages 5817–5837. PMLR, 2023.
  • [20] L.R. Sevyeri, I. Sheth, F. Farahnak, and S.A. Enger. Transparent anomaly detection via concept-based explanations. arXiv:2310.10702, Oct 2023.
  • [21] L.A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), pages 264–279, 2018.
  • [22] R. Marcinkevičs, P.R. Wolfertstetter, U. Klimiene, Kieran Chin-Cheong, A. Paschke, J. Zerres, M. Denzinger, D. Niederberger, S. Wellmann, E. Ozkan, et al. Interpretable and intervenable ultrasonography-based machine learning models for pediatric appendicitis. Medical Image Analysis, 91:103042, 2024.
  • [23] A.A. Meldo, L.V. Utkin, M.S. Kovalev, and E.M. Kasimov. The natural language explanation algorithms for the lung cancer computer-aided diagnosis system. Artificial Intelligence in Medicine, 108(Article 101952):1–10, 2020.
  • [24] C. Patrício, J.C. Neves, and L.F. Teixeira. Coherent concept-based explanations in medical image and its application to skin lesion diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3798–3807, 2023.
  • [25] A. Vats, M. Pedersen, and A. Mohammed. Concept-based reasoning in medical imaging. International Journal of Computer Assisted Radiology and Surgery, 18(7):1335–1339, 2023.
  • [26] An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, **gbo Shang, et al. Robust and interpretable medical image classifiers via concept bottleneck models. arXiv:2310.03182, Oct 2023.
  • [27] Jae Hee Lee, S. Lanza, and S. Wermter. From neural activations to concepts: A survey on explaining concepts in neural networks. arXiv:2310.11884, Oct 2023.
  • [28] A. Mahinpei, J. Clark, I. Lage, F. Doshi-Velez, and Weiwei Pan. Promises and pitfalls of black-box concept learning models. arXiv:2106.13314, Jun 2021.
  • [29] I. Sheth and S.E. Kahou. Auxiliary losses for learning generalizable concept-based models. arXiv:2311.11108, Nov 2023.
  • [30] Mert Yuksekgonul, Maggie Wang, and James Zou. Post-hoc concept bottleneck models. arXiv:2205.15480, May 2022.
  • [31] Z.M. Espinosa, P. Barbiero, G. Ciravegna, G. Marra, F. Giannini, M. Diligenti, Z. Shams, F. Precioso, S. Melacci, A. Weller, et al. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems, volume 35, pages 21400–21413, 2022.
  • [32] A.A. Ismail, J. Adebayo, H.C. Bravo, S. Ra, and Kyunghyun Cho. Concept bottleneck generative models. In Proceedings of ICML 2023. Workshop on Deployment Challenges for Generative AI, https://openreview.net/group?id=ICML.cc/2023/Workshop, pages 1–10, 2023.
  • [33] Eunji Kim, Dahuin Jung, Sangha Park, Siwon Kim, and Sungroh Yoon. Probabilistic concept bottleneck models. In International Conference on Machine Learning, pages 16521–16540. PMLR, 2023.
  • [34] Naveen R., Mateo E. Zarlenga, Juyeon Heo, and Mateja Jamnik. Do concept bottleneck models obey locality? arXiv:2401.01259, Jan 2024.
  • [35] R. Marcinkevics, S. Laguna, M. Vandenhirtz, and J.E. Vogt. Beyond concept bottleneck models: How to make black boxes intervenable? arXiv:2401.13544, Jan 2024.
  • [36] F. Pittino, V. Dimitrievska, and R. Heer. Hierarchical concept bottleneck models for vision and their application to explainable fine classification and tracking. Engineering Applications of Artificial Intelligence, 118:105674, 2023.
  • [37] M. Havasi, S. Parbhoo, and F. Doshi-Velez. Addressing leakage in concept bottleneck models. In Advances in Neural Information Processing Systems, volume 35, pages 23386–23397, 2022.
  • [38] A. Radford, Jong Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [39] R. Kazmierczak, E. Berthier, G. Frehse, and G. Franchi. CLIP-QDA: An explainable concept bottleneck model. arXiv:2312.00110, Dec 2023.
  • [40] K. Chauhan, R. Tiwari, J. Freyberg, P. Shenoy, and K. Dvijotham. Interactive concept bottleneck models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 5948–5955, 2023.
  • [41] Yan Cui, Shuhong Liu, Liuzhuozheng Li, and Zhiyuan Yuan. Ceir: Concept-based explainable image representation learning. arXiv:2312.10747, Dec 2023.
  • [42] E. Marconato, A. Passerini, and S. Teso. Glancenets: Interpretable, leak-proof concept-based models. In Advances in Neural Information Processing Systems, volume 35, pages 21212–21227, 2022.
  • [43] A. Margeloiu, M. Ashman, U. Bhatt, Yanzhi Chen, M. Jamnik, and A. Weller. Do concept bottleneck models learn as intended? arXiv:2105.04289, May 2021.
  • [44] Ao Sun, Yuanyuan Yuan, **chuan Ma, and Shuai Wang. Eliminating information leakage in hard concept bottleneck models with supervised, hierarchical concept learning. arXiv:2402.05945, Feb 2024.
  • [45] L. Von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch, J. Pfrommer, A. Pick, R. Ramamurthy, et al. Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Transactions on Knowledge and Data Engineering, 35(1):614–633, 2021.
  • [46] M. Diligenti, M. Gori, and C. Sacca. Semantic-based regularization for learning and inference. Artificial Intelligence, 244:143–165, 2017.
  • [47] M. Diligenti, S. Roychowdhury, and M. Gori. Integrating prior knowledge into deep learning. In 2017 16th IEEE international conference on machine learning and applications (ICMLA), pages 920–923. IEEE, 2017.
  • [48] Zhiting Hu, Zichao Yang, R. Salakhutdinov, and Eric Xing. Deep neural networks with massive learned knowledge. In Proceedings of the 2016 conference on empirical methods in natural language processing, pages 1670–1679, 2016.
  • [49] **gyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Broeck. A semantic loss function for deep learning with symbolic knowledge. In International conference on machine learning, pages 5502–5511. PMLR, 2018.
  • [50] M.V.M. França, G. Zaverucha, and d’Avila G.A.S. Fast relational learning using bottom clause propositionalization with artificial neural networks. Machine learning, 94:81–104, 2014.
  • [51] A. Garcez, M. Gori, L.C. Lamb, L. Serafini, M. Spranger, and S.N. Tran. Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. Journal of Applied Logics, 6(4):611–632, 2019.
  • [52] Xiao-Wen Yang, Jie-**g Shao, Wei-Wei Tu, Yu-Feng Li, Wang-Zhou Dai, and Zhi-Hua Zhou. Safe abductive learning in the presence of inaccurate rules. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16361–16369, 2024.
  • [53] V. Chen, U. Bhatt, H. Heidari, A. Weller, and A. Talwalkar. Perspectives on incorporating expert feedback into model updates. Patterns, 4(7):1–13, 2023.
  • [54] Kaiwen Xu, K. Fukuchi, Y. Akimoto, and J. Sakuma. Statistically significant concept-based explanation of image classifiers via model knockoffs. arXiv:2305.18362, May 2023.
  • [55] Hung T. Nguyen, Masao Mukaidono, and V. Kreinovich. Probability of implication, logical version of bayes theorem, and fuzzy logic operations. In IEEE World Congress on Computational Intelligence. IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02, volume 1, pages 530–535. IEEE, 2002.
  • [56] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [57] Ziwei Liu, ** Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738. IEEE, 2015.