License: arXiv.org perpetual non-exclusive license
arXiv:2401.10749v1 [cs.CY] 29 Dec 2023

ReliCD: A Reliable Cognitive Diagnosis Framework with Confidence Awareness

Yunfei Zhang1,*11,*1 , * , Chuan Qin2,*22,*2 , *, Dazhong Shen3333, Hai** Ma1,11,\dagger1 , †, Le Zhang4444, Xingyi Zhang5555, Hengshu Zhu2,22,\dagger2 , † * Yunfei Zhang and Chuan Qin contribute equally to this research.\dagger Hai** Ma and Hengshu Zhu are corresponding authors. 1111Department of Information Materials and Intelligent Sensing Laboratory of Anhui Province, Institutes of Physical Science and
Information Technology, Anhui University, China, {[email protected], [email protected]}
2222Career Science Lab, BOSS Zhipin, China, {chuanqin0426, zhuhengshu}@gmail.com 3333Shanghai Artificial Intelligence Laboratory, China, [email protected] 4444Business Intelligence Lab, Baidu Inc, China, [email protected] 5555School of Computer Science and Technology, Anhui University, China, [email protected]
Abstract

During the past few decades, cognitive diagnostics modeling has attracted increasing attention in computational education communities, which is capable of quantifying the learning status and knowledge mastery levels of students. Indeed, the recent advances in neural networks have greatly enhanced the performance of traditional cognitive diagnosis models through learning the deep representations of students and exercises. Nevertheless, existing approaches often suffer from the issue of overconfidence in predicting students’ mastery levels, which is primarily caused by the unavoidable noise and sparsity in realistic student-exercise interaction data, severely hindering the educational application of diagnostic feedback. To address this, in this paper, we propose a novel Reliable Cognitive Diagnosis (ReliCD) framework, which can quantify the confidence of the diagnosis feedback and is flexible for different cognitive diagnostic functions. Specifically, we first propose a Bayesian method to explicitly estimate the state uncertainty of different knowledge concepts for students, which enables the confidence quantification of diagnostic feedback. In particular, to account for potential differences, we suggest modeling individual prior distributions for the latent variables of different ability concepts using a pre-trained model. Additionally, we introduce a logical hypothesis for ranking confidence levels. Along this line, we design a novel calibration loss to optimize the confidence parameters by modeling the process of student performance prediction. Finally, extensive experiments on four real-world datasets clearly demonstrate the effectiveness of our ReliCD framework.

Index Terms:
Reliable cognitive diagnosis, intelligent education, knowledge state uncertainty

I Introduction

Cognitive diagnosis, as an essential component of computer-aided education, has garnered increasing attention over the past decades [1, 2, 3]. The primary objective of cognitive diagnostics modeling is to quantitatively assess students’ learning status and knowledge mastery levels, providing valuable formative feedback [1, 2]. Indeed, relevant studies have enabled a wide range of downstream educational applications, such as course recommendations [4], student assessment [5], and computerized adaptive testing [6]. As shown in Figure 1(a), given the answering records of student Lano concerning a series of exercises, the cognitive diagnosis model can automatically estimate her mastery levels of various knowledge concepts.

Refer to caption
Figure 1: (a) An example of cognitive diagnosis; (b) the predicted Lano’s diagnostic feedback on concept C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with different interaction data and the corresponding accuracy of her performance prediction on all the exercises related to the concept C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the test set, where e1:jsubscript𝑒:1𝑗e_{1:j}italic_e start_POSTSUBSCRIPT 1 : italic_j end_POSTSUBSCRIPT denotes the exercises set {e1,e2,,ej}subscript𝑒1subscript𝑒2subscript𝑒𝑗\{e_{1},e_{2},...,e_{j}\}{ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } and h¯¯\bar{h}over¯ start_ARG italic_h end_ARG indicates Lano’s actual ability state on C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

In the literature, traditional cognitive diagnosis models (CDMs) utilize different linear psychometric functions to measure students’ learning status by modeling the process of student performance prediction, such as Deterministic Inputs, Noisy “And” gate (DINA) [7], Item Response Theory (IRT) [8]. Recently, with the rapid development of deep learning techniques, several neural-based cognitive diagnostic methods have been proposed to enhance diagnostic performance. For instance, the neural cognitive diagnosis framework (NCD) utilizes neural networks to model students-exercise interactions, in order to uncover deeper features of both students and exercises [2]. Moreover, the flexibility of neural model design has enabled researchers to incorporate additional information, such as concept dependency maps [3] and student profiling [9], to further improve the effectiveness and interpretability of the models.

Previous studies have evaluated the effectiveness of cognitive diagnostic models by calculating the accuracy of student performance prediction, but they have not measured the reliability of diagnostic feedback. Meanwhile, due to the presence of noise and sparsity in student-exercise interaction data, existing approaches lead to the potential overconfidence in students’ mastery prediction, severely reducing the reliability of real-time diagnostic feedback in practical online education systems. More specifically, as illustrated in Figure 1(b), when Lano interacted with each exercise (i.e., from e1subscript𝑒1e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to e5subscript𝑒5e_{5}italic_e start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT), we present the cognitive diagnosis model’s results regarding her mastery of knowledge concept C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the accuracy of her performance prediction on all the exercises related to the concept C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the test set. We found that due to the noise present in the interaction data (i.e., <<<Lano, e5subscript𝑒5e_{5}italic_e start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, \checkmark>>>), the mastery of C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at h5subscript5h_{5}italic_h start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT deviates from the actual state h¯¯\bar{h}over¯ start_ARG italic_h end_ARG. It indicates that we cannot trust the diagnostic feedback in a monotonic manner with the increase in interaction. Furthermore, traditional evaluation metrics like accuracy are non-smooth functions, which can result in the same evaluation outcome despite different diagnostic feedback. Additionally, these indicators are often not available in real-time during the diagnostic process in practical use. Consequently, an ideal cognitive diagnosis model should be able to provide both accurate diagnostic feedback and indications of its reliability.

To this end, in this paper, we propose a novel reliable cognitive diagnosis framework, namely ReliCD. To the best of our knowledge, this is the first one to quantify the confidence of the diagnosis feedback and is flexible for different cognitive diagnostic functions. Specifically, we first propose a Bayesian method for explicitly estimating the uncertainty of students’ states for various knowledge concepts with Gaussian latent variables, where the mean parameter represents the average ability status and the variance enables the quantification of diagnostic feedback’s confidence. In particular, due to the potential difference, we model the individual prior distribution for the latent variables of different ability concepts with a pre-trained model. Then, we introduce a logical hypothesis for ranking confidence levels and present a novel calibration loss to optimize the parameters in determining diagnostic feedback’s confidence through modeling the process of student performance prediction. Finally, extensive experiments on four real-world datasets demonstrate the effectiveness and flexibility of our ReliCD.

II Related Work

Generally, the related work in this paper can be grouped into two categories: cognitive diagnosis and confidence estimation.

II-A Cognitive Diagnosis

The main task of cognitive diagnosis is to use students’ responses to exercises for diagnosing students’ ability state. Over the past decades, experts in related educational psychology fields have proposed many cognitive diagnostic models. The two most classic ones are IRT [8] and DINA [7]. In IRT, Embretson et al. represented students’ ability state as a one-dimensional and continuous scalar. And a logistic function is used to predict the probability that the student eventually responds correctly to the exercise. Later, some researchers improved upon IRT and proposed MIRT [10] by extending the ability state of students to multi-dimensional vectors. Different from IRT, DINA uses a binary vector to model the student’s ability state with each dimension’s value representing his/her mastery of relevant knowledge concepts. There are two possible values on each dimension, 1 (mastered) or 0 (not mastered). Furthermore, Jimmy De La Torre believed that DINA itself has strong assumptions and constraints, which do not conform to the actual situation. Along this line, they proposed a generalized DINA (G-DINA) [11] to improve the diagnostic performance by weakening these constraints.

In recent years, neural-based cognitive diagnosis models have achieved state-of-the-art prediction performance, benefiting from the successful application of neural networks in various fields, including recommendation systems [12], knowledge tracing [13], and computer vision [14]. These works can be mainly divided into two aspects. The first aspect focuses on designing diagnostic functions that leverage the power of neural networks to capture complex and non-linear interactions between students and exercises, such as NCD [2]. The second is to use neural networks to enrich the representation of students and exercises by considering more additional information (e.g., the exercise text information, the relationship between knowledge concepts). For example, deep IRT (DIRT) [15] uses the semantic information of the exercise text to enrich the parameter representation of the traditional IRT. Educational context-aware cognitive diagnosis (ECD) [16] was proposed by incorporating the student’s educational background into the modeling of student knowledge status. Gao et al. [3] proposed the relation map-driven cognitive diagnosis (RCD) framework by exploiting the prerequisite relation and similarity relationship of knowledge concepts. Ma et al. [17] proposed a prerequisite attention model (PAKP) for knowledge proficiency diagnosis of students by considering the prerequisite relationship of knowledge concepts and learning the influence weights of predecessor knowledge concepts on successor knowledge concepts. Furthermore, Li et al. [18] proposed a novel CDM, namely HCDF, to enhance diagnostic performance by modeling the hierarchical relationship between knowledge concepts.

The majority of existing studies primarily concentrate on enhancing the accuracy of student performance prediction. However, there has been a notable lack of comprehensive investigation into the aspect of reliability in diagnostic feedback. In this paper, we introduce a novel approach for reliable cognitive diagnosis, which is the first to quantitatively assess the reliability of diagnostic feedback.

II-B Confidence Estimation

Confidence estimation has been incorporated within the machine learning community in some specific areas including autonomous driving [19], medical applications [20], and career mobility analysis [21, 22], so as to provide insight into the reliability of the results while making accurate predictions. The reliability feedback of results can serve as a measure for future tasks. For instance, Yukun et al. [23] suggested that challenging cases with low confidence levels in the field of medicine should be reviewed by skilled surgeons.

In the past years, the research direction of confidence modeling has evolved in two directions. The first direction is to quantify the confidence of predicted results with diverse heuristic approaches. For example, DeVries et al. [24] enhanced the model’s prediction by adding a branch of calculating the confidence value, based on the original classification task. The confidence value is utilized to identify whether the input sample is an out-of-distribution (OOD) sample. Hendrycks et al. [25] used the predicted softmax probability of the sample as the confidence estimation and detected OOD samples by selecting the samples with the minimum softmax probability values. Kendall et al. [26] argued that the model uncertainty could be explained by inherent noise in the captured data with Bayesian approaches. On the other hand, the confidence calibration work has also received extensive attention recently. For instance, Guo et al.[27] analyzed the overconfidence reasons (i.e., model capacity, batch normalization, and weight decay) of models based on deep neural networks and gave some post-processing techniques (e.g., temperature scaling, matrix, and vector scaling) to deal with these problems. Moon, Jooyoung et al.[28] introduced the correctness ranking loss to ensure the credibility of the predicted probability, which defines the optimization objection that the confidence estimate for the correctly predicted sample is greater than the confidence estimate for the incorrectly predicted sample. However, these confidence estimation methods cannot be directly applied to the task of cognitive diagnosis. In this paper, we propose a novel calibration loss method that aims to optimize parameters, thereby ensuring the reliability of the predicted probabilities, which allows the model to maintain confidence in its output results.

Refer to caption

Refer to caption
Figure 2: (a) The distribution of all students’ ability status diagnosed by NCD on the Assist2009 dataset. The blue part represents diagnostic status of knowledge concepts not interacted with, and the red part represents diagnostic status of knowledge concepts interacted with. (b) The density plot of all students’ status on the knowledge concepts that they have interacted with.

III PRELIMINARIES

In this section, we first introduce some currently known cognitive diagnosis functions (i.e., IRT, MIRT, and NCD). Then, we analyze the diagnostic feedback of the previous CDMs using NCD as a case study. Finally, we formally define the research problem being investigated in this paper.

III-A Cognitive Diagnostic Functions

Generally, cognitive diagnosis in computational education aims to determine the student’s ability status through the student exercising performance prediction task.

As a classic and representative diagnostic formula in educational psychology, IRT [8] portrays the student ability status of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with an integrated value θiR1subscript𝜃𝑖superscript𝑅1\theta_{i}\in R^{1}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. The logistic regression function is used to predict the probability p(yij=1)𝑝subscript𝑦𝑖𝑗1p(y_{ij}=1)italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) that the student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will answer the exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT correctly as follows,

p(yij=1)=11+eDβj(θiαj),𝑝subscript𝑦𝑖𝑗111superscript𝑒𝐷subscript𝛽𝑗subscript𝜃𝑖subscript𝛼𝑗\displaystyle p(y_{ij}=1)=\frac{1}{1+e^{-D\beta_{j}(\theta_{i}-\alpha_{j})}},italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_D italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG , (1)

where αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and βjR1subscript𝛽𝑗superscript𝑅1\beta_{j}\in R^{1}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT are difficulty and discrimination of exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT respectively. D𝐷Ditalic_D is a constant.

MIRT [10] expands the student’s ability status and exercise parameters from an integrated value to multi-dimensional vectors on the basis of IRT, so as to assess the student’s ability status from multiple aspects. In this paper, we resemble some work (e.g., RCD [3]) to map each dimension of MIRT to a specific knowledge concept by integrating the Q𝑄Qitalic_Q-matrix [2]. Under such consideration, the probability that student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with ability status vector θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT makes a correct response to exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with difficulty vector αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be expressed as:

p(yij=1)=ϱ(fsum(Qj(θiαj))),𝑝subscript𝑦𝑖𝑗1italic-ϱsubscript𝑓𝑠𝑢𝑚subscript𝑄𝑗subscript𝜃𝑖subscript𝛼𝑗\displaystyle p(y_{ij}=1)=\mathrm{\varrho}(f_{sum}(Q_{j}\circ(\theta_{i}-% \alpha_{j}))),italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) = italic_ϱ ( italic_f start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∘ ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) , (2)

where Qjsubscript𝑄𝑗Q_{j}italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT indicates which knowledge concepts are relevant to ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and αjRKsubscript𝛼𝑗superscript𝑅𝐾\alpha_{j}\in R^{K}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, fsum(.)f_{sum}(.)italic_f start_POSTSUBSCRIPT italic_s italic_u italic_m end_POSTSUBSCRIPT ( . ) is the sum operation and ϱ(.)\varrho(.)italic_ϱ ( . ) is the sigmoid function.

NCD [2] attempts to accommodate complex nonlinear interactions between students and exercises by building a new diagnostic function consisting of three fully connected layers (fMLPsubscript𝑓𝑀𝐿𝑃f_{MLP}italic_f start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT) and one shallow layer inspired by MIRT. The cognitive diagnostic function of NCD can be formalized as:

p(yij=1)=ϱ(fMLP(Qj(θiαj)×βj)),𝑝subscript𝑦𝑖𝑗1italic-ϱsubscript𝑓𝑀𝐿𝑃subscript𝑄𝑗subscript𝜃𝑖subscript𝛼𝑗subscript𝛽𝑗\displaystyle p(y_{ij}=1)=\mathrm{\varrho(}f_{MLP}(Q_{j}\circ(\theta_{i}-% \alpha_{j})\times\beta_{j})),italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) = italic_ϱ ( italic_f start_POSTSUBSCRIPT italic_M italic_L italic_P end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∘ ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) × italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , (3)

where θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a K𝐾Kitalic_K-dimensional vector representing the ability status of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and βjRKsubscript𝛽𝑗superscript𝑅𝐾\beta_{j}\in R^{K}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT are exercise difficulty and exercise discrimination, respectively. The value of each dimension in θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’ mastery level of the corresponding knowledge concept.

Refer to caption

Refer to caption

Figure 3: (a) The density plot of the correct rate of students’ performance prediction task related to knowledge concept #50 by NCD in the test set of Assist2009. (b) The density plot of the correct rate after randomly adding one noisy interaction data on concept #50 for each student.

III-B Diagnostic Feedback Analysis

While current CDMs show remarkable accuracy in predicting student performance, we argue that their diagnostic feedback may not always be meaningful.

Without loss of generality, we take the NCD model as an example. Specifically, we trained the NCD on a public real-world dataset, namely Assist2009. Then, we can obtain all students’ diagnostic feedback, i.e., their ability status θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As shown in Figure 2, we present the distribution of θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Here, the red part indicates all ability status θilsubscript𝜃𝑖𝑙\theta_{il}italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT for student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on each knowledge concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT that sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has interacted with it. Similarly, the blue part shows the ability status that the student has not interacted with. Clearly, we can find that although the distributions of ability status values corresponding to the interactive knowledge concepts and non-interactive knowledge concepts are different, both of them have limited support included in [0.4, 0.6], which impedes the discriminate diagnostic feedback.

Moreover, we further analyze the impact of noisy interaction data on the diagnostic model. Here, Figure 3 shows the density plot, which indicates the correct rate of students’ performance prediction related to the knowledge concept #50 in the test set based on the above NCD model on Assist2009. Next, we incorporate a randomly generated interaction for each student at knowledge concept #50. The corresponding density plot on the correct rate of students’ performance prediction is shown in Figure 3. We can find that the student’s performance predictions were significantly degraded after incorporating the noisy data, which also demonstrates that even adding just one noisy interaction can undermine the reliability of diagnostic results.

Considering the aforementioned issues in the existing CDMs, in this paper, we focus on improving the reliability of diagnostic feedback by quantifying the confidence of the student’s ability status. And the proposed framework, ReliCD, is designed to be adaptable to various cognitive diagnostic functions, including IRT, MIRT, and NCD.

III-C PROBLEM STATEMENT

III-C1 Task Overview

Cognitive diagnosis in intelligent education consists of three parts, a set of students S={s1,s2,,sN}𝑆subscript𝑠1subscript𝑠2subscript𝑠𝑁S=\{s_{1},s_{2},...,s_{N}\}italic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, a set of exercises E={e1,e2,,eM}𝐸subscript𝑒1subscript𝑒2subscript𝑒𝑀E=\{e_{1},e_{2},...,e_{M}\}italic_E = { italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_e start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } and a set of knowledge concepts C={c1,c2,,cK}𝐶subscript𝑐1subscript𝑐2subscript𝑐𝐾C=\{c_{1},c_{2},...,c_{K}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT }, where N𝑁Nitalic_N, M𝑀Mitalic_M, K𝐾Kitalic_K represent the number of students, the number of exercises and the number of knowledge concepts, respectively. The relationship between exercises and knowledge concepts is represented by a Q𝑄Qitalic_Q-matrix predefined by experts, where the Q𝑄Qitalic_Q-matrix is defined as {Q}M×Ksubscript𝑄𝑀𝐾\{Q\}_{M\times K}{ italic_Q } start_POSTSUBSCRIPT italic_M × italic_K end_POSTSUBSCRIPT. If exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT contains knowledge concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, then Qjl=1subscript𝑄𝑗𝑙1Q_{jl}=1italic_Q start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT = 1, otherwise Qjl=0subscript𝑄𝑗𝑙0Q_{jl}=0italic_Q start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT = 0. The response logs R𝑅Ritalic_R include a set of triplets <si,ej,rij><s_{i},e_{j},r_{ij}>< italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT >, where if the student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT answers exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT correctly, rij=1subscript𝑟𝑖𝑗1r_{ij}=1italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 otherwise rij=0subscript𝑟𝑖𝑗0r_{ij}=0italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0. Along this line, we can formally the research problem in this paper as follows.

Problem Definition: Given the students’ answer logs R𝑅Ritalic_R and the experts’ predefined Q𝑄Qitalic_Q-matrix, our goal is to diagnose the students’ proficiency level for specific knowledge concepts and provide a confidence level for the diagnosis result.

IV Method

Cognitive diagnosis is the process of diagnosing the student’s abilities θ𝜃\thetaitalic_θ in a particular skill or concept. However, the reliability of diagnosis can be affected by various factors, such as noise in the data and the sparsity of interaction data. To address this issue, it is crucial to incorporate methods of modeling uncertainty in the diagnostic process, which encourages an accurate and reliable final diagnosis. Along this line, we design a novel reliable cognitive diagnosis (ReliCD) framework. As shown in Figure 4, it can be divided into three parts: 1) the student’s state and uncertainty module, 2) the cognitive diagnosis module, and 3) the training objective. Additionally, we have employed two effective strategies, namely prior consensus and uncertainty regularization, to enhance the performance of our framework.

IV-A Student’s State and Uncertainty

To model the uncertainty in the diagnostic process, we argue that there should be a deviation in the ability representations of students diagnosed by traditional score prediction methods. This deviation is caused by errors that can occur when students interact with the exercises, which can lead to unreliable diagnostic results. To address this issue, we model the student’s ability representation as a Gaussian distribution. The mean parameter represents the average ability status, while the variance provides criteria for reliable diagnostic results. If the variance of the distribution is small, the diagnosis tends to be more reliable.

Refer to caption
Figure 4: The illustration of our basic idea in ReliCD. Each student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is denoted by a personalized Gaussian distribution variable zi𝒩(μi,σi2)similar-tosubscript𝑧𝑖𝒩subscript𝜇𝑖subscriptsuperscript𝜎2𝑖z_{i}\sim\mathcal{N}(\mu_{i},\sigma^{2}_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the corresponding ability state θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be specified by applying the Sigmoid function on zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is also a distribution with the support on [0,1]01[0,1][ 0 , 1 ]. Next, prior common cognition 𝒩(μmean,1)𝒩subscript𝜇𝑚𝑒𝑎𝑛1\mathcal{N}(\mu_{mean},1)caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , 1 ) helps avoid the situation that students master all knowledge concepts in advance to 0. Then, a calibration loss is induced to close the relationship between uncertainty and the reliability of the student’s ability states by establishing a ranking relationship.

To obtain a personalized distribution representation (Gaussian distribution) for each student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we multiply the student’s one-hot encoding by different transferable matrices to obtain the means and variance parameters, respectively, i.e.,

q(zi|xis)=𝒩(μi,σi2),μi=𝐖μT×xis,logσi2=𝐖σT×xis,formulae-sequence𝑞conditionalsubscript𝑧𝑖superscriptsubscript𝑥𝑖𝑠𝒩subscript𝜇𝑖superscriptsubscript𝜎𝑖2formulae-sequencesubscript𝜇𝑖superscriptsubscript𝐖𝜇𝑇superscriptsubscript𝑥𝑖𝑠subscriptsuperscript𝜎2𝑖superscriptsubscript𝐖𝜎𝑇superscriptsubscript𝑥𝑖𝑠\displaystyle q\left(z_{i}|x_{i}^{s}\right)=\mathcal{N}\left(\mu_{i},\sigma_{i% }^{2}\right),\mu_{i}=\mathbf{W}_{\mu}^{T}\times x_{i}^{s},\log\sigma^{2}_{i}=% \mathbf{W}_{\sigma}^{T}\times x_{i}^{s},italic_q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , roman_log italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_W start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , (4)

where μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σi2Rdsubscriptsuperscript𝜎2𝑖superscript𝑅𝑑\sigma^{2}_{i}\in R^{d}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represent mean and variance parameters for student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. xisRNsuperscriptsubscript𝑥𝑖𝑠superscript𝑅𝑁x_{i}^{s}\in R^{N}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is the one-hot encoded representation of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. 𝐖μsubscript𝐖𝜇\mathbf{W_{\mu}}bold_W start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT and 𝐖σsubscript𝐖𝜎\mathbf{W_{\sigma}}bold_W start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT \in RN×dsuperscript𝑅𝑁𝑑R^{N\times d}italic_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT are different transferable matrices. N𝑁Nitalic_N notes the number of students and d𝑑ditalic_d indicates the dimensionality of the hidden vector (we will discuss the setting of d𝑑ditalic_d in detail in Section IV-B).

Unlike previous student ability modeling techniques, here, we randomly sample students sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ability representations θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the constructed Gaussian distribution q(zi|xis)𝑞conditionalsubscript𝑧𝑖superscriptsubscript𝑥𝑖𝑠q(z_{i}|x_{i}^{s})italic_q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ). This approach simulates deviations in diagnostic results caused by potential noise in interactions between student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and exercises ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The details are as follows,

θi=ϱ(zi),ziq(zi|xis),formulae-sequencesubscript𝜃𝑖italic-ϱsubscript𝑧𝑖similar-tosubscript𝑧𝑖𝑞conditionalsubscript𝑧𝑖superscriptsubscript𝑥𝑖𝑠\displaystyle\theta_{i}=\varrho(z_{i}),\ \ z_{i}\sim q\left(z_{i}|x_{i}^{s}% \right),italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϱ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) , (5)

where θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \in Rdsuperscript𝑅𝑑R^{d}italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the ability representation of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a vector randomly sampled from the constructed Gaussian distribution of sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. ϱitalic-ϱ\varrhoitalic_ϱ is a Sigmoid activation function to ensure that each dimension of θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is in [0,1]01[0,1][ 0 , 1 ].

IV-B Cognitive Diagnosis

After modeling the student’s ability status with uncertainty, we turn to predict exercise performance with cognitive diagnosis functions fcdsubscript𝑓𝑐𝑑f_{cd}italic_f start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT in Section III-A. Specifically, we first extract diagnostic factors from exercise, i.e., exercise difficulty hdiffsuperscript𝑑𝑖𝑓𝑓h^{diff}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT and exercise discrimination hdiscsuperscript𝑑𝑖𝑠𝑐h^{disc}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT, which are required in all cognitive diagnosis functions. The details are as follows:

hjdiff=ϱ(𝐖diffT×xje),hjdisc=ϱ(𝐖discT×xje),formulae-sequencesuperscriptsubscript𝑗𝑑𝑖𝑓𝑓italic-ϱsuperscriptsubscript𝐖𝑑𝑖𝑓𝑓𝑇superscriptsubscript𝑥𝑗𝑒superscriptsubscript𝑗𝑑𝑖𝑠𝑐italic-ϱsuperscriptsubscript𝐖𝑑𝑖𝑠𝑐𝑇superscriptsubscript𝑥𝑗𝑒\displaystyle h_{j}^{diff}=\varrho(\mathbf{W}_{diff}^{T}\times x_{j}^{e}),~{}h% _{j}^{disc}=\varrho(\mathbf{W}_{disc}^{T}\times x_{j}^{e}),italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT = italic_ϱ ( bold_W start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT = italic_ϱ ( bold_W start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) , (6)

where xjesuperscriptsubscript𝑥𝑗𝑒x_{j}^{e}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT \in RMsuperscript𝑅𝑀R^{M}italic_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT denotes the one-hot representation of exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. hjdiffRdsuperscriptsubscript𝑗𝑑𝑖𝑓𝑓superscript𝑅𝑑h_{j}^{diff}~{}\in~{}R^{d}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and hjdiscR1superscriptsubscript𝑗𝑑𝑖𝑠𝑐superscript𝑅1h_{j}^{disc}~{}\in~{}R^{1}italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT ∈ italic_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT are exercise difficulty and exercise discrimination of ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, respectively. 𝐖diffRM×dsubscript𝐖𝑑𝑖𝑓𝑓superscript𝑅𝑀𝑑\mathbf{W}_{diff}\in R^{M\times d}bold_W start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT and 𝐖discRM×1subscript𝐖𝑑𝑖𝑠𝑐superscript𝑅𝑀1\mathbf{W}_{disc}\in R^{M\times 1}bold_W start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_M × 1 end_POSTSUPERSCRIPT are two transferable matrices. M𝑀Mitalic_M indicates the number of exercises and d𝑑ditalic_d denotes the dimensionality of the hidden vector.

Here, the predict probability p(yij=1)𝑝subscript𝑦𝑖𝑗1p(y_{ij}=1)italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ), indicating student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT answers correctly on exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, can be derived as follows:

p(yij=1)=fcd(θi,hjdiff,hjdisc),𝑝subscript𝑦𝑖𝑗1subscript𝑓𝑐𝑑subscript𝜃𝑖superscriptsubscript𝑗𝑑𝑖𝑓𝑓superscriptsubscript𝑗𝑑𝑖𝑠𝑐\displaystyle p\left(y_{ij}=1\right)=f_{cd}\left(\theta_{i},h_{j}^{diff},h_{j}% ^{disc}\right),italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ) = italic_f start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT ) , (7)

where fcdsubscript𝑓𝑐𝑑f_{cd}italic_f start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT denotes the cognitive diagnostic function. Please noted that our framework is flexible for various diagnostic functions. Here we further present three popular diagnostic functions, i.e., IRT, MIRT, and NCD, and specify detailed rules for them as follows:

  • 𝐈𝐑𝐓𝐈𝐑𝐓\mathbf{IRT}bold_IRT: As shown in Eq. (1), IRT models student ability θ𝜃\thetaitalic_θ, exercise difficulty hdiffsuperscript𝑑𝑖𝑓𝑓h^{diff}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT and exercise discrimination hdiscsuperscript𝑑𝑖𝑠𝑐h^{disc}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT as a one-dimensional continuous scalar. Therefore, because of the definition of IRT itself, we set the hidden vector dimension d=1 in Eq. (4) and Eq. (6).

  • 𝐌𝐈𝐑𝐓𝐌𝐈𝐑𝐓\mathbf{MIRT}bold_MIRT: For MIRT, we firstly let hdisc=1superscript𝑑𝑖𝑠𝑐1h^{disc}=1italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_s italic_c end_POSTSUPERSCRIPT = 1 in Eq. (6). Then, we uniformly map students’ ability representation θ𝜃\thetaitalic_θ in Eq. (4) and exercise difficulty hdiffsuperscript𝑑𝑖𝑓𝑓h^{diff}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT in Eq. (6) to the K𝐾Kitalic_K dimension (i.e., d=K𝑑𝐾d=Kitalic_d = italic_K) to model MIRT from the perspective of knowledge concepts.

  • 𝐍𝐂𝐃𝐍𝐂𝐃\mathbf{NCD}bold_NCD: For NCD, as shown in Eq. (2), it models students’ ability θ𝜃\thetaitalic_θ in Eq. (4) and exercise difficulty hdiffsuperscript𝑑𝑖𝑓𝑓h^{diff}italic_h start_POSTSUPERSCRIPT italic_d italic_i italic_f italic_f end_POSTSUPERSCRIPT in Eq. (6) from the dimension of knowledge concepts, i.e., d=K𝑑𝐾d=Kitalic_d = italic_K.

IV-C Training Objective

To optimize the parameters for obtaining the students’ ability status, we maximize the likelihood p(rij|xis)𝑝conditionalsubscript𝑟𝑖𝑗subscriptsuperscript𝑥𝑠𝑖p(r_{ij}|x^{s}_{i})italic_p ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), which indicates the true probability that the student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT answers the exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Specifically, we follow the literature [29, 30, 31] and utilize the evidence lower bound as the training objective, which is tractable. Formally, we have

logpϕsubscript𝑝italic-ϕ\displaystyle\log p_{\phi}roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT (rij|xi)logp(rij|zi)p(zi|xi)𝑑zconditionalsubscript𝑟𝑖𝑗subscript𝑥𝑖𝑝conditionalsubscript𝑟𝑖𝑗subscript𝑧𝑖𝑝conditionalsubscript𝑧𝑖subscript𝑥𝑖differential-d𝑧\displaystyle\left(r_{ij}|x_{i}\right)\geq\int\log p\left(r_{ij}|z_{i}\right)p% \left(z_{i}|x_{i}\right)dz( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∫ roman_log italic_p ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d italic_z (8)
=KL(qφ(zi|xi)|pϕ(zi))LKL+Eqφ[logp(rij|zi)]Lpred,absentsubscript𝐾𝐿conditionalsubscript𝑞𝜑conditionalsubscript𝑧𝑖subscript𝑥𝑖subscript𝑝italic-ϕsubscript𝑧𝑖subscript𝐿𝐾𝐿subscriptsubscript𝐸subscript𝑞𝜑delimited-[]𝑝conditionalsubscript𝑟𝑖𝑗subscript𝑧𝑖subscript𝐿𝑝𝑟𝑒𝑑\displaystyle=-\underbrace{KL\left(q_{\varphi}(z_{i}|x_{i})|p_{\phi}(z_{i})% \right)}_{L_{KL}}+\underbrace{E_{q_{\varphi}}\left[\log p(r_{ij}|z_{i})\right]% }_{L_{pred}},= - under⏟ start_ARG italic_K italic_L ( italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_p ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

where pϕ(zi)subscript𝑝italic-ϕsubscript𝑧𝑖p_{\phi}(z_{i})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the prior distribution for the ability status of students. qφ(zi|xi)subscript𝑞𝜑conditionalsubscript𝑧𝑖subscript𝑥𝑖q_{\varphi}(z_{i}|x_{i})italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the posterior distribution we constructed for the student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. logp(rij|zi)𝑝conditionalsubscript𝑟𝑖𝑗subscript𝑧𝑖\log p(r_{ij}|z_{i})roman_log italic_p ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) measures the likelihood that students with ability status θi=ϱ(zi)subscript𝜃𝑖italic-ϱsubscript𝑧𝑖\theta_{i}=\varrho(z_{i})italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϱ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) answers correctly on exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We follow the variational autoencoder (VAE) [32, 33, 34] and leverage the sampling strategy to approximate Lpredsubscript𝐿𝑝𝑟𝑒𝑑L_{pred}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT with one sample. Then, Lpredsubscript𝐿𝑝𝑟𝑒𝑑L_{pred}italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT can be specified as:

Lpred=logp(rij|zi)=rijlogyij+(1rij)log(1yij).subscript𝐿𝑝𝑟𝑒𝑑𝑝conditionalsubscript𝑟𝑖𝑗subscript𝑧𝑖subscript𝑟𝑖𝑗subscript𝑦𝑖𝑗1subscript𝑟𝑖𝑗1subscript𝑦𝑖𝑗\displaystyle L_{pred}=\log p(r_{ij}|z_{i})=r_{ij}\log y_{ij}+(1-r_{ij})\log(1% -y_{ij}).italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT = roman_log italic_p ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ( 1 - italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) roman_log ( 1 - italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) . (9)

When assuming that prior distribution pϕ(zi)subscript𝑝italic-ϕsubscript𝑧𝑖p_{\phi}(z_{i})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of student xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT satisfies the standard Gaussian distribution, LKLsubscript𝐿𝐾𝐿L_{KL}italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT in the Eq. (8) can be calculated by,

KL(qφ(zi|xi)||𝒩(0,1))=k=1K12(μik2+σik2lnσik21),\begin{split}\begin{aligned} KL\left(q_{\varphi}(z_{i}|x_{i})||\mathcal{N}(0,1% )\right)&=\sum_{k=1}^{K}\frac{1}{2}\left(\mu_{ik}^{2}+\sigma_{ik}^{2}-\ln% \sigma_{ik}^{2}-1\right),\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL italic_K italic_L ( italic_q start_POSTSUBSCRIPT italic_φ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | caligraphic_N ( 0 , 1 ) ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_μ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_ln italic_σ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 ) , end_CELL end_ROW end_CELL end_ROW (10)

where μiksubscript𝜇𝑖𝑘\mu_{ik}italic_μ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT, σik2superscriptsubscript𝜎𝑖𝑘2\sigma_{ik}^{2}italic_σ start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean and variance of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on knowledge concept k𝑘kitalic_k.

Furthermore, based on the cognitive diagnosis scenario, we can define the reliability of a student’s diagnostic feedback on a specific knowledge concept as the probability of correctly predicting a student’s performance on the corresponding knowledge. Formally, the diagnostic feedback reliability can be defined as follows:

Definition 1

Given a student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a cognitive diagnosis model fcd()subscript𝑓𝑐𝑑normal-⋅f_{cd}(\cdot)italic_f start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT ( ⋅ ), and sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s diagnostic feedback θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the reliability of diagnostic feedback (i.e., θilsubscript𝜃𝑖𝑙\theta_{il}italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT) on a specific knowledge concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT based on fcd()subscript𝑓𝑐𝑑normal-⋅f_{cd}(\cdot)italic_f start_POSTSUBSCRIPT italic_c italic_d end_POSTSUBSCRIPT ( ⋅ ) is the probability of correctly predicting sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s performance on clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, i.e., jp(yij=rij|θil)subscriptproduct𝑗𝑝subscript𝑦𝑖𝑗conditionalsubscript𝑟𝑖𝑗subscript𝜃𝑖𝑙\prod_{j}p(y_{ij}=r_{ij}|\theta_{il})∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ), where yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT belong to all the response logs of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT answered exercise ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT which cover the concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (i.e., Qjl=1subscript𝑄𝑗𝑙1Q_{jl}=1italic_Q start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT = 1).

Since we aim to utilize the standard deviation σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to assess the reliability of each student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s diagnostic feedback on different knowledge concepts, we design a novel calibration loss as a regularization term for the training objective. Specifically, given σilsubscript𝜎𝑖𝑙\sigma_{il}italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT and σuvsubscript𝜎𝑢𝑣\sigma_{uv}italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT as the standard variances of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s and susubscript𝑠𝑢s_{u}italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT’s ability representations on the knowledge concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and cvsubscript𝑐𝑣c_{v}italic_c start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, respectively, if σij>σuvsubscript𝜎𝑖𝑗subscript𝜎𝑢𝑣\sigma_{ij}>\sigma_{uv}italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT, we can assume the reliability of sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s diagnostic feedback θilsubscript𝜃𝑖𝑙\theta_{il}italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT should smaller than θuvsubscript𝜃𝑢𝑣\theta_{uv}italic_θ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT. Formally, we have the following hypothesis for ranking confidence levels.

Assumption IV.1

Given σilsubscript𝜎𝑖𝑙\sigma_{il}italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σuvsubscript𝜎𝑢𝑣\sigma_{uv}italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT of student susubscript𝑠𝑢s_{u}italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, we have the relationship:

σil>=σuv<si,ej,rij>Ri,Qjl=1p(yij=rij|θil)<=<su,ej,ruj>Ru,Qjv=1p(yuj=ruj|θuv).\small\begin{split}\begin{aligned} &\sigma_{il}>=\sigma_{uv}\Leftrightarrow\\ &\prod_{\begin{subarray}{c}<s_{i},e_{j},r_{ij}>\\ \in R_{i},\ Q_{jl}=1\end{subarray}}p\left(y_{ij}=r_{ij}|\theta_{il}\right)<=% \prod_{\begin{subarray}{c}<s_{u},e_{j},r_{uj}>\\ \in R_{u},\ Q_{jv}=1\end{subarray}}p\left(y_{uj}=r_{uj}|\theta_{uv}\right).% \end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT > = italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ⇔ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∏ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL < italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > end_CELL end_ROW start_ROW start_CELL ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT = 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ) < = ∏ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL < italic_s start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_u italic_j end_POSTSUBSCRIPT > end_CELL end_ROW start_ROW start_CELL ∈ italic_R start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j italic_v end_POSTSUBSCRIPT = 1 end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_u italic_j end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_u italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) . end_CELL end_ROW end_CELL end_ROW (11)

Considering the probability p(yij=rij|θil)𝑝subscript𝑦𝑖𝑗conditionalsubscript𝑟𝑖𝑗subscript𝜃𝑖𝑙p(y_{ij}=r_{ij}|\theta_{il})italic_p ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT ) is generally impractical to directly obtain, we follow the idea from [35] and [28] and hypothesis the probability of correctly predicting student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT performance on a specific knowledge concept is proportional to the frequency of correct predictions of sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on the triples <si,ej,rij>Ri<s_{i},e_{j},r_{ij}>\in R_{i}< italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where Qjl=1subscript𝑄𝑗𝑙1Q_{jl}=1italic_Q start_POSTSUBSCRIPT italic_j italic_l end_POSTSUBSCRIPT = 1 during the training process. Along this line, we design a calibration loss in a pairwise manner as follows,

LRL=max(0,g(oil,ouv)(σil2σuv2)+|oilouv|),subscript𝐿𝑅𝐿max0𝑔subscript𝑜𝑖𝑙subscript𝑜𝑢𝑣superscriptsubscript𝜎𝑖𝑙2superscriptsubscript𝜎𝑢𝑣2subscript𝑜𝑖𝑙subscript𝑜𝑢𝑣\displaystyle L_{RL}={\rm max}\left(0,-g(o_{il},o_{uv})(\sigma_{il}^{2}-\sigma% _{uv}^{2})+|o_{il}-o_{uv}|\right),italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = roman_max ( 0 , - italic_g ( italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + | italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT - italic_o start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT | ) , (12)

where oilsubscript𝑜𝑖𝑙o_{il}italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT denotes the proportion of the frequency of correct predictions of sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on concept clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT over the total number of prediction on such interactions <si,ej,rij>Ri,Qjk=1<s_{i},e_{j},r_{ij}>\in R_{i},\ Q_{jk}=1< italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = 1 during the training process. The g(,)𝑔g(\cdot,\cdot)italic_g ( ⋅ , ⋅ ) is defined as:

g(oil,ouv)={1,if,oil>ouv0,if,oil=ouv1,otherwise.\displaystyle g(o_{il},o_{uv})=\left\{\begin{aligned} 1,&\quad\text{if},\quad o% _{il}>o_{uv}\\ 0,&\quad\text{if},\quad o_{il}=o_{uv}\\ -1,&\quad\text{otherwise.}\end{aligned}\right.italic_g ( italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 , end_CELL start_CELL if , italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT > italic_o start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if , italic_o start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - 1 , end_CELL start_CELL otherwise. end_CELL end_ROW (13)

Moreover, to reduce the training time cost, we sample the pair (σil,σuv)subscript𝜎𝑖𝑙subscript𝜎𝑢𝑣(\sigma_{il},\sigma_{uv})( italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) under the current mini-batch when optimizing Eq. (8). That is, given a mini-batch of the input interactions {<s1b,e1b,r1b>,\{<~{}s^{b}_{1},e^{b}_{1},r^{b}_{1}~{}>~{},{ < italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > , <s2b,e2b,r2b>,,<sBb,eBb,rBb>}<~{}s^{b}_{2},e^{b}_{2},r^{b}_{2}~{}>,...,<~{}s^{b}_{B},e^{b}_{B},r^{b}_{B}~{}>\}< italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > , … , < italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT > }, we obtain the pair of standard deviations based on the sampled i𝑖iitalic_i-th and j𝑗jitalic_j-th (Here i𝑖iitalic_i and j𝑗jitalic_j only represent the i𝑖iitalic_i-th and the j𝑗jitalic_j-th instance) training instance pair, where <sib,eib,rib><~{}s^{b}_{i},e^{b}_{i},r^{b}_{i}~{}>< italic_s start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > denotes the i𝑖iitalic_i-th training instance in the current mini-batch and B𝐵Bitalic_B denotes the size of mini-batch.

For IRT, since it models the students’ ability representation as a one-dimensional continuous scalar from a macro perspective, we revise the Eq. (12) as follows,

LRLIRT=max(0,g(oi,ou)(σi2σu2)+|oiou|),superscriptsubscript𝐿𝑅𝐿𝐼𝑅𝑇max0𝑔subscript𝑜𝑖subscript𝑜𝑢superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑢2subscript𝑜𝑖subscript𝑜𝑢\displaystyle L_{RL}^{IRT}={\rm max}\left(0,-g(o_{i},o_{u})(\sigma_{i}^{2}-% \sigma_{u}^{2})+|o_{i}-o_{u}|\right),italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I italic_R italic_T end_POSTSUPERSCRIPT = roman_max ( 0 , - italic_g ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + | italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_o start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT | ) , (14)

where oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the proportion of the frequency of correct predictions of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over the total number of predictions on such interactions <si,ej,rij>Ri<s_{i},e_{j},r_{ij}>\in R_{i}< italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > ∈ italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the standard deviation of student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s ability representation.

Finally, the total loss function is defined as:

L=Lpred+γ*LKL+β*LRL,𝐿subscript𝐿𝑝𝑟𝑒𝑑𝛾subscript𝐿𝐾𝐿𝛽subscript𝐿𝑅𝐿\displaystyle L=L_{pred}+\gamma*L_{KL}+\beta*L_{RL},italic_L = italic_L start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d end_POSTSUBSCRIPT + italic_γ * italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT + italic_β * italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT , (15)

where γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β are introduced to balance different items. Particularly, we follow the approach of β𝛽\betaitalic_β-VAE [36] to weight LKLsubscript𝐿𝐾𝐿L_{KL}italic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT for enhancing performance.

IV-D Prior Consensus and Uncertainty Regularization

Here, we propose two strategies to further improve our model: adjusting the prior distribution of the student’s state and regularizing the range of uncertainty.

Algorithm 1 The training process of ReliCD.
1:Students’ response logs R𝑅Ritalic_R and Q𝑄Qitalic_Q matrix.
2:Each student’s ability status θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and variance σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.
3:Pretrain ReliCD with Eq. (16) (set β=0𝛽0\beta=0italic_β = 0);
4:μmean=m=1NμmNsubscript𝜇meansuperscriptsubscript𝑚1𝑁subscript𝜇𝑚𝑁\mu_{\text{mean}}=\frac{\sum_{m=1}^{N}\mu_{m}}{N}italic_μ start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG;
5:while not convergence do
6:     Sample a mini-batch <si,ej,rij><s_{i},e_{j},r_{ij}>< italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT >;
7:     Obtain μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, σisubscript𝜎𝑖\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from Eq. (4);
8:     σi2gx(σi2α)+αsuperscriptsubscript𝜎𝑖2subscript𝑔𝑥superscriptsubscript𝜎𝑖2𝛼𝛼{\sigma_{i}}^{2}\leftarrow g_{x}(\sigma_{i}^{2}-\alpha)+\alphaitalic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← italic_g start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_α ) + italic_α;
9:     Sample zi𝒩(μi,σi2)similar-tosubscript𝑧𝑖𝒩subscript𝜇𝑖superscriptsubscript𝜎𝑖2z_{i}\sim\mathcal{N}(\mu_{i},\sigma_{i}^{2})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and obtain θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
10:     Generate yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT based on Eq. (7);
11:     Sample B𝐵Bitalic_B pairs (σil,σuv)subscript𝜎𝑖𝑙subscript𝜎𝑢𝑣(\sigma_{il},\sigma_{uv})( italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) randomly in this mini-batch;
12:     Compute gradients based on loss functions Eq. (15);
13:     Update all parameters;
14:end while
15:Return θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for each student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

IV-D1 Prior Consensus

Due to the potential difference between knowledge concepts, such as in terms of difficulty and discrimination, we assume that the prior distribution of the student’s status on each knowledge concept is different and individual. To model the individual prior and prevent information leakage, we only use the training set to pre-train our model by setting β=0𝛽0\beta=0italic_β = 0. Then, we average the ability state vector, i.e., μmsubscript𝜇𝑚\mu_{m}italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of all students as the mean parameter μmeansubscript𝜇𝑚𝑒𝑎𝑛\mu_{mean}italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT of the prior distribution, i.e,

μmean=m=1NμmN,subscript𝜇𝑚𝑒𝑎𝑛superscriptsubscript𝑚1𝑁subscript𝜇𝑚𝑁\displaystyle\mu_{mean}=\frac{\sum_{m=1}^{N}\mu_{m}}{N},italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , (16)

which represents the prior common cognition of all knowledge concepts. This method is helpful for us to understand the overall level of students in advance, and avoid the situation that students master all knowledge concepts in advance to 0. Then, we train our entire model with the prior distribution 𝒩(μmean,1)𝒩subscript𝜇𝑚𝑒𝑎𝑛1\mathcal{N}(\mu_{mean},1)caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , 1 ), where the variance parameter is set as 1. Therefore, the new KL loss can be defined as:

KL(𝒩(μi,σi2)||𝒩(μmean,1))=12l=1K((μilμmean,l)2+σil2log(σil2)1).\begin{split}\begin{aligned} &{KL}\left(\mathcal{N}(\mu_{i},\sigma_{i}^{2})||% \mathcal{N}(\mu_{mean},1)\right)\\ =&\frac{1}{2}\sum_{l=1}^{K}\left((\mu_{il}-\mu_{mean,l})^{2}+\sigma_{il}^{2}-% \log(\sigma_{il}^{2})-1\right).\end{aligned}\end{split}start_ROW start_CELL start_ROW start_CELL end_CELL start_CELL italic_K italic_L ( caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | | caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , 1 ) ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( ( italic_μ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n , italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_log ( italic_σ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1 ) . end_CELL end_ROW end_CELL end_ROW (17)

IV-D2 Uncertainty Regularization

At the same time, inspired by [37, 38], we also believe that the variance of the modeling should be within a reasonable range, neither too fluctuating nor too smooth. So we follow the idea in the literature [38] and apply dropout to the variance parameter for each student i𝑖iitalic_i, which discourages the large variance,

σ^il2=gxil(σil2α)+α,subscriptsuperscript^𝜎2𝑖𝑙subscript𝑔subscript𝑥𝑖𝑙subscriptsuperscript𝜎2𝑖𝑙𝛼𝛼\displaystyle\hat{\sigma}^{2}_{il}=g_{x_{i}l}\left(\sigma^{2}_{il}-\alpha% \right)+\alpha,over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT - italic_α ) + italic_α , (18)

where gxilsubscript𝑔subscript𝑥𝑖𝑙g_{x_{i}l}italic_g start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is an independent random variable generated from a standard Bernoulli distribution. α𝛼\alphaitalic_α is our empirically defined value. At this point, the distribution we construct for student sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be rewritten as q(zi|xi)=𝒩(μi,σ^i2)𝑞conditionalsubscript𝑧𝑖subscript𝑥𝑖𝒩subscript𝜇𝑖superscriptsubscript^𝜎𝑖2q\left(z_{i}|x_{i}\right)=\mathcal{N}\left(\mu_{i},\hat{\sigma}_{i}^{2}\right)italic_q ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

After designing the technical details of our framework with two strategies for enhancing performance, we can train our framework with the training data following Algorithm 1.

V Experiment

In this section, we will provide a detailed description of the benchmark datasets, baselines, and experimental setup. The designed experiments aim to answer the following questions:

  • 𝐑𝐐𝟏𝐑𝐐𝟏\mathbf{RQ1}bold_RQ1: How does our framework perform compare to state-of-the-art cognitive diagnosis models?

  • 𝐑𝐐𝟐𝐑𝐐𝟐\mathbf{RQ2}bold_RQ2: Whether the specially designed parts of our framework effective?

  • 𝐑𝐐𝟑𝐑𝐐𝟑\mathbf{RQ3}bold_RQ3: How do the hyperparameters influence the effectiveness of our framework?

  • 𝐑𝐐𝟒𝐑𝐐𝟒\mathbf{RQ4}bold_RQ4: Whether our study can improve the reliability of cognitive diagnosis models?

V-A Dataset Description

We validated the performance of our framework on four real-world datasets, which are three public datasets namely Assistments2009 (Assist2009) 111https://sites.google.com/site/assistmentsdata/home/assistment2009-2010-data/skill-builder-data-2009-2010, Junyi 222 https://www.kaggle.com/datasets/junyiacademy/learning-activity-public-dataset-by-junyi-academy?resource=download and ENEM 333https://github.com/godtn0/DP-MTL, and a private dataset namely e-Math. ASSISTments2009 (ASSISTments 2009-2010 “skill builder”) is a public dataset collected by the assistant online tutoring systems in the 2009-2010 academic year. Junyi is a public dataset collected by the Khan Academy in 2012 year. e-Math is a private dataset collected by a well-known electronic educational product, mainly containing math exercises and response records of primary and secondary school students. ENEM is a Brazilian students’ college entrance examination.

Table I shows the basic information of the four datasets, including the number of students, the number of exercises, the number of knowledge concepts, the total number of answer logs, the average number of answer logs per student, and the average number of knowledge concepts contained in each exercise. Moreover, we uniformly filtered out students with less than 15151515 response logs to guarantee that there is enough data for modeling each student for all datasets. Along this line, we obtained 2,49324932,4932 , 493 students, 17,6711767117,67117 , 671 exercises, and 123123123123 knowledge concepts in Assist2009; 1,00010001,0001 , 000 students, 712712712712 exercises, and 39393939 knowledge concepts in Junyi; 517517517517 students, 1,58215821,5821 , 582 exercises, and 61616161 knowledge concepts in e-Math; 10,0001000010,00010 , 000 students, 185185185185 exercises and 4444 knowledge concepts in ENEM.

We divided each dataset into training set, validation set, and test set by splitting each student’s response records at a ratio of 70%:10%:20%:percent70percent10:percent2070\%:10\%:20\%70 % : 10 % : 20 %. And, we trained our framework on the training set, tune the parameters of our framework on the validation set, and verify the performance of our framework on the test set.

TABLE I: The statistics of datasets.
Dataset Assist2009 Junyi e-Math ENEM
# Students 4.1k 1.0k 1.9k 10k
# Exercises 17.7k 0.7k 1.6k 0.1k
# Knowledge concepts 123 39 61 4
# Response logs 324k 203k 62k 18500k
# Avg logs per student 77.96 203.94 120.71 185
# Avg concepts per exercise 1.19 1.00 1.21 1.00
TABLE II: Quantitative results on students’ score prediction.
Datasets Metrics IRT Reli-IRT MIRT Reli-MIRT NCD Reli-NCD
Assist2009 ACC (% \uparrow) 68.17 ±plus-or-minus\pm± 0.06 69.56 ±plus-or-minus\pm± 0.29normal-∗\ast 70.62 ±plus-or-minus\pm± 0.43 71.12 ±plus-or-minus\pm± 0.23 72.20 ±plus-or-minus\pm± 1.11 72.55 ±plus-or-minus\pm± 0.20
RMSE (\downarrow) 0.4554 ±plus-or-minus\pm± 0.0054 0.4429 ±plus-or-minus\pm± 0.0012 0.4536 ±plus-or-minus\pm± 0.0018 0.4478 ±plus-or-minus\pm± 0.0007 0.4347 ±plus-or-minus\pm± 0.0028 0.4311 ±plus-or-minus\pm± 0.0010
AUC (% \uparrow) 69.15 ±plus-or-minus\pm± 1.35 72.36 ±plus-or-minus\pm± 0.12normal-∗\ast 72.53 ±plus-or-minus\pm± 0.73 72.14 ±plus-or-minus\pm± 0.09 75.10 ±plus-or-minus\pm± 0.14 75.10 ±plus-or-minus\pm± 0.32
ECE (% \downarrow) 4.75 ±plus-or-minus\pm± 0.03 3.13 ±plus-or-minus\pm± 0.15normal-∗\ast 9.97 ±plus-or-minus\pm± 1.16 7.81 ±plus-or-minus\pm± 0.23normal-∗\ast 6.97 ±plus-or-minus\pm± 0.50 1.69 ±plus-or-minus\pm± 0.19normal-∗\ast
MCE (% \downarrow) 10.91 ±plus-or-minus\pm± 0.39 10.58 ±plus-or-minus\pm± 0.26 13.11 ±plus-or-minus\pm± 1.28 12.21 ±plus-or-minus\pm± 0.52 9.00 ±plus-or-minus\pm± 0.19 3.85 ±plus-or-minus\pm± 0.75normal-∗\ast
e-Math ACC (% \uparrow) 67.57 ±plus-or-minus\pm± 0.41 70.00 ±plus-or-minus\pm± 0.05normal-∗\ast 67.49 ±plus-or-minus\pm± 0.42 69.20 ±plus-or-minus\pm± 0.42normal-∗\ast 69.11 ±plus-or-minus\pm± 0.32 69.13 ±plus-or-minus\pm± 0.34
RMSE (\downarrow) 0.4564 ±plus-or-minus\pm± 0.0014 0.4390 ±plus-or-minus\pm± 0.0008 0.4595 ±plus-or-minus\pm± 0.0022 0.4506 ±plus-or-minus\pm± 0.0007 0.4427 ±plus-or-minus\pm± 0.0030 0.4399 ±plus-or-minus\pm± 0.0013
AUC (% \uparrow) 69.77 ±plus-or-minus\pm± 0.36 74.20 ±plus-or-minus\pm± 0.29normal-∗\ast 71.23 ±plus-or-minus\pm± 0.24 72.52 ±plus-or-minus\pm± 0.23normal-∗\ast 73.79 ±plus-or-minus\pm± 0.21 74.12 ±plus-or-minus\pm± 0.20
ECE (% \downarrow) 3.56 ±plus-or-minus\pm± 0.37 3.14 ±plus-or-minus\pm± 0.24 8.88 ±plus-or-minus\pm± 0.42 5.56 ±plus-or-minus\pm± 0.39normal-∗\ast 4.53 ±plus-or-minus\pm± 0.01 1.17 ±plus-or-minus\pm± 0.07normal-∗\ast
MCE (% \downarrow) 10.37 ±plus-or-minus\pm± 0.78 9.85 ±plus-or-minus\pm± 1.08 13.60 ±plus-or-minus\pm± 0.63 13.49 ±plus-or-minus\pm± 0.34 5.90 ±plus-or-minus\pm± 1.01 2.06 ±plus-or-minus\pm± 0.01normal-∗\ast
Junyi ACC (% \uparrow) 71.56 ±plus-or-minus\pm± 0.27 75.31 ±plus-or-minus\pm± 0.19normal-∗\ast 75.73 ±plus-or-minus\pm± 0.17 75.79 ±plus-or-minus\pm± 0.15 75.60 ±plus-or-minus\pm± 0.26 76.05 ±plus-or-minus\pm± 0.15
RMSE (\downarrow) 0.4342 ±plus-or-minus\pm± 0.0021 0.4081 ±plus-or-minus\pm± 0.0009normal-∗\ast 0.4291 ±plus-or-minus\pm± 0.0004 0.4279 ±plus-or-minus\pm± 0.0005 0.4068 ±plus-or-minus\pm± 0.0006 0.4042 ±plus-or-minus\pm± 0.0005
AUC (% \uparrow) 74.09 ±plus-or-minus\pm± 0.39 78.84 ±plus-or-minus\pm± 0.15normal-∗\ast 77.18 ±plus-or-minus\pm± 0.11 77.34 ±plus-or-minus\pm± 0.17 79.87 ±plus-or-minus\pm± 0.13 80.11 ±plus-or-minus\pm± 0.14normal-∗\ast
ECE (% \downarrow) 3.89 ±plus-or-minus\pm± 0.19 2.36 ±plus-or-minus\pm± 0.12normal-∗\ast 10.68 ±plus-or-minus\pm± 0.17 10.13 ±plus-or-minus\pm± 0.13 1.97 ±plus-or-minus\pm± 0.72 1.68 ±plus-or-minus\pm± 0.88
MCE (% \downarrow) 8.45 ±plus-or-minus\pm± 0.36 4.85 ±plus-or-minus\pm± 0.26normal-∗\ast 20.90 ±plus-or-minus\pm± 14.54 14.25 ±plus-or-minus\pm± 0.17normal-∗\ast 3.07 ±plus-or-minus\pm± 0.94 2.59 ±plus-or-minus\pm± 1.04
ENEM ACC (% \uparrow) 71.70 ±plus-or-minus\pm± 0.26 73.09 ±plus-or-minus\pm± 0.44normal-∗\ast 70.91 ±plus-or-minus\pm± 0.02 72.02 ±plus-or-minus\pm± 0.03normal-∗\ast 73.45 ±plus-or-minus\pm± 0.16 73.46 ±plus-or-minus\pm± 0.12
RMSE (\downarrow) 0.4448 ±plus-or-minus\pm± 0.0009 0.4319 ±plus-or-minus\pm± 0.020 0.4514 ±plus-or-minus\pm± 0.0002 0.4443 ±plus-or-minus\pm± 0.0001normal-∗\ast 0.4288 ±plus-or-minus\pm± 0.0008 0.4286 ±plus-or-minus\pm± 0.0005
AUC (% \uparrow) 69.31 ±plus-or-minus\pm± 0.14 72.18 ±plus-or-minus\pm± 0.06normal-∗\ast 69.86 ±plus-or-minus\pm± 0.08 69.92 ±plus-or-minus\pm± 0.02 72.93 ±plus-or-minus\pm± 0.10 72.96 ±plus-or-minus\pm± 0.06
ECE (% \downarrow) 5.03 ±plus-or-minus\pm± 0.15 2.10 ±plus-or-minus\pm± 0.09normal-∗\ast 7.78 ±plus-or-minus\pm± 0.06 6.63 ±plus-or-minus\pm± 0.09normal-∗\ast 1.64 ±plus-or-minus\pm± 0.16 0.76 ±plus-or-minus\pm± 0.08normal-∗\ast
MCE (% \downarrow) 10.96 ±plus-or-minus\pm± 0.46 3.71 ±plus-or-minus\pm± 0.14normal-∗\ast 12.79 ±plus-or-minus\pm± 0.11 7.56 ±plus-or-minus\pm± 0.12normal-∗\ast 3.82 ±plus-or-minus\pm± 0.19 1.64 ±plus-or-minus\pm± 0.16normal-∗\ast

V-B Experimental Setup

V-B1 Experimental settings

In the experiment, we used Xavier initialization to initialize all parameters in our framework. We leveraged the Adam optimizer to train our reliable CDMs with a batch size of 32323232 and a learning rate of 0.0020.0020.0020.002, respectively. We used five-fold cross-validation to more accurately evaluate the performance of our framework on all datasets. As mentioned in Section 4.3, we set β=0𝛽0\beta=0italic_β = 0 during the model pre-training phase. While during the training, validation, and testing phases, we set γ𝛾\gammaitalic_γ=1e-4, β𝛽\betaitalic_β=0.1. Our framework 444https://github.com/BIMK/Intelligent-Education/tree/main/ReliCD and baselines were implemented with Pytorch=1.7.11.7.11.7.11.7.1 by Python=3.63.63.63.6, and all experiments were conducted on an NVIDIA GeForce RTX 3090-24GHB.

V-B2 Evaluation metrics

Here we evaluate our work from two aspects. The first aspect is the performance of our framework, which can be measured by ACC (Accuracy), RMSE (Root Mean Square Error), and AUC (Area Under an ROC Curve), using the same metrics as previous work (e.g., NCD). The second is the quality of confidence estimation on the student’s ability status, which can not be evaluated directly. Here, we turn to measure the confidence of our framework in predicting exercise performance by the expected calibration error (ECE) [39] and the Maximum Calibration Error (MCE) [39], which are widely used in confidence estimation related literature [27, 28]. The smaller the value of ECE or MCE, the better the quality of confidence estimation. Specifically, we first divide the prediction probability interval into a certain number of bins. Then, ECE and MCE can be calculated by adding up and taking the maximum of the differences between the mean probability in each bin and the accuracy among the corresponding samples with weight, respectively. The calculation formulas are as follows,

ECE=n=1M|Bn|a|acc(Bn)avgProb(Bn)|,ECEsuperscriptsubscript𝑛1𝑀subscript𝐵𝑛𝑎accsubscript𝐵𝑛avgProbsubscript𝐵𝑛\displaystyle{\rm ECE}=\sum_{n=1}^{M}\frac{|B_{n}|}{a}|{\rm acc}\left(B_{n}% \right)-{\rm avgProb}\left(B_{n}\right)|,roman_ECE = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG | italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | end_ARG start_ARG italic_a end_ARG | roman_acc ( italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_avgProb ( italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | , (19)
MCE=maxn{1,2,,M}|acc(Bn)avgProb(Bn)|,MCE𝑚𝑎subscript𝑥𝑛12𝑀accsubscript𝐵𝑛avgProbsubscriptBn\displaystyle{\rm MCE}=max_{n\in\left\{1,2,...,M\right\}}|{\rm acc}\left(B_{n}% \right)-{\rm avgProb\left(B_{n}\right)}|,roman_MCE = italic_m italic_a italic_x start_POSTSUBSCRIPT italic_n ∈ { 1 , 2 , … , italic_M } end_POSTSUBSCRIPT | roman_acc ( italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - roman_avgProb ( roman_B start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ) | ,

where M𝑀Mitalic_M is the number of interval bins, Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the set of samples with prediction probability in [n1M𝑛1𝑀\frac{n-1}{M}divide start_ARG italic_n - 1 end_ARG start_ARG italic_M end_ARG, nM𝑛𝑀\frac{n}{M}divide start_ARG italic_n end_ARG start_ARG italic_M end_ARG], a𝑎aitalic_a is the total number of samples, acc(Bn)accsubscript𝐵𝑛{\rm acc}(B_{n})roman_acc ( italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the accuracy of the samples in Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, avgProb(Bn)avgProbsubscript𝐵𝑛{\rm avgProb}(B_{n})roman_avgProb ( italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is the average prediction probability of our framework for the samples in Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Results of Reli-NCD and its variants.

V-C Performance Comparison (RQ1)

To verify the effectiveness of our proposed framework, we applied it to different cognitive diagnostic functions, including IRT, MIRT, and NCD. As a result, we obtained three reliable cognitive diagnosis methods: Reli-IRT, Reli-MIRT, and Reli-NCD. Our goal is to substantially enhance confidence metrics (i.e., ECE and MCE) while making slight improvements to traditional metrics (i.e., ACC, RMSE, and AUC). Specifically, all these generated reliable cognitive diagnosis methods sampled the students’ ability state from a constructed distribution to model students’ uncertainty. As illustrated in Table II, we compared our ReliCDs with the corresponding baselines on four real-world datasets and we bolded the best experimental results with black lines. Moreover, we conducted the standard student t-test for the pair of our ReliCDs and the baselines at all indicators. Results are summarized in in Table II with significant improvement (p𝑝pitalic_p-value<0.01𝑣𝑎𝑙𝑢𝑒0.01value<0.01italic_v italic_a italic_l italic_u italic_e < 0.01) denoted with an asterisk (\ast). We can obtain the following observations. Firstly, we can find that our ReliCDs have a significant decline compared to the corresponding baseline in ECE and MCE on almost all datasets. It not only shows that capturing the uncertainty of students can calibrate the confidence value of the prediction results but also demonstrates our method largely improves the reliability of the diagnostic results. Secondly, we observed that our ReliCDs have significantly improved the performance in terms of ACC, AUC, and RMSE in some datasets. It reveals that estimating the uncertainty of students’ ability status on different knowledge concepts can enhance the effectiveness of the student performance prediction. Thirdly, we found that our Reli-NCD achieved the best performance on all datasets. Meanwhile, we observed that basic NCD did not achieve the best performance on the ECE at Assistments2009 and e-Math. It also shows that our solution brings a good reliability improvement to strengthen cognitive diagnostic functions like NCD.

V-D Ablation Study (RQ2)

To verify the effectiveness of each specially designed component of our framework, we constructed two variants of our ReliCD by removing the corresponding components. Without loss of generality, we used Reli-NCD as the baseline, which is an implementation of our framework with the specific diagnostic function NCD. The variants are described as follows:

  • 𝐰/𝐨𝐩𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠𝐰𝐨𝐩𝐫𝐞𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠\mathbf{w/opre-training}bold_w / bold_opre - bold_training: It is a variant of Reli-NCD by removing the pre-training process, so as to explore its impact on the experimental results.

  • 𝐰/𝐨𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧𝐥𝐨𝐬𝐬𝐰𝐨𝐜𝐚𝐥𝐢𝐛𝐫𝐚𝐭𝐢𝐨𝐧𝐥𝐨𝐬𝐬\mathbf{w/ocalibrationloss}bold_w / bold_ocalibrationloss: It is a variant of Reli-NCD by removing the calibration loss, so as to explore its impact on the experimental results.

The results are summarized in Figure 5. Clearly, we can find that Reli-NCD has achieved the best performance on all datasets. Meanwhile, we observe that both pre-training strategy and calibration loss can significantly improve the performance of ECE and MCE in almost all datasets. Specifically, the pre-training strategy effectively reduces about 26.4%, 19.2%, and 37.9% on ECE at Assist2009, Junyi, and e-Math, respectively. Correspondingly, the calibration loss effectively reduces about 10.2%, 7.2%, and 24.8% on ECE at the above datasets. It clearly demonstrates the effectiveness of those components in our framework, which also answers 𝐑𝐐𝟐𝐑𝐐𝟐\mathbf{RQ2}bold_RQ2.

Refer to caption
(a) e-Math
Refer to caption
(b) Assist2009
Refer to caption
(c) Junyi
Refer to caption
(d) ENEM
Figure 6: Impact of different sizes of γ𝛾\gammaitalic_γ on the performance.
Refer to caption
(a) e-Math
Refer to caption
(b) Assist2009
Refer to caption
(c) Junyi
Refer to caption
(d) ENEM
Figure 7: Impact of different sizes of β𝛽\betaitalic_β on the performance.

V-E Parameter Sensitivity Analysis (RQ3)

To evaluate the sensitivity of hyperparameters γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β in the loss function and answer the 𝐑𝐐𝟑𝐑𝐐𝟑\mathbf{RQ3}bold_RQ3, we conducted multiple experiments on e-Math, Assist2009, Junyi, and ENEM. We varied γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β individually from 1e61𝑒61e-61 italic_e - 6 to 1111, while kee** the other parameter fixed.

As depicted in Figure 6, it is evident that the size of γ𝛾\gammaitalic_γ has a significant effect on the results of both the AUC and ECE. The model’s performance is optimal within the range of 0.000110.000110.0001-10.0001 - 1. This observation indicates that considering the constraints of student distribution within a reasonable range is beneficial to the model performance. As for β𝛽\betaitalic_β shown in Figure 7, varying the size of β𝛽\betaitalic_β did not greatly affect the AUC values in e-Math and Assist2009, while the ECE values varied significantly on all three datasets under different sizes. The observation regarding β𝛽\betaitalic_β suggests that the partial order relationship we established has a certain level of calibration effect on the final prediction of the model.

Refer to caption
Figure 8: The distribution of the student’s ability statue θ𝜃\thetaitalic_θ under different numbers of interaction data.

Refer to caption

Refer to caption
Figure 9: (a) The distribution of all students’ ability status diagnosed by Reli-NCD on the Assist2009 dataset includes the left part for knowledge concepts not interacted with and the right part for those interacted with. (b) The density plot of all students’ ability status of the knowledge concepts that they have interacted with. The red one is the predicted ability status based on NCD and the blue one is based on our Reli-NCD.

V-F Case Study (RQ4)

Here we first show an example of predicted ability status via our framework. Specifically, we trained our Reli-NCD on the Assist2009 dataset. Figure 8 shows the predicted student #4164’s ability status θ𝜃\thetaitalic_θ distribution of knowledge concept #15 corresponding to training with different numbers of exercises on this concept. Clearly, we can observe that the fluctuation of student ability state decreases with more interaction data on this concept, while the mean value of the student ability status is also regionally stable. Therefore, our model can effectively identify the reliability of the diagnostic feedback, which will serve as a great aid to educators in assessing the usability of the diagnostic feedback.

Moreover, similar to the diagnostic feedback analysis in the preliminaries, we trained our Reli-NCD on the Assist2009 and obtained the distributions of students’ ability status of the interactive knowledge concepts and non-interactive knowledge concepts. As shown in Figure 9, we can find that our Reli-NCD provides more discriminate diagnostic feedback than NCD, as its predicted ability status distribution span is significantly wider. Meanwhile, we obtained our Reli-NCD can achieve concentrated distribution on students’ ability status of the knowledge concepts that they have not interacted with, which demonstrates the reliability of our diagnostic feedback.

VI Conclusion

In this paper, we introduced a reliable cognitive diagnosis framework with confidence awareness, namely ReliCD, which is the first one to quantify the confidence of the diagnosis feedback and is flexible for different cognitive diagnostic functions. To be specific, we first proposed a Bayesian method to explicitly estimate the state uncertainty of different knowledge concepts for students, which enables the confidence quantification of diagnostic feedback. In particular, to avoid the potential difference, we proposed to model the individual prior distribution for the latent variables of different ability concepts with a pre-trained model. Then, we introduced a logical hypothesis for ranking confidence levels. Moreover, we designed a novel calibration loss to optimize the confidence parameters by modeling the process of student performance prediction. Finally, we have conducted extensive experiments on 4444 real-world datasets, and the experimental results have clearly demonstrated the effectiveness of our ReliCD.

VII Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 62107001, in part by the National Key Research and Development Project (NO. 2018AAA0100105), in part by the Anhui Provincial Natural Science Foundation (NO. 2108085QF272 and No. 2208085QF194), in part by the University Synergy Innovation Program of Anhui Province under Grant GXXT-2021-004.

References

  • [1] J. Leighton and M. Gierl, Cognitive diagnostic assessment for education: Theory and applications.   Cambridge University Press, 2007.
  • [2] F. Wang, Q. Liu, E. Chen, Z. Huang, Y. Chen, Y. Yin, Z. Huang, and S. Wang, “Neural cognitive diagnosis for intelligent education systems,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6153–6161.
  • [3] W. Gao, Q. Liu, Z. Huang, Y. Yin, H. Bi, M.-C. Wang, J. Ma, S. Wang, and Y. Su, “Rcd: Relation map driven cognitive diagnosis for intelligent education systems,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 501–510.
  • [4] C. Wang, H. Zhu, C. Zhu, X. Zhang, E. Chen, and H. Xiong, “Personalized employee training course recommendation with career development awareness,” in Proceedings of the Web Conference 2020, 2020, pp. 1648–1659.
  • [5] M. Khajah, R. Wing, R. V. Lindsey, and M. Mozer, “Integrating latent-factor and knowledge-tracing models to predict individual differences in learning.” in EDM, 2014, pp. 99–106.
  • [6] H. Ma, Y. Zeng, S. Yang, C. Qin, X. Zhang, and L. Zhang, “A novel computerized adaptive testing framework with decoupled learning selector,” Complex & Intelligent Systems, pp. 1–12, 2023.
  • [7] J. De La Torre, “Dina model and parameter estimation: A didactic,” Journal of educational and behavioral statistics, vol. 34, no. 1, pp. 115–130, 2009.
  • [8] S. E. Embretson and S. P. Reise, Item response theory.   Psychology Press, 2013.
  • [9] Y. Zhou, Q. Liu, J. Wu, F. Wang, Z. Huang, W. Tong, H. Xiong, E. Chen, and J. Ma, “Modeling context-aware features for cognitive diagnosis in student learning,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2420–2428.
  • [10] M. D. Reckase, “Multidimensional item response theory models,” in Multidimensional item response theory.   Springer, 2009, pp. 79–112.
  • [11] J. De La Torre, “The generalized dina model framework,” Psychometrika, vol. 76, no. 2, pp. 179–199, 2011.
  • [12] Y. Deng, Y. Xie, Y. Li, M. Yang, W. Lam, and Y. Shen, “Contextualized knowledge-aware attentive neural network: Enhancing answer selection with knowledge,” ACM Transactions on Information Systems (TOIS), vol. 40, no. 1, pp. 1–33, 2021.
  • [13] H. Ma, J. Wang, H. Zhu, X. Xia, H. Zhang, X. Zhang, and L. Zhang, “Reconciling cognitive modeling with knowledge forgetting: A continuous time-aware neural network approach,” in Proceedings of the 31st International Joint Conference on Artificial Intelligence, 2022, pp. 2174–2181.
  • [14] J. Chai, H. Zeng, A. Li, and E. W. Ngai, “Deep learning in computer vision: A critical review of emerging techniques and application scenarios,” Machine Learning with Applications, vol. 6, p. 100134, 2021.
  • [15] S. Cheng, Q. Liu, E. Chen, Z. Huang, Z. Huang, Y. Chen, H. Ma, and G. Hu, “Dirt: Deep learning enhanced item response theory for cognitive diagnosis,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 2397–2400.
  • [16] Y. Zhou, Q. Liu, J. Wu, F. Wang, Z. Huang, W. Tong, H. Xiong, E. Chen, and J. Ma, “Modeling context-aware features for cognitive diagnosis in student learning,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 2420–2428.
  • [17] H. Ma, J. Zhu, S. Yang, Q. Liu, H. Zhang, X. Zhang, Y. Cao, and X. Zhao, “A prerequisite attention model for knowledge proficiency diagnosis of students,” in Proceedings of the 31st ACM CIKM, 2022, pp. 4304–4308.
  • [18] J. Li, F. Wang, Q. Liu, M. Zhu, W. Huang, Z. Huang, E. Chen, Y. Su, and S. Wang, “Hiercdf: A bayesian network-based hierarchical cognitive diagnosis framework,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 904–913.
  • [19] S. Hecker, D. Dai, and L. Van Gool, “Failure prediction for autonomous driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 1792–1799.
  • [20] M. Bernhardt, F. D. S. Ribeiro, and B. Glocker, “Failure detection in medical image classification: A reality check and benchmarking testbed,” 2022.
  • [21] R. Zha, C. Qin, L. Zhang, D. Shen, T. Xu, H. Zhu, and E. Chen, “Career mobility analysis with uncertainty-aware graph autoencoders: A job title transition perspective,” IEEE Transactions on Computational Social Systems, 2023.
  • [22] C. Qin, L. Zhang, R. Zha, D. Shen, Q. Zhang, Y. Sun, C. Zhu, H. Zhu, and H. Xiong, “A comprehensive survey of artificial intelligence techniques for talent analytics,” arXiv preprint arXiv:2307.03195, 2023.
  • [23] Y. Ding, J. Liu, X. Xu, M. Huang, J. Zhuang, J. Xiong, and Y. Shi, “Uncertainty-aware training of neural networks for selective medical image segmentation,” in Medical Imaging with Deep Learning.   PMLR, 2020, pp. 156–173.
  • [24] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprint arXiv:1802.04865, 2018.
  • [25] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
  • [26] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
  • [27] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning.   PMLR, 2017, pp. 1321–1330.
  • [28] J. Moon, J. Kim, Y. Shin, and S. Hwang, “Confidence-aware learning for deep neural networks,” in international conference on machine learning.   PMLR, 2020, pp. 7034–7044.
  • [29] A. Pagnoni, K. Liu, and S. Li, “Conditional variational autoencoder for neural machine translation,” arXiv preprint arXiv:1812.04405, 2018.
  • [30] D. Shen, C. Qin, H. Zhu, T. Xu, E. Chen, and H. Xiong, “Joint representation learning with relation-enhanced topic models for intelligent job interview assessment,” ACM Transactions on Information Systems (TOIS), vol. 40, no. 1, pp. 1–36, 2021.
  • [31] D. Shen, H. Zhu, C. Zhu, T. Xu, C. Ma, and H. Xiong, “A joint learning approach to intelligent job interview assessment.” in IJCAI, vol. 18, 2018, pp. 3542–3548.
  • [32] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [33] C. Qin, K. Yao, H. Zhu, T. Xu, D. Shen, E. Chen, and H. Xiong, “Towards automatic job description generation with capability-aware neural networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 5, pp. 5341–5355, 2022.
  • [34] C. Wang, H. Zhu, C. Zhu, X. Zhang, E. Chen, and H. Xiong, “Personalized employee training course recommendation with career development awareness,” in Proceedings of the Web Conference 2020, 2020, pp. 1648–1659.
  • [35] M. Toneva, A. Sordoni, R. T. des Combes, A. Trischler, Y. Bengio, and G. J. Gordon, “An empirical study of example forgetting during deep neural network learning,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=BJlxm30cKm
  • [36] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner, “Understanding disentangling in β𝛽\betaitalic_β vae𝑣𝑎𝑒-vae- italic_v italic_a italic_e,” arXiv preprint arXiv:1804.03599, 2018.
  • [37] X. Ma, C. Zhou, and E. Hovy, “MAE: Mutual posterior-divergence regularization for variational autoencoders,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=Hke4l2AcKQ
  • [38] D. Shen, C. Qin, C. Wang, H. Zhu, E. Chen, and H. Xiong, “Regularizing variational autoencoder with diversity and uncertainty awareness,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, ser. IJCAI-2021.   International Joint Conferences on Artificial Intelligence Organization, Aug. 2021. [Online]. Available: http://dx.doi.org/10.24963/ijcai.2021/408
  • [39] M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 29, no. 1, 2015.