Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time

Salvatore Greco 0000-0001-7239-9602 Politecnico di TorinoTurinItaly [email protected] Bartolomeo Vacchetti 0000-0001-5583-4692 Politecnico di TorinoTurinItaly [email protected] Daniele Apiletti 0000-0003-0538-9775 Politecnico di TorinoTurinItaly [email protected]  and  Tania Cerquitelli 0000-0002-9039-6226 Politecnico di TorinoTurinItaly [email protected]
Abstract.

Concept Drift is a phenomenon in which the underlying data distribution and statistical properties of a target domain change over time, leading to a degradation of the model’s performance. Consequently, models deployed in production require continuous monitoring through drift detection techniques. Most drift detection methods to date are supervised, i.e., based on ground-truth labels. However, true labels are usually not available in many real-world scenarios. Although recent efforts have been made to develop unsupervised methods, they often lack the required accuracy, have a complexity that makes real-time implementation in production environments difficult, or are unable to effectively characterize drift. To address these challenges, we propose DriftLens, an unsupervised real-time concept drift detection framework. It works on unstructured data by exploiting the distribution distances of deep learning representations. DriftLens can also provide drift characterization by analyzing each label separately. A comprehensive experimental evaluation is presented with multiple deep learning classifiers for text, image, and speech. Results show that (i) DriftLens performs better than previous methods in detecting drift in 11/13111311/1311 / 13 use cases; (ii) it runs at least 5 times faster; (iii) its detected drift value is very coherent with the amount of drift (correlation 0.85absent0.85\geq 0.85≥ 0.85); (iv) it is robust to parameter changes.

Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/grecosalvatore/drift-lens.

1. Introduction

The basic assumption of deep learning is that the training data mimics the real world. Nevertheless, deep learning models are typically trained and evaluated on static datasets. However, the world is dynamic, and the model learned during training may no longer be valid. The underlying data distribution and statistical properties of the target domain may change over time, leading to the models’ performance degradation (or decay) (Lu et al., 2019). This phenomenon is called “concept drift”.

Concept drift and performance degradation can affect the reliability and robustness of deep learning models in real-world production applications (Wang et al., 2022). Therefore, it is important to continuously monitor production models and detect concept/data drift at an early stage (Shankar and Parameswaran, 2022). This monitoring process should provide early warnings when a shift occurs and implement adaptive measures to maintain expected performance on newly processed data.

A large body of research has focused on detecting concept drift over time through supervised strategies. They typically rely on error rates or performance-based measures computed from actual labels (Lu et al., 2019). However, in many real-world applications, the actual labels are not available for newly processed data.

A parallel research effort has been devoted to unsupervised strategies for detecting concept/data drift in data streams (Hinder et al., 2023a; Gemaque et al., 2020; Shen et al., 2023). Most of them usually rely on distribution distances or divergence measures that are evaluated for each individual instance. Therefore, they are often computationally intensive and thus ineffective in detecting concept drift in the case of deep learning models working with unstructured data. In such scenarios, existing techniques are not suitable for detecting drift in (near) real-time. Moreover, many techniques show poor performance in detecting drift in high-dimensional data.

To overcome these limitations, this paper proposes DriftLens, an unsupervised drift detection framework for deep learning classifiers working with unstructured data. In designing DriftLens, we attempt to answer the following four research questions (RQs):

(RQ1) To what extent can DriftLens detect drift of varying severity in deep learning classifiers for unstructured data in the absence of ground truth labels?

(RQ2) How can DriftLens be applied broadly and effectively across various data types, models, and classification tasks?

(RQ3) How efficient is DriftLens at detecting drift in near real-time?

(RQ4) To what extent can DriftLens accurately model and characterize the presence of drift over time?

To this aim, the contribution of this paper is twofold:

(1) We propose DriftLens (§4), a new unsupervised drift detection framework able to detect whether and when drift occurs by computing the distribution distances of the deep learning embedding representations on unstructured data. DriftLens is able to perform drift characterization by determining which labels are most affected by the drift. Due to its low complexity, it can run in real-time.

(2) We perform a comprehensive evaluation of DriftLens for several deep learning models and datasets for text, image, and speech classification (§5). We found that: (i) DriftLens is very effective in discriminating new data with and without drift. Overall, DriftLens achieves better performance than previous techniques across 11/13111311/1311 / 13 use cases. (ii) DriftLens is extremely fast and enables real-time concept drift detection, as it can classify the presence of drift in less than 0.20.20.20.2 seconds independently of the data volumes, running at least 5555 times faster than other detectors. (iii) The drift curve modeled by DriftLens is highly correlated with the amount of drift present as it exhibits a correlation index 0.85absent0.85\geq 0.85≥ 0.85 for the drift patterns evaluated. We also qualitatively show that the drifting trend is coherently represented by the drift curves, and that it is able to characterize drift by identifying the labels most affected by drift. (iv) DriftLens is robust to its parameter settings.

Before discussing the methodology (§4) and the experimental evaluations (§5), we define the concept drift problem and the application scenario with the specific challenges we focus on in this paper (§2). Then, we highlight the limitations of previous work on concept drift detection in our setting (§3).

2. Problem Formulation

Here, we formally define the concept drift problem (§2.1), we show the patterns in which drift can occur (§2.2), and we define the application scenario, challenges and desiderata in this study (§2.3).

2.1. Concept drift definition

Concept drift can be defined as a phenomenon in which the underlying data distribution and statistical properties of a target data domain change over time. The presence of drift can negatively affect the performance of predictive models over time. Various terms have been proposed to refer to “concept drift” (Bayram et al., 2022), such as data drift, dataset shift, covariate shift, prior probability shift, and concept shift. Although each definition emphasizes a particular facet of the phenomenon, most works broadly refer to all subcategories under the term “concept drift”. Concept drift is defined as a change in the joint distribution between a time period [0,t]0𝑡[0,t][ 0 , italic_t ] and a time window t+w𝑡𝑤t+witalic_t + italic_w. Drift occurs at time window t+w𝑡𝑤t+witalic_t + italic_w if:

(1) P[0,t](X,y)Pt+w(X,y)subscript𝑃0𝑡𝑋𝑦subscript𝑃𝑡𝑤𝑋𝑦P_{[0,t]}(X,y)\neq P_{t+w}(X,y)italic_P start_POSTSUBSCRIPT [ 0 , italic_t ] end_POSTSUBSCRIPT ( italic_X , italic_y ) ≠ italic_P start_POSTSUBSCRIPT italic_t + italic_w end_POSTSUBSCRIPT ( italic_X , italic_y )

Where X𝑋Xitalic_X and y𝑦yitalic_y are the feature vectors and the target variable of each data instance (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and Pt(X,y)subscript𝑃𝑡𝑋𝑦P_{t}(X,y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X , italic_y ) is the joint probability. The time window t+w𝑡𝑤t+witalic_t + italic_w can be defined as a time period or instant based on how the data stream is processed. The joint probability can be further decomposed as:

(2) Pt(X,y)=Pt(y/X)Pt(X)=Pt(X/y)Pt(y)subscript𝑃𝑡𝑋𝑦subscript𝑃𝑡𝑦𝑋subscript𝑃𝑡𝑋subscript𝑃𝑡𝑋𝑦subscript𝑃𝑡𝑦P_{t}(X,y)=P_{t}(y/X)P_{t}(X)=P_{t}(X/y)P_{t}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X , italic_y ) = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y / italic_X ) italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X ) = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X / italic_y ) italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y )

Where, Pt(X/y)subscript𝑃𝑡𝑋𝑦P_{t}(X/y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X / italic_y ) is the class-conditional probability, Pt(y/X)subscript𝑃𝑡𝑦𝑋P_{t}(y/X)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y / italic_X ) is the target labels posterior probability, Pt(X)subscript𝑃𝑡𝑋P_{t}(X)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_X ) is the input data prior probability, and Pt(y)subscript𝑃𝑡𝑦P_{t}(y)italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_y ) is target labels prior probability. Therefore, in classification tasks, concept drift can occur as a change in any of these terms: (1) P(X)𝑃𝑋P(X)italic_P ( italic_X ): A drift in the input data. The marginal probability of the input features X𝑋Xitalic_X changes. This type of drift is also known as data drift, covariance drift, or virtual drift. (2) P(y/X)𝑃𝑦𝑋P(y/X)italic_P ( italic_y / italic_X ): The relationships or conditional probabilities of target labels given input features change, but the input features do not necessarily change. This is usually referred to as concept or real drift. (3) P(y)𝑃𝑦P(y)italic_P ( italic_y ): A change in the output data. Therefore, the labels and their probabilities change. This type of drift is sometimes referred to as label drift.

This work focuses on the detection of drift in production scenarios where no ground truth labels are available for new data. Therefore, drift has to be detected in an unsupervised way. Due to the lack of actual labels, the only possible drift to be considered is the change in the priority probability of the features P(X)𝑃𝑋P(X)italic_P ( italic_X ) (Hinder et al., 2023a).

2.2. Drift patterns

Refer to caption
Figure 1. Drift patterns.

Concept drift can occur in several patterns. Some examples are shown in Figure 1 and described below.

Suddent/abrupt change. Drift can occur suddenly if the distribution of new samples changes rapidly. A possible example of a sudden drift is a tweets topic classifier during the outbreak of the COVID-19 pandemic. At that point, the model was suddenly exposed to numerous text samples containing a new topic. Recognizing this scenario is crucial to initiate the retraining process and restore the classifier’s performance.

Recurrent change. Drift occurs repeatedly after the first observed event, with a seasonality unknown during training. For example, a topic or a sentiment classifier dealing with a new topic of interest, seasonal events, or cultural changes will lead to changes in the labels to be predicted. In an election year, the language and topics of discussion on social media platforms can change significantly, affecting the sentiment and context in which certain words or phrases are used. After the election, discussions return to a normal state, but during another significant event, such as a global sporting event like the Olympics, the state may change again. The classifier must adapt to these changes in order to perform well. However, as these changes occur repeatedly, it faces the challenge of the concept drifting in a cyclical pattern. It is important to recognize this scenario in order to possibly (i) retrain the classifiers to include examples of the new recurrent distribution or (ii) switch to another model that has been explicitly trained on the specific case.

Incremental change. The transition between concepts occurs smoothly and gradually over time. For instance, during image or object classification for autonomous vehicles, the model may encounter new vehicle types, such as scooters (i.e., Personal Light Electric Vehicles), which were not part of the original training data. Initially, these new vehicles appear only rarely and their occurrence is sporadic. Over time, however, they gradually become more commonplace and an integral part of the streetscape.

2.3. Application scenarios

We design DriftLens so that it can be effectively exploited in a variety of application scenarios characterized by:

(1) Unavailability of ground truth labels for new incoming data. Many applications, such as sentiment analysis or topic detection on social media comments, deal with a data-intensive production to be classified through deep learning models. Most of them do not have actual labels available for newly processed samples. Therefore, both drift detection and adaptation must be made in an unsupervised way, as performed by DriftLens.

(2) High complexity and high-dimensional data. Most applications deal with unstructured data, such as text, images, and speech. Detecting concept drift in unstructured data is even more complex than in structured data due to the high dimensionality (usually characterized by high data sparsity), the complexity of the inputs, and the lack of a fixed structure, such as columns. All those aspects can undermine the effectiveness and increase the complexity of drift detection techniques. Appropriate data modeling is required to detect data changes over time effectively. DriftLens relies on the analysis of the internal representation generated by deep learning models to immediately detect concept drift in classifiers working with unstructured data.

In such a target scenario, the desiderata for the drift detection method are the following: (1) Fast detection. The drift detection technique should be able to detect drift as soon as possible and not only when drift occurs with high severity; (2) Real-time detection. The technique should have low complexity and be able to run in real-time; (3) Drift characterization. It should provide information on what type of drift occurs and evaluate the labels most affected by the data changes. DriftLens addresses all the above desiderata.

3. Related Works

There has been a significant effort in the field of concept drift detection (Lu et al., 2019; Gemaque et al., 2020; Hu et al., 2020; Bayram et al., 2022; Wang et al., 2024; Adams et al., 2023). Drift detection techniques can be categorized into two macro-categories based on the true labels availability assumption: (1) supervised and (2) unsupervised.

(1) Supervised concept drift methods. Most of the previous drift detection techniques are supervised (Gama et al., 2004; Baena-Garcıa et al., 2006; Gama and Castillo, 2006; Frías-Blanco et al., 2015; Liu et al., 2017b; Xu and Wang, 2017; Bifet and Gavalda, 2007; Arora et al., 2023; Vorburger and Bernstein, 2006; Grulich et al., 2018; Sun et al., 2017; Mayaki and Riveill, 2022; Kim and Park, 2016; Haque et al., 2016; Jadhav and Deshpande, 2017; Pinagé et al., 2020). These methods typically rely on error rate-based measures or ensemble models to assess the performance degradation over time (e.g., accuracy decrease). However, these methods assume that true labels are available along with the new data or within a short period of time. In practice, the labels for new data are usually not available, and labeling them is very costly and time-consuming. This dependence on the availability of true labels is a major limitation of these techniques, which limits their applicability in real-world scenarios.

(2) Unsupervised concept drift methods. In contrast, unsupervised techniques do not require the availability of true labels (Hinder et al., 2023a; Gemaque et al., 2020; Shen et al., 2023; Werner et al., 2023). Our methodology falls into this category. As outlined by (Hinder et al., 2023a), unsupervised techniques can be further divided into: (2.i) statistical-based, (2.ii) loss-based, and (2.iii) virtual classifier-based.

(2.i) Statistical-based methods. Most of the unsupervised techniques rely on statistical tests or divergence metrics between two distributions  (Gretton et al., 2012; dos Reis et al., 2016; Bu et al., 2018; Cerquitelli et al., 2019; Ventura et al., 2019; Bashir et al., 2016; de Mello et al., 2019; Kifer et al., 2004; Nishida and Yamauchi, 2007; Rabanser et al., 2018), usually between a reference and a new window. Our methodology belongs to this category. These techniques are independent of the type of data and models and do not require external models or resources. Most of these techniques use every single sample in the reference and new window to calculate the distances. Therefore, they do not scale with the number of samples and the dimensionality of the inputs. This can affect both their runtime and drift prediction performance.

(2.ii) Loss-based methods. These techniques exploit machine learning model loss functions to evaluate the similarity of newly arrived data points with previous ones (Lughofer et al., 2015; Suprem et al., 2020; Hushchyn and Ustyuzhanin, 2020; Yamanishi and Takeuchi, 2002). Usually, they rely on auto-encoders or are used in conjunction with supervised drift detectors. In the first case, these methods are widely used but are difficult to transfer to other types of distributions. In the second case, they rely on the assumption that there is a correlation between the increase in model losses and the concept drift, as shown in (Hinder et al., 2023b). Moreover, their implementation varies depending on the data type, typically being less effective for texts or speech compared to images.

(2.iii) Virtual classifier-based methods. Also these techniques, such as (Gözüaçık et al., 2019; Liu et al., 2017a; Hido et al., 2008), implement classifiers to detect drift but in a different way. The general idea is to divide data into two sets, before and after a certain moment in time. If the classifier achieves an accuracy higher than random in classifying samples into the two classes, it implies divergent data properties between the two class distributions, which means that there is drift. The limitation of these approaches lies in their need to train and maintain a new model to detect drift.

Drift adaptation and incremental learning. Some supervised works are related to incremental learning for drift detection and adaptation (Gama et al., 2014). However, in unsupervised settings, the adaptation to drift is much more challenging due to the lack of annotated samples (Hinder et al., 2023a). The drift adaptation problem is out of the scope of this paper. Instead, we focus on the drift detection problem only.

This paper presents a new unsupervised statistical-based drift detection technique, namely DriftLens. The preliminary idea was presented in (Greco and Cerquitelli, 2021). DriftLens differs from previous work in three key aspects. Firstly, like other statistical-based methods, it is completely unsupervised. Thus, it does not require any external model to detect drift, unlike supervised, loss-based, and virtual classifier methods. Secondly, it exploits statistical distances that better scale with data dimensionality than other statistical-based methods, thereby enabling efficient real-time drift detection in large data volumes. Thirdly, DriftLens is able to better characterize drift by identifying each drifting label.

Refer to caption
Figure 2. DriftLens framework. It comprises an offline and an online phases. In the offline phase, it estimates the reference distributions and distance thresholds from historical (training) data. The distributions are modeled as multivariate normal and are computed: (i) for the entire batch (per-batch), and (ii) conditioned on the predicted label (per-label). In the online phase, it analyzes data streams in fixed windows, comparing new and reference distributions, and using thresholds to identify drift, visualized through a drift trend monitor.
Table 1. Symbols and notation.
Symbol Description
L𝐿Litalic_L; |L|𝐿|L|| italic_L | Set of labels used to train the model; N° of labels.
ϕ(X)italic-ϕ𝑋\phi(X)italic_ϕ ( italic_X ) Encoder function that takes a set of inputs X𝑋Xitalic_X and outputs the embedding E𝐸Eitalic_E.
d𝑑ditalic_d; dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Embedding dimensionality; Reduced embedding dimensionality.
y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG Vector of predicted labels.
E𝐸Eitalic_E; Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Embedding matrix; Reduced embedding matrix
m𝑚mitalic_m; mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT; mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT N° of samples; N° of baseline samples; N° of window samples (window size).
μ𝜇\muitalic_μ; μbsubscript𝜇𝑏\mu_{b}italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT; μwsubscript𝜇𝑤\mu_{w}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT Mean vector; Baseline mean vector; Window mean vector.
ΣΣ\Sigmaroman_Σ; ΣbsubscriptΣ𝑏\Sigma_{b}roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT; ΣwsubscriptΣ𝑤\Sigma_{w}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT Covariance matrix; Baseline covariance matrix; Window covariance matrix.
mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT; mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT Window size; Size of the reference set.
nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT Number of randomly sampled windows to estimate the threshold.
T;Tα𝑇subscript𝑇𝛼T;T_{\alpha}italic_T ; italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT Threshold value; Threshold sensitivity.

4. DriftLens

DriftLens is an unsupervised drift detection technique based on distribution distances within the embeddings, i.e., the internal dense representations generated by deep learning models. It is specifically designed to deal with unstructured data and uses the nuanced patterns and relationships represented in the embeddings to detect data shifts or changes over time. The methodology includes an offline and an online phases, as shown in Figure 2. In the offline phase, DriftLens estimates the reference distributions and threshold values from the historical (e.g., training) dataset. The reference distributions, called baseline, represent the distribution of features (i.e., embedding) of the concepts that the model has learned during the training phase. They, therefore, represent the absence of drift. In the online phase, the new data stream is processed in windows of fixed size. Firstly, the distributions of the new data windows are estimated. Secondly, the distribution distances are computed with respect to the reference distributions. If the distance exceeds the threshold, the presence of drift is predicted.

We first present the data modeling (§4.1) and the distribution distance used (§4.2), as they are performed similarly in both phases. Then, we describe the offline4.3) and online4.4) phases. Table 1 reports a list of symbols used to describe the methodology.

4.1. Data modeling

The goal of data modeling is to estimate the distribution of a batch of data. Instead of using the raw input data, we exploit the internal representation that the deep learning model generates when processing a batch of data (i.e., the embeddings). Figure 3 summarizes the main steps of the data modeling performed by DriftLens. It estimates (i) the embedding distribution of the entire batch independent of the predicted class labels (per-batch) and (ii) the embedding distributions for each predicted class separately (per-label). The per-batch models the entire set of embedding vectors as a multivariate normal distribution. For the per-label, |L|𝐿|L|| italic_L | normal distributions are estimated instead, where |L|𝐿|L|| italic_L | is the number of labels on which the model was trained. In our framework, we assume that the embeddings are distributed as a multivariate normal distribution. Although this is usually a strong assumption for the raw input features, it may be a good approximation when applied to the features in the embedding space. This is because the dense features in the embedding space, especially in the last layers, have simpler relationships (Bengio et al., 2013) and can be better approximated as a multivariate normal distribution.

Refer to caption
Figure 3. DriftLens data modeling. This process inputs a deep learning model and a set of data, and estimates the multivariate normal embedding distributions. It extracts embeddings from the model, enriches them with labels, reduces the embedding dimensionality, and estimates the per-batch distribution by computing the mean vector μ𝜇\muitalic_μ and covariance matrix ΣΣ\Sigmaroman_Σ. It further estimates label-specific distributions by grou** embeddings by label, reducing the dimensionality, and computing the per-label μlsubscript𝜇𝑙\mu_{l}italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ΣlsubscriptΣ𝑙\Sigma_{l}roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, lLfor-all𝑙𝐿\forall l\in L∀ italic_l ∈ italic_L.

Consider a deep learning classifier capable of distinguishing between a set of class labels L𝐿Litalic_L. The classifier consists of an encoder part ϕ(X)italic-ϕ𝑋\phi(X)italic_ϕ ( italic_X ) that maps the sparse and complex representations of the raw input data to a dense and simpler latent representation by applying several nonlinear transformations using some learned weights W𝑊Witalic_W. The encoder is usually followed by one or more fully connected neural network layers to predict a class label based on the latent embedding representation.

In data modeling, a dataset X𝑋Xitalic_X (i.e., the entire baseline or a new window) is fed into the deep learning model to extract the corresponding embedding matrix E=ϕ(X)m×d𝐸italic-ϕ𝑋superscript𝑚𝑑E=\phi(X)\in\mathbb{R}^{m\times d}italic_E = italic_ϕ ( italic_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT and the vector of predicted labels y^m^𝑦superscript𝑚\hat{y}\in\mathbb{R}^{m}over^ start_ARG italic_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where m𝑚mitalic_m is the number of samples in the set and d𝑑ditalic_d is the dimensionality of the extracted embedding layer. For deep learning classifiers, the dimensionality of the embedding space is usually large (i.e., in the range of thousands). Since we model the embedding as a multivariate normal distribution, we need to compute the mean vector μ𝜇\muitalic_μ and the covariance matrix ΣΣ\Sigmaroman_Σ over the set of vectors. However, the covariance matrix ΣΣ\Sigmaroman_Σ requires at least d𝑑ditalic_d linearly independent vectors (i.e., inputs) to be full rank. Otherwise, it will contain complex numbers that affect the computation of the distribution distance (see §4.2). This problem is particularly pronounced when estimating an arbitrary multivariate normal distribution for each label, as we need d𝑑ditalic_d linearly independent vectors predicted with each label. Although this is usually not a problem when modeling the baseline as it is computed over the entire historical dataset (e.g., the training set), in the online phase d𝑑ditalic_d linearly independent vectors are needed to estimate the distribution per batch, and d×|L|𝑑𝐿d\times|L|italic_d × | italic_L | linearly independent vectors are needed to estimate the per-label distributions in each data stream window. Therefore, the fixed size used to divide the data stream into windows must be very large. To solve this problem, DriftLens performs a dimensionality reduction of the embedding by applying a principal component analysis (PCA) step.

(3) E=PCA(E)superscript𝐸𝑃𝐶𝐴𝐸E^{\prime}=PCA(E)italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_P italic_C italic_A ( italic_E )

This leads to a reduced embedding matrix Em×dsuperscript𝐸superscript𝑚superscript𝑑E^{\prime}\in\mathbb{R}^{m\times d^{\prime}}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, where 0<dd0superscript𝑑𝑑0<d^{\prime}\leq d0 < italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_d is a user-defined parameter that determines the number of principal components for the entire batch. Dimensionality reduction is also performed separately for each embedding matrix, depending on the predicted label.

(4) El=PCA(El),lLformulae-sequencesubscriptsuperscript𝐸𝑙𝑃𝐶𝐴subscript𝐸𝑙for-all𝑙𝐿E^{\prime}_{l}=PCA(E_{l}),\forall l\in Litalic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_P italic_C italic_A ( italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , ∀ italic_l ∈ italic_L

Where Elml×dlsubscript𝐸𝑙superscriptsubscript𝑚𝑙subscriptsuperscript𝑑𝑙E_{l}\in\mathbb{R}^{m_{l}\times d^{\prime}_{l}}italic_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the embedding vectors associated with the inputs predicted with the label l𝑙litalic_l, and 0<dld0subscriptsuperscript𝑑𝑙𝑑0<d^{\prime}_{l}\leq d0 < italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≤ italic_d is a user-defined parameter that specifies the number of principal components for each label. Note that dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and dlsubscriptsuperscript𝑑𝑙d^{\prime}_{l}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can be set to different values. For the per-batch, dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can possibly be set to d=min(mw,d)superscript𝑑𝑚𝑖𝑛subscript𝑚𝑤𝑑d^{\prime}=min(m_{w},d)italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_m italic_i italic_n ( italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_d ), where mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the window size used in the online phase and d𝑑ditalic_d is the embedding dimensionality. In our experiments (see §5), we always set d=150𝑑150d=150italic_d = 150. Instead, dlsubscriptsuperscript𝑑𝑙d^{\prime}_{l}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT depends on the window size, the dimensionality of the embedding, and the number of labels. A reasonable suggestion for a balanced data stream is to set its value close to dl=min(mw/|L|,d)subscriptsuperscript𝑑𝑙𝑚𝑖𝑛subscript𝑚𝑤𝐿𝑑d^{\prime}_{l}=min(m_{w}/|L|,d)italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_m italic_i italic_n ( italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT / | italic_L | , italic_d ).

Note that the PCA models are fitted and used in the offline phase to reduce the dimensionality of the embedding. In the online phase, the PCA models that were fitted in the offline phase are used instead to reduce the embedding of the new data.

Once performed the dimensionality reduction, the reduced embedding matrices Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Elsubscriptsuperscript𝐸𝑙E^{\prime}_{l}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are used to estimate the multivariate normal distribution of the per-batch (equation 5), and the |L|𝐿|L|| italic_L | multivariate normal distribution o each predicted label (equation 6).

(5) P(ϕ(x);W)𝒩(μ,Σ)similar-to𝑃italic-ϕ𝑥𝑊𝒩𝜇ΣP(\phi(x);W)\sim\mathcal{N}(\mu,\,\Sigma)italic_P ( italic_ϕ ( italic_x ) ; italic_W ) ∼ caligraphic_N ( italic_μ , roman_Σ )
(6) P(ϕ(x)|y^=l;W)𝒩(μl,Σl),lLformulae-sequencesimilar-to𝑃conditionalitalic-ϕ𝑥^𝑦𝑙𝑊𝒩subscript𝜇𝑙subscriptΣ𝑙for-all𝑙𝐿P(\phi(x)\ |\ \hat{y}=l;W)\sim\mathcal{N}(\mu_{l},\,\Sigma_{l}),\forall l\in Litalic_P ( italic_ϕ ( italic_x ) | over^ start_ARG italic_y end_ARG = italic_l ; italic_W ) ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , ∀ italic_l ∈ italic_L

Where μd𝜇superscriptsuperscript𝑑\mu\in\mathbb{R}^{d^{\prime}}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Σd×dΣsuperscriptsuperscript𝑑superscript𝑑\Sigma\in\mathbb{R}^{d^{\prime}\times d^{\prime}}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT are the mean vector and the covariance matrix of the per-batch multivariate normal distribution, and each μldsubscript𝜇𝑙superscriptsuperscript𝑑\mu_{l}\in\mathbb{R}^{d^{\prime}}italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and Σld×dsubscriptΣ𝑙superscriptsuperscript𝑑superscript𝑑\Sigma_{l}\in\mathbb{R}^{d^{\prime}\times d^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT represents the per-label multivariate normal distribution of each label lL𝑙𝐿l\in Litalic_l ∈ italic_L, separately. The multivariate normal distributions are straightforward and fast to estimate because they can be fully characterized by the mean vector and the covariance matrix. The main advantage of estimating the distributions as multivariate normal distributions is that they can be represented with dimensionality dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, regardless of the number of samples in the reference and new windows. This allows the method to scale well even with large amounts of data and enables the detection of drifts in real-time regardless of the data volumes.

4.2. Distribution distance

DriftLens uses the Frechét distance (Dowson and Landau, 1982) to calculate the distance between two multivariate normal distributions. The Frechét distance, also known as the Wasserstein-2 distance (Wasserstein, 1969), has been widely used in deep learning to measure the distances between the distributions of models’ features, but in very different scenarios (Heusel et al., 2017). DriftLens uses the Frechét distance to measure the distances between the embedding distributions of a baseline (reference distributions) and the new windows in the data stream, and we call it Frechét Drift Distance (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D).

Starting from a reference multivariate normal distribution (e.g., the baseline distribution) b𝑏bitalic_b characterized by a mean vector μbsubscript𝜇𝑏\mu_{b}italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and a covariance matrix ΣbsubscriptΣ𝑏\Sigma_{b}roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and the multivariate normal distribution of the new data window w𝑤witalic_w, characterized by μwsubscript𝜇𝑤\mu_{w}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ΣwsubscriptΣ𝑤\Sigma_{w}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D is computed as follows:

(7) FDD(b,w)=μbμw22+Tr(Σb+Σw2ΣbΣw)𝐹𝐷𝐷𝑏𝑤subscriptsuperscriptnormsubscript𝜇𝑏subscript𝜇𝑤22𝑇𝑟subscriptΣ𝑏subscriptΣ𝑤2subscriptΣ𝑏subscriptΣ𝑤FDD(b,w)=||\mu_{b}-\mu_{w}||^{2}_{2}+Tr\bigg{(}\Sigma_{b}+\Sigma_{w}-2\sqrt{% \Sigma_{b}\Sigma_{w}}\bigg{)}italic_F italic_D italic_D ( italic_b , italic_w ) = | | italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_T italic_r ( roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - 2 square-root start_ARG roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG )

Where FDD(b,w)𝐹𝐷𝐷𝑏𝑤FDD(b,w)\in\mathbb{R}italic_F italic_D italic_D ( italic_b , italic_w ) ∈ blackboard_R in [0,]0[0,\infty][ 0 , ∞ ]. The higher the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D, the greater the distance, and the more likely the drift. FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D takes into account changes in the mean (i.e., center) and the diagonal elements of the covariance (i.e., spread) between the two distributions. The former results from the L2 norm of the difference between the mean vectors μbμw22subscriptsuperscriptnormsubscript𝜇𝑏subscript𝜇𝑤22||\mu_{b}-\mu_{w}||^{2}_{2}| | italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The latter by considering Tr(Σb+Σw2ΣbΣw)𝑇𝑟subscriptΣ𝑏subscriptΣ𝑤2subscriptΣ𝑏subscriptΣ𝑤Tr(\Sigma_{b}+\Sigma_{w}-2\sqrt{\Sigma_{b}\Sigma_{w}})italic_T italic_r ( roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - 2 square-root start_ARG roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ). This last term can be considered as the generalization of the squared difference between the standard deviations in one-dimensional space. Therefore, the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D score can potentially be useful to identify more subtle types of drift that affect not only the center of the distributions but also the spread.

DriftLens computes a single per-batch distance by computing the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D score between the baseline (μbsubscript𝜇𝑏\mu_{b}italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, ΣbsubscriptΣ𝑏\Sigma_{b}roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) and the new window (μw,Σw)subscript𝜇𝑤subscriptΣ𝑤(\mu_{w},\Sigma_{w})( italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) per-batch distributions, and —L— per-label distances by computing the distance of the distributions (μb,l,Σb,l)subscript𝜇𝑏𝑙subscriptΣ𝑏𝑙(\mu_{b,l},\Sigma_{b,l})( italic_μ start_POSTSUBSCRIPT italic_b , italic_l end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_b , italic_l end_POSTSUBSCRIPT ) and (μw,l,Σw,l)subscript𝜇𝑤𝑙subscriptΣ𝑤𝑙(\mu_{w,l},\Sigma_{w,l})( italic_μ start_POSTSUBSCRIPT italic_w , italic_l end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_w , italic_l end_POSTSUBSCRIPT ) for each label lL𝑙𝐿l\in Litalic_l ∈ italic_L separately.

4.3. Offline phase

In the offline phase (steps 1 and 2 in Figure 2), DriftLens estimates (i) the probability distribution of a reference dataset that represents what the model learned during training, and (ii) distance thresholds to distinguish between normal and abnormal (i.e., possible drift) distances. The offline phase is run once, and its results are permanently stored on disk for later use during the online phase.

4.3.1. Baseline estimation

The baseline estimation consists of performing the data modeling (§4.1) on the baseline dataset and permanently storing the estimated distributions and the PCA models on disk. Specifically, given a baseline dataset Xbsubscript𝑋𝑏X_{b}italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, containing mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT samples, the entire baseline dataset Xbsubscript𝑋𝑏X_{b}italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is fed into the deep learning model to extract the embedding vectors Eb=ϕ(Xb)subscript𝐸𝑏italic-ϕsubscript𝑋𝑏E_{b}=\phi(X_{b})italic_E start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_ϕ ( italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) and estimate the vector of predicted labels y^bsubscript^𝑦𝑏\hat{y}_{b}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Then, the per-batch PCA is fitted over the entire set of vectors, and |L|𝐿|L|| italic_L | different PCAs are fitted, grou** the embedding vectors according to the predicted labels. The embedding vectors are then reduced both for the per-batch, to obtain the reduced embedding matrix Ebsubscriptsuperscript𝐸𝑏E^{\prime}_{b}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and for the per-label, to obtain the |L|𝐿|L|| italic_L | reduced embedding matrices Eb,lsubscriptsuperscript𝐸𝑏𝑙E^{\prime}_{b,l}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b , italic_l end_POSTSUBSCRIPT. The embedding matrices are then used to estimate the baseline per-batch and per-label multivariate normal distributions. The per-batch distribution is fully characterized by the baseline per-batch mean vector μbsubscript𝜇𝑏\mu_{b}italic_μ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the covariance matrix ΣbsubscriptΣ𝑏\Sigma_{b}roman_Σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, which are obtained by calculating the mean and covariance over the entire set of reduced embedding vectors in the baseline dataset. The per-label distributions are obtained by computing the |L|𝐿|L|| italic_L | mean vectors μb,lsubscript𝜇𝑏𝑙\mu_{b,l}italic_μ start_POSTSUBSCRIPT italic_b , italic_l end_POSTSUBSCRIPT and the covariance matrices Σb,lsubscriptΣ𝑏𝑙\Sigma_{b,l}roman_Σ start_POSTSUBSCRIPT italic_b , italic_l end_POSTSUBSCRIPT on the reduced embedding vectors, grouped by predicted labels. Note that regardless of the dimensionality of the reference set mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (i.e., the baseline), the offline phase estimates distributions characterized by vectors of size dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus, when predicting the drift for new windows, it is not influenced by the reference set mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.

4.3.2. Threshold estimation

The threshold estimation aims to identify the maximum possible distance (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) that a given window without drift can reach. To this end, DriftLens takes in input the threshold dataset Xthsubscript𝑋𝑡X_{th}italic_X start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, the window size mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (equal to the one that will be used in the online phase), the baseline, and a parameter nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT defining the number of windows to be randomly sampled from the threshold dataset. The nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT parameter should be as large as possible to better estimate the larger distance in data considered without drift. In our experiments (see §5.1), we empirically set nth=10000subscript𝑛𝑡10000n_{th}=10000italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 10000. However, we found that varying this parameter does not significantly impact the framework (see §5.5).

Specifically, DriftLens  randomly sample nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT windows from the threshold dataset Xthsubscript𝑋𝑡X_{th}italic_X start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, each one containing mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT inputs. For each window, it performs the data modeling phase and computes the per-batch and per-label distribution distances with respect to the baseline distributions. Therefore, nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT distribution distances for the entire batch and each label are computed. Finally, the distribution distances are sorted in descending order. The first element contains the maximum distance a window of data considered without drift can have with respect to the baseline. Thus, distances that exceed this value are potential warnings of drift. However, there are potentially outlier distances due to the large number of randomly sampled windows. Therefore, DriftLens provides a parameter to define the threshold sensitivity Tα[0,1]subscript𝑇𝛼01T_{\alpha}\in[0,1]italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. This parameter removes the Tα%percentsubscript𝑇𝛼T_{\alpha}\%italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT % left tail of the sorted distances (in descending order) to remove outliers. The final thresholds T𝑇Titalic_T and Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are set to the maximum distance after removing the Tα%percentsubscript𝑇𝛼T_{\alpha}\%italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT % of the more considerable distances. Therefore, the higher the value of Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, the lower the values of the thresholds, and the higher the sensitivity to possible drift and false alarm. In our experiments (see §5.1), we use Tα=0.01subscript𝑇𝛼0.01T_{\alpha}=0.01italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.01 as a default threshold sensitivity value because we want to take the maximum possible distance by just removing the outliers (top 1%).

4.3.3. Choice of the reference dataset

The reference set must be historical data representing what the model learned during training. Ideally, the entire training set can be used as the baseline, and the test set or a number of windows from the real data stream, assumed they could represent distributions without drift, for the threshold dataset. In our experiments, we use the training set for the baseline and the test set for the threshold estimation (see §5.1).

4.4. Online phase

In the online phase (steps 3, 4, and 5 in Figure 2), DriftLens processes the new data stream into fixed-size windows. The window size is defined by the parameter mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Given a new data window Xwsubscript𝑋𝑤X_{w}italic_X start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT containing mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT new samples, the data modeling is processed by (i) extracting the embedding Ewmw×dsubscript𝐸𝑤superscriptsubscript𝑚𝑤𝑑E_{w}\in\mathbb{R}^{m_{w}\times d}italic_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and predicted labels y^wmwsubscript^𝑦𝑤superscriptsubscript𝑚𝑤\hat{y}_{w}\in\mathbb{R}^{m_{w}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, (ii) performing the embedding dimensionality reduction to obtain Ewmw×dsubscriptsuperscript𝐸𝑤superscriptsubscript𝑚𝑤superscript𝑑E^{\prime}_{w}\in\mathbb{R}^{m_{w}\times d^{\prime}}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and (iii) estimating the per-batch and per-label multivariate normal distributions. The PCA models fitted during the offline phase are used in this step. The per-batch multivariate normal distribution is obtained by computing the mean vector μwdsubscript𝜇𝑤superscriptsuperscript𝑑\mu_{w}\in\mathbb{R}^{d^{\prime}}italic_μ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and the covariance matrix Σwd×dsubscriptΣ𝑤superscriptsuperscript𝑑superscript𝑑\Sigma_{w}\in\mathbb{R}^{d^{\prime}\times d^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT over the entire set of embeddings in the window. For the per-label, a multivariate normal distribution for each distinct label is obtained by selecting the embedding vectors predicted with such label and computing the mean and covariance on that subset, obtaining the |L|𝐿|L|| italic_L | multivariate normal distributions characterized by μw,ldlsubscript𝜇𝑤𝑙superscriptsubscriptsuperscript𝑑𝑙\mu_{w,l}\in\mathbb{R}^{d^{\prime}_{l}}italic_μ start_POSTSUBSCRIPT italic_w , italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Σw,ldl×dlsubscriptΣ𝑤𝑙superscriptsubscriptsuperscript𝑑𝑙subscriptsuperscript𝑑𝑙\Sigma_{w,l}\in\mathbb{R}^{d^{\prime}_{l}\times d^{\prime}_{l}}roman_Σ start_POSTSUBSCRIPT italic_w , italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Finally, the per-batch and the per-label FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D distances between the window and baseline distributions are computed (equation 7). If the per-batch and/or the per-label distribution distances exceed the threshold values , drift is predicted. The former is essential to detect if the entire window is affected by drift. The latter characterizes drift and identifies the most impacted labels.

Refer to caption
Figure 4. DriftLens monitor example. It shows the per-batch and each per-label distribution distances (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) over time.

4.4.1. Drift monitoring over time.

The same process is repeated for each window. Once DriftLens processes a new window, the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D distribution distances are added to the drift monitor. The drift monitor shows in two separate charts the per-batch and per-label distribution distances. A warning symbol is added to the plot when a given distance exceeds the corresponding threshold. The drift monitor provides valuable insights to understand (i) when and whether drift occurs, (ii) the severity of the drift and the patterns, and (iii) what are the labels the most affected by drift.

Figure 4 shows an example of the drift monitor.111Generated with the DriftLens web tool (Greco et al., 2024). The above chart shows the per-batch, while the bottom shows the per-label distribution distances for all the windows. The x-axis reports the timestamps and the window identifiers, while the distribution distances (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) are in the y-axis. When drift is detected (i.e., the distribution distance in a given window is above the threshold), the area under the curve is filled, and a warning is displayed in the x-tick of the charts. In this case, the monitor shows that drift occurs for the first time after 50 windows with high severity, and then reoccurs intermittently with non-drifted windows with a periodic pattern. The label World is the most affected by drift, followed by Business. Instead, the label Sports is negligibly impacted.

Table 2. Overview of the experimental use cases. Use cases are partitioned into groups based on the dataset and task. The training labels, the way the drift is simulated, and the split of the dataset are given in the description for each group. Within each group, different deep learning models are considered, and the corresponding F1 scores obtained on the test set are given.
Data Type Dataset Task Use Case Models F1 Description
Text Ag News Topic 1.1 BERT 0.98 Training Labels: World, Business, and Sport Drift: Simulated with one new class label: Science/Tech
Detection 1.2 DistilBERT 0.97
1.3 RoBERTa 0.98 Dataset Split: 59,4805948059{,}48059 , 480 train - 5,70057005{,}7005 , 700 test - 30,5203052030{,}52030 , 520 without drift data stream - 31,9003190031{,}90031 , 900 drifted dataset stream
Text 20 Newsgroups Topic 2.1 BERT 0.88 Training Labels: Technology, Sale-Ads, Politics, Religion, Science (5 macro-labels) Drift: Simulated with one new class: Recreation (1 macro-label)
Detection 2.2 DistilBERT 0.87
2.3 RoBERTa 0.88 Dataset Split: 5,08050805{,}0805 , 080 train - 3,38733873{,}3873 , 387 test - 5,56055605{,}5605 , 560 without drift data stream - 3,65536553{,}6553 , 655 drifted dataset stream
Text 20 Newsgroups Topic 3 BERT 0.87 Training Labels: Training on 6 labels: 5 macro-labels: Technology, Sale-Ads, Politics, Religion, Science
Detection and a subset of the macro-label Recreation: baseball and hockey
Drift: Simulated with another subset of the macro-label Recreation: motorcycles and autos
Dataset Split: 5,74457445{,}7445 , 744 train - 3,82938293{,}8293 , 829 test - 6,30463046{,}3046 , 304 without drift data stream - 1,80518051{,}8051 , 805 drifted dataset stream
Image Intel Image Image 4.1 Vi-T 0.90 Training Labels: Forest, Glacier, Mountain, Building, Street Drift: Simulated with one new class label: Sea
Classification 4.2 VGG16 0.89
Dataset Split: 6,00060006{,}0006 , 000 train - 4,00040004{,}0004 , 000 test - 4,25642564{,}2564 , 256 without drift data stream - 2,78027802{,}7802 , 780 drifted dataset stream
Image STL-10 Image 5.1 Vi-T 0.96 Training Labels: Airplane, Bird, Car, Cat, Deer, Dog, Horse, Monkey, Sheep Drift: Simulated with one new class label: Truck
Classification 5.2 VGG16 0.82
Dataset Split: 5,85058505{,}8505 , 850 train - 2,92529252{,}9252 , 925 test - 2,92529252{,}9252 , 925 without drift data stream - 1,30013001{,}3001 , 300 drifted dataset stream
Image STL-10 Image 6 Vi-T 0.90 Training Labels: Airplane, Bird, Car, Cat, Deer, Dog, Horse, Monkey, Sheep, Truck Drift: Simulated by introducing blur in the images within the same labels
Classification
Dataset Split: 6,50065006{,}5006 , 500 train - 3,25032503{,}2503 , 250 test - 3,25032503{,}2503 , 250 without drift data stream - 3,25032503{,}2503 , 250 drifted dataset stream
Speech Common Voice Gender 7 Wav2Vec 0.91 Training Labels: Male, Female (US and UK accent) Drift: Introduced with speeches from same labels but different accent (Australian, Canadian, Scottish)
Identification
Dataset Split: 70,5787057870{,}57870 , 578 train - 9,95199519{,}9519 , 951 test - 29,5562955629{,}55629 , 556 without drift data stream - 42,6974269742{,}69742 , 697 drifted dataset stream

5. Evaluation

To evaluate the ability of DriftLens in detecting drift across various deep learning classifiers for text, image, and speech data (introduced in §5.1), we evaluate the drift detection performance (§5.2) and the execution time (§5.3) in comparison with state-of-the-art drift detectors. Then, we evaluate its ability to accurately characterize drift (§5.4), and analyze its sensitivity to parameter setting (§5.5).

5.1. Experimental settings

To demonstrate the effectiveness of DriftLens in detecting different types of drift in different scenarios, we conduct experiments with several use cases, which are summarized in Table 2. Use cases are categorized into groups based on the dataset used, the task chosen, and the way the drift is simulated. Within each group, different deep learning models are considered and evaluated. By evaluating DriftLens on a variety of experiments, we aim to demonstrate the broad applicability and generalizability of the proposed framework.

Deep learning classifiers. We perform experiments with several deep learning models suitable for NLP (BERT222https://huggingface.co/google-bert/bert-base-uncased (Devlin et al., 2019), DistilBERT333https://huggingface.co/distilbert/distilbert-base-uncased (Sanh et al., 2020), and RoBERTa444https://huggingface.co/FacebookAI/roberta-base (Liu et al., 2019)), computer vision (VGG16555https://keras.io/api/applications/vgg/ (Simonyan and Zisserman, 2014) and Visual Transformer666https://huggingface.co/google/vit-base-patch16-224 (Dosovitskiy et al., 2020)), and speech (Wav2Vec777https://huggingface.co/facebook/wav2vec2-base (Schneider et al., 2019)) classification tasks.

Datasets. Since we focus on working with unstructured data, we have selected several well-known datasets. In the NLP domain, we trained models for topic detection by using the Ag News (Zhang et al., 2015) and 20 Newsgroups (Mitchell, 1999) datasets. In the computer vision domain, we used the Intel Image (Rahimzadeh et al., 2021) and STL-10 (Coates et al., 2011) datasets for image classification. In the speech domain, we used the Common Voice (Ardila et al., 2020) dataset to train a model for classifying the gender of speakers.

Drift simulation. To investigate the applicability of DriftLens to different application scenarios, we simulated different drift sources. In the use cases from 1 to 5 in Table 2, drift is simulated by introducing a new unknown class label. Specifically, we used two subsets of the dataset to train (fine-tune) and test the models. A third part of the dataset is kept away to generate the windows in the data stream (i.e., to simulate new, unseen data with a similar distribution to the training). Drift is simulated by removing one of the class labels during training and presenting these examples in the data stream windows. In use case 3, the same dataset is used as in use case 2, but the nature of drift is more subtle, as only a subset of a class is used to simulate drift by exploiting the hierarchical categorization of the dataset. In contrast, in use cases 6 and 7 in Table 2, we simulate drift by changing the properties of the input features. In use case 6, drift is simulated by blurring the input images. Specifically, we introduce a Gaussian blur with radius 4 on a circular patch that covers an area of D%percent𝐷D\%italic_D % of the entire image, where D%percent𝐷D\%italic_D % corresponds to the percentage of drift we want to insert. For use case 7, drift is simulated instead by presenting the model with speech samples in English accents other than those used for training (i.e., Australian, Canadian, and Scottish).

Embedding extraction. For the BERT, DistilBERT, RoBERTa, and Visual Transformer (Vi-T) models, we extract the embedding of the [CLS] token from the last hidden layer (embedding dimensionality d=768𝑑768d=768italic_d = 768). For VGG16, we extract and flatten the last convolutional layer (d=4608𝑑4608d=4608italic_d = 4608). For Wav2Vec, we extract and average all the representations of the last transformers’ hidden layer (d=768𝑑768d=768italic_d = 768).

Windows generation. When generating windows of the data streams, we use samples with or without drift that the model has never seen during the training or testing phase. The dimensionality of the data split is specified in the last column of Table 2. When creating a window without drift, we randomly select a balanced sample by class label from the new data without drift. These examples represent new data, never seen by the model, but with the same distribution as the training data. When generating windows containing D%percent𝐷D\%italic_D % of drift, we randomly select D%percent𝐷D\%italic_D % of the window size (mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT) from the drifted samples, while the remaining (100D)%percent100𝐷(100-D)\%( 100 - italic_D ) % are balanced samples from the new unseen samples without drift. Use case 6 is the only exception where a percentage D% of the pixels in all images are blurred. Windows are sampled with replacement due to dimensionality constraints (i.e., some samples may belong to more than one window).

DriftLens configuration. The following are the default experimental parameters of DriftLens. The offline phase exploits the entire training and test sets, the former for the baseline modeling, and the latter for the threshold estimation. The dimensionality of the training and test set for each use case is indicated in Table 2. We set d=150superscript𝑑150d^{\prime}=150italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 150 as the number of principal components used to reduce the per-batch embedding, except for use case 7 where we set d=25superscript𝑑25d^{\prime}=25italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 25 to have full rank matrices. The number of windows in the threshold estimation nth=10ksubscript𝑛𝑡10𝑘n_{th}=10kitalic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 10 italic_k, and its sensitivity is Tα=0.01subscript𝑇𝛼0.01T_{\alpha}=0.01italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.01. The window size is kept fixed in both the offline and online phases.

Comparison techniques. We compare DriftLens with four unsupervised statistical-based drift detection techniques from previous work: Maximum Mean Discrepancy888https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/mmddrift.html (MMD) (Gretton et al., 2012), Kolmogorov-Smirnov999https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/ksdrift.html (KS) (dos Reis et al., 2016), Least-Squares Density Difference101010https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/lsdddrift.html (LSDD) (Bu et al., 2018), and Cramér-von Mises111111https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/cvmdrift.html (CVM) drift detection techniques. We use the implementation provided by the Alibi Detect library (Van Looveren et al., 2019). We keep the default parameters configuration of each technique. The p-value to discriminate between drifted and non-drifted distributions is set to 0.05 as the default. All the techniques require a reference dataset. Due to complexity reasons, the reference dataset is obtained by randomly sampling a subset of the training with mb=5000subscript𝑚𝑏5000m_{b}=5000italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 5000 balanced samples, except for the 20 Newsgroup dataset, where we used mb1700subscript𝑚𝑏1700m_{b}\approx 1700italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≈ 1700 for use case 2, and mb2050subscript𝑚𝑏2050m_{b}\approx 2050italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ≈ 2050 for use case 3, to make them balanced. Similarly to our framework, we use the embedding vectors as input.

5.2. Drift detection performance evaluation

This evaluation aims to determine the effectiveness of DriftLens in detecting windows containing drifted samples of varying severity (to answer RQ1 in §1). We treat the drift detection problem as a binary classification task. The task is to predict whether a window of new samples contains drift. We perform the evaluation for different models, data types, and tasks to assess the general applicability of DriftLens (to answer RQ2 in §1).

Evaluation metrics. We use the accuracy as an evaluation measure for the drift prediction performance with different degrees of severity D%{0%,5%,10%,15%,20%}percent𝐷percent0percent5percent10percent15percent20D\%\in\{0\%,5\%,10\%,15\%,20\%\}italic_D % ∈ { 0 % , 5 % , 10 % , 15 % , 20 % }. If a window contains any percentage of drifted examples D%percent𝐷D\%italic_D %, the ground truth is set to 1; otherwise, it is set to 0. Data windows without drift (i.e., D=0%𝐷percent0D=0\%italic_D = 0 %) are used to measure type I errors (i.e., no drift, but the technique has detected one; false alarm). For each drift percentage D%percent𝐷D\%italic_D %, use case, and window size mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, we randomly draw 100 windows. Each window contains mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT samples, with D%percent𝐷D\%italic_D % of the samples drawn from the drift dataset. The accuracy is calculated over the 100 windows, repeated 5 times, and then averaged. At each run, the DriftLens’ threshold is re-estimated and the reference set of the other detectors is re-sampled. However, since the predictions of drifted and non-drifted windows are closely related, if a technique always predicts drift, it should be considered unreliable. Therefore, we first calculate the average accuracy in detecting drift:

(8) A¯drift=(A5%+A10%+A15%+A20%)/4subscript¯𝐴driftsubscript𝐴percent5subscript𝐴percent10subscript𝐴percent15subscript𝐴percent204\bar{A}_{\text{drift}}=(A_{5\%}+A_{10\%}+A_{15\%}+A_{20\%})/4over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT drift end_POSTSUBSCRIPT = ( italic_A start_POSTSUBSCRIPT 5 % end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT 10 % end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT 15 % end_POSTSUBSCRIPT + italic_A start_POSTSUBSCRIPT 20 % end_POSTSUBSCRIPT ) / 4

Then, we introduce a Harmonic Drift Detection (HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT) mean between the mean accuracy for non-drifted windows, and the mean accuracy of the drifted windows A¯driftsubscript¯𝐴drift\bar{A}_{\text{drift}}over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT drift end_POSTSUBSCRIPT, as follows:

(9) HDD=21A0%+1A¯driftsubscript𝐻𝐷𝐷21subscript𝐴percent01subscript¯𝐴driftH_{DD}=\frac{2}{\frac{1}{A_{0\%}}+\frac{1}{\bar{A}_{\text{drift}}}}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT = divide start_ARG 2 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_A start_POSTSUBSCRIPT 0 % end_POSTSUBSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT drift end_POSTSUBSCRIPT end_ARG end_ARG

The HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT is a real number in the range [0,1]01[0,1][ 0 , 1 ] that measures the overall quality of the drift detector in distinguishing between windows with and without drift. The HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT disfavors techniques that always predict the presence or absence of drift (i.e., A0%=0subscript𝐴percent00A_{0\%}=0italic_A start_POSTSUBSCRIPT 0 % end_POSTSUBSCRIPT = 0 or A¯drift=0subscript¯𝐴drift0\bar{A}_{\text{drift}}=0over¯ start_ARG italic_A end_ARG start_POSTSUBSCRIPT drift end_POSTSUBSCRIPT = 0). In such cases, HDD=0subscript𝐻𝐷𝐷0H_{DD}=0italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT = 0.

Table 3. Drift detection performance evaluation for larger data volume. For each drift detector and window size (mw{500,1000,2000}subscript𝑚𝑤50010002000m_{w}\in\{500,1000,2000\}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ { 500 , 1000 , 2000 }) are reported: i) the accuracy separately for each drift percentage (D%{0%,5%,10%,15%,20%}percent𝐷percent0percent5percent10percent15percent20D\%\in\{0\%,5\%,10\%,15\%,20\%\}italic_D % ∈ { 0 % , 5 % , 10 % , 15 % , 20 % }), and the harmonic drift detection mean HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT. Each accuracy is computed over 100 windows and averaged repeating 5 runs. The best-performing detector based on the HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT for each use case and window size is in bold.
Use Drift Data Stream Window Size mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
mw=500subscript𝑚𝑤500m_{w}=500italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 500 mw=1000subscript𝑚𝑤1000m_{w}=1000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1000 mw=2000subscript𝑚𝑤2000m_{w}=2000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 2000
Case Detector Drift Percentage D%percent𝐷D\%italic_D % Drift Percentage D%percent𝐷D\%italic_D % Drift Percentage D%percent𝐷D\%italic_D %
0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT 0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT 0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT
1.1 MMD 1.00 0.00 0.15 0.96 1.00 0.69 1.00 0.00 0.73 1.00 1.00 0.81 1.00 0.00 1.00 1.00 1.00 0.86
KS 1.00 0.00 0.14 0.95 1.00 0.69 1.00 0.00 0.83 1.00 1.00 0.97 1.00 0.24 1.00 1.00 1.00 0.90
LSDD 1.00 0.00 0.08 0.87 1.00 0.66 1.00 0.00 0.37 1.00 1.00 0.74 1.00 0.03 1.00 1.00 1.00 0.86
CVM 1.00 0.00 0.15 0.96 1.00 0.69 1.00 0.03 0.84 1.00 1.00 0.84 1.00 0.31 1.00 1.00 1.00 0.91
DriftLens 0.99 0.83 1.00 1.00 1.00 0.97 1.00 0.98 1.00 1.00 1.00 1.00 0.97 1.00 1.00 1.00 1.00 0.98
1.2 MMD 1.00 0.00 0.06 0.83 1.00 0.64 1.00 0.00 0.73 1.00 1.00 0.81 1.00 0.00 1.00 1.00 1.00 0.86
KS 1.00 0.01 0.21 0.76 1.00 0.66 1.00 0.02 0.67 1.00 1.00 0.80 1.00 0.04 1.00 1.00 1.00 0.86
LSDD 1.00 0.00 0.03 0.51 0.98 0.55 1.00 0.00 0.32 1.00 1.00 0.73 1.00 0.00 0.99 1.00 1.00 0.86
CVM 1.00 0.00 0.08 0.86 1.00 0.65 0.99 0.02 0.80 1.00 1.00 0.82 1.00 0.09 1.00 1.00 1.00 0.87
DriftLens 0.99 0.71 1.00 1.00 1.00 0.96 0.99 0.98 1.00 1.00 1.00 0.99 0.99 1.00 1.00 1.00 1.00 0.99
1.3 MMD 1.00 0.00 0.14 0.95 1.00 0.69 1.00 0.00 0.97 1.00 1.00 0.85 1.00 0.00 1.00 1.00 1.00 0.86
KS 1.00 0.00 0.17 0.78 1.00 0.66 1.00 0.01 0.75 1.00 1.00 0.82 1.00 0.02 1.00 1.00 1.00 0.86
LSDD 1.00 0.00 0.12 0.94 1.00 0.68 1.00 0.00 0.82 1.00 1.00 0.83 1.00 0.00 1.00 1.00 1.00 0.86
CVM 1.00 0.01 0.06 0.86 1.00 0.65 1.00 0.01 0.84 1.00 1.00 0.83 1.00 0.02 1.00 1.00 1.00 0.86
DriftLens 1.00 0.09 0.98 1.00 1.00 0.87 1.00 0.47 1.00 1.00 1.00 0.93 1.00 0.96 1.00 1.00 1.00 1.00
7 MMD 0.15 0.94 0.99 1.00 1.00 0.26 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
KS 0.15 0.90 0.96 0.99 1.00 0.26 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
LSDD 0.05 0.98 0.99 1.00 1.00 0.10 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
CVM 0.08 0.94 0.98 1.00 1.00 0.15 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.97 0.07 0.17 0.27 0.42 0.38 0.93 0.19 0.38 0.58 0.85 0.65 0.84 0.42 0.75 0.95 0.99 0.81
Table 4. Drift detection performance evaluation for smaller data volume. For each drift detector and window size (mw{250,500,1000}subscript𝑚𝑤2505001000m_{w}\in\{250,500,1000\}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ { 250 , 500 , 1000 }) are reported: i) the accuracy separately for each drift percentage (D%{0%,5%,10%,15%,20%}percent𝐷percent0percent5percent10percent15percent20D\%\in\{0\%,5\%,10\%,15\%,20\%\}italic_D % ∈ { 0 % , 5 % , 10 % , 15 % , 20 % }) and the harmonic drift detection mean HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT. Each accuracy is computed over 100 windows and averaged repeating 5 runs. The best-performing detector based on the HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT for each use case and window size is in bold.
Use Drift Data Stream Window Size mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
mw=250subscript𝑚𝑤250m_{w}=250italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 250 mw=500subscript𝑚𝑤500m_{w}=500italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 500 mw=1000subscript𝑚𝑤1000m_{w}=1000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1000
Case Detector Drift Percentage D%percent𝐷D\%italic_D % Drift Percentage D%percent𝐷D\%italic_D % Drift Percentage D%percent𝐷D\%italic_D %
0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT 0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT 0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT
2.1 MMD 0.83 0.33 0.74 0.98 1.00 0.80 0.06 1.00 1.00 1.00 1.00 0.11 0.00 1.00 1.00 1.00 1.00 0.00
KS 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
LSDD 1.00 0.01 0.04 0.19 0.48 0.31 0.98 0.14 0.41 0.76 0.96 0.72 0.61 0.65 0.93 1.00 1.00 0.72
CVM 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.92 0.28 0.68 0.96 1.00 0.81 0.89 0.42 0.92 1.00 1.00 0.86 0.84 0.78 1.00 1.00 1.00 0.89
2.2 MMD 1.00 0.01 0.08 0.61 0.98 0.59 0.95 0.12 0.63 1.00 1.00 0.80 0.56 0.72 1.00 1.00 1.00 0.70
KS 0.42 0.68 0.89 1.00 1.00 0.57 0.01 0.99 1.00 1.00 1.00 0.02 0.00 1.00 1.00 1.00 1.00 0.00
LSDD 1.00 0.00 0.01 0.05 0.25 0.14 1.00 0.01 0.06 0.35 0.83 0.48 0.98 0.03 0.33 0.92 0.99 0.72
CVM 0.31 0.74 0.90 1.00 1.00 0.46 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 1.00 0.15 0.72 0.99 1.00 0.83 1.00 0.23 0.95 1.00 1.00 0.89 1.00 0.40 1.00 1.00 1.00 0.92
2.3 MMD 1.00 0.00 0.06 0.40 0.91 0.51 0.96 0.22 0.78 1.00 1.00 0.84 0.37 0.95 1.00 1.00 1.00 0.54
KS 0.33 0.84 0.06 0.40 0.91 0.41 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
LSDD 1.00 0.00 0.00 0.03 0.09 0.06 1.00 0.00 0.03 0.17 0.51 0.30 0.97 0.09 0.36 0.86 1.00 0.72
CVM 0.26 0.87 0.98 1.00 1.00 0.41 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.98 0.08 0.26 0.57 0.88 0.61 0.98 0.07 0.34 0.82 0.99 0.71 0.99 0.07 0.53 0.98 1.00 0.78
3 MMD 1.00 0.00 0.04 0.59 0.98 0.57 0.99 0.08 0.76 1.00 1.00 0.83 0.48 0.89 1.00 1.00 1.00 0.61
KS 0.01 1.00 1.00 1.00 1.00 0.02 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
LSDD 1.00 0.00 0.00 0.03 0.35 0.17 1.00 0.00 0.02 0.55 0.99 0.56 1.00 0.01 0.41 1.00 1.00 0.75
CVM 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.97 0.21 0.65 0.96 1.00 0.82 0.98 0.35 0.94 1.00 1.00 0.89 0.98 0.63 1.00 1.00 1.00 0.94
4.1 MMD 1.00 0.00 0.00 0.43 0.99 0.52 1.00 0.00 0.18 1.00 1.00 0.71 1.00 0.00 0.97 1.00 1.00 0.85
KS 1.00 0.00 0.02 0.39 0.97 0.51 1.00 0.00 0.18 0.99 1.00 0.70 1.00 0.01 0.92 1.00 1.00 0.85
LSDD 1.00 0.00 0.00 0.20 0.73 0.38 1.00 0.00 0.05 0.86 1.00 0.65 1.00 0.00 0.57 1.00 1.00 0.78
CVM 1.00 0.00 0.02 0.45 0.99 0.54 1.00 0.00 0.25 1.00 1.00 0.72 1.00 0.01 0.98 1.00 1.00 0.86
DriftLens 1.00 0.01 0.30 1.00 1.00 0.73 1.00 0.00 0.35 1.00 1.00 0.74 1.00 0.00 0.50 1.00 1.00 0.77
4.2 MMD 1.00 0.00 0.09 0.67 0.99 0.61 1.00 0.01 0.47 0.99 1.00 0.76 1.00 0.05 0.98 1.00 1.00 0.86
KS 1.00 0.00 0.00 0.02 0.32 0.16 1.00 0.00 0.00 0.31 0.97 0.47 1.00 0.00 0.15 1.00 1.00 0.70
LSDD 0.98 0.05 0.05 0.07 0.10 0.13 0.98 0.04 0.07 0.13 0.22 0.21 0.96 0.03 0.11 0.20 0.47 0.33
CVM 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.95 0.06 0.17 0.61 0.94 0.61 0.94 0.14 0.66 0.99 1.00 0.80 0.93 0.31 0.98 1.00 1.00 0.87
5.1 MMD 1.00 0.00 0.06 1.00 1.00 0.68 1.00 0.00 1.00 1.00 1.00 0.86 1.00 0.07 1.00 1.00 1.00 0.87
KS 0.97 0.11 0.33 0.99 1.00 0.75 0.83 0.54 0.97 1.00 1.00 0.85 0.37 0.98 1.00 1.00 1.00 0.54
LSDD 1.00 0.00 0.01 1.00 1.00 0.67 1.00 0.00 0.97 1.00 1.00 0.85 1.00 0.03 1.00 1.00 1.00 0.86
CVM 0.95 0.14 0.39 0.99 1.00 0.76 0.75 0.62 0.97 1.00 1.00 0.82 0.26 0.99 1.00 1.00 1.00 0.41
DriftLens 0.96 0.82 1.00 1.00 1.00 0.96 0.96 1.00 1.00 1.00 1.00 0.98 0.97 1.00 1.00 1.00 1.00 0.99
5.2 MMD 0.98 0.03 0.25 0.93 1.00 0.71 0.99 0.04 0.76 1.00 1.00 0.82 0.99 0.03 1.00 1.00 1.00 0.86
KS 1.00 0.00 0.01 0.11 0.57 0.29 1.00 0.00 0.05 0.70 1.00 0.61 1.00 0.01 0.32 1.00 1.00 0.74
LSDD 0.95 0.04 0.05 0.20 0.63 0.63 0.99 0.03 0.05 0.72 1.00 0.62 0.95 0.06 0.43 1.00 1.00 0.75
CVM 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00 0.00 1.00 1.00 1.00 1.00 0.00
DriftLens 0.98 0.00 0.05 0.20 0.60 0.35 1.00 0.04 0.12 0.77 1.00 0.65 1.00 0.07 0.70 1.00 1.00 0.82
6 MMD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
KS 0.93 1.00 1.00 1.00 1.00 0.96 0.65 1.00 1.00 1.00 1.00 0.79 0.11 1.00 1.00 1.00 1.00 0.20
LSDD 1.00 0.00 0.03 0.24 0.77 0.41 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
CVM 0.92 1.00 1.00 1.00 1.00 0.96 0.61 1.00 1.00 1.00 1.00 0.76 0.08 1.00 1.00 1.00 1.00 0.15
DriftLens 0.95 1.00 1.00 1.00 1.00 0.97 0.93 1.00 1.00 1.00 1.00 0.96 0.92 1.00 1.00 1.00 1.00 0.96

Results. Table 3 and Table 4 show the drift prediction accuracy, broken down by severity, and the HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT score for all the techniques and use cases. Table 3 uses larger data windows than Table 4 as the datasets contain more samples. For each use case, each drift detector, and each window size, the following values are reported: (i) the mean accuracy by drift percentage, and (ii) the overall harmonic drift detection mean HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT.

For larger data volumes (Table 3), DriftLens achieves better drift prediction performance over all the experimental use cases independently of the data type and window size. For use cases 1.1, 1.2, and 1.3, it is particularly effective in detecting drift, achieving an HDD0.93subscript𝐻𝐷𝐷0.93H_{DD}\geq 0.93italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT ≥ 0.93, except for use case 1.3 with window size 500 where the HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT is slightly slower. Interestingly, DriftLens is the only effective technique for the speech task (use case 7). The better performance with a high volume of data is probable due to the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D distance, which is more reliable with large data volumes when estimating the reference distributions and the threshold values.

For smaller data volume (Table 4), overall is the most effective technique. However, for some use cases (e.g., 5.2 and 6), other techniques such as MDD or LSDD achieve better HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT scores. Surprisingly, all detectors are able to identify the drift simulated with blur (use case 6). However, some of them (KS and CVM) exhibit a large number of false positives, especially for larger window sizes. In addition, DriftLens is the only technique that consistently achieves effective performance across all the use cases and window sizes (HDD0.60subscript𝐻𝐷𝐷0.60H_{DD}\geq 0.60italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT ≥ 0.60), except for use case 5.2 with window size 250. In contrast, the other techniques always present some use cases for which they are totally unreliable (HDD=0subscript𝐻𝐷𝐷0H_{DD}=0italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT = 0).

In summary, DriftLens is highly effective in detecting windows containing drift (answering RQ1 in §1). It is also the most reliable and generally applicable technique across models, datasets, and data volumes, as it is the only detector that achieves good performance over all the experimental use cases (answering RQ2 in §1).

5.3. Complexity evaluation

This evaluation aims to ascertain the effectiveness of DriftLens to perform near real-time drift detection. To this end, we compare the running time of the drift detectors by varying the reference and data stream windows sizes, and the embedding dimensionality.

Evaluation metrics. We measure the mean running time in seconds to provide the drift prediction for each data window, given the embeddings already extracted. The experiments are executed on an Apple M1 MacBook Pro 13 2020 with 16GB of RAM.

Results. Figure 5 shows the running time in seconds on a logarithmic scale of each drift detector by varying (a) the reference window size mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, (b) the data stream window size mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and (c) the embedding dimensionality d𝑑ditalic_d. One dimension at a time is varied, and the others are held fixed at the following values: window size mw=1000subscript𝑚𝑤1000m_{w}=1000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1000, embedding dimensionality d=1000𝑑1000d=1000italic_d = 1000, and reference window size mb=5000subscript𝑚𝑏5000m_{b}=5000italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 5000. DriftLens outperforms all the evaluated techniques in terms of running time, running at least 5 times faster. Moreover, its execution time increases almost negligibly as the number of variables analyzed increases, while the other techniques are highly affected by the window sizes and embedding dimensionality.

Figure 6 shows DriftLens’ running time when dealing with a large volume of data. In this case, the reference window is increased up to 500k500𝑘500k500 italic_k samples, and the data stream window is up to 10k. The other drift detectors do not work with such dimensionalities. When varying the reference window size (mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT), the data stream window size is kept fixed to mw=5000subscript𝑚𝑤5000m_{w}=5000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 5000. Instead, when varying the data stream window size (mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT), the reference window size is kept fixed to mb=500ksubscript𝑚𝑏500𝑘m_{b}=500kitalic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 500 italic_k. Figure 6-(a) confirms that the running time of DriftLens is almost independent of the size of the reference window, as only the distributions are loaded in the online phase (mean vector and covariance matrix with dimensionality dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). Additionally, Figure 6-(b) shows that the running time increases almost negligibly as the window size increases. It can, therefore, be used for real-time drift detection even on data streams with high throughput. Notably, the running time is always lower to 0.2 seconds.

In summary, DriftLens is the fastest detector in terms of running time. It can detect drift in near real-time independently of the amount of data in the reference set and in the data stream, as well as the embedding dimensionality (answering RQ3 in §1).

Refer to caption
Figure 5. Running time comparison. For each drift detector, the mean and the standard deviation of the running time in seconds to process a new window are reported by varying (a) the reference window size, (b) the size of the data stream window, and (c) the embedding dimensionality, while kee** the other sizes fixed. The fixed values are: reference window size mb=5000subscript𝑚𝑏5000m_{b}=5000italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 5000, window size mw=1000subscript𝑚𝑤1000m_{w}=1000italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1000, and embedding dimensionality d=1000𝑑1000d=1000italic_d = 1000. Mean and std. are computed over 5 runs. Time is on a logarithmic scale.
Refer to caption
Figure 6. DriftLens mean running time in seconds as the reference mbsubscript𝑚𝑏m_{b}italic_m start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (a) and data stream mwsubscript𝑚𝑤m_{w}italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (b) window sizes change.

5.4. Drift curve evaluation

This evaluation aims to measure the ability of DriftLens to coherently represent and characterize the drift curve.

Evaluation metrics. We use the Spearman correlation coefficient (ref, 2008) to measure the correlation between the per-batch (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) distances over time and the curve of injected drift. The Spearman correlation evaluates the monotonic relationship (i.e., when two variables tend to move in the same direction at a constant rate). Thus, it is more suitable for the non-linearities present in the evaluated patterns (e.g., sudden and periodic). The coefficient ranges from -1 to +1, where +1 (-1) indicates a perfect positive (negative) monotonic relationship, and 0 indicates no monotonic relationship. The curve of injected drift is composed of 0 in the windows without the presence of drift, and the percentage of drift (D%percent𝐷D\%italic_D %) in the windows containing some drift. We also qualitatively evaluate the per-batch and per-label drift curves for three drift patterns.

Results. Table 5 reports the mean and the standard deviation of the Spearman correlation coefficient computed over all the experimental use cases. The data streams are generated by randomly sampling 100 windows containing 1000100010001000 samples each. In the sudden pattern, drift comes after 50 windows and is constant with a percentage of D%=40%percent𝐷percent40D\%=40\%italic_D % = 40 %. In the incremental pattern, drift comes after 50 windows with a percentage of D%=20%percent𝐷percent20D\%=20\%italic_D % = 20 % and increases by ΔD%=1%Δpercent𝐷percent1\Delta D\%=1\%roman_Δ italic_D % = 1 % after each window. In the periodic pattern, 20 windows without drift and 20 windows containing D%=40%percent𝐷percent40D\%=40\%italic_D % = 40 % of drift reoccur periodically. The experiments are repeated 5 times and averaged.

Table 5 reveals that the per-batch (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) distance is highly correlated with the generated drift curves. The Spearman correlation coefficient is always 0.85absent0.85\geq 0.85≥ 0.85 for the three considered drift patterns. Notably, for the incremental pattern, it is almost 1111. These results quantitatively demonstrate the ability of DriftLens to correctly characterize the drift trend over time.

Figure 7 and Figure 8 show two examples of DriftLens monitors obtained generating the sudden (a), incremental (b), and periodic (c) drift patterns, using the previously described settings. For both use cases, the per-batch and per-label drift curves are coherent with the generated pattern. In use case 1.1 (Figure 7), the labels World (red) and Business (green) are the most impacted by drift. This probably happens because most of the examples of the new injected class (i.e., Science/Technology) are classified with those labels. Similarly, in use case 4.1 (Figure 8), the most impacted labels are Mountain (green) and Street (orange). These plots show that the per-label (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) curves provide valuable insights about the drift characterization by showing which labels are the most impacted by drift.

In summary, the trend of the drift curve generated by DriftLens is highly coherent with the amount of drift present, and thanks to the per-label analysis, it is able to provide useful insights for characterizing the drift (answering RQ4 in §1).

Table 5. Drift patterns evaluation. Spearman correlation between the amount of drift and the per-batch distribution distance (FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D) for different drift patterns and use cases.
Drift Pattern
Sudden Incremental Recurrent
Corr. 0.875±.02plus-or-minus0.875.020.875\pm.020.875 ± .02 0.993±.01plus-or-minus0.993.010.993\pm.010.993 ± .01 0.849±.00plus-or-minus0.849.000.849\pm.000.849 ± .00
Refer to caption
(a) Sudden drift.
Refer to caption
(b) Incremental drift.
Refer to caption
(c) Periodic drift.
Figure 7. Drift patterns qualitative evaluation use case 1.1.
Refer to caption
(a) Sudden drift.
Refer to caption
(b) Incremental drift.
Refer to caption
(c) Periodic drift.
Figure 8. Drift patterns qualitative evaluation use case 4.1.
Table 6. Parameters sensitivity.
Use Parameter Drift Percentage D%percent𝐷D\%italic_D %
Case 0% 5% 10% 15% 20% HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT
Number of sampled windows for threshold estimation nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT
1.1 nth=1ksubscript𝑛𝑡1𝑘n_{th}=1kitalic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 1 italic_k 1.00 1.00 1.00 1.00 1.00 1.00
nth=5ksubscript𝑛𝑡5𝑘n_{th}=5kitalic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 5 italic_k 0.99 0.99 1.00 1.00 1.00 0.99
nth{100,10k¯,25k}subscript𝑛𝑡100¯10𝑘25𝑘n_{th}\in\{100,\underline{10k},25k\}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ∈ { 100 , under¯ start_ARG 10 italic_k end_ARG , 25 italic_k } 0.99 1.00 1.00 1.00 1.00 0.99
5.1 nth=1ksubscript𝑛𝑡1𝑘n_{th}=1kitalic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 1 italic_k 0.93 1.00 1.00 1.00 1.00 0.96
nth{5k,10k¯,15k,20k}subscript𝑛𝑡5𝑘¯10𝑘15𝑘20𝑘n_{th}\in\{5k,\underline{10k},15k,20k\}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT ∈ { 5 italic_k , under¯ start_ARG 10 italic_k end_ARG , 15 italic_k , 20 italic_k } 0.96 1.00 1.00 1.00 1.00 0.98
Threshold sensitivity Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT
1.1 Tα=0.00subscript𝑇𝛼0.00T_{\alpha}=0.00italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.00 1.00 0.82 1.00 1.00 1.00 0.97
Tα=0.01¯subscript𝑇𝛼¯0.01T_{\alpha}=\underline{0.01}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = under¯ start_ARG 0.01 end_ARG 0.99 1.00 1.00 1.00 1.00 0.99
Tα=0.05subscript𝑇𝛼0.05T_{\alpha}=0.05italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.05 0.95 1.00 1.00 1.00 1.00 0.97
Tα=0.10subscript𝑇𝛼0.10T_{\alpha}=0.10italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.10 0.90 1.00 1.00 1.00 1.00 0.95
Tα=0.25subscript𝑇𝛼0.25T_{\alpha}=0.25italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.25 0.75 1.00 1.00 1.00 1.00 0.86
5.1 Tα=0.00subscript𝑇𝛼0.00T_{\alpha}=0.00italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.00 1.00 1.00 1.00 1.00 1.00 1.00
Tα=0.01¯subscript𝑇𝛼¯0.01T_{\alpha}=\underline{0.01}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = under¯ start_ARG 0.01 end_ARG 0.97 1.00 1.00 1.00 1.00 0.98
Tα=0.05subscript𝑇𝛼0.05T_{\alpha}=0.05italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.05 0.90 1.00 1.00 1.00 1.00 0.95
Tα=0.10subscript𝑇𝛼0.10T_{\alpha}=0.10italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.10 0.83 1.00 1.00 1.00 1.00 0.91
Tα=0.25subscript𝑇𝛼0.25T_{\alpha}=0.25italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.25 0.68 1.00 1.00 1.00 1.00 0.81
Number of principal components dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
1.1 d=50superscript𝑑50d^{\prime}=50italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 50 0.97 1.00 1.00 1.00 1.00 0.98
d=100superscript𝑑100d^{\prime}=100italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 100 0.99 0.99 1.00 1.00 1.00 0.99
d{150¯,200,250}superscript𝑑¯150200250d^{\prime}\in\{\underline{150},200,250\}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { under¯ start_ARG 150 end_ARG , 200 , 250 } 0.99 1.00 1.00 1.00 1.00 0.99
5.1 d=50superscript𝑑50d^{\prime}=50italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 50 0.95 1.00 1.00 1.00 1.00 0.97
d{100,150¯,200,250}superscript𝑑100¯150200250d^{\prime}\in\{100,\underline{150},200,250\}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 100 , under¯ start_ARG 150 end_ARG , 200 , 250 } 0.97 1.00 1.00 1.00 1.00 0.98

5.5. Parameters sensitivity evaluation

This evaluation aims to determine the robustness and sensitivity of DriftLens to its parameters. To this end, we evaluate its performance in drift prediction by varying the values of the following parameters: (i) the number of randomly sampled windows to estimate the threshold nthsubscript𝑛𝑡n_{th}italic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, (ii) the threshold sensitivity parameter Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, and (iii) the number of principal components used to reduce the dimensionality of the embedding dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Evaluation metrics. We measure the accuracy in predicting drift with different severity levels and the HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT.

Results. Table 6 reports the accuracy and HDDsubscript𝐻𝐷𝐷H_{DD}italic_H start_POSTSUBSCRIPT italic_D italic_D end_POSTSUBSCRIPT by varying one parameter at a time while kee** the others fixed. The default fixed values are underlined and set to the following values: nth=10ksubscript𝑛𝑡10𝑘n_{th}=10kitalic_n start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT = 10 italic_k, Tα=0.01subscript𝑇𝛼0.01T_{\alpha}=0.01italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 0.01, and d=150superscript𝑑150d^{\prime}=150italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 150. The experiments are repeated 5 times and averaged for use cases 1.1 and 5.1.

The results indicate that varying the parameter values has minimal impact on performance, with a maximum reduction of 0.03. The only exception is the threshold sensitivity Tαsubscript𝑇𝛼T_{\alpha}italic_T start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT value. As this parameter increases, DriftLens reduces its estimated threshold values, resulting in an increase in false positives—normal windows incorrectly identified as drift. This leads to reduced accuracy, especially when there is no actual drift (D%=0%percent𝐷percent0D\%=0\%italic_D % = 0 %). However, we can conclude that the performance of DriftLens is not significantly affected by the choice of parameters, except for the threshold sensitivity.

6. Conclusion

This paper presents DriftLens, an unsupervised drift detection framework for deep learning models and unstructured data. It can be used to continuously monitor deep learning production models to detect whether and when drift occurs to increase their reliability and robustness in real-world production applications. Our experiments show that DriftLens is effective in detecting drift and also more efficient than state-of-the-art techniques. Thanks to its fast execution time, it enables the detection of concept drift in real-time. Moreover, it correctly represents the drift trend over time and characterizes each drifting label.

Limitations. DriftLens uses the Frechét distance to compute the distances between the baseline and the new window distributions (i.e., FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D). It can, therefore, inherit some of its limitations: (1) Noise and selection bias with small datasets. For small datasets, the FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D can potentially be affected by noise and selection bias in the distance calculation. However, we show empirically that DriftLens performs well even with smaller reference data and window sizes. (2) Limited number of statistics used. The FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D score uses a limited number of statistics in the distance calculation (i.e., mean and covariance). Therefore, it may not cover all aspects of the distributions. For example, while it covers the first two moments of the distribution, it does not take into account other moments (e.g., skewness and kurtosis). (3) Distance score range. The FDD𝐹𝐷𝐷FDDitalic_F italic_D italic_D score calculates distances in the range of [0,]0[0,\infty][ 0 , ∞ ]. However, it would be more interpretable in the range of [0,1]01[0,1][ 0 , 1 ]. In the future, we can overcome this limitation by expressing the distance in a relative value with respect to the threshold. Finally, (4) we always addressed drift in windows with balanced label distributions in our evaluations. However, in many scenarios, this hypothesis might not hold.

In future work, we would like to (i) integrate explainability techniques based on the embedding representations (Ventura et al., 2022, 2023) or concept-based explanations (Poeta et al., 2023) to propose an explainable concept drift detection tool, (ii) address the problem of drift adaptation in unsupervised or supervised environments with limited labels; (iii) test the drift detectors in scenarios with unbalanced data distributions; (iii) extend the proposed approach to tasks other than classification; (iv) detect concept drift on multimodal models.

References

  • (1)
  • ref (2008) 2008. Spearman Rank Correlation Coefficient. Springer New York, New York, NY, 502–505. https://doi.org/10.1007/978-0-387-32833-1_379
  • Adams et al. (2023) Jan Niklas Adams, Cameron Pitsch, Tobias Brockhoff, and Wil M. P. van der Aalst. 2023. An Experimental Evaluation of Process Concept Drift Detection. Proc. VLDB Endow. 16, 8 (apr 2023), 1856–1869. https://doi.org/10.14778/3594512.3594517
  • Ardila et al. (2020) Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4218–4222. https://aclanthology.org/2020.lrec-1.520
  • Arora et al. (2023) Shruti Arora, Rinkle Rani, and Nitin Saxena. 2023. SETL: a transfer learning based dynamic ensemble classifier for concept drift detection in streaming data. Cluster Computing (Oct. 2023). https://doi.org/10.1007/s10586-023-04149-w
  • Baena-Garcıa et al. (2006) M Baena-Garcıa, J Del Campo-Ávila, R Fidalgo, A Bifet, R Gavalda, and R Morales-Bueno. 2006. Early drift detection method. Fourth international workshop on knowledge discovery from data streams 6 (2006), 77–86.
  • Bashir et al. (2016) Sulaimon A. Bashir, Andrei Petrovski, and Daniel Doolan. 2016. UDetect: Unsupervised Concept Change Detection for Mobile Activity Recognition. In Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media (Singapore, Singapore) (MoMM ’16). Association for Computing Machinery, New York, NY, USA, 20–27. https://doi.org/10.1145/3007120.3007144
  • Bayram et al. (2022) Firas Bayram, Bestoun S. Ahmed, and Andreas Kassler. 2022. From Concept Drift to Model Degradation: An Overview on Performance-Aware Drift Detectors. Know.-Based Syst. 245, C (jun 2022). https://doi.org/10.1016/j.knosys.2022.108632
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
  • Bifet and Gavalda (2007) Albert Bifet and Ricard Gavalda. 2007. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM international conference on data mining. SIAM, 443–448.
  • Bu et al. (2018) Li Bu, Cesare Alippi, and Dongbin Zhao. 2018. A pdf-Free Change Detection Test Based on Density Difference Estimation. IEEE Transactions on Neural Networks and Learning Systems 29, 2 (2018), 324–334. https://doi.org/10.1109/TNNLS.2016.2619909
  • Cerquitelli et al. (2019) Tania Cerquitelli, Stefano Proto, Francesco Ventura, Daniele Apiletti, and Elena Baralis. 2019. Towards a Real-Time Unsupervised Estimation of Predictive Model Degradation. In Proceedings of Real-Time Business Intelligence and Analytics (Los Angeles, CA, USA) (BIRTE 2019). Association for Computing Machinery. https://doi.org/10.1145/3350489.3350494
  • Coates et al. (2011) Adam Coates, Andrew Ng, and Honglak Lee. 2011. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. Journal of Machine Learning Research - Proceedings Track 15 (01 2011), 215–223.
  • de Mello et al. (2019) Rodrigo F. de Mello, Yule Vaz, Carlos H. Grossi, and Albert Bifet. 2019. On learning guarantees to unsupervised concept drift detection on data streams. Expert Systems with Applications 117 (2019), 90–102. https://doi.org/10.1016/j.eswa.2018.08.054
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • dos Reis et al. (2016) Denis Moreira dos Reis, Peter Flach, Stan Matwin, and Gustavo Batista. 2016. Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1545–1554. https://doi.org/10.1145/2939672.2939836
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). https://arxiv.longhoe.net/abs/2010.11929
  • Dowson and Landau (1982) D.C Dowson and B.V Landau. 1982. The Fréchet distance between multivariate normal distributions. Journal of Multivariate Analysis 12, 3 (1982), 450–455. https://doi.org/10.1016/0047-259X(82)90077-X
  • Frías-Blanco et al. (2015) Isvani Frías-Blanco, José del Campo-Ávila, Gonzalo Ramos-Jiménez, Rafael Morales-Bueno, Agustín Ortiz-Díaz, and Yailé Caballero-Mota. 2015. Online and Non-Parametric Drift Detection Methods Based on Hoeffding’s Bounds. IEEE Transactions on Knowledge and Data Engineering 27, 3 (2015), 810–823. https://doi.org/10.1109/TKDE.2014.2345382
  • Gama and Castillo (2006) João Gama and Gladys Castillo. 2006. Learning with Local Drift Detection. In Advanced Data Mining and Applications, Xue Li, Osmar R. Zaïane, and Zhanhuai Li (Eds.). Springer Berlin Heidelberg, 42–55.
  • Gama et al. (2004) João Gama, Pedro Medas, Gladys Castillo, and Pedro Rodrigues. 2004. Learning with Drift Detection. In Advances in Artificial Intelligence – SBIA 2004, Ana L. C. Bazzan and Sofiane Labidi (Eds.). Berlin, Heidelberg, 286–295.
  • Gama et al. (2014) João Gama, Indrundefined Žliobaitundefined, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 46, 4 (2014). https://doi.org/10.1145/2523813
  • Gemaque et al. (2020) Rosana Noronha Gemaque, Albert França Josuá Costa, Rafael Giusti, and Eulanda Miranda dos Santos. 2020. An overview of unsupervised drift detection methods. WIREs Data Mining and Knowledge Discovery 10, 6 (2020), e1381. https://doi.org/10.1002/widm.1381
  • Gözüaçık et al. (2019) Ömer Gözüaçık, Alican Büyükçakır, Hamed Bonab, and Fazli Can. 2019. Unsupervised Concept Drift Detection with a Discriminative Classifier. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Bei**g, China) (CIKM ’19). Association for Computing Machinery, New York, NY, USA, 2365–2368. https://doi.org/10.1145/3357384.3358144
  • Greco and Cerquitelli (2021) Salvatore Greco and Tania Cerquitelli. 2021. Drift Lens: Real-time unsupervised Concept Drift detection by evaluating per-label embedding distributions. In 2021 International Conference on Data Mining Workshops (ICDMW). 341–349. https://doi.org/10.1109/ICDMW53433.2021.00049
  • Greco et al. (2024) Salvatore Greco, Bartolomeo Vacchetti, Daniele Apiletti, and Tania Cerquitelli. 2024. DriftLens: A Concept Drift Detection Tool. In Proceedings 27th International Conference on Extending Database Technology, EDBT 2024, Paestum, Italy, March 25 - March 28. OpenProceedings.org, 806–809. https://doi.org/10.48786/edbt.2024.75
  • Gretton et al. (2012) Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A Kernel Two-Sample Test. Journal of Machine Learning Research 13, 25 (2012), 723–773. http://jmlr.org/papers/v13/gretton12a.html
  • Grulich et al. (2018) Philipp Grulich, René Saitenmacher, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2018. Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing. https://doi.org/10.5441/002/edbt.2018.51
  • Haque et al. (2016) Ahsanul Haque, Latifur Khan, and Michael Baron. 2016. SAND: Semi-Supervised Adaptive Novel Class Detection and Classification over Data Stream. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 1652–1658.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, Vol. 30.
  • Hido et al. (2008) Shohei Hido, Tsuyoshi Idé, Hisashi Kashima, Harunobu Kubo, and Hirofumi Matsuzawa. 2008. Unsupervised Change Analysis Using Supervised Learning. In Advances in Knowledge Discovery and Data Mining, Takashi Washio, Einoshin Suzuki, Kai Ming Ting, and Akihiro Inokuchi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 148–159.
  • Hinder et al. (2023b) Fabian Hinder, Valerie Vaquet, Johannes Brinkrolf, and Barbara Hammer. 2023b. On the Hardness and Necessity of Supervised Concept Drift Detection. 164–175. https://doi.org/10.5220/0011797500003411
  • Hinder et al. (2023a) Fabian Hinder, Valerie Vaquet, and Barbara Hammer. 2023a. One or Two Things We know about Concept Drift – A Survey on Monitoring Evolving Environments. arXiv:2310.15826 [cs.LG]
  • Hu et al. (2020) Hanqing Hu, Mehmed Kantardzic, and Tegjyot S. Sethi. 2020. No Free Lunch Theorem for concept drift detection in streaming data classification: A review. WIREs Data Mining and Knowledge Discovery 10, 2 (2020), e1327. https://doi.org/10.1002/widm.1327
  • Hushchyn and Ustyuzhanin (2020) Mikhail Hushchyn and Andrey Ustyuzhanin. 2020. Generalization of Change-Point Detection in Time Series Data Based on Direct Density Ratio Estimation. CoRR abs/2001.06386 (2020). arXiv:2001.06386 https://arxiv.longhoe.net/abs/2001.06386
  • Jadhav and Deshpande (2017) Aditee Jadhav and Leena Deshpande. 2017. An Efficient Approach to Detect Concept Drifts in Data Streams. In 2017 IEEE 7th International Advance Computing Conference (IACC). 28–32. https://doi.org/10.1109/IACC.2017.0021
  • Kifer et al. (2004) Daniel Kifer, Shai Ben-David, and Johannes Gehrke. 2004. Detecting change in data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). VLDB Endowment, 180–191.
  • Kim and Park (2016) Young In Kim and Cheong Hee Park. 2016. Concept Drift Detection on Streaming Data under Limited Labeling. In 2016 IEEE International Conference on Computer and Information Technology (CIT). 273–280. https://doi.org/10.1109/CIT.2016.34
  • Liu et al. (2017a) An** Liu, Yiliao Song, Guangquan Zhang, and Jie Lu. 2017a. Regional Concept Drift Detection and Density Synchronized Drift Adaptation. 2280–2286. https://doi.org/10.24963/ijcai.2017/317
  • Liu et al. (2017b) An** Liu, Guangquan Zhang, and Jie Lu. 2017b. Fuzzy time windowing for gradual concept drift adaptation. In 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). 1–6. https://doi.org/10.1109/FUZZ-IEEE.2017.8015596
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.longhoe.net/abs/1907.11692
  • Lu et al. (2019) Jie Lu, An** Liu, Fan Dong, Feng Gu, João Gama, and Guangquan Zhang. 2019. Learning under Concept Drift: A Review. IEEE Transactions on Knowledge and Data Engineering 31, 12 (2019), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857
  • Lughofer et al. (2015) Edwin Lughofer, Eva Weigl, Wolfgang Heidl, Christian Eitzinger, and Thomas Radauer. 2015. Drift detection in data stream classification without fully labelled instances. In 2015 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS). 1–8. https://doi.org/10.1109/EAIS.2015.7368802
  • Mayaki and Riveill (2022) Mansour Zoubeirou A Mayaki and Michel Riveill. 2022. Autoregressive based Drift Detection Method. In 2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892066
  • Mitchell (1999) Tom Mitchell. 1999. Twenty Newsgroups. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5C323.
  • Nishida and Yamauchi (2007) Kyosuke Nishida and Koichiro Yamauchi. 2007. Detecting Concept Drift Using Statistical Testing. In Discovery Science, Vincent Corruble, Masayuki Takeda, and Einoshin Suzuki (Eds.). Springer Berlin Heidelberg, 264–269.
  • Pinagé et al. (2020) Felipe Pinagé, Eulanda M. dos Santos, and João Gama. 2020. A Drift Detection Method Based on Dynamic Classifier Selection. Data Min. Knowl. Discov. 34, 1 (jan 2020), 50–74. https://doi.org/10.1007/s10618-019-00656-w
  • Poeta et al. (2023) Eleonora Poeta, Gabriele Ciravegna, Eliana Pastor, Tania Cerquitelli, and Elena Baralis. 2023. Concept-based Explainable Artificial Intelligence: A Survey. arXiv:2312.12936
  • Rabanser et al. (2018) Stephan Rabanser, Stephan Günnemann, and Zachary Chase Lipton. 2018. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:53096511
  • Rahimzadeh et al. (2021) Mohammad Rahimzadeh, Soroush Parvin, Elnaz Safi, and Mohammad Reza Mohammadi. 2021. Wise-SrNet: A Novel Architecture for Enhancing Image Classification by Learning Spatial Resolution of Feature Maps. CoRR abs/2104.12294 (2021). arXiv:2104.12294 https://arxiv.longhoe.net/abs/2104.12294
  • Sanh et al. (2020) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL]
  • Schneider et al. (2019) Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-training for Speech Recognition. CoRR abs/1904.05862 (2019). arXiv:1904.05862 http://arxiv.longhoe.net/abs/1904.05862
  • Shankar and Parameswaran (2022) Shreya Shankar and Aditya G. Parameswaran. 2022. Towards Observability for Production Machine Learning Pipelines. Proc. VLDB Endow. 15, 13 (sep 2022), 4015–4022. https://doi.org/10.14778/3565838.3565853
  • Shen et al. (2023) Pei Shen, Yongjie Ming, Hongpeng Li, **gyu Gao, and Wanpeng Zhang. 2023. Unsupervised Concept Drift Detectors: A Survey. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery. Springer International Publishing, Cham, 1117–1124.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Sun et al. (2017) Nick Sun, Ke Tang, Zexuan Zhu, and Xin Yao. 2017. Concept Drift Adaptation by Exploiting Historical Knowledge. IEEE Transactions on Neural Networks and Learning Systems (2017). https://doi.org/10.1109/TNNLS.2017.2775225
  • Suprem et al. (2020) Abhijit Suprem, Joy Arulraj, Calton Pu, and Joao Ferreira. 2020. ODIN: automated drift detection and recovery in video analytics. Proc. VLDB Endow. 13, 12 (jul 2020), 2453–2465. https://doi.org/10.14778/3407790.3407837
  • Van Looveren et al. (2019) Arnaud Van Looveren, Janis Klaise, Giovanni Vacanti, Oliver Cobb, Ashley Scillitoe, Robert Samoilescu, and Alex Athorne. 2019. Alibi Detect: Algorithms for outlier, adversarial and drift detection. https://github.com/SeldonIO/alibi-detect
  • Ventura et al. (2022) Francesco Ventura, Salvatore Greco, Daniele Apiletti, and Tania Cerquitelli. 2022. Trusting deep learning natural-language models via local and global explanations. Knowledge and Information Systems (2022). https://doi.org/10.1007/s10115-022-01690-9
  • Ventura et al. (2023) Francesco Ventura, Salvatore Greco, Daniele Apiletti, and Tania Cerquitelli. 2023. Explaining deep convolutional models by measuring the influence of interpretable features in image classification. Data Mining and Knowledge Discovery (2023), 1–58.
  • Ventura et al. (2019) Francesco Ventura, Stefano Proto, Daniele Apiletti, Tania Cerquitelli, Simone Panicucci, Elena Baralis, Enrico Macii, and Alberto Macii. 2019. A New Unsupervised Predictive-Model Self-Assessment Approach That SCALEs. In 2019 IEEE International Congress on Big Data, Milan, Italy, July 8-13, 2019. IEEE, 144–148. https://doi.org/10.1109/BIGDATACONGRESS.2019.00033
  • Vorburger and Bernstein (2006) P. Vorburger and A. Bernstein. 2006. Entropy-based Concept Shift Detection. In Sixth International Conference on Data Mining (ICDM’06). https://doi.org/10.1109/ICDM.2006.66
  • Wang et al. (2024) **fan Wang, Hang Yu, Nanlin **, Duncan Davies, and Wai Lok Woo. 2024. QuadCDD: A Quadruple-based Approach for Understanding Concept Drift in Data Streams. Expert Systems with Applications 238 (2024), 122114. https://doi.org/10.1016/j.eswa.2023.122114
  • Wang et al. (2022) Xuezhi Wang, Haohan Wang, and Diyi Yang. 2022. Measure and Improve Robustness in NLP Models: A Survey. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4569–4586.
  • Wasserstein (1969) L. N. Wasserstein. 1969. Markov processes over denumerable products of spaces describing large systems of automata. Probl. Inform.
  • Werner et al. (2023) Elias Werner, Nishant Kumar, Matthias Lieber, Sunna Torge, Stefan Gumhold, and Wolfgang E. Nagel. 2023. Examining Computational Performance of Unsupervised Concept Drift Detection: A Survey and Beyond. arXiv:2304.08319 [cs.LG]
  • Xu and Wang (2017) Shuliang Xu and Junhong Wang. 2017. Dynamic extreme learning machine for data stream classification. Neurocomputing 238 (2017), 433–449. https://doi.org/10.1016/j.neucom.2016.12.078
  • Yamanishi and Takeuchi (2002) Kenji Yamanishi and Jun-ichi Takeuchi. 2002. A unifying framework for detecting outliers and change points from non-stationary time series data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edmonton, Alberta, Canada) (KDD ’02). Association for Computing Machinery, New York, NY, USA, 676–681. https://doi.org/10.1145/775047.775148
  • Zhang et al. (2015) Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. In NIPS.