¹¹institutetext: IDIADA Fahrzeugtechnik GmbH, Munich, Germany ¹¹email: [email protected]
²²institutetext: Dr. Ing. h.c. F. Porsche AG Stuttgart, Germany ²²email: {extern.karim.belaid,maximilian.rabus2}@porsche.de
³³institutetext: Ludwig-Maximilians-Universität Munich, Germany
³³email: [email protected]

Pairwise Difference Learning for Classification

Mohamed Karim Belaid 1122 0000-1111-2222-3333 Maximilian Rabus 22 0000-0003-0755-1772 Eyke Hüllermeier 33 0000-0002-9944-4108

Abstract

Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a map** from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.

Keywords:

Supervised learning Multiclass classification Meta-learning.

1 Introduction

Pairwise difference learning (PDL) has recently been introduced independently by Tynes et al. [19] and Wetzel et al. [22] as a meta-learning technique for regression, which transforms the original task of learning to predict outcomes for individual inputs into the task of learning to predict differences between the outcomes of input pairs: Noting that the value of a function $f$ at a point $x$ can be written “from the perspective” of any other point $x^{\prime}$ as $f(x)=f(x^{\prime})+\Delta(x,x^{\prime})$ with $\Delta(x,x^{\prime})=f(x)-f(x^{\prime})$ , the simple idea of PDL is to train an approximation $\tilde{\Delta}$ of the difference function $\Delta$ and obtain predictions of new outcomes $y=f(x)$ by averaging over the predicted differences to the outcomes in the training data:

y\approx\frac{1}{N}\sum_{i=1}^{N}y_{i}+\tilde{\Delta}(x,x_{i})

(1)

One of the main motivations of PDL is the quadratic increase of the training data: If the original training data contains $N$ data points $(x_{1},y_{1}),\ldots,(x_{N},y_{N})$ , the difference function can be trained on potentially $\mathcal{O}(N^{2})$ training examples of the form $((x_{i},x_{j}),y_{i}-y_{j})$ . This increase might be specifically useful in the “small data” regime (even if the transformed examples are of course no longer independent of each other). Moreover, note that the prediction (1) benefits from a statistically useful averaging effect.

Building on the basic idea of PDL, we make the following contributions. We extend the idea of PDL toward the task of classification and propose the PDL classifier, a meta-learning approach that transforms any multiclass classification problem into a single binary problem. This innovative method leverages the concept of learning inter-class differences, leading to demonstrably improved average prediction accuracy (Section 3). We introduce the “pairwise difference learning library” (pdll) on PyPI, which incorporates our implementation of the PDL classifier and ensures compatibility with any Sklearn ML model (Section 3.5). We conduct a large-scale experimental analysis of PDL and compare the results to state-of-the-art ML estimators (Section 4). We discuss the architecture of PDL and how it can lead to an improvement of the accuracy (Section 5).

2 Related Work

Tynes et al. introduced pairwise difference regressor [19], a novel meta-learner for chemical tasks that enhances prediction performance, compared to random forest and provides robust uncertainty quantification. In computational chemistry, estimating differences between data points helps mitigate systematic errors [19]. In parallel, Wetzel et al. used twin neural network architectures for semi-supervised regression tasks, focusing on predicting differences between target values of distinct data points [22]. The approach of Wetzel et al. enabled training on unlabelled data points when paired with labeled anchor data points. By ensembling predicted differences between target values, the method achieved high prediction performance for regression problems. While conceptually similar to the pairwise difference regressor in emphasizing differences between data points, it is specialized to neural network architectures for semi-supervised regression tasks [23].

The pairwise difference learning (PDL) literature has since then, evolved into diverse methodologies and applications. Spiers et al. measured sample similarity in chemistry, emphasizing spectral shape differences using metrics like Euclidean and Mahalanobis distances. They extended the approach by calculating a Z-score which offers insights into prediction accuracy, facilitating outlier detection and model adaptation [18]. PDL was developed mainly for regression tasks. It can also be adapted to targets that might be known or only bounded. Example of target annotations could be $y=5.3$ , $y<2.1$ , or $y>6.5$ . Predicting an increase/decrease between a pair is a possible solution [8]. PDL regressor with its variants has demonstrated efficacy in various applications, including regression with image input [11], learning chemical properties [7], quantum mechanical reactions [5], and drug activity ranking [21].

3 PDL Classification

Consider a standard setting of supervised (classification) learning: Given a set of training data

\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}\subset\mathbb{R}^{d}\times\mathcal{Y}\,,

comprised of training instances in the form of feature vectors $x_{i}\in\mathbb{R}^{d}$ together with observed discrete labels $y\in\mathcal{Y}=\{1,\ldots,K\}$ , and assumed to be generated i.i.d. according to an underlying (unknown) joint probability measure $P$ , the task is to learn a predictor $\textup{PDC}:\mathbb{R}^{d}\rightarrow\mathcal{Y}$ with low risk (expected loss). The PDL classifier transforms the original training data $\mathcal{D}$ into the new data

\mathcal{D}_{pair}=\big{\{}(z_{i,j},y_{i,j})\,|\,1\leq i,j\leq N\big{\}}\,,

(2)

where $z_{i,j}=\phi(x_{i},x_{j})$ is a joint feature representation of the instance pair $(x_{i},x_{j})$ and

y_{i,j}=\begin{cases}0&\text{for }y_{i}\neq y_{j},\\ 1&\text{for }y_{i}=y_{j}\end{cases}\,.

(3)

Thus, we seek a binary classifier $\gamma:\,\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1]$ that, given two instances $x$ and $x^{\prime}$ as input, predicts whether or not the respective classes $y$ and $y^{\prime}$ are the same. More specifically, we assume $\gamma$ to be a probabilistic classifier, so that $\gamma(x,x^{\prime})\in[0,1]$ is the probability that $y=y^{\prime}$ . Deterministic classifiers that return a binary label as a prediction are treated as degenerate $\{0,1\}$ -valued probabilistic classifiers. Leveraging the joint feature representation, $\gamma$ is of the form $\gamma(x,x^{\prime})=h(\phi(x,x^{\prime}))$ , where $h$ is trained on the transformed data (2). To this end, any binary classification method can be used. Note, however, that the binary problem might be quite imbalanced, as the transformation (3) will produce much more negative (unequal) than positive (equal) examples. One can solve this issue by introducing class weights [13] to equalize the loss function of the classifier $\gamma$ . As for the joint feature representation, the original proposal was to define $z_{i,j}$ as a concatenation of $x_{i}$ and $x_{j}$ . It turned out, however, that expanding this vector by the difference $x_{i}-x_{j}$ has a positive influence on performance [19], wherefore we also adopted this representation in our work.

Refer to caption — Figure 1: Illustration of the PDL classifier.

Since (class) equality is a symmetric relation, $\gamma$ is naturally expected to be symmetric in the sense that $\gamma(x_{i},x_{j})=\gamma(x_{j},x_{i})$ . By adding both $(\phi(x_{i},x_{j}),y_{i,j})$ and $(\phi(x_{j},x_{i}),y_{j,i})$ to $\mathcal{D}_{pair}$ , this symmetry can also be reflected in the training data. But even then, however, $\gamma$ is not necessarily guaranteed to preserve symmetry. Therefore, we additionally “symmetrize” the predictor as follows:

\gamma_{sym}(x_{i},x_{j})=\frac{\gamma(x_{i},x_{j})+\gamma(x_{j},x_{i})}{2}

(4)

Given a query $x_{q}$ , we finally estimate the probability of class labels $y\in\mathcal{Y}$ as follows: Considering each training example $(x_{i},y_{i})$ as a piece of evidence for the unknown class $y_{q}$ , the semantics of the above prediction suggests that the probability of the event $y_{q}=y_{i}$ is given by (4). More formally, $P(E)=\gamma_{sym}(x_{q},x_{i})$ , where $E$ denotes the event $y_{q}=y_{i}$ (and hence $P(\neg E)=1-\gamma_{sym}(x_{q},x_{i})$ ). Let $p$ denote the prior distribution on the class labels $\mathcal{Y}$ (which can easily be estimated by relative frequencies on the training data). This distribution is then updated by conditioning it on the (uncertain) event $E$ , which yields the following posterior suggested by $(x_{i},y_{i})$ :

p_{post,i}(y)=\left\{\begin{array}[]{cl}\gamma_{sym}(x_{q},x_{i})&\text{ if }y% =y_{i}\\[5.69054pt] \dfrac{p(y)\cdot(1-\gamma_{sym}(x_{q},x_{i}))}{1-p(y_{i})}&\text{ otherwise}% \end{array}\right.

(5)

Thus, the (posterior) probability of $y_{i}$ is fixed to $\gamma_{sym}(x_{q},x_{i})$ , and all other probabilities are rescaled in a proportional way, to guarantee that the sum of posterior probabilities adds to 1. Finally, we average over the evidences from all training examples to obtain

p_{post}(y)=\frac{1}{N}\sum_{i=1}^{N}p_{post,i}(y)\,.

(6)

In case a deterministic prediction is sought, the class with the highest (estimated) probability is chosen:

\hat{y}_{q}=\arg\max_{y\in\mathcal{Y}}p_{post}(y)

(7)

3.1 Uncertainty Quantification

Interestingly, the PDL approach also offers a natural approach to uncertainty quantification, a topic that has received increasing attention in the recent machine learning literature. In particular, recent research has focused on the distinction between so-called aleatoric uncertainty (caused by inherent randomness in the data) and epistemic uncertainty (caused by the learner’s incomplete knowledge of the true data-generating process) — we refer to [12] for a detailed exposition of this topic.

Within the Bayesian approach, these two types of uncertainty can be captured by properties of the posterior predictive distribution, which in turn can be approximated through ensemble learning [15]. In a sense, PDL parallels this approach, with each anchor playing the role of an ensemble member, and (6) mimicking Bayesian model averaging. This suggests the following quantification of aleatoric (AU), epistemic (EU), and total uncertainty (TU) of a prediction, with $H$ denoting Shannon entropy.:

	TU	$\displaystyle=H(p_{post}(y))=H\left(\sum_{i=1}^{N}p_{post,i}(y)\right)$
	AU	$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}H(p_{post,i}(y))$
	EU	$\displaystyle=\text{TU}-\text{AU}$

Theoretically, these measures are justified based on a well-known result from information theory, according to which entropy additively decomposes into conditional entropy and mutual information [6]. Broadly speaking, the more uniform the (averaged) distribution $p_{post}$ , the higher the total uncertainty, and the more diverse the individual predictions $p_{post,i}$ , the higher the epistemic uncertainty.

3.2 Illustration

Thanks to its novel structure, the PDL classifier can solve a multiclass classification task by training exactly one instance of a base learner on a binary task. Fig. 1 illustrates the PDL classifier algorithm, showcasing both the training and prediction phases on a simple multiclass task. Fig. 1.a shows a traditional multiclass classifier $g$ that maps each of the N training data points to their assigned unique class label (star, square, or circle). In Fig. 1.b, PDL classifier transforms the data by creating $N^{2}$ pairs of data points. During training, a binary classifier $\gamma$ learns to distinguish between pairs that belong to the same class (positive label) from pairs of different classes (negative label). In Fig. 1.c, given one query input, the PDL classifier pairs it with each of the N training data points. For each pair, the classifier predicts a probability of similarity (belonging to the same class). Predicted probabilities are mapped to the column corresponding to the initial label of each training point. Missing posterior probabilities, in grey, are estimated by updating the prior probabilities, assuming a uniform distribution in this example. Finally, averaging across all training points yields the predicted probabilities for each class. The class with the highest predicted probability is chosen as the final class label for the query point (e.g., Class 3).

Fig. 2 illustrates the patterns learned by nine baseline models across three 2D datasets. The baseline 3-Nearest-Neighbor (3-NN) classifier can only predict four probabilities: $0,\frac{1}{3},\frac{2}{3},$ and $1$ . This is evident in the figure, where each dataset shows only four discrete regions. In contrast, when using PDL on top of 3-NN, the predicted probability is derived from the averaging over $N$ discrete predictions. This results in more refined and precise probability estimates. Despite the simplicity of some estimators, PDL leverages more complex patterns. The contrast between DecisionTree with and without PDL clearly illustrates PDL’s capability to learn non-linear patterns. The underfitting observed when incompatible base models learn corrupted patterns underscores the critical role of the choice of base learners.

3.3 Choice of Base Learners

As already said, PDL can theoretically be implemented with any (probabilistic) binary classifier as a base learner — or, stated differently, it can be used as a wrapper for any (binary or multinomial) classifier. Practically, however, some classifiers might be more suitable as base learners and others less.

One thing one should keep in mind is that even if the original data $\mathcal{D}$ is i.i.d., independence will be lost for $\mathcal{D}_{pair}$ as soon as the same instance $x_{i}$ is paired with various other instances. This is very similar to the setting of metric learning, where models are also trained on pairs of data points [2]. In practice, although many machine learning algorithms turn out to be quite robust against violations of the i.i.d. assumption [14], some methods may be concerned more than others.

Another important aspect is the joint feature representation $z=\phi(x,x^{\prime})$ . For example, by defining $z$ as a concatenation of $x$ , $x^{\prime}$ , and the difference $x-x^{\prime}$ , one obviously introduces (perfect) multicollinearity. Again, while this is problematic for some machine learning methods, notably linear models [19, p.8], others can deal with this property more easily.

While an in-depth analysis of the suitability of different base learners is beyond the scope of this paper, we generally found that non-parametric methods are more robust and tend to show better performance than parametric ones. In our experimental evaluation, we will therefore mainly use tree-based methods, which have the additional advantage of being fast to train.

3.4 Complexity

Looking at the complexity of PDL, suppose the complexity of a base learner to be $\mathcal{O}(p(N,M,F,K))$ , where $p(\cdot)$ is polynomial in the number of training points ( $N$ ), the number of test points ( $M$ ), the number of input features ( $F$ ), and the number of output classes ( $K$ ). The complexity of PDL is then $\mathcal{O}(p(N^{2},2MA,3F,2))$ : The training points are scaled to $N^{2}$ pairs; the features are scaled to $3F$ ( $F$ features of point $x_{i}$ , $F$ features of point $x_{j}$ , and $F$ features of the difference $x_{i}-x_{j}$ . This feature construction technique for PDL has demonstrated previously improved results [19]); Each test point is paired with the $A$ anchor points. Pairs are duplicated twice to obtain their symmetry. Thus, $M$ test predictions of PDL require $2MA$ predictions using the base learner. The number of output classes $K$ shrinks to $2$ since the model is asked to predict whether the pair of points has a similar class.

3.5 PDL Library

Our library¹¹1Link: https://github.com/Karim-53/pdll includes a Python implementation of the PDL classifier, adhering to the Scikit-learn standards. Consequently, integrating the PDL classifier into existing codebases is straightforward, requiring minimal modifications. As demonstrated in the example below, only two additional lines of code are needed:

[Uncaptioned image]

4 Evaluation

In this section, we test PDC on various public datasets from OpenML [20] and compare it to 7 Scikit-learn state-of-the-art learners.

4.1 Data

OpenML provides a diverse range of datasets, many of which are small, with 37% having less than 600 data points. This study focuses on small datasets, for which the pairwise learning approach is presumably most effective. We applied grid search CV for parameter tuning, leveraging the search space from TPOT [16]. To accommodate our grid search setup, we subsampled the search space to 1,000 parameter combinations per estimator. Following dataset selection constraints similar to the OpenML-CC18 benchmark [3], we randomly selected 99 datasets (see summary statistics in Fig. 3). Although these datasets are relatively small, the effective data size for PDC is quadrupled due to the pairing, reaching $360000$ data points. We also monitored class imbalance using the “minority class” meta-data, which represents the percentage of the minority class relative to the total size of each dataset. Considering the 7 baseline models, we performed 5 times 5-fold CV with an inner 3-fold grid search CV, totaling 66 528 000 train-test runs and 3 weeks wall-time on an HPC.

4.2 Data Processing Pipeline

Using scikit-learn [17], we implemented a common data processing pipeline for all runs, with standardization for numeric features, one-hot encoding for nominal features, and ordinal encoding for ordinal features. Since PDL needs the pair difference $x_{i}-x_{j}$ as additional inputs, processed features are all treated as numeric when applying the difference.

4.3 Performance Measures

We measure performance in terms of the (macro) F1 score, which is arguably more meaningful than the standard misclassification rate in the case of imbalanced data. In binary classification, the F1 score is defined as the harmonic mean of precision and recall. For multinomial problems, the macro version of this score is the (unweighted) mean of the F1 scores for the individual class:

\text{Macro}F1=\frac{1}{K}\sum_{i=1}^{K}F1_{i}\,,

where $F1_{i}$ is the F1 score on the $i^{th}$ class (treating test examples of this class as positive and all others as negative). We also report the improvement of PDL over the base learner in terms of the difference $\Delta F1=\text{Macro}F1_{PDC}-\text{Macro}F1_{base}$ . We aggregate the results using the mean $\pm$ standard error.

To aggregate the results of all data sets, we count the number of wins/losses by comparing the average performance of models over 25 runs (5 times 5-fold CV) per dataset. A win is counted when PDC’s average score is higher than the baseline; a loss is counted otherwise. To determine the number of significant wins/losses, a Student’s t-test is conducted for each dataset to assess the statistical significance of the difference in performance. A significant win/loss is recorded when the p-value of the t-test is below a predetermined threshold $\alpha=0.05$ . In some cases, there may be a tie in the average scores, leading to instances where the number of wins and losses does not sum to 99, which is the total number of datasets benchmarked.

As an alternative to counting wins and losses, and despite being aware of the questionable nature of this statistic, we also average performance over data sets. Average performance may provide a first overall impression, although we agree that it should always be interpreted in a cautious way.

4.4 Results

First, the PDL classifier, on top of ExtraTrees, obtained the best average Macro F1 score over the 99 datasets, outperforming all baselines, see Fig. 4. In Tab. 1, the ratio of significant wins demonstrates an advantage for the PDL classifier, suggesting that, in a one-to-one comparison, PDL is more likely to outperform its equivalent baseline.

Table 1: Comparing baseline classifiers to PDC using 99 datasets.

	Significant wins		Wins		Average Test Macro F1
Classifier	base	PDC	base	PDC	base	$\pm$ sem	PDC	$\pm$ sem
Bagging	3	26	27	70	0.7906	0.0035	0.8062	0.0034
DecisionTree	2	50	22	76	0.7694	0.0037	0.7982	0.0034
ExtraTree	1	61	9	90	0.7434	0.0037	0.7987	0.0035
ExtraTrees	6	24	21	77	0.7951	0.0036	0.8113	0.0035
GradientBoosting	9	23	25	72	0.7839	0.0037	0.7903	0.0039
HistGradientBoosting	2	32	15	82	0.7888	0.0037	0.8053	0.0035
RandomForest	5	27	22	73	0.7933	0.0035	0.8073	0.0034

The PDL classifier can be viewed as a method to simplify the trained model. As shown in Fig. 4, the test performance of PDC(DecisionTree) is equivalent to or better than that of the seven benchmarked state-of-the-art estimators. This indicates that, with the help of PDL, training a single tree can compete with ensemble methods that typically train around 100 trees. In this context, explaining a single tree may provide a more straightforward solution.

Analyzing the unique contribution.

While PDL classifiers have high probabilities of outperforming baseline models in a one-to-one comparison, the ultimate goal of a data scientist is to obtain the best performance on each dataset. Before introducing PDC, the maximum achievable Macro F1 score was $0.8112\pm 0.0035$ averaged over the 99 datasets. With the help of PDC, we achieve higher scores in 75 datasets, and the new record becomes $0.8243\pm 0.0031$ . This advance showcases the unique contribution of PDC to the field of ML compared to existing algorithms. Moreover, PDC offers not only an important unique contribution to the record but also the highest contribution. Indeed, PDC’s leave-one-out contribution to this record is $0.8243-0.8112=0.0131$ while popular estimators like HistGradientBoosting get no unique contribution, i.e., they are not able to outperform all other estimators on any of the 99 datasets, see Tab. 2. PDC’s contribution is even 32 times more important than the best baseline.

Table 2: Unique contribution of each estimator to the average Macro F1 score using the best optimized model on each dataset.

Estimator	Unique contribution	Wins
ExtraTree	0	0
HistGradientBoosting	0	0
RandomForest	0.00002	1
Bagging	0.00004	2
GradientBoosting	0.00006	2
DecisionTree	0.00020	10
ExtraTrees	0.00041	9
PDC	0.01312	75

Analyzing the overfitting.

PDL classifiers have the advantage of decreasing overfitting. Indeed, looking at the 199 cross-validation (CV) runs in which both the baseline and PDL classifier obtain non-significant differences in train Macro F1 scores, we notice that PDL classifiers have a smaller train-test gap. A lower overfitting is observed when grou** by base classifier, see Tab. 3. This even remains true without conditioning on non-significantly different train scores.

Table 3: Comparing test Macro F1 on the subset of runs where train scores are not significantly different.

	# CV	Baseline Macro F1		PDC Macro F1		Test	Test
Estimator	runs	Train	Test	Train	Test	$\Delta F1$	p-value
Bagging	20	0.998	0.835	0.999	0.859	0.024	$10^{-15}$
DecisionTree	14	0.950	0.884	0.955	0.895	0.011	$10^{-05}$
ExtraTree	11	0.915	0.844	0.924	0.861	0.017	$10^{-04}$
ExtraTrees	26	0.985	0.828	0.991	0.853	0.025	$10^{-16}$
GradientBoosting	58	0.930	0.822	0.926	0.840	0.018	$10^{-17}$
HistGradientBoosting	52	0.961	0.820	0.963	0.839	0.019	$10^{-19}$
RandomForest	18	0.992	0.855	0.997	0.881	0.026	$10^{-11}$
Total	199	0.958	0.832	0.960	0.852	0.020	$10^{-74}$

5 Why Does PDL Yield Improved Performance?

The empirical results reveal that the PDL classifier significantly improves over the baseline methods. In this section, we elaborate on possible reasons for this improvement.

5.1 Combining Instance-based and Model-based Learning

A distinguishing feature of PDL is a unique combination of (local) instance-based learning and (global) model-based learning. Like the well-known nearest-neighbor principle, a prediction for a new query is produced by other instances from the training set, namely the anchor points; yet, as opposed to NN, these instances are not restricted to nearby cases but can be located anywhere in the instance space. This becomes possible through the model-based component of PDL, namely the classifier $\gamma$ , which is a global model that generalizes over the entire instance space. Broadly speaking, by constructing $\gamma$ , the classifier learns how to transfer class information from one data point to another.

Of course, there are other learning methods with similar characteristics. For example, instead of using a predefined distance function, the nearest neighbor method can be instantiated with a distance function $\delta$ that is learned on the training data. Metric learning typically proceeds from sets of similar instances (belonging to the same class) and dissimilar instances (belonging to different classes), and seeks to learn a function $\delta$ that keeps the distance low for the former while making it high for the latter [10, 2]. In a sense, this is indeed quite comparable to PDL, especially because both $\delta$ and $\gamma$ are two-place functions taking pairs of instances as input. Moreover, $\gamma$ could indeed also be seen as a kind of distance measure, if “distance” is defined in terms of “probability of belonging to the same class”. Yet, PDL is arguably more flexible, because $\gamma$ is not required to satisfy properties of a distance or metric.

5.2 Simplification through Binary Reduction

Another advantage of PDL is simplicity: The original classification task is effectively reduced to a binary problem, namely, to decide whether or not two instances share the same class label. This is comparable to binary decomposition techniques such as one-vs-rest and all-pairs [4, p.202], which reduce a single multinomial classification problem to several binary problems. Instead, PDL constructs a single binary problem, although the total number of training examples produced essentially coincides for all methods (it is roughly quadratic in the size of the original data). In any case, binary problems are normally easier to solve, which explains the improved classification accuracy commonly reported for reduction techniques. In this regard, a decomposition can even be useful for methods that are able to handle multinomial problems right away (such as decision trees).

5.3 Error Reduction through Averaging

Last but not least, by instantiating the global model for every anchor and collecting predictions from all of them, PDL benefits from a kind of ensemble effect and reduces error through averaging. In particular, since prediction errors of individual anchors can be compensated by other anchors, PDL is able to reduce the variance of the prediction error. Again, this is somewhat comparable to the nearest-neighbor method. Given the model $\gamma$ , the anchor predictions can even be considered as independent²²2Of course, this independence is lost if the anchor points are also part of the data used to train $\gamma$ ., which, under the simplified assumption of homoscedasticity, means that the prediction error is reduced by a factor of $1/\sqrt{A}$ , with $A$ the number of anchors [23, p.4].

Even if these assumptions may not be completely satisfied, an expected improvement through averaging can clearly be observed in empirical studies. Fig. 5 represents four cases encountered with four different datasets and DecisionTree as a baseline. We compare the loss of the baseline (baseline loss) with the actual PDL loss, i.e., the loss given all available anchors. The empirical approximation curve is meant to show how the loss depends on the number of anchor points. Its value at $A$ is produced by averaging the performance over randomly selected anchor subsets of size $A$ . The curve goes from the average loss when only one anchor is used ( $\gamma$ loss) until reaching the actual PDL loss. The theoretical approximation curve is an optimal fit of a theoretical model to the empirical approximation, namely, the decrease of the error under the ideal assumption of independent prediction errors distributed normally with mean $\mu$ and standard deviation $\sigma$ . As can be seen, even if this assumption may not fully hold, the two curves deviate but slightly.

In case (a), the loss of the PDC’s $\gamma$ estimator is better than the loss of the baseline model. As expected in this case, PDC is better than the baseline with any number of anchors. In case (b), the baseline loss is between $\gamma$ loss and PDC’s loss. With the theoretical approximation, we estimate how many anchors are enough to outperform the baseline. In case (c), the baseline model is better than PDL. Nevertheless, the theoretical approximation allows us to estimate the additional anchors needed to outperform the baseline and the best reachable loss. It becomes less and less efficient to improve the score by adding more anchors. It might become more interesting, starting from a certain size, to work more on the base learner or the data quality. In case (d), the baseline model is even better than the approximated asymptote because learning the dual problem is more difficult. Adding more anchors is less likely to help.

6 Conclusion

Building on the concept of pairwise difference learning (PDL), we proposed the PDL classifier (PDC), a meta-learner able to reduce a multiclass classification problem into a binary problem. Our extensive empirical evaluation across 99 diverse datasets demonstrates that PDL consistently outperforms state-of-the-art machine learning models, resulting in improved F1 scores in a majority of cases. This highlights PDL’s effectiveness in enhancing performance over baseline methods, facilitated through its straightforward integration via our Python package. To explain its strong performance, we also elaborated on several properties and features of PDC.

Future research directions include the exploration of instance (anchor) weighting through regularization or Shapley data importance [9] and interaction [1]. Moreover, we plan to elaborate more closely on PDC’s potential to quantify predictive uncertainty (cf. Section 3.1)

In conclusion, PDL emerges as a practical solution for improving ML models, offering versatility and performance improvements across diverse applications. Its adaptability and robust performance make it a valuable addition to the ML toolkit, promising more accurate and reliable predictions in various domains.

References

[1] Belaid, M.K., El Mekki, D., Rabus, M., Hüllermeier, E.: Optimizing Data Shapley Interaction calculation from $\mathcal{O}(2^{N})$ to $\mathcal{O}(TN^{2})$ for KNN models. arXiv preprint arXiv:2304.01224 (2023)
[2] Bian, W., Tao, D.: Learning a Distance Metric by Empirical Loss Minimization. In: Proc. IJCAI, International Joint Conference on Artificial Intelligence (2013)
[3] Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: OpenML benchmarking suites. arXiv preprint arXiv:1708.03731 (2017)
[4] Bishop, C.: Pattern recognition and ML. Springer 2, 183 (2006)
[5] Chen, Y., Ou, Y., Zheng, P., Huang, Y., Ge, F., Dral, P.O.: Benchmark of general-purpose ML-based quantum mechanical method AIQM1 on reaction barrier heights. The Journal of Chemical Physics 158(7) (2023)
[6] Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F., Udluft, S.: Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In: Proc. ICML, 35th Int. Conf. on Machine Learning. Stockholm, Sweden (2018)
[7] Fralish, Z., Chen, A., Skaluba, P., Reker, D.: DeepDelta: predicting ADMET improvements of molecular derivatives with deep learning. Journal of Cheminformatics 15(1), 101 (2023)
[8] Fralish, Z., Skaluba, P., Reker, D.: Leveraging bounded datapoints to classify molecular potency improvements. RSC Medicinal Chemistry (2024)
[9] Ghorbani, A., Zou, J.: Data Shapley: Equitable valuation of data for ML. In: International Conference on ML. pp. 2242–2251. PMLR (2019)
[10] Globerson, A., Roweis, S.: Metric learning by collapsing classes. Advances in neural information processing systems 18 (2005)
[11] Hu, J., Yang, S., Mao, J., Shi, C., Wang, G., Liu, Y., Pu, X.: Exploring a general convolutional neural network-based prediction model for critical casting diameter of metallic glasses. Journal of Alloys and Compounds 947, 169479 (2023)
[12] Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning 110(3), 457–506 (2021). https://doi.org/10.1007/s10994-021-05946-3
[13] King, G., Zeng, L.: Logistic regression in rare events data. Political analysis 9(2), 137–163 (2001)
[14] Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W.: Applied linear statistical models. McGraw-hill (2005)
[15] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proc. NeurIPS, 31st Conf. on Neural Information Processing Systems. Long Beach, California, USA (2017)
[16] Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016. pp. 485–492 (2016)
[17] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: ML in Python. the Journal of ML research 12, 2825–2830 (2011)
[18] Spiers, R.C., Norby, C., Kalivas, J.H.: Physicochemical Responsive Integrated Similarity Measure (PRISM) for a Comprehensive Quantitative Perspective of Sample Similarity Dynamically Assessed with NIR Spectra. Analytical Chemistry (2023)
[19] Tynes, M., Gao, W., Burrill, D.J., Batista, E.R., Perez, D., Yang, P., Lubbers, N.: Pairwise difference regression: A ML meta-algorithm for improved prediction and uncertainty quantification in chemical search. Journal of Chemical Information and Modeling 61(8), 3846–3857 (2021)
[20] Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in ML. SIGKDD Explorations 15(2), 49–60 (2013). https://doi.org/10.1145/2641190.2641198, http://doi.acm.org/10.1145/2641190.264119
[21] Wang, Y., King, R.D.: Extrapolation is Not the Same as Interpolation. In: International Conference on Discovery Science. pp. 277–292. Springer (2023)
[22] Wetzel, S.J., Melko, R.G., Tamblyn, I.: Twin neural network regression is a semi-supervised regression algorithm. ML: Science and Technology 3(4), 045007 (2022)
[23] Wetzel, S.J., Ryczko, K., Melko, R.G., Tamblyn, I.: Twin neural network regression. Applied AI Letters 3(4), e78 (2022)