11institutetext: IDIADA Fahrzeugtechnik GmbH, Munich, Germany 11email: [email protected]
22institutetext: Dr. Ing. h.c. F. Porsche AG Stuttgart, Germany 22email: {extern.karim.belaid,maximilian.rabus2}@porsche.de
33institutetext: Ludwig-Maximilians-Universität Munich, Germany
33email: [email protected]

Pairwise Difference Learning for Classification

Mohamed Karim Belaid 1122 0000-1111-2222-3333    Maximilian Rabus 22 0000-0003-0755-1772    Eyke Hüllermeier 33 0000-0002-9944-4108
Abstract

Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a map** from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.

Keywords:
Supervised learning Multiclass classification Meta-learning.

1 Introduction

Pairwise difference learning (PDL) has recently been introduced independently by Tynes et al. [19] and Wetzel et al. [22] as a meta-learning technique for regression, which transforms the original task of learning to predict outcomes for individual inputs into the task of learning to predict differences between the outcomes of input pairs: Noting that the value of a function f𝑓fitalic_f at a point x𝑥xitalic_x can be written “from the perspective” of any other point xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as f(x)=f(x)+Δ(x,x)𝑓𝑥𝑓superscript𝑥Δ𝑥superscript𝑥f(x)=f(x^{\prime})+\Delta(x,x^{\prime})italic_f ( italic_x ) = italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + roman_Δ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with Δ(x,x)=f(x)f(x)Δ𝑥superscript𝑥𝑓𝑥𝑓superscript𝑥\Delta(x,x^{\prime})=f(x)-f(x^{\prime})roman_Δ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_f ( italic_x ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), the simple idea of PDL is to train an approximation Δ~~Δ\tilde{\Delta}over~ start_ARG roman_Δ end_ARG of the difference function ΔΔ\Deltaroman_Δ and obtain predictions of new outcomes y=f(x)𝑦𝑓𝑥y=f(x)italic_y = italic_f ( italic_x ) by averaging over the predicted differences to the outcomes in the training data:

y1Ni=1Nyi+Δ~(x,xi)𝑦1𝑁superscriptsubscript𝑖1𝑁subscript𝑦𝑖~Δ𝑥subscript𝑥𝑖y\approx\frac{1}{N}\sum_{i=1}^{N}y_{i}+\tilde{\Delta}(x,x_{i})italic_y ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG roman_Δ end_ARG ( italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (1)

One of the main motivations of PDL is the quadratic increase of the training data: If the original training data contains N𝑁Nitalic_N data points (x1,y1),,(xN,yN)subscript𝑥1subscript𝑦1subscript𝑥𝑁subscript𝑦𝑁(x_{1},y_{1}),\ldots,(x_{N},y_{N})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), the difference function can be trained on potentially 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) training examples of the form ((xi,xj),yiyj)subscript𝑥𝑖subscript𝑥𝑗subscript𝑦𝑖subscript𝑦𝑗((x_{i},x_{j}),y_{i}-y_{j})( ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). This increase might be specifically useful in the “small data” regime (even if the transformed examples are of course no longer independent of each other). Moreover, note that the prediction (1) benefits from a statistically useful averaging effect.

Building on the basic idea of PDL, we make the following contributions. We extend the idea of PDL toward the task of classification and propose the PDL classifier, a meta-learning approach that transforms any multiclass classification problem into a single binary problem. This innovative method leverages the concept of learning inter-class differences, leading to demonstrably improved average prediction accuracy (Section 3). We introduce the “pairwise difference learning library” (pdll) on PyPI, which incorporates our implementation of the PDL classifier and ensures compatibility with any Sklearn ML model (Section 3.5). We conduct a large-scale experimental analysis of PDL and compare the results to state-of-the-art ML estimators (Section 4). We discuss the architecture of PDL and how it can lead to an improvement of the accuracy (Section 5).

2 Related Work

Tynes et al. introduced pairwise difference regressor [19], a novel meta-learner for chemical tasks that enhances prediction performance, compared to random forest and provides robust uncertainty quantification. In computational chemistry, estimating differences between data points helps mitigate systematic errors [19]. In parallel, Wetzel et al. used twin neural network architectures for semi-supervised regression tasks, focusing on predicting differences between target values of distinct data points [22]. The approach of Wetzel et al. enabled training on unlabelled data points when paired with labeled anchor data points. By ensembling predicted differences between target values, the method achieved high prediction performance for regression problems. While conceptually similar to the pairwise difference regressor in emphasizing differences between data points, it is specialized to neural network architectures for semi-supervised regression tasks [23].

The pairwise difference learning (PDL) literature has since then, evolved into diverse methodologies and applications. Spiers et al. measured sample similarity in chemistry, emphasizing spectral shape differences using metrics like Euclidean and Mahalanobis distances. They extended the approach by calculating a Z-score which offers insights into prediction accuracy, facilitating outlier detection and model adaptation [18]. PDL was developed mainly for regression tasks. It can also be adapted to targets that might be known or only bounded. Example of target annotations could be y=5.3𝑦5.3y=5.3italic_y = 5.3, y<2.1𝑦2.1y<2.1italic_y < 2.1, or y>6.5𝑦6.5y>6.5italic_y > 6.5. Predicting an increase/decrease between a pair is a possible solution [8]. PDL regressor with its variants has demonstrated efficacy in various applications, including regression with image input [11], learning chemical properties [7], quantum mechanical reactions [5], and drug activity ranking [21].

3 PDL Classification

Consider a standard setting of supervised (classification) learning: Given a set of training data

𝒟={(xi,yi)}i=1Nd×𝒴,𝒟superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑁superscript𝑑𝒴\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}\subset\mathbb{R}^{d}\times\mathcal{Y}\,,caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × caligraphic_Y ,

comprised of training instances in the form of feature vectors xidsubscript𝑥𝑖superscript𝑑x_{i}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT together with observed discrete labels y𝒴={1,,K}𝑦𝒴1𝐾y\in\mathcal{Y}=\{1,\ldots,K\}italic_y ∈ caligraphic_Y = { 1 , … , italic_K }, and assumed to be generated i.i.d. according to an underlying (unknown) joint probability measure P𝑃Pitalic_P, the task is to learn a predictor PDC:d𝒴:PDCsuperscript𝑑𝒴\textup{PDC}:\mathbb{R}^{d}\rightarrow\mathcal{Y}PDC : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → caligraphic_Y with low risk (expected loss). The PDL classifier transforms the original training data 𝒟𝒟\mathcal{D}caligraphic_D into the new data

𝒟pair={(zi,j,yi,j)| 1i,jN},subscript𝒟𝑝𝑎𝑖𝑟conditional-setsubscript𝑧𝑖𝑗subscript𝑦𝑖𝑗formulae-sequence1𝑖𝑗𝑁\mathcal{D}_{pair}=\big{\{}(z_{i,j},y_{i,j})\,|\,1\leq i,j\leq N\big{\}}\,,caligraphic_D start_POSTSUBSCRIPT italic_p italic_a italic_i italic_r end_POSTSUBSCRIPT = { ( italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) | 1 ≤ italic_i , italic_j ≤ italic_N } , (2)

where zi,j=ϕ(xi,xj)subscript𝑧𝑖𝑗italic-ϕsubscript𝑥𝑖subscript𝑥𝑗z_{i,j}=\phi(x_{i},x_{j})italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a joint feature representation of the instance pair (xi,xj)subscript𝑥𝑖subscript𝑥𝑗(x_{i},x_{j})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and

yi,j={0for yiyj,1for yi=yj.subscript𝑦𝑖𝑗cases0for subscript𝑦𝑖subscript𝑦𝑗1for subscript𝑦𝑖subscript𝑦𝑗y_{i,j}=\begin{cases}0&\text{for }y_{i}\neq y_{j},\\ 1&\text{for }y_{i}=y_{j}\end{cases}\,.italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL for italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL for italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW . (3)

Thus, we seek a binary classifier γ:d×d[0,1]:𝛾superscript𝑑superscript𝑑01\gamma:\,\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[0,1]italic_γ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , 1 ] that, given two instances x𝑥xitalic_x and xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as input, predicts whether or not the respective classes y𝑦yitalic_y and ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the same. More specifically, we assume γ𝛾\gammaitalic_γ to be a probabilistic classifier, so that γ(x,x)[0,1]𝛾𝑥superscript𝑥01\gamma(x,x^{\prime})\in[0,1]italic_γ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ 0 , 1 ] is the probability that y=y𝑦superscript𝑦y=y^{\prime}italic_y = italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Deterministic classifiers that return a binary label as a prediction are treated as degenerate {0,1}01\{0,1\}{ 0 , 1 }-valued probabilistic classifiers. Leveraging the joint feature representation, γ𝛾\gammaitalic_γ is of the form γ(x,x)=h(ϕ(x,x))𝛾𝑥superscript𝑥italic-ϕ𝑥superscript𝑥\gamma(x,x^{\prime})=h(\phi(x,x^{\prime}))italic_γ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_h ( italic_ϕ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ), where hhitalic_h is trained on the transformed data (2). To this end, any binary classification method can be used. Note, however, that the binary problem might be quite imbalanced, as the transformation (3) will produce much more negative (unequal) than positive (equal) examples. One can solve this issue by introducing class weights [13] to equalize the loss function of the classifier γ𝛾\gammaitalic_γ. As for the joint feature representation, the original proposal was to define zi,jsubscript𝑧𝑖𝑗z_{i,j}italic_z start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT as a concatenation of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. It turned out, however, that expanding this vector by the difference xixjsubscript𝑥𝑖subscript𝑥𝑗x_{i}-x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT has a positive influence on performance [19], wherefore we also adopted this representation in our work.

Refer to caption
Figure 1: Illustration of the PDL classifier.

Since (class) equality is a symmetric relation, γ𝛾\gammaitalic_γ is naturally expected to be symmetric in the sense that γ(xi,xj)=γ(xj,xi)𝛾subscript𝑥𝑖subscript𝑥𝑗𝛾subscript𝑥𝑗subscript𝑥𝑖\gamma(x_{i},x_{j})=\gamma(x_{j},x_{i})italic_γ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_γ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). By adding both (ϕ(xi,xj),yi,j)italic-ϕsubscript𝑥𝑖subscript𝑥𝑗subscript𝑦𝑖𝑗(\phi(x_{i},x_{j}),y_{i,j})( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) and (ϕ(xj,xi),yj,i)italic-ϕsubscript𝑥𝑗subscript𝑥𝑖subscript𝑦𝑗𝑖(\phi(x_{j},x_{i}),y_{j,i})( italic_ϕ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_j , italic_i end_POSTSUBSCRIPT ) to 𝒟pairsubscript𝒟𝑝𝑎𝑖𝑟\mathcal{D}_{pair}caligraphic_D start_POSTSUBSCRIPT italic_p italic_a italic_i italic_r end_POSTSUBSCRIPT, this symmetry can also be reflected in the training data. But even then, however, γ𝛾\gammaitalic_γ is not necessarily guaranteed to preserve symmetry. Therefore, we additionally “symmetrize” the predictor as follows:

γsym(xi,xj)=γ(xi,xj)+γ(xj,xi)2subscript𝛾𝑠𝑦𝑚subscript𝑥𝑖subscript𝑥𝑗𝛾subscript𝑥𝑖subscript𝑥𝑗𝛾subscript𝑥𝑗subscript𝑥𝑖2\gamma_{sym}(x_{i},x_{j})=\frac{\gamma(x_{i},x_{j})+\gamma(x_{j},x_{i})}{2}italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_γ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_γ ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG (4)

Given a query xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, we finally estimate the probability of class labels y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y as follows: Considering each training example (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) as a piece of evidence for the unknown class yqsubscript𝑦𝑞y_{q}italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, the semantics of the above prediction suggests that the probability of the event yq=yisubscript𝑦𝑞subscript𝑦𝑖y_{q}=y_{i}italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by (4). More formally, P(E)=γsym(xq,xi)𝑃𝐸subscript𝛾𝑠𝑦𝑚subscript𝑥𝑞subscript𝑥𝑖P(E)=\gamma_{sym}(x_{q},x_{i})italic_P ( italic_E ) = italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where E𝐸Eitalic_E denotes the event yq=yisubscript𝑦𝑞subscript𝑦𝑖y_{q}=y_{i}italic_y start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (and hence P(¬E)=1γsym(xq,xi)𝑃𝐸1subscript𝛾𝑠𝑦𝑚subscript𝑥𝑞subscript𝑥𝑖P(\neg E)=1-\gamma_{sym}(x_{q},x_{i})italic_P ( ¬ italic_E ) = 1 - italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )). Let p𝑝pitalic_p denote the prior distribution on the class labels 𝒴𝒴\mathcal{Y}caligraphic_Y (which can easily be estimated by relative frequencies on the training data). This distribution is then updated by conditioning it on the (uncertain) event E𝐸Eitalic_E, which yields the following posterior suggested by (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ):

ppost,i(y)={γsym(xq,xi) if y=yip(y)(1γsym(xq,xi))1p(yi) otherwisesubscript𝑝𝑝𝑜𝑠𝑡𝑖𝑦casessubscript𝛾𝑠𝑦𝑚subscript𝑥𝑞subscript𝑥𝑖 if 𝑦subscript𝑦𝑖𝑝𝑦1subscript𝛾𝑠𝑦𝑚subscript𝑥𝑞subscript𝑥𝑖1𝑝subscript𝑦𝑖 otherwisep_{post,i}(y)=\left\{\begin{array}[]{cl}\gamma_{sym}(x_{q},x_{i})&\text{ if }y% =y_{i}\\[5.69054pt] \dfrac{p(y)\cdot(1-\gamma_{sym}(x_{q},x_{i}))}{1-p(y_{i})}&\text{ otherwise}% \end{array}\right.italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t , italic_i end_POSTSUBSCRIPT ( italic_y ) = { start_ARRAY start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_p ( italic_y ) ⋅ ( 1 - italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG 1 - italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY (5)

Thus, the (posterior) probability of yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fixed to γsym(xq,xi)subscript𝛾𝑠𝑦𝑚subscript𝑥𝑞subscript𝑥𝑖\gamma_{sym}(x_{q},x_{i})italic_γ start_POSTSUBSCRIPT italic_s italic_y italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and all other probabilities are rescaled in a proportional way, to guarantee that the sum of posterior probabilities adds to 1. Finally, we average over the evidences from all training examples to obtain

ppost(y)=1Ni=1Nppost,i(y).subscript𝑝𝑝𝑜𝑠𝑡𝑦1𝑁superscriptsubscript𝑖1𝑁subscript𝑝𝑝𝑜𝑠𝑡𝑖𝑦p_{post}(y)=\frac{1}{N}\sum_{i=1}^{N}p_{post,i}(y)\,.italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t , italic_i end_POSTSUBSCRIPT ( italic_y ) . (6)

In case a deterministic prediction is sought, the class with the highest (estimated) probability is chosen:

y^q=argmaxy𝒴ppost(y)subscript^𝑦𝑞subscript𝑦𝒴subscript𝑝𝑝𝑜𝑠𝑡𝑦\hat{y}_{q}=\arg\max_{y\in\mathcal{Y}}p_{post}(y)over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT ( italic_y ) (7)

3.1 Uncertainty Quantification

Interestingly, the PDL approach also offers a natural approach to uncertainty quantification, a topic that has received increasing attention in the recent machine learning literature. In particular, recent research has focused on the distinction between so-called aleatoric uncertainty (caused by inherent randomness in the data) and epistemic uncertainty (caused by the learner’s incomplete knowledge of the true data-generating process) — we refer to [12] for a detailed exposition of this topic.

Within the Bayesian approach, these two types of uncertainty can be captured by properties of the posterior predictive distribution, which in turn can be approximated through ensemble learning [15]. In a sense, PDL parallels this approach, with each anchor playing the role of an ensemble member, and (6) mimicking Bayesian model averaging. This suggests the following quantification of aleatoric (AU), epistemic (EU), and total uncertainty (TU) of a prediction, with H𝐻Hitalic_H denoting Shannon entropy.:

TU =H(ppost(y))=H(i=1Nppost,i(y))absent𝐻subscript𝑝𝑝𝑜𝑠𝑡𝑦𝐻superscriptsubscript𝑖1𝑁subscript𝑝𝑝𝑜𝑠𝑡𝑖𝑦\displaystyle=H(p_{post}(y))=H\left(\sum_{i=1}^{N}p_{post,i}(y)\right)= italic_H ( italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT ( italic_y ) ) = italic_H ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t , italic_i end_POSTSUBSCRIPT ( italic_y ) )
AU =1Ni=1NH(ppost,i(y))absent1𝑁superscriptsubscript𝑖1𝑁𝐻subscript𝑝𝑝𝑜𝑠𝑡𝑖𝑦\displaystyle=\frac{1}{N}\sum_{i=1}^{N}H(p_{post,i}(y))= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_H ( italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t , italic_i end_POSTSUBSCRIPT ( italic_y ) )
EU =TUAUabsentTUAU\displaystyle=\text{TU}-\text{AU}= TU - AU

Theoretically, these measures are justified based on a well-known result from information theory, according to which entropy additively decomposes into conditional entropy and mutual information [6]. Broadly speaking, the more uniform the (averaged) distribution ppostsubscript𝑝𝑝𝑜𝑠𝑡p_{post}italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT, the higher the total uncertainty, and the more diverse the individual predictions ppost,isubscript𝑝𝑝𝑜𝑠𝑡𝑖p_{post,i}italic_p start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t , italic_i end_POSTSUBSCRIPT, the higher the epistemic uncertainty.

3.2 Illustration

Thanks to its novel structure, the PDL classifier can solve a multiclass classification task by training exactly one instance of a base learner on a binary task. Fig. 1 illustrates the PDL classifier algorithm, showcasing both the training and prediction phases on a simple multiclass task. Fig. 1.a shows a traditional multiclass classifier g𝑔gitalic_g that maps each of the N training data points to their assigned unique class label (star, square, or circle). In Fig. 1.b, PDL classifier transforms the data by creating N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairs of data points. During training, a binary classifier γ𝛾\gammaitalic_γ learns to distinguish between pairs that belong to the same class (positive label) from pairs of different classes (negative label). In Fig. 1.c, given one query input, the PDL classifier pairs it with each of the N training data points. For each pair, the classifier predicts a probability of similarity (belonging to the same class). Predicted probabilities are mapped to the column corresponding to the initial label of each training point. Missing posterior probabilities, in grey, are estimated by updating the prior probabilities, assuming a uniform distribution in this example. Finally, averaging across all training points yields the predicted probabilities for each class. The class with the highest predicted probability is chosen as the final class label for the query point (e.g., Class 3).

Refer to caption
Figure 2: Comparing learned patterns using PDL classifiers and baseline models.

Fig. 2 illustrates the patterns learned by nine baseline models across three 2D datasets. The baseline 3-Nearest-Neighbor (3-NN) classifier can only predict four probabilities: 0,13,23,013230,\frac{1}{3},\frac{2}{3},0 , divide start_ARG 1 end_ARG start_ARG 3 end_ARG , divide start_ARG 2 end_ARG start_ARG 3 end_ARG , and 1111. This is evident in the figure, where each dataset shows only four discrete regions. In contrast, when using PDL on top of 3-NN, the predicted probability is derived from the averaging over N𝑁Nitalic_N discrete predictions. This results in more refined and precise probability estimates. Despite the simplicity of some estimators, PDL leverages more complex patterns. The contrast between DecisionTree with and without PDL clearly illustrates PDL’s capability to learn non-linear patterns. The underfitting observed when incompatible base models learn corrupted patterns underscores the critical role of the choice of base learners.

3.3 Choice of Base Learners

As already said, PDL can theoretically be implemented with any (probabilistic) binary classifier as a base learner —  or, stated differently, it can be used as a wrapper for any (binary or multinomial) classifier. Practically, however, some classifiers might be more suitable as base learners and others less.

One thing one should keep in mind is that even if the original data 𝒟𝒟\mathcal{D}caligraphic_D is i.i.d., independence will be lost for 𝒟pairsubscript𝒟𝑝𝑎𝑖𝑟\mathcal{D}_{pair}caligraphic_D start_POSTSUBSCRIPT italic_p italic_a italic_i italic_r end_POSTSUBSCRIPT as soon as the same instance xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is paired with various other instances. This is very similar to the setting of metric learning, where models are also trained on pairs of data points [2]. In practice, although many machine learning algorithms turn out to be quite robust against violations of the i.i.d. assumption [14], some methods may be concerned more than others.

Another important aspect is the joint feature representation z=ϕ(x,x)𝑧italic-ϕ𝑥superscript𝑥z=\phi(x,x^{\prime})italic_z = italic_ϕ ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). For example, by defining z𝑧zitalic_z as a concatenation of x𝑥xitalic_x, xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and the difference xx𝑥superscript𝑥x-x^{\prime}italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, one obviously introduces (perfect) multicollinearity. Again, while this is problematic for some machine learning methods, notably linear models [19, p.8], others can deal with this property more easily.

While an in-depth analysis of the suitability of different base learners is beyond the scope of this paper, we generally found that non-parametric methods are more robust and tend to show better performance than parametric ones. In our experimental evaluation, we will therefore mainly use tree-based methods, which have the additional advantage of being fast to train.

3.4 Complexity

Looking at the complexity of PDL, suppose the complexity of a base learner to be 𝒪(p(N,M,F,K))𝒪𝑝𝑁𝑀𝐹𝐾\mathcal{O}(p(N,M,F,K))caligraphic_O ( italic_p ( italic_N , italic_M , italic_F , italic_K ) ), where p()𝑝p(\cdot)italic_p ( ⋅ ) is polynomial in the number of training points (N𝑁Nitalic_N), the number of test points (M𝑀Mitalic_M), the number of input features (F𝐹Fitalic_F), and the number of output classes (K𝐾Kitalic_K). The complexity of PDL is then 𝒪(p(N2,2MA,3F,2))𝒪𝑝superscript𝑁22𝑀𝐴3𝐹2\mathcal{O}(p(N^{2},2MA,3F,2))caligraphic_O ( italic_p ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 2 italic_M italic_A , 3 italic_F , 2 ) ): The training points are scaled to N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairs; the features are scaled to 3F3𝐹3F3 italic_F (F𝐹Fitalic_F features of point xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, F𝐹Fitalic_F features of point xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and F𝐹Fitalic_F features of the difference xixjsubscript𝑥𝑖subscript𝑥𝑗x_{i}-x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. This feature construction technique for PDL has demonstrated previously improved results [19]); Each test point is paired with the A𝐴Aitalic_A anchor points. Pairs are duplicated twice to obtain their symmetry. Thus, M𝑀Mitalic_M test predictions of PDL require 2MA2𝑀𝐴2MA2 italic_M italic_A predictions using the base learner. The number of output classes K𝐾Kitalic_K shrinks to 2222 since the model is asked to predict whether the pair of points has a similar class.

3.5 PDL Library

Our library111Link: https://github.com/Karim-53/pdll includes a Python implementation of the PDL classifier, adhering to the Scikit-learn standards. Consequently, integrating the PDL classifier into existing codebases is straightforward, requiring minimal modifications. As demonstrated in the example below, only two additional lines of code are needed:

[Uncaptioned image]

4 Evaluation

In this section, we test PDC on various public datasets from OpenML [20] and compare it to 7 Scikit-learn state-of-the-art learners.

4.1 Data

OpenML provides a diverse range of datasets, many of which are small, with 37% having less than 600 data points. This study focuses on small datasets, for which the pairwise learning approach is presumably most effective. We applied grid search CV for parameter tuning, leveraging the search space from TPOT [16]. To accommodate our grid search setup, we subsampled the search space to 1,000 parameter combinations per estimator. Following dataset selection constraints similar to the OpenML-CC18 benchmark [3], we randomly selected 99 datasets (see summary statistics in Fig. 3). Although these datasets are relatively small, the effective data size for PDC is quadrupled due to the pairing, reaching 360000360000360000360000 data points. We also monitored class imbalance using the “minority class” meta-data, which represents the percentage of the minority class relative to the total size of each dataset. Considering the 7 baseline models, we performed 5 times 5-fold CV with an inner 3-fold grid search CV, totaling 66 528 000 train-test runs and 3 weeks wall-time on an HPC.

Refer to caption

Figure 3: Distribution of key characteristics of the 99 OpenML classification datasets (minimum, mean, maximum).

4.2 Data Processing Pipeline

Using scikit-learn [17], we implemented a common data processing pipeline for all runs, with standardization for numeric features, one-hot encoding for nominal features, and ordinal encoding for ordinal features. Since PDL needs the pair difference xixjsubscript𝑥𝑖subscript𝑥𝑗x_{i}-x_{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as additional inputs, processed features are all treated as numeric when applying the difference.

4.3 Performance Measures

We measure performance in terms of the (macro) F1 score, which is arguably more meaningful than the standard misclassification rate in the case of imbalanced data. In binary classification, the F1 score is defined as the harmonic mean of precision and recall. For multinomial problems, the macro version of this score is the (unweighted) mean of the F1 scores for the individual class:

MacroF1=1Ki=1KF1i,Macro𝐹11𝐾superscriptsubscript𝑖1𝐾𝐹subscript1𝑖\text{Macro}F1=\frac{1}{K}\sum_{i=1}^{K}F1_{i}\,,Macro italic_F 1 = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where F1i𝐹subscript1𝑖F1_{i}italic_F 1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the F1 score on the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class (treating test examples of this class as positive and all others as negative). We also report the improvement of PDL over the base learner in terms of the difference ΔF1=MacroF1PDCMacroF1baseΔ𝐹1Macro𝐹subscript1𝑃𝐷𝐶Macro𝐹subscript1𝑏𝑎𝑠𝑒\Delta F1=\text{Macro}F1_{PDC}-\text{Macro}F1_{base}roman_Δ italic_F 1 = Macro italic_F 1 start_POSTSUBSCRIPT italic_P italic_D italic_C end_POSTSUBSCRIPT - Macro italic_F 1 start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT. We aggregate the results using the mean ±plus-or-minus\pm± standard error.

To aggregate the results of all data sets, we count the number of wins/losses by comparing the average performance of models over 25 runs (5 times 5-fold CV) per dataset. A win is counted when PDC’s average score is higher than the baseline; a loss is counted otherwise. To determine the number of significant wins/losses, a Student’s t-test is conducted for each dataset to assess the statistical significance of the difference in performance. A significant win/loss is recorded when the p-value of the t-test is below a predetermined threshold α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. In some cases, there may be a tie in the average scores, leading to instances where the number of wins and losses does not sum to 99, which is the total number of datasets benchmarked.

As an alternative to counting wins and losses, and despite being aware of the questionable nature of this statistic, we also average performance over data sets. Average performance may provide a first overall impression, although we agree that it should always be interpreted in a cautious way.

4.4 Results

First, the PDL classifier, on top of ExtraTrees, obtained the best average Macro F1 score over the 99 datasets, outperforming all baselines, see Fig. 4. In Tab. 1, the ratio of significant wins demonstrates an advantage for the PDL classifier, suggesting that, in a one-to-one comparison, PDL is more likely to outperform its equivalent baseline.

Refer to caption

Figure 4: Comparing average Macro F1 score of optimized baseline classifiers and PDL classifiers.
Table 1: Comparing baseline classifiers to PDC using 99 datasets.
Significant wins Wins Average Test Macro F1
Classifier base PDC base PDC base ±plus-or-minus\pm± sem PDC ±plus-or-minus\pm± sem
Bagging 3 26 27 70 0.7906 0.0035 0.8062 0.0034
DecisionTree 2 50 22 76 0.7694 0.0037 0.7982 0.0034
ExtraTree 1 61 9 90 0.7434 0.0037 0.7987 0.0035
ExtraTrees 6 24 21 77 0.7951 0.0036 0.8113 0.0035
GradientBoosting 9 23 25 72 0.7839 0.0037 0.7903 0.0039
HistGradientBoosting 2 32 15 82 0.7888 0.0037 0.8053 0.0035
RandomForest 5 27 22 73 0.7933 0.0035 0.8073 0.0034

The PDL classifier can be viewed as a method to simplify the trained model. As shown in Fig. 4, the test performance of PDC(DecisionTree) is equivalent to or better than that of the seven benchmarked state-of-the-art estimators. This indicates that, with the help of PDL, training a single tree can compete with ensemble methods that typically train around 100 trees. In this context, explaining a single tree may provide a more straightforward solution.

Analyzing the unique contribution.

While PDL classifiers have high probabilities of outperforming baseline models in a one-to-one comparison, the ultimate goal of a data scientist is to obtain the best performance on each dataset. Before introducing PDC, the maximum achievable Macro F1 score was 0.8112±0.0035plus-or-minus0.81120.00350.8112\pm 0.00350.8112 ± 0.0035 averaged over the 99 datasets. With the help of PDC, we achieve higher scores in 75 datasets, and the new record becomes 0.8243±0.0031plus-or-minus0.82430.00310.8243\pm 0.00310.8243 ± 0.0031. This advance showcases the unique contribution of PDC to the field of ML compared to existing algorithms. Moreover, PDC offers not only an important unique contribution to the record but also the highest contribution. Indeed, PDC’s leave-one-out contribution to this record is 0.82430.8112=0.01310.82430.81120.01310.8243-0.8112=0.01310.8243 - 0.8112 = 0.0131 while popular estimators like HistGradientBoosting get no unique contribution, i.e., they are not able to outperform all other estimators on any of the 99 datasets, see Tab. 2. PDC’s contribution is even 32 times more important than the best baseline.

Table 2: Unique contribution of each estimator to the average Macro F1 score using the best optimized model on each dataset.
Estimator Unique contribution Wins
ExtraTree 0 0
HistGradientBoosting 0 0
RandomForest 0.00002 1
Bagging 0.00004 2
GradientBoosting 0.00006 2
DecisionTree 0.00020 10
ExtraTrees 0.00041 9
PDC 0.01312 75

Analyzing the overfitting.

PDL classifiers have the advantage of decreasing overfitting. Indeed, looking at the 199 cross-validation (CV) runs in which both the baseline and PDL classifier obtain non-significant differences in train Macro F1 scores, we notice that PDL classifiers have a smaller train-test gap. A lower overfitting is observed when grou** by base classifier, see Tab. 3. This even remains true without conditioning on non-significantly different train scores.

Table 3: Comparing test Macro F1 on the subset of runs where train scores are not significantly different.
# CV Baseline Macro F1 PDC Macro F1 Test Test
Estimator runs Train Test Train Test ΔF1Δ𝐹1\Delta F1roman_Δ italic_F 1 p-value
Bagging 20 0.998 0.835 0.999 0.859 0.024 1015superscript101510^{-15}10 start_POSTSUPERSCRIPT - 15 end_POSTSUPERSCRIPT
DecisionTree 14 0.950 0.884 0.955 0.895 0.011 1005superscript100510^{-05}10 start_POSTSUPERSCRIPT - 05 end_POSTSUPERSCRIPT
ExtraTree 11 0.915 0.844 0.924 0.861 0.017 1004superscript100410^{-04}10 start_POSTSUPERSCRIPT - 04 end_POSTSUPERSCRIPT
ExtraTrees 26 0.985 0.828 0.991 0.853 0.025 1016superscript101610^{-16}10 start_POSTSUPERSCRIPT - 16 end_POSTSUPERSCRIPT
GradientBoosting 58 0.930 0.822 0.926 0.840 0.018 1017superscript101710^{-17}10 start_POSTSUPERSCRIPT - 17 end_POSTSUPERSCRIPT
HistGradientBoosting 52 0.961 0.820 0.963 0.839 0.019 1019superscript101910^{-19}10 start_POSTSUPERSCRIPT - 19 end_POSTSUPERSCRIPT
RandomForest 18 0.992 0.855 0.997 0.881 0.026 1011superscript101110^{-11}10 start_POSTSUPERSCRIPT - 11 end_POSTSUPERSCRIPT
Total 199 0.958 0.832 0.960 0.852 0.020 1074superscript107410^{-74}10 start_POSTSUPERSCRIPT - 74 end_POSTSUPERSCRIPT

5 Why Does PDL Yield Improved Performance?

The empirical results reveal that the PDL classifier significantly improves over the baseline methods. In this section, we elaborate on possible reasons for this improvement.

5.1 Combining Instance-based and Model-based Learning

A distinguishing feature of PDL is a unique combination of (local) instance-based learning and (global) model-based learning. Like the well-known nearest-neighbor principle, a prediction for a new query is produced by other instances from the training set, namely the anchor points; yet, as opposed to NN, these instances are not restricted to nearby cases but can be located anywhere in the instance space. This becomes possible through the model-based component of PDL, namely the classifier γ𝛾\gammaitalic_γ, which is a global model that generalizes over the entire instance space. Broadly speaking, by constructing γ𝛾\gammaitalic_γ, the classifier learns how to transfer class information from one data point to another.

Of course, there are other learning methods with similar characteristics. For example, instead of using a predefined distance function, the nearest neighbor method can be instantiated with a distance function δ𝛿\deltaitalic_δ that is learned on the training data. Metric learning typically proceeds from sets of similar instances (belonging to the same class) and dissimilar instances (belonging to different classes), and seeks to learn a function δ𝛿\deltaitalic_δ that keeps the distance low for the former while making it high for the latter [10, 2]. In a sense, this is indeed quite comparable to PDL, especially because both δ𝛿\deltaitalic_δ and γ𝛾\gammaitalic_γ are two-place functions taking pairs of instances as input. Moreover, γ𝛾\gammaitalic_γ could indeed also be seen as a kind of distance measure, if “distance” is defined in terms of “probability of belonging to the same class”. Yet, PDL is arguably more flexible, because γ𝛾\gammaitalic_γ is not required to satisfy properties of a distance or metric.

5.2 Simplification through Binary Reduction

Another advantage of PDL is simplicity: The original classification task is effectively reduced to a binary problem, namely, to decide whether or not two instances share the same class label. This is comparable to binary decomposition techniques such as one-vs-rest and all-pairs [4, p.202], which reduce a single multinomial classification problem to several binary problems. Instead, PDL constructs a single binary problem, although the total number of training examples produced essentially coincides for all methods (it is roughly quadratic in the size of the original data). In any case, binary problems are normally easier to solve, which explains the improved classification accuracy commonly reported for reduction techniques. In this regard, a decomposition can even be useful for methods that are able to handle multinomial problems right away (such as decision trees).

5.3 Error Reduction through Averaging

Last but not least, by instantiating the global model for every anchor and collecting predictions from all of them, PDL benefits from a kind of ensemble effect and reduces error through averaging. In particular, since prediction errors of individual anchors can be compensated by other anchors, PDL is able to reduce the variance of the prediction error. Again, this is somewhat comparable to the nearest-neighbor method. Given the model γ𝛾\gammaitalic_γ, the anchor predictions can even be considered as independent222Of course, this independence is lost if the anchor points are also part of the data used to train γ𝛾\gammaitalic_γ., which, under the simplified assumption of homoscedasticity, means that the prediction error is reduced by a factor of 1/A1𝐴1/\sqrt{A}1 / square-root start_ARG italic_A end_ARG, with A𝐴Aitalic_A the number of anchors [23, p.4].

Even if these assumptions may not be completely satisfied, an expected improvement through averaging can clearly be observed in empirical studies. Fig. 5 represents four cases encountered with four different datasets and DecisionTree as a baseline. We compare the loss of the baseline (baseline loss) with the actual PDL loss, i.e., the loss given all available anchors. The empirical approximation curve is meant to show how the loss depends on the number of anchor points. Its value at A𝐴Aitalic_A is produced by averaging the performance over randomly selected anchor subsets of size A𝐴Aitalic_A. The curve goes from the average loss when only one anchor is used (γ𝛾\gammaitalic_γ loss) until reaching the actual PDL loss. The theoretical approximation curve is an optimal fit of a theoretical model to the empirical approximation, namely, the decrease of the error under the ideal assumption of independent prediction errors distributed normally with mean μ𝜇\muitalic_μ and standard deviation σ𝜎\sigmaitalic_σ. As can be seen, even if this assumption may not fully hold, the two curves deviate but slightly.

In case (a), the loss of the PDC’s γ𝛾\gammaitalic_γ estimator is better than the loss of the baseline model. As expected in this case, PDC is better than the baseline with any number of anchors. In case (b), the baseline loss is between γ𝛾\gammaitalic_γ loss and PDC’s loss. With the theoretical approximation, we estimate how many anchors are enough to outperform the baseline. In case (c), the baseline model is better than PDL. Nevertheless, the theoretical approximation allows us to estimate the additional anchors needed to outperform the baseline and the best reachable loss. It becomes less and less efficient to improve the score by adding more anchors. It might become more interesting, starting from a certain size, to work more on the base learner or the data quality. In case (d), the baseline model is even better than the approximated asymptote because learning the dual problem is more difficult. Adding more anchors is less likely to help.

Refer to caption

Figure 5: Effect of the anchor set size on PDC’s loss relative to the baseline.

6 Conclusion

Building on the concept of pairwise difference learning (PDL), we proposed the PDL classifier (PDC), a meta-learner able to reduce a multiclass classification problem into a binary problem. Our extensive empirical evaluation across 99 diverse datasets demonstrates that PDL consistently outperforms state-of-the-art machine learning models, resulting in improved F1 scores in a majority of cases. This highlights PDL’s effectiveness in enhancing performance over baseline methods, facilitated through its straightforward integration via our Python package. To explain its strong performance, we also elaborated on several properties and features of PDC.

Future research directions include the exploration of instance (anchor) weighting through regularization or Shapley data importance [9] and interaction [1]. Moreover, we plan to elaborate more closely on PDC’s potential to quantify predictive uncertainty (cf. Section 3.1)

In conclusion, PDL emerges as a practical solution for improving ML models, offering versatility and performance improvements across diverse applications. Its adaptability and robust performance make it a valuable addition to the ML toolkit, promising more accurate and reliable predictions in various domains.

References

  • [1] Belaid, M.K., El Mekki, D., Rabus, M., Hüllermeier, E.: Optimizing Data Shapley Interaction calculation from 𝒪(2N)𝒪superscript2𝑁\mathcal{O}(2^{N})caligraphic_O ( 2 start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) to 𝒪(TN2)𝒪𝑇superscript𝑁2\mathcal{O}(TN^{2})caligraphic_O ( italic_T italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for KNN models. arXiv preprint arXiv:2304.01224 (2023)
  • [2] Bian, W., Tao, D.: Learning a Distance Metric by Empirical Loss Minimization. In: Proc. IJCAI, International Joint Conference on Artificial Intelligence (2013)
  • [3] Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: OpenML benchmarking suites. arXiv preprint arXiv:1708.03731 (2017)
  • [4] Bishop, C.: Pattern recognition and ML. Springer 2,  183 (2006)
  • [5] Chen, Y., Ou, Y., Zheng, P., Huang, Y., Ge, F., Dral, P.O.: Benchmark of general-purpose ML-based quantum mechanical method AIQM1 on reaction barrier heights. The Journal of Chemical Physics 158(7) (2023)
  • [6] Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F., Udluft, S.: Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In: Proc. ICML, 35th Int. Conf. on Machine Learning. Stockholm, Sweden (2018)
  • [7] Fralish, Z., Chen, A., Skaluba, P., Reker, D.: DeepDelta: predicting ADMET improvements of molecular derivatives with deep learning. Journal of Cheminformatics 15(1),  101 (2023)
  • [8] Fralish, Z., Skaluba, P., Reker, D.: Leveraging bounded datapoints to classify molecular potency improvements. RSC Medicinal Chemistry (2024)
  • [9] Ghorbani, A., Zou, J.: Data Shapley: Equitable valuation of data for ML. In: International Conference on ML. pp. 2242–2251. PMLR (2019)
  • [10] Globerson, A., Roweis, S.: Metric learning by collapsing classes. Advances in neural information processing systems 18 (2005)
  • [11] Hu, J., Yang, S., Mao, J., Shi, C., Wang, G., Liu, Y., Pu, X.: Exploring a general convolutional neural network-based prediction model for critical casting diameter of metallic glasses. Journal of Alloys and Compounds 947, 169479 (2023)
  • [12] Hüllermeier, E., Waegeman, W.: Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning 110(3), 457–506 (2021). https://doi.org/10.1007/s10994-021-05946-3
  • [13] King, G., Zeng, L.: Logistic regression in rare events data. Political analysis 9(2), 137–163 (2001)
  • [14] Kutner, M.H., Nachtsheim, C.J., Neter, J., Li, W.: Applied linear statistical models. McGraw-hill (2005)
  • [15] Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proc. NeurIPS, 31st Conf. on Neural Information Processing Systems. Long Beach, California, USA (2017)
  • [16] Olson, R.S., Bartley, N., Urbanowicz, R.J., Moore, J.H.: Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016. pp. 485–492 (2016)
  • [17] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: ML in Python. the Journal of ML research 12, 2825–2830 (2011)
  • [18] Spiers, R.C., Norby, C., Kalivas, J.H.: Physicochemical Responsive Integrated Similarity Measure (PRISM) for a Comprehensive Quantitative Perspective of Sample Similarity Dynamically Assessed with NIR Spectra. Analytical Chemistry (2023)
  • [19] Tynes, M., Gao, W., Burrill, D.J., Batista, E.R., Perez, D., Yang, P., Lubbers, N.: Pairwise difference regression: A ML meta-algorithm for improved prediction and uncertainty quantification in chemical search. Journal of Chemical Information and Modeling 61(8), 3846–3857 (2021)
  • [20] Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in ML. SIGKDD Explorations 15(2), 49–60 (2013). https://doi.org/10.1145/2641190.2641198, http://doi.acm.org/10.1145/2641190.264119
  • [21] Wang, Y., King, R.D.: Extrapolation is Not the Same as Interpolation. In: International Conference on Discovery Science. pp. 277–292. Springer (2023)
  • [22] Wetzel, S.J., Melko, R.G., Tamblyn, I.: Twin neural network regression is a semi-supervised regression algorithm. ML: Science and Technology 3(4), 045007 (2022)
  • [23] Wetzel, S.J., Ryczko, K., Melko, R.G., Tamblyn, I.: Twin neural network regression. Applied AI Letters 3(4),  e78 (2022)