11institutetext: Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin TX, USA 22institutetext: Department of Computer Science, Johns Hopkins University, Baltimore MD, USA 33institutetext: Center for Data Science, New York University, New York NY, USA 33email: [email protected]

Novel Node Category Detection Under Subpopulation Shift

Hsing-Huan Chung (✉) 11    Shravan Chaudhari 22    Yoav Wald 33    Xing Han 22    Joydeep Ghosh 11
Abstract

In real-world graph data, distribution shifts can manifest in various ways, such as the emergence of new categories and changes in the relative proportions of existing categories. It is often important to detect nodes of novel categories under such distribution shifts for safety or insight discovery purposes. We introduce a new approach, Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP), to detect nodes belonging to novel categories in attributed graphs under subpopulation shifts. By integrating a recall-constrained learning framework with a sample-efficient link prediction mechanism, RECO-SLIP addresses the dual challenges of resilience against subpopulation shifts and the effective exploitation of graph structure. Our extensive empirical evaluation across multiple graph datasets demonstrates the superior performance of RECO-SLIP over existing methods. The experimental code is available at: https://github.com/hsinghuan/novel-node-category-detection.

Keywords:
Novel category detection Positive-unlabeled learning Graph neural networks.

1 Introduction

Distribution shifts may occur in real-world graphs either through natural temporal evolution [5] or the manual integration of new data sources [30]. Such shifts manifest in various ways, including the emergence of new categories or alterations in the relative proportions of existing ones. The task of novel category detection [3, 27] involves identifying data samples that do not fit into pre-existing categories even in the face of these shifts. As a motivation, consider the product co-purchasing network of a consumer-to-consumer (c2c) e-commerce platform in Fig. 1. The co-purchasing network may evolve due to the e-commerce platform being available in new regions. The evolution can lead to the introduction of new product categories and changes in the relative popularity of existing ones. It is crucial to identify products in potential novel categories due to either safety reasons or the need to develop new insights for platform improvement. Additionally, the detection has to be able to be done under the subpopulation shift among existing (non-novel) categories, ideally without knowing the category labels of individual products due to the chaotic nature of a c2c platform. Similar phenomena and demands also exist in academic citation graphs where papers demonstrating new research topics are to be detected. These applications underscore the importance of studying novel node category detection amidst subpopulation shifts.

Refer to caption
Figure 1: An illustration of novel node category detection under subpopulation shift using a product co-purchasing network. The target domain consists of products from the two categories that exist in the source domain (sports and kitchen) and a novel category (weapons). Meanwhile, the relative proportion of the two original categories changes from source to target. The goal is to detect the products belonging to the novel category in the target domain.

In this study, we consider an attributed graph where nodes belong to either the source or the target domain. The source nodes all fall into the non-novel categories whereas the target nodes may belong to the non-novel categories or a novel category.111This formulation can be readily extended to multiple novel categories by viewing them as a single category. The detected novel category can be subsequently partitioned into multiple sub-categories by a suitable graph partitioning algorithm. Using the co-purchasing network example, one can think of the source nodes as the products that have already existed before a certain time point and the target nodes as the products introduced afterward. Our primary goal is to identify nodes within the target domain that fall into the novel category while the relative proportions of non-novel categories between the source and target domains may vary. As shown by previous work [3, 10, 27], novel category detection can be naturally reframed as a positive-unlabeled (PU) learning problem [1]. In this context, non-novel categories are collectively considered as the positive class, and the novel category as the negative class. Since all source nodes belong to the non-novel categories and the novel/non-novel labels of individual target nodes are not known, the source and target nodes can be viewed as positively labeled and unlabeled data, respectively. Consequently, the problem of novel node category detection is equivalent to learning from positively labeled and unlabeled nodes.

Despite the connection to PU learning, existing standard and graph PU learning methods encounter significant limitations. These methods often rely on the Selected Completely At Random (SCAR) assumption [8], which posits that each positive instance is equally likely to be labeled. However, this assumption breaks in the face of subpopulation shifts, leading to compromised PU learning performance. Several approaches have sought to relax the SCAR assumption. For instance, propensity weighting-based methods [2, 12, 28] estimate the probability of each positive sample being labeled and integrate the probabilities into the loss function. One other recent method that loosens the SCAR assumption, CoNoC [27], solves a constrained learning problem and has finite sample guarantees under distribution shift. Although these methods do not require SCAR to hold, they essentially treat all unlabeled samples as equally negative, or in our language, all target samples as equally novel. They do not explicitly utilize the subgroup structure provided by edges in a graph to further distinguish nodes of the novel category from the rest.

To address the challenges of vulnerability to subpopulation shifts and the ineffective use of graph structure, we introduce REcall-Constrained Optimization with Selective LInk Prediction (RECO-SLIP). RECO-SLIP builds upon the constrained learning framework of CoNoC, which has only shown empirical success on tabular datasets. RECO-SLIP enhances this framework by integrating a link prediction loss induced by a sample-efficient edge sampling strategy to preserve the novel subgroup structure in the node representation space. Our comprehensive experiments showcase the effectiveness of RECO-SLIP. In summary, our key contributions are threefold:

  • We formally define the problem of detecting nodes from novel categories in attributed graphs, particularly under conditions of subpopulation shift.

  • We introduce RECO-SLIP, which synergizes a recall-constrained learning framework with a sample-efficient link prediction mechanism. This approach addresses the limitations of existing methods under subpopulation shifts and the underutilization of graph structures.

  • We conduct a comprehensive empirical evaluation of our approach, comparing its performance against standard PU learning, propensity-weighting, and graph PU learning methods on five graph datasets. Our findings affirm the effectiveness and robustness of the proposed solution.

2 Related Work

2.1 PU Learning and Novel Category Detection

PU learning [1] is the problem of learning from a dataset with only positively labeled and unlabeled data. Novel category detection [3, 27] can be naturally reframed as a PU learning problem. The concept of novel and non-novel in our problem can be mapped to the concept of negative and positive in PU learning, respectively. Then, the role of the source nodes would correspond to the positively labeled samples in PU learning since all source nodes are known to be from the non-novel (positive) categories. The target nodes would correspond to the unlabeled samples because they could be from a novel (negative) or non-novel (positive) category and the learner does not know which ones are novel and which ones are not. This connection has been drawn by several prior work [3, 10, 27]. Mainstream PU learning methods [7, 8, 17, 38] deal with the absence of negative labels through risk estimator design, treating unlabeled samples as negatives and labeled samples as weighted combinations of positives and negatives. These risk estimators assume the class priors are given. In practice, the class priors have to be estimated from data by mixture proportion estimation (MPE) techniques [11, 35]. These PU learning approaches are based on the Selected Completely at Random (SCAR) assumption [8]. In our context, SCAR would assume that every node from a non-novel category has an equal probability of being in the source domain.

2.2 Subpopulation Shift and PU Learning

Subpopulation shift [18, 23, 33] is a specific type of distribution shift where the proportions of categories differ between the source and target domains. Prior work in subpopulation shift mostly focuses on learning a classifier with a decent worst group accuracy in the target domain [20, 22, 34]. Our focus is different as we are interested in detecting the emergence of new categories in the face of subpopulation shifts among the pre-existing ones.

SCAR would break when there exists a subpopulation shift between the source and target. Using Fig. 1 as an example, a kitchen product would have a higher probability of being in the source domain than a sports product due to the subpopulation shift, violating SCAR. We show this connection more formally in Appendix 0.A. One line of work [2, 12, 28] relaxes SCAR through learning the propensity scores, i.e. the probability of a positive sample being labeled, and incorporating them into the risk estimator. However, they require other assumptions in order to estimate propensity scores. A relatively mild assumption is that the propensity score depends on fewer attributes than the PU classifier [2] while some stronger ones are Local Certainty and Probabilistic Gap [12, 28]. One other recent work [27] uses a constrained learning method, CoNoC, to minimize the error of classifying labeled positive data as negative while reserving enough unlabeled data as negative. The constrained learning framework has PAC-like finite sample guarantees but its empirical success has only been shown on tabular data so far. Our method, RECO-SLIP, leverages the constrained learning framework due to its suitability to our problem. In addition, RECO-SLIP uses a selective link prediction strategy to preserve the novel category subgroup structure in the node representation space for better separation.

2.3 PU Learning on Graphs

Graph PU learning utilizes the edge relation between node samples to learn a PU classifier. LSDAN [29] is the first method proposed for PU learning on attributed graphs. LSDAN trains a long-short distance attention model with the non-negative PU loss [17]. LP-PUL [4] uses the average shortest path distances from the positive nodes to identify the most negative nodes and performs label propagation to obtain the final predictions. GRAB [36] estimates the class prior and learns the PU classifier through iterative belief propagation on the graph. PU-GNN [32] is a state-of-the-art graph PU learning method. PU-GNN segregates unlabeled nodes into two sets by proximity to source nodes and aligns the expectations of predicted label distributions with the class priors separately on the two sets. It also employs structural regularization, which is similar to a link prediction loss. However, its edge sampling space is all potential edges, making the regularization sample inefficient. In contrast, RECO-SLIP reduces the edge sampling space for link prediction by identifying the subgraph that the classifier would underperform and needs further preservation of the subgroup structure.

2.4 Anomaly Detection and OOD Detection on Graphs

We briefly clarify the differences between our problem and node-level anomaly detection and out-of-distribution (OOD) detection. Node-level anomaly detection [6, 19, 21] aims to identify a set of anomalous nodes or rank nodes in a given graph according to the degree of abnormality. There are two major differences between anomaly detection and our problem. Firstly, the concept of anomaly is at the level of individual nodes whereas the concept of novelty in our problem is at the level of node categories. Secondly, there is no notion of source and target in anomaly detection. This difference prevents node-level anomaly detection algorithms from leveraging domain labels and utilizing PU learning techniques.

Node-level OOD detection [26, 31, 37] is the task of detecting if a node is outside of the training distribution. Most approaches leverage the uncertainty estimates from the probabilistic predictions of a graph neural network to quantify if a node is OOD. There are two key differences between our problem and OOD detection. Firstly, out-of-distribution nodes are not seen during training for OOD detection. In our problem, novel categories are present during training in the target domain but the learning algorithm does not know which target nodes belong to the novel category. Secondly, in-distribution category labels are typically provided in OOD detection whereas they are not provided in our problem setting. The only type of labels we have is the source/target domain labels.

3 Problem Formulation

We consider a set of nodes 𝒱𝒱\mathcal{V}caligraphic_V that can be partitioned into a subset of source nodes 𝒱Ssubscript𝒱𝑆\mathcal{V}_{S}caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and another subset of target nodes 𝒱Tsubscript𝒱𝑇\mathcal{V}_{T}caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, i.e. 𝒱=𝒱S𝒱T,𝒱S𝒱T=formulae-sequence𝒱subscript𝒱𝑆subscript𝒱𝑇subscript𝒱𝑆subscript𝒱𝑇\mathcal{V}=\mathcal{V}_{S}\cup\mathcal{V}_{T},\mathcal{V}_{S}\cap\mathcal{V}_% {T}=\varnothingcaligraphic_V = caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∩ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∅. Each node v𝑣vitalic_v is accompanied by a feature vector 𝐱vsubscript𝐱𝑣\mathbf{x}_{v}bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. The source nodes are generated from a source distribution PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, which is a mixture of K𝐾Kitalic_K category distributions G1,G2,,GKsubscript𝐺1subscript𝐺2subscript𝐺𝐾G_{1},G_{2},\ldots,G_{K}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. The target nodes consist of nodes from the aforementioned categories and those from a novel category that does not appear in the source. We denote the target distribution by PTsubscript𝑃𝑇P_{T}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Among the target distribution, we denote the distribution of the non-novel categories by PT,0subscript𝑃𝑇0P_{T,0}italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT and the distribution of the novel category by PT,1subscript𝑃𝑇1P_{T,1}italic_P start_POSTSUBSCRIPT italic_T , 1 end_POSTSUBSCRIPT. Let α𝛼\alphaitalic_α be the novel ratio in the target, i.e. PT=(1α)PT,0+αPT,1subscript𝑃𝑇1𝛼subscript𝑃𝑇0𝛼subscript𝑃𝑇1P_{T}=(1-\alpha)P_{T,0}+\alpha P_{T,1}italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ( 1 - italic_α ) italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT + italic_α italic_P start_POSTSUBSCRIPT italic_T , 1 end_POSTSUBSCRIPT. We consider the scenario where subpopulation shifts may happen between the source and target among the non-novel categories. More concretely, let 𝜸,𝜸^ΔK1𝜸^𝜸superscriptΔ𝐾1\bm{\gamma},\hat{\bm{\gamma}}\in\Delta^{K-1}bold_italic_γ , over^ start_ARG bold_italic_γ end_ARG ∈ roman_Δ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT be probability vectors that determine the mixture proportions of the non-novel categories in source and target, respectively. Then,

PS=i=1KγiGi,PT,0=i=1Kγ^iGi,𝜸𝜸^formulae-sequencesubscript𝑃𝑆superscriptsubscript𝑖1𝐾subscript𝛾𝑖subscript𝐺𝑖formulae-sequencesubscript𝑃𝑇0superscriptsubscript𝑖1𝐾subscript^𝛾𝑖subscript𝐺𝑖𝜸^𝜸P_{S}=\sum_{i=1}^{K}\gamma_{i}G_{i},P_{T,0}=\sum_{i=1}^{K}\hat{\gamma}_{i}G_{i% },\bm{\gamma}\neq\hat{\bm{\gamma}}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_γ ≠ over^ start_ARG bold_italic_γ end_ARG (1)

Apart from the nodes, we also have access to an edge set 𝒱×𝒱𝒱𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}caligraphic_E ⊆ caligraphic_V × caligraphic_V. We assume the graph to be homophilous, meaning that two nodes having an edge between them have a higher probability of belonging to the same category than those that do not. This assumption is common for large-scale graphs such as online social networks, citation graphs, or co-purchasing networks.

The goal of a learning algorithm for novel node category detection is to take the graph 𝒢=(𝒱S,𝒱T,)𝒢subscript𝒱𝑆subscript𝒱𝑇\mathcal{G}=(\mathcal{V}_{S},\mathcal{V}_{T},\mathcal{E})caligraphic_G = ( caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_E ) and node features {𝐱vv𝒱}conditional-setsubscript𝐱𝑣𝑣𝒱\{\mathbf{x}_{v}\mid v\in\mathcal{V}\}{ bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_v ∈ caligraphic_V } as input, and learn a binary classifier f:𝒱[0,1]:𝑓𝒱01f:\mathcal{V}\rightarrow[0,1]italic_f : caligraphic_V → [ 0 , 1 ] that minimizes the following expected risk over the target distribution:

RT(f)=(1α)𝔼vPT,0[f(v)]+α𝔼vPT,1[1f(v)]subscript𝑅𝑇𝑓1𝛼subscript𝔼similar-to𝑣subscript𝑃𝑇0delimited-[]𝑓𝑣𝛼subscript𝔼similar-to𝑣subscript𝑃𝑇1delimited-[]1𝑓𝑣R_{T}(f)=(1-\alpha)\mathbb{E}_{v\sim P_{T,0}}[f(v)]+\alpha\mathbb{E}_{v\sim P_% {T,1}}[1-f(v)]italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_f ) = ( 1 - italic_α ) blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f ( italic_v ) ] + italic_α blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P start_POSTSUBSCRIPT italic_T , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ 1 - italic_f ( italic_v ) ] (2)

Simply put, it would be ideal for the classifier to output a score close to 00 if an input target node is from a non-novel category and output a score close to 1111 otherwise. It is worth noting that the learning algorithm has access to the domain labels, i.e. whether a node belongs to 𝒱Ssubscript𝒱𝑆\mathcal{V}_{S}caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT or 𝒱Tsubscript𝒱𝑇\mathcal{V}_{T}caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, but not the category labels. To jointly utilize the graph structure and node features, we mainly consider binary classifiers composed of a graph neural network (GNN) encoder g𝑔gitalic_g and a multi-layer perceptron (MLP) head hhitalic_h, i.e. f=hg𝑓𝑔f=h\circ gitalic_f = italic_h ∘ italic_g. Although the concept of novel/non-novel is at the category level, we will also say a node is novel/non-novel if it is from a novel/non-novel category for brevity in the remaining sections.

4 Recall-Constrained Optimization with Selective Link Prediction (RECO-SLIP)

4.1 Recall-Constrained Optimization

Recently, Wald et al. [27] propose a constrained optimization-based method as a principled alternative to PU learning under distribution shift. They show that the expected target risk can be bounded by the false positive rate (FPR) 222The positive in false positive rate refers to the novel category being detected, not the positive in PU-learning. We use FPR in the remaining paper to avoid confusion. on the source domain, the negative of the recall on the target domain, and the divergence between the source and target distributions. Motivated by the bound, they suggest minimizing the FPR while kee** the recall above a certain value. Due to the suitability of this recall-constrained method to our problem, we adopt the same principle and describe it more formally below.

Let β(f)=𝔼vPS[f(v)]𝛽𝑓subscript𝔼similar-to𝑣subscript𝑃𝑆delimited-[]𝑓𝑣\beta(f)=\mathbb{E}_{v\sim P_{S}}[f(v)]italic_β ( italic_f ) = blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f ( italic_v ) ] be the FPR on the source domain, namely the error rate of identifying source nodes as novel. Let α(f)=𝔼vPT[f(v)]𝛼𝑓subscript𝔼similar-to𝑣subscript𝑃𝑇delimited-[]𝑓𝑣\alpha(f)=\mathbb{E}_{v\sim P_{T}}[f(v)]italic_α ( italic_f ) = blackboard_E start_POSTSUBSCRIPT italic_v ∼ italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f ( italic_v ) ] be the recall on the target domain, which is the rate of identifying target nodes as novel. We denote the empirical estimate of α(f)𝛼𝑓\alpha(f)italic_α ( italic_f ) and β(f)𝛽𝑓\beta(f)italic_β ( italic_f ) by α^(f)^𝛼𝑓\hat{\alpha}(f)over^ start_ARG italic_α end_ARG ( italic_f ) and β^(f)^𝛽𝑓\hat{\beta}(f)over^ start_ARG italic_β end_ARG ( italic_f ). Note that the FPR defined here is not evaluated on all non-novel nodes but only the non-novel nodes within the source domain. Also, the recall here is not just evaluated on the novel nodes but all target nodes. We do not have ground-truth novel/non-novel labels in the target domain so we could only use domain labels as a proxy.

We use the average output score of f𝑓fitalic_f over the source nodes and the target nodes as a differentiable proxy of β^(f)^𝛽𝑓\hat{\beta}(f)over^ start_ARG italic_β end_ARG ( italic_f ) and α^(f)^𝛼𝑓\hat{\alpha}(f)over^ start_ARG italic_α end_ARG ( italic_f ), respectively. Then, the optimization problem of minimizing the empirical FPR and kee** the empirical recall above a value α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG can be written as:

minf1|𝒱S|v𝒱Sf(v),s.t. 1|𝒱T|v𝒱Tf(v)α~subscript𝑓1subscript𝒱𝑆subscript𝑣subscript𝒱𝑆𝑓𝑣s.t. 1subscript𝒱𝑇subscript𝑣subscript𝒱𝑇𝑓𝑣~𝛼\displaystyle\min_{f}\frac{1}{|\mathcal{V}_{S}|}\sum_{v\in\mathcal{V}_{S}}f(v)% ,\ \text{s.t. }\frac{1}{|\mathcal{V}_{T}|}\sum_{v\in\mathcal{V}_{T}}f(v)\geq% \tilde{\alpha}roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_v ) , s.t. divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_v ) ≥ over~ start_ARG italic_α end_ARG (3)
Refer to caption
Figure 2: An illustration of RECO-SLIP. The upper-right module is the recall-constrained optimization component (Eq. 3) where the classifier adjusts its scores to minimize the FPR on the source while reserving enough target nodes as novel. The bottom module is the selective link prediction component where the link prediction loss (Eq. 7) is imposed on the target subgraph excluding the nodes with the highest scores (Eq. 4, 5, 6). A solid orange arrow pair between two nodes denotes their representation similarity is maximized whereas a bidirectional hollow arrow denotes the similarity is minimized.

In practice, we solve the optimization problem using different values of α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG. Then we select the model that has the highest empirical recall out of the models that achieve an empirical FPR below a user-specified threshold β~~𝛽\tilde{\beta}over~ start_ARG italic_β end_ARG.

The intuitive idea of why solving the constrained learning problem could outperform standard PU learning methods under subpopulation shift can be shown in the upper-right module of Fig. 2. Standard PU learning directly discriminates source from target nodes. Since there are more target nodes in non-novel category 2, standard PU learning would result in the tilted decision boundary represented by the dashed line. By solving the constrained learning problem, the classifier avoids identifying source nodes as novel while reserving enough target nodes as novel, resulting in a more horizontal decision boundary represented by the solid line.

4.2 Selective Link Prediction

Using the domain labels as a proxy of novel/non-novel labels essentially views all target nodes as equally novel. The novel and non-novel node representations within the target domain will not be sufficiently separated since the domain labels confuse them as the same. However, the edge connection pattern can reveal additional information due to the homophily property. For instance, the target novel nodes should have few edge connections to the target non-novel nodes since they belong to different categories. Therefore, we use link prediction with the graph autoencoder (GAE) [15] objective as an auxiliary task during training to preserve the category subgroup structure in the node representation space.

Link prediction maximizes the similarity between two node representations if the two nodes are connected and minimizes their similarity otherwise. To determine which pairs of unconnected nodes should have minimized representation similarities, we perform negative sampling from non-existent edges. As real-world graphs are sparse, the negative sampling space is huge compared to the existent edges. Therefore, we reduce the sampling space according to our problem to improve sample efficiency. The non-existent edges 𝖢superscript𝖢\mathcal{E}^{\mathsf{C}}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT can be categorized into three subsets S,S𝖢subscriptsuperscript𝖢𝑆𝑆\mathcal{E}^{\mathsf{C}}_{S,S}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_S end_POSTSUBSCRIPT, S,T𝖢subscriptsuperscript𝖢𝑆𝑇\mathcal{E}^{\mathsf{C}}_{S,T}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT, and T,T𝖢subscriptsuperscript𝖢𝑇𝑇\mathcal{E}^{\mathsf{C}}_{T,T}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T , italic_T end_POSTSUBSCRIPT defined as follows:

S,S𝖢={(vi,vj)vi𝒱Svj𝒱S(vi,vj)𝖢}subscriptsuperscript𝖢𝑆𝑆conditional-setsubscript𝑣𝑖subscript𝑣𝑗subscript𝑣𝑖subscript𝒱𝑆subscript𝑣𝑗subscript𝒱𝑆subscript𝑣𝑖subscript𝑣𝑗superscript𝖢\displaystyle\mathcal{E}^{\mathsf{C}}_{S,S}=\{(v_{i},v_{j})\mid v_{i}\in% \mathcal{V}_{S}\land v_{j}\in\mathcal{V}_{S}\land(v_{i},v_{j})\in\mathcal{E}^{% \mathsf{C}}\}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_S end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT }
S,T𝖢={(vi,vj)((vi𝒱Svj𝒱T)(vi𝒱Tvj𝒱S))(vi,vj)𝖢}subscriptsuperscript𝖢𝑆𝑇conditional-setsubscript𝑣𝑖subscript𝑣𝑗subscript𝑣𝑖subscript𝒱𝑆subscript𝑣𝑗subscript𝒱𝑇subscript𝑣𝑖subscript𝒱𝑇subscript𝑣𝑗subscript𝒱𝑆subscript𝑣𝑖subscript𝑣𝑗superscript𝖢\displaystyle\mathcal{E}^{\mathsf{C}}_{S,T}=\{(v_{i},v_{j})\mid((v_{i}\in% \mathcal{V}_{S}\land v_{j}\in\mathcal{V}_{T})\lor(v_{i}\in\mathcal{V}_{T}\land v% _{j}\in\mathcal{V}_{S}))\land(v_{i},v_{j})\in\mathcal{E}^{\mathsf{C}}\}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ ( ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∨ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ) ) ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT }
T,T𝖢={(vi,vj)vi𝒱Tvj𝒱T(vi,vj)𝖢}subscriptsuperscript𝖢𝑇𝑇conditional-setsubscript𝑣𝑖subscript𝑣𝑗subscript𝑣𝑖subscript𝒱𝑇subscript𝑣𝑗subscript𝒱𝑇subscript𝑣𝑖subscript𝑣𝑗superscript𝖢\displaystyle\mathcal{E}^{\mathsf{C}}_{T,T}=\{(v_{i},v_{j})\mid v_{i}\in% \mathcal{V}_{T}\land v_{j}\in\mathcal{V}_{T}\land(v_{i},v_{j})\in\mathcal{E}^{% \mathsf{C}}\}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T , italic_T end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT }
Algorithm 1 RECO-SLIP
Dataset: 𝒱S,𝒱T,,{𝐱vv𝒱S𝒱T}subscript𝒱𝑆subscript𝒱𝑇conditional-setsubscript𝐱𝑣𝑣subscript𝒱𝑆subscript𝒱𝑇\mathcal{V}_{S},\mathcal{V}_{T},\mathcal{E},\{\mathbf{x}_{v}\mid v\in\mathcal{% V}_{S}\cup\mathcal{V}_{T}\}caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , caligraphic_E , { bold_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∣ italic_v ∈ caligraphic_V start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }, hyper-parameters: ξ,𝜶~,β~,T𝜉~𝜶~𝛽𝑇\xi,\tilde{\bm{\alpha}},\tilde{\beta},Titalic_ξ , over~ start_ARG bold_italic_α end_ARG , over~ start_ARG italic_β end_ARG , italic_T
for α~𝜶~~𝛼~𝜶\tilde{\alpha}\in\tilde{\bm{\alpha}}over~ start_ARG italic_α end_ARG ∈ over~ start_ARG bold_italic_α end_ARG do
     Initialize binary classifier f(0)superscript𝑓0f^{(0)}italic_f start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and dual variable λ(0)superscript𝜆0\lambda^{(0)}italic_λ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT.
     for t1𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
         Construct subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT and +subscript\mathcal{E}_{+}caligraphic_E start_POSTSUBSCRIPT + end_POSTSUBSCRIPT based on Eq. 5 and Eq. 6.
         Sample superscriptsubscript\mathcal{E}_{-}^{*}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT.
         f(t),λ(t)superscript𝑓𝑡superscript𝜆𝑡absentf^{(t)},\lambda^{(t)}\leftarrowitalic_f start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ← Primal-dual optimization update with the Lagrangian in Eq. 8.
     end for
     fα~f(T)subscript𝑓~𝛼superscript𝑓𝑇f_{\tilde{\alpha}}\leftarrow f^{(T)}italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ← italic_f start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT
     Calculate empirical recall α^(fα~)^𝛼subscript𝑓~𝛼\hat{\alpha}(f_{\tilde{\alpha}})over^ start_ARG italic_α end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ) and empirical FPR β^(fα~)^𝛽subscript𝑓~𝛼\hat{\beta}(f_{\tilde{\alpha}})over^ start_ARG italic_β end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ).
end for
Return argmaxfα~:α~𝜶~,β^(fα~)<β~α^(fα~)subscriptargmax:subscript𝑓~𝛼formulae-sequence~𝛼~𝜶^𝛽subscript𝑓~𝛼~𝛽^𝛼subscript𝑓~𝛼\operatorname*{arg\,max}_{f_{\tilde{\alpha}}:\tilde{\alpha}\in\tilde{\bm{% \alpha}},\hat{\beta}(f_{\tilde{\alpha}})<\tilde{\beta}}\hat{\alpha}(f_{\tilde{% \alpha}})start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT : over~ start_ARG italic_α end_ARG ∈ over~ start_ARG bold_italic_α end_ARG , over^ start_ARG italic_β end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ) < over~ start_ARG italic_β end_ARG end_POSTSUBSCRIPT over^ start_ARG italic_α end_ARG ( italic_f start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT )

S,S𝖢subscriptsuperscript𝖢𝑆𝑆\mathcal{E}^{\mathsf{C}}_{S,S}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_S end_POSTSUBSCRIPT are non-existent edges where the nodes on both sides are from the source domain. We do not sample from this subset because all source nodes are non-novel and there is no need to separate them apart. S,T𝖢subscriptsuperscript𝖢𝑆𝑇\mathcal{E}^{\mathsf{C}}_{S,T}caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S , italic_T end_POSTSUBSCRIPT are non-existent edges where the node on one side is from the source domain and the other is from the target domain. We do not sample from this subset either because the main task using the domain labels is already separating source and target nodes. The strategy above restricts us to sampling from the target subgraph. We can further reduce the sampling space based on the score produced by the classifier. When solving the constrained optimization in Eq. 3, the classifier reserves at least α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG portion of the target nodes as novel. The other target nodes scored at the bottom 1α~1~𝛼1-\tilde{\alpha}1 - over~ start_ARG italic_α end_ARG portion are the samples that the classifier is less confident in identifying as novel and require the auxiliary task to separate node representations of different subgroups. Therefore, we reduce the negative sampling space to the non-existent edges among the target nodes scored at the bottom 1α~1~𝛼1-\tilde{\alpha}1 - over~ start_ARG italic_α end_ARG. Let zα~subscript𝑧~𝛼z_{\tilde{\alpha}}italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT denote the score separating the top α𝛼\alphaitalic_α and bottom 1α1𝛼1-\alpha1 - italic_α target nodes, and subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT denote the final negative sampling space:

zα~=min{z1|𝒱T|v𝒱T𝟙[f(v)z]>1α~}subscript𝑧~𝛼𝑧ket1subscript𝒱𝑇subscript𝑣subscript𝒱𝑇1delimited-[]𝑓𝑣𝑧1~𝛼\displaystyle z_{\tilde{\alpha}}=\min\{z\mid\frac{1}{|\mathcal{V}_{T}|}\sum_{v% \in\mathcal{V}_{T}}\mathbbm{1}[f(v)\leq z]>1-\tilde{\alpha}\}italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT = roman_min { italic_z ∣ divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 [ italic_f ( italic_v ) ≤ italic_z ] > 1 - over~ start_ARG italic_α end_ARG } (4)
={(vi,vj)f(vi)<zα~f(vj)<zα~(vi,vj)T,T𝖢}subscriptconditional-setsubscript𝑣𝑖subscript𝑣𝑗𝑓subscript𝑣𝑖subscript𝑧~𝛼𝑓subscript𝑣𝑗subscript𝑧~𝛼subscript𝑣𝑖subscript𝑣𝑗subscriptsuperscript𝖢𝑇𝑇\displaystyle\mathcal{E}_{-}=\{(v_{i},v_{j})\mid f(v_{i})<z_{\tilde{\alpha}}% \land f(v_{j})<z_{\tilde{\alpha}}\land(v_{i},v_{j})\in\mathcal{E}^{\mathsf{C}}% _{T,T}\}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_f ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ∧ italic_f ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUPERSCRIPT sansserif_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T , italic_T end_POSTSUBSCRIPT } (5)

To produce a balancing force towards the separation induced by subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT, we consider the existing edges between the same nodes involved in subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT the positive samples:

+={(vi,vj)f(vi)<zα~f(vj)<zα~vi𝒱Tvj𝒱T(vi,vj)}subscriptconditional-setsubscript𝑣𝑖subscript𝑣𝑗𝑓subscript𝑣𝑖subscript𝑧~𝛼𝑓subscript𝑣𝑗subscript𝑧~𝛼subscript𝑣𝑖subscript𝒱𝑇subscript𝑣𝑗subscript𝒱𝑇subscript𝑣𝑖subscript𝑣𝑗\displaystyle\mathcal{E}_{+}=\{(v_{i},v_{j})\mid f(v_{i})<z_{\tilde{\alpha}}% \land f(v_{j})<z_{\tilde{\alpha}}\land v_{i}\in\mathcal{V}_{T}\land v_{j}\in% \mathcal{V}_{T}\land(v_{i},v_{j})\in\mathcal{E}\}caligraphic_E start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = { ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∣ italic_f ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ∧ italic_f ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) < italic_z start_POSTSUBSCRIPT over~ start_ARG italic_α end_ARG end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∧ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∧ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E } (6)

At each iteration, we uniformly sample the same amount of negative edges as the positive samples from subscript\mathcal{E}_{-}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT. We denote the sampled negative edges by superscriptsubscript\mathcal{E}_{-}^{*}caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Let 𝐠v=g(v)subscript𝐠𝑣𝑔𝑣\mathbf{g}_{v}=g(v)bold_g start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_g ( italic_v ) denote the representation of node v𝑣vitalic_v produced by the GNN encoder and σ𝜎\sigmaitalic_σ denote the sigmoid function. The final link prediction loss is shown below:

lp(g)=1|+|(vi,vj)+logσ(𝐠vi𝐠vj)1||(vi,vj)log(1σ(𝐠vi𝐠vj))subscript𝑙𝑝𝑔1subscriptsubscriptsubscript𝑣𝑖subscript𝑣𝑗subscript𝜎subscript𝐠subscript𝑣𝑖subscript𝐠subscript𝑣𝑗1superscriptsubscriptsubscriptsubscript𝑣𝑖subscript𝑣𝑗superscriptsubscript1𝜎subscript𝐠subscript𝑣𝑖subscript𝐠subscript𝑣𝑗\displaystyle\mathcal{L}_{lp}(g)=-\frac{1}{|\mathcal{E}_{+}|}\sum_{(v_{i},v_{j% })\in\mathcal{E}_{+}}\log\sigma(\mathbf{g}_{v_{i}}\cdot\mathbf{g}_{v_{j}})-% \frac{1}{|\mathcal{E}_{-}^{*}|}\sum_{(v_{i},v_{j})\in\mathcal{E}_{-}^{*}}\log(% 1-\sigma(\mathbf{g}_{v_{i}}\cdot\mathbf{g}_{v_{j}}))caligraphic_L start_POSTSUBSCRIPT italic_l italic_p end_POSTSUBSCRIPT ( italic_g ) = - divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUBSCRIPT + end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_σ ( bold_g start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_g start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG | caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_E start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log ( 1 - italic_σ ( bold_g start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ bold_g start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (7)

We multiply the link prediction loss with a hyper-parameter ξ𝜉\xiitalic_ξ and add it to the constrained learning objective. Then the Lagrangian can be computed as follows:

(g,h,λ)=β^(hg)+ξlp(g)+λ(α~α^(hg))𝑔𝜆^𝛽𝑔𝜉subscript𝑙𝑝𝑔𝜆~𝛼^𝛼𝑔\displaystyle\mathcal{L}(g,h,\lambda)=\hat{\beta}(h\circ g)+\xi\mathcal{L}_{lp% }(g)+\lambda(\tilde{\alpha}-\hat{\alpha}(h\circ g))caligraphic_L ( italic_g , italic_h , italic_λ ) = over^ start_ARG italic_β end_ARG ( italic_h ∘ italic_g ) + italic_ξ caligraphic_L start_POSTSUBSCRIPT italic_l italic_p end_POSTSUBSCRIPT ( italic_g ) + italic_λ ( over~ start_ARG italic_α end_ARG - over^ start_ARG italic_α end_ARG ( italic_h ∘ italic_g ) ) (8)

where λ𝜆\lambdaitalic_λ is a non-negative dual variable. We apply T𝑇Titalic_T primal-dual optimization [9] updates to learn the classifier. The overall procedure is in Algorithm 1.

5 Experiments

5.1 Experimental Setup

Table 1: Source ratio per category for Cora-S, CiteSeer-S, Computers-S, and Photo-S. Each entry represents the ratio of nodes belonging to a category that is assigned to the source domain. The source ratio of the last available category in each dataset is 00 because the category is novel.
Category label
1 2 3 4 5 6 7 8 9 10
Cora-S 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 00 - - -
CiteSeer-S 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.50.50.50.5 00 - - - -
Computers-S 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.50.50.50.5 00
Photo-S 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.50.50.50.5 00 - -
Table 2: Dataset statistics.
Cora-S Citeseer-S Computers-S Photo-S arxiv
# categories 7 6 10 8 6
# source nodes 1317 1265 5252 3068 859
# target nodes 1391 2062 8500 4582 3442
# novel nodes 180 508 291 331 176
Novel ratio (# novel / # target) 0.129 0.246 0.034 0.072 0.051

5.1.1 Data

We evaluate RECO-SLIP and baseline methods on five widely used public benchmark datasets: Cora [24], CiteSeer [24], Computers [25], Photo [25], and arxiv [13]. Cora, CiteSeer, and arxiv are academic citation graphs while Computers and Photo are product co-purchasing networks. We use the original labels provided by the datasets as the category labels. For Cora, CiteSeer, Computers, and Photo, we view the last class as the novel category and simulate subpopulation shifts by splitting nodes in each category into source and target based on Table 1. We name the preprocessed versions of the four datasets with “-S” at the end representing “shift”. arxiv is a subset of the ogbn-arxiv graph [13] containing 6666 robotics-related categories from 1990199019901990 to 2012201220122012. arxiv enables a natural source/target split and a novel category emergence since each node is associated with a timestamp and nodes belonging to the category “cs.SY: systems and control” did not exist until 2007200720072007. Therefore, we define nodes with timestamps before 2007200720072007 as the source and the others as the target. In terms of train/test data split, we consider the commonly used transductive node classification setup. The test nodes of arxiv are the nodes with timestamps in 2012201220122012 while the test nodes of the other four datasets are randomly selected from the target domain. We further split a part of the training nodes into the validation set for model selection. We show the dataset-specific statistics in Table 2.

5.1.2 Baselines

We compare RECO-SLIP with PU learning approaches both SCAR and not SCAR-based as well as graph PU learning approaches.

  • Domain discriminator: The domain discriminator directly discriminates source from target nodes. It is the basis of most PU-learning approaches.

  • uPU [7]: uPU is an unbiased risk estimator for PU-learning. It treats unlabeled samples as negative and labeled samples as weighted combinations of positives and negatives.

  • nnPU [17]: nnPU is a non-negative risk estimator for PU-learning. It addresses the overfitting problem of uPU since the empirical risk of uPU could be negative and unbounded below.

  • SAR-EM [2]: SAR-EM is a propensity-weighting approach for PU-learning when SCAR does not hold. It jointly learns the PU classifier and the propensity score of each data point via the EM algorithm. We select this method as the representative method of propensity-weighting approaches due to its relatively flexible assumptions and publicly available code implementation. Later work using propensity-weighting [12, 28] imposes strong assumptions such as Local Certainty and Probabilistic Gap333Local Certainty assumes the relationship between the observed features and the true class is a deterministic function, meaning that the class distributions do not overlap. Probabilistic Gap allows overlap** class distributions but assumes that the propensity scores follow the ordering of the posterior probabilities, i.e. p(y=1|𝐱)𝑝𝑦conditional1𝐱p(y=1|\mathbf{x})italic_p ( italic_y = 1 | bold_x ). and do not provide code. Therefore, we omit these methods in our experiments.

  • LP-PUL [4]: LP-PUL is a graph-based PU learning approach. It first identifies the furthermost target nodes from the source nodes as the most novel nodes based on the shortest path distances, then performs label propagation on top of the graph. Note that LP-PUL is the only experimented method that does not learn a graph neural network.

  • PU-GNN [32]: PU-GNN is a state-of-the-art graph PU learning method. It adapts the Dist-PU risk estimator [38] towards graph data and applies structural regularization to learn pairwise relations between nodes.

Aside from the methods mentioned above, we also train a classifier using the Oracle novel/non-novel labels as an upper bound reference.

5.1.3 Hyper-Parameters

All methods except LP-PUL use a two-layer graph convolution network (GCN) [16] coupled with a two-layer MLP as the classifier architecture. We find that the Adam optimizer [14] with a learning rate of 0.0010.0010.0010.001 works well with the domain discriminator components in all methods except LP-PUL. This optimizer setting also works well for the primal and dual optimizers of RECO-SLIP. For hyper-parameters specific to RECO-SLIP, we set ξ=0.001𝜉0.001\xi=0.001italic_ξ = 0.001, T=1000𝑇1000T=1000italic_T = 1000, and 𝜶~=[0.05,0.1,0.15,0.2,0.25]~𝜶0.050.10.150.20.25\tilde{\bm{\alpha}}=[0.05,0.1,0.15,0.2,0.25]over~ start_ARG bold_italic_α end_ARG = [ 0.05 , 0.1 , 0.15 , 0.2 , 0.25 ] for all datasets. By default, we set β~~𝛽\tilde{\beta}over~ start_ARG italic_β end_ARG to 0.010.010.010.01. However, since the empirical FPR estimate under all α~~𝛼\tilde{\alpha}over~ start_ARG italic_α end_ARG could not go below 0.010.010.010.01 on Photo-S and CiteSeer-S due to dataset characteristics, we set β~~𝛽\tilde{\beta}over~ start_ARG italic_β end_ARG to 0.050.050.050.05 for these two datasets.

5.1.4 Evaluation Metrics

Following the convention of novelty and anomaly detection, we use AU-ROC to evaluate the ability of a classifier to rank novel nodes above non-novel nodes. We test all methods with 10 different random seeds and report the average scores and standard errors.

We list additional details of the experimental setup in Appendix 0.B and 0.C.

5.2 Results and Discussions

Table 3: AU-ROC on five datasets. The best performance on each dataset is highlighted in bold and the second best performance is highlighted by underline.
Method Cora-S CiteSeer-S Computers-S Photo-S arxiv
Oracle 0.953±plus-or-minus\pm±0.004 0.864±plus-or-minus\pm±0.011 0.986±plus-or-minus\pm±0.005 0.985±plus-or-minus\pm±0.001 0.847±plus-or-minus\pm±0.011
Domain discriminator 0.705±plus-or-minus\pm±0.016 0.717±plus-or-minus\pm±0.011 0.877±plus-or-minus\pm±0.025 0.671±plus-or-minus\pm±0.035 0.628±plus-or-minus\pm±0.020
uPU 0.705±plus-or-minus\pm±0.016 0.701±plus-or-minus\pm±0.023 0.834±plus-or-minus\pm±0.032 0.677±plus-or-minus\pm±0.038 0.628±plus-or-minus\pm±0.018
nnPU 0.705±plus-or-minus\pm±0.015 0.701±plus-or-minus\pm±0.023 0.830±plus-or-minus\pm±0.031 0.676±plus-or-minus\pm±0.037 0.629±plus-or-minus\pm±0.020
SAR-EM 0.740±plus-or-minus\pm±0.039 0.695±plus-or-minus\pm±0.015 0.721±plus-or-minus\pm±0.073 0.607±plus-or-minus\pm±0.036 0.615±plus-or-minus\pm±0.032
LP-PUL 0.666±plus-or-minus\pm±0.000 0.638±plus-or-minus\pm±0.000 0.830±plus-or-minus\pm±0.000 0.256±plus-or-minus\pm±0.000 0.658±plus-or-minus\pm±0.000
PU-GNN 0.705±plus-or-minus\pm±0.015 0.701±plus-or-minus\pm±0.024 0.833±plus-or-minus\pm±0.034 0.678±plus-or-minus\pm±0.040 0.627±plus-or-minus\pm±0.018
RECO-SLIP 0.770±plus-or-minus\pm±0.025 0.745±plus-or-minus\pm±0.019 0.948±plus-or-minus\pm±0.008 0.810±plus-or-minus\pm±0.011 0.710±plus-or-minus\pm±0.022

The overall results are shown in Table 3. One can have an estimate of how good the domain labels are as a proxy of the novel/non-novel labels by examining the performance gap between Oracle and the domain discriminator. For example, the performance gap on Photo-S is above 0.30.30.30.3 while the ones on other datasets range from 0.10.10.10.1 to 0.250.250.250.25, indicating that Photo-S is harder for PU learning. Out of all methods, RECO-SLIP is the best-performing one on all datasets, demonstrating its effectiveness in novel node category detection under subpopulation shifts.

Standard PU learning approaches such as the domain discriminator, uPU, and nnPU perform relatively stable across datasets. Perhaps because AU-ROC is a ranking metric, the advantage of avoiding the predictions to be overly novel of uPU and nnPU is not shown. In addition, uPU and nnPU require mixture proportion estimation in the warm-up phase, which could divert the PU classifier learning process and lead to a slight performance decrease from the domain discriminator. RECO-SLIP outperforms these three methods since they do not address the subpopulation shift problem and utilize the subgroup information provided by the graph structure.

SAR-EM addresses subpopulation shifts by jointly learning the propensity scores and the PU classifier. However, it is extremely unstable due to the application of the EM algorithm to neural networks. Its instability is reflected in the results. SAR-EM achieves the second highest AU-ROC on Cora-S but gets the worst AU-ROC on Computers-S and arxiv. LP-PUL utilizes label propagation to encourage each subgroup to have consistent predictions. Nevertheless, it uses the shortest path distances to identify the most novel nodes and does not leverage the node feature information. This approach can be brittle when the node features are informative but the edge connections are noisy. For instance, LP-PUL performs worse than a no-skill classifier on Photo-S, potentially because it identifies the incorrect novel nodes in the initialization phase. PU-GNN leverages an adapted Dist-PU risk estimator coupled with structural regularization, which is similar to the idea of link prediction. However, it does not outperform standard PU learning approaches very much because it is also based on SCAR and the sampling space of its structural regularization is all potential edges, leading to low sample efficiency. RECO-SLIP simultaneously addresses subpopulation shifts and utilizes the subgroup information provided by the graph structure in a stable and sample-efficient manner, resulting in higher performance.

Table 4: Ablation study on selective link prediction evaluated by AU-ROC. The best performance is highlighted in bold and the second best is highlighted by underline.
Method Cora-S CiteSeer-S Computers-S Photo-S arxiv
w/o link prediction 0.755±plus-or-minus\pm±0.023 0.714±plus-or-minus\pm±0.035 0.946±plus-or-minus\pm±0.006 0.795±plus-or-minus\pm±0.013 0.703±plus-or-minus\pm±0.019
w/ full link prediction 0.750±plus-or-minus\pm±0.031 0.719±plus-or-minus\pm±0.030 0.943±plus-or-minus\pm±0.013 0.793±plus-or-minus\pm±0.012 0.693±plus-or-minus\pm±0.024
w/ target link prediction 0.767±plus-or-minus\pm±0.027 0.723±plus-or-minus\pm±0.032 0.943±plus-or-minus\pm±0.011 0.812±plus-or-minus\pm±0.010 0.699±plus-or-minus\pm±0.020
RECO-SLIP 0.770±plus-or-minus\pm±0.025 0.745±plus-or-minus\pm±0.019 0.948±plus-or-minus\pm±0.008 0.810±plus-or-minus\pm±0.011 0.710±plus-or-minus\pm±0.022

5.3 Auxiliary Experiments

5.3.1 Ablation Study

We conduct an ablation study on the selective link prediction component of RECO-SLIP. To understand the effectiveness of link prediction, we consider drop** the link prediction loss (w/o link prediction). In addition, we consider vanilla full link prediction without reducing the edge sampling space (w/ full link prediction) and doing link prediction on the target subgraph without using the classifier score filtering rule shown in Eq. 4 and 5 (w/ target link prediction). We show the results in Table 4. RECO-SLIP performs the best across datasets except for slightly under-performing w/ target link prediction on Photo-S. This is potentially because Photo-S is the hardest dataset for PU learning and relying on the classifier scores for filtering does not further improve from target subgraph sampling. It is worth noting that w/ target link prediction also performs the second best on Cora-S and CiteSeer-S. This demonstrates the effectiveness of not sampling source-source and source-target pairs.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Overall results of the shift intensity study. The x-axis represents shift intensity (NS: no shift, MS: minor shift, S: shift) and the y-axis represents AU-ROC performance.
Table 5: Average performance rank of representative methods under different shift intensities (highest rank in bold, second-highest underlined).
Method NS MS S Overall
Domain discriminator 2.75 2.75 2.50 2.66
SAR-EM 3.00 2.75 3.50 3.08
PU-GNN 1.75 3.25 3.00 2.66
RECO-SLIP 2.50 1.25 1.00 1.58

5.3.2 Shift Intensity Study

We further study how different classes of methods react to the shift intensity on the four datasets where we can control the intensities. We select representative methods from standard PU learning (domain discriminator), propensity-weighting (SAR-EM), and graph PU learning (PU-GNN) to compare with RECO-SLIP. We consider datasets with no shift (NS) where the source ratio of each category except the novel one is 0.50.50.50.5. We also consider datasets with minor shift (MS), which is a midpoint interpolation between shift (S) and no shift (NS). The overall results are shown in Fig. 3 and the performance rank of each method averaged over datasets under each shift intensity is presented in Table 5. The advantage of RECO-SLIP diminishes as the shift intensity reduces. However, it still ranks the highest under minor shift (MS) and the second highest under no shift (NS), surpassed by the state-of-the-art graph PU learning method, PU-GNN. Overall, we can observe from Fig. 3 and Table 5 that RECO-SLIP is the most robust to subpopulation shifts and its overall performance rank is the highest.

6 Conclusion

In this study, we present RECO-SLIP, a new method for identifying nodes belonging to the novel categories in attributed graphs. RECO-SLIP builds upon a recall-constrained learning framework to address subpopulation shifts and leverages a sample-efficient link prediction mechanism to preserve node subgroup structure. Our experiment results demonstrate the superiority of RECO-SLIP over standard PU learning, propensity-weighting, and graph PU learning methods. Furthermore, we conduct an ablation study and a shift intensity study, confirming the importance of selective link prediction and the robustness of RECO-SLIP across multiple shift intensities. In terms of future work, novel node category detection under the shift in intra/inter-category edge connection probabilities is an exciting direction. Through this extension, we will be able to capture a realistic scenario where the node interaction pattern changes from source to target, making novelty detection even more robust when deployed in the wild.

References

  • [1] Bekker, J., Davis, J.: Learning from positive and unlabeled data: A survey. Machine Learning (2020)
  • [2] Bekker, J., Robberechts, P., Davis, J.: Beyond the selected completely at random assumption for learning from positive and unlabeled data. In: Joint European conference on machine learning and knowledge discovery in databases (2019)
  • [3] Blanchard, G., Lee, G., Scott, C.: Semi-supervised novelty detection. The Journal of Machine Learning Research (2010)
  • [4] Carnevali, J.C., Rossi, R.G., Milios, E., de Andrade Lopes, A.: A graph-based approach for positive and unlabeled learning. Information Sciences (2021)
  • [5] Chung, H.H., Ghosh, J.: Incremental unsupervised domain adaptation on evolving graphs. In: Proceedings of The 2nd Conference on Lifelong Learning Agents (2023)
  • [6] Ding, K., Li, J., Bhanushali, R., Liu, H.: Deep anomaly detection on attributed networks. In: Proceedings of the 2019 SIAM International Conference on Data Mining (2019)
  • [7] Du Plessis, M.C., Niu, G., Sugiyama, M.: Analysis of learning from positive and unlabeled data. Advances in neural information processing systems (2014)
  • [8] Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008)
  • [9] Gallego-Posada, J., Ramirez, J.: Cooper: a toolkit for Lagrangian-based constrained optimization. https://github.com/cooper-org/cooper (2022)
  • [10] Garg, S., Balakrishnan, S., Lipton, Z.: Domain adaptation under open set label shift. Advances in Neural Information Processing Systems (2022)
  • [11] Garg, S., Wu, Y., Smola, A.J., Balakrishnan, S., Lipton, Z.: Mixture proportion estimation and pu learning: a modern approach. Advances in Neural Information Processing Systems (2021)
  • [12] Gerych, W., Hartvigsen, T., Buquicchio, L., Agu, E., Rundensteiner, E.: Recovering the propensity score from biased positive unlabeled data. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
  • [13] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., Leskovec, J.: Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems (2020)
  • [14] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [15] Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016)
  • [16] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (2017)
  • [17] Kiryo, R., Niu, G., Du Plessis, M.C., Sugiyama, M.: Positive-unlabeled learning with non-negative risk estimator. Advances in neural information processing systems (2017)
  • [18] Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R.L., Gao, I., et al.: Wilds: A benchmark of in-the-wild distribution shifts. In: International conference on machine learning (2021)
  • [19] Li, Y., Huang, X., Li, J., Du, M., Zou, N.: Specae: Spectral autoencoder for anomaly detection in attributed networks. In: Proceedings of the 28th ACM international conference on information and knowledge management (2019)
  • [20] Liu, E.Z., Haghgoo, B., Chen, A.S., Raghunathan, A., Koh, P.W., Sagawa, S., Liang, P., Finn, C.: Just train twice: Improving group robustness without training group information. In: International Conference on Machine Learning (2021)
  • [21] Ma, X., Wu, J., Xue, S., Yang, J., Zhou, C., Sheng, Q.Z., Xiong, H., Akoglu, L.: A comprehensive survey on graph anomaly detection with deep learning. IEEE Transactions on Knowledge and Data Engineering (2021)
  • [22] Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks. In: International Conference on Learning Representations (2020)
  • [23] Santurkar, S., Tsipras, D., Madry, A.: {BREEDS}: Benchmarks for subpopulation shift. In: International Conference on Learning Representations (2021)
  • [24] Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI magazine (2008)
  • [25] Shchur, O., Mumme, M., Bojchevski, A., Günnemann, S.: Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018)
  • [26] Stadler, M., Charpentier, B., Geisler, S., Zügner, D., Günnemann, S.: Graph posterior network: Bayesian predictive uncertainty for node classification. Advances in Neural Information Processing Systems (2021)
  • [27] Wald, Y., Saria, S.: Birds of an odd feather: guaranteed out-of-distribution (ood) novel category detection. In: Uncertainty in Artificial Intelligence (2023)
  • [28] Wang, X., Chen, H., Guo, T., Wang, Y.: Pue: Biased positive-unlabeled learning enhancement by causal inference. Advances in Neural Information Processing Systems (2024)
  • [29] Wu, M., Pan, S., Du, L., Tsang, I., Zhu, X., Du, B.: Long-short distance aggregation networks for positive unlabeled graph learning. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019)
  • [30] Wu, M., Pan, S., Zhou, C., Chang, X., Zhu, X.: Unsupervised domain adaptive graph convolutional networks. In: Proceedings of The Web Conference 2020 (2020)
  • [31] Wu, Q., Chen, Y., Yang, C., Yan, J.: Energy-based out-of-distribution detection for graph neural networks. In: The Eleventh International Conference on Learning Representations (2023)
  • [32] Yang, H., Zhang, Y., Yao, Q., Kwok, J.: Positive-unlabeled node classification with structure-aware graph learning. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2023)
  • [33] Yang, Y., Zhang, H., Katabi, D., Ghassemi, M.: Change is hard: A closer look at subpopulation shift. In: International Conference on Machine Learning (2023)
  • [34] Yao, H., Wang, Y., Li, S., Zhang, L., Liang, W., Zou, J., Finn, C.: Improving out-of-distribution robustness via selective augmentation. In: International Conference on Machine Learning (2022)
  • [35] Yao, Y., Liu, T., Han, B., Gong, M., Niu, G., Sugiyama, M., Tao, D.: Rethinking class-prior estimation for positive-unlabeled learning. In: International Conference on Learning Representations (2022)
  • [36] Yoo, J., Kim, J., Yoon, H., Kim, G., Jang, C., Kang, U.: Accurate graph-based pu learning without class prior. In: 2021 IEEE International Conference on Data Mining (ICDM) (2021)
  • [37] Zhao, X., Chen, F., Hu, S., Cho, J.H.: Uncertainty aware semi-supervised learning on graph data. Advances in Neural Information Processing Systems (2020)
  • [38] Zhao, Y., Xu, Q., Jiang, Y., Wen, P., Huang, Q.: Dist-pu: Positive-unlabeled learning from a label distribution perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)

Appendix 0.A SCAR Does Not Hold Under Subpopulation Shift

In the context of novel category detection, the selected completely at random (SCAR) assumption [8] assumes each sample from a non-novel category has an equal probability of being in the source domain regardless of its data attributes. We show that SCAR does not hold under subpopulation shift.

As a reminder, PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and PT,0subscript𝑃𝑇0P_{T,0}italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT denote the source distribution and the non-novel distribution in the target. They are mixtures of K𝐾Kitalic_K category distributions G1,G2,,GKsubscript𝐺1subscript𝐺2subscript𝐺𝐾G_{1},G_{2},\ldots,G_{K}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT with different mixture proportions defined by two probability vectors 𝜸,𝜸^ΔK1𝜸^𝜸superscriptΔ𝐾1\bm{\gamma},\hat{\bm{\gamma}}\in\Delta^{K-1}bold_italic_γ , over^ start_ARG bold_italic_γ end_ARG ∈ roman_Δ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT.

PS=i=1KγiGi,PT,0=i=1Kγ^iGi,𝜸𝜸^formulae-sequencesubscript𝑃𝑆superscriptsubscript𝑖1𝐾subscript𝛾𝑖subscript𝐺𝑖formulae-sequencesubscript𝑃𝑇0superscriptsubscript𝑖1𝐾subscript^𝛾𝑖subscript𝐺𝑖𝜸^𝜸P_{S}=\sum_{i=1}^{K}\gamma_{i}G_{i},P_{T,0}=\sum_{i=1}^{K}\hat{\gamma}_{i}G_{i% },\bm{\gamma}\neq\hat{\bm{\gamma}}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_γ ≠ over^ start_ARG bold_italic_γ end_ARG (9)

Let S𝑆Sitalic_S be a binary random variable denoting whether a sample belongs to the source domain or not, i.e. S=1𝑆1S=1italic_S = 1 if the sample is in the source domain and S=0𝑆0S=0italic_S = 0 otherwise. Let N𝑁Nitalic_N be another binary random variable indicating if a sample is novel or not, i.e. N=1𝑁1N=1italic_N = 1 if the sample is from the novel category and N=0𝑁0N=0italic_N = 0 otherwise. We use 𝐗𝐗\mathbf{X}bold_X and 𝐱𝐱\mathbf{x}bold_x to denote a random feature vector and its realization. Let pSsubscript𝑝𝑆p_{S}italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and pT,0subscript𝑝𝑇0p_{T,0}italic_p start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT be the probability density functions (PDF) associated with PSsubscript𝑃𝑆P_{S}italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and PT,0subscript𝑃𝑇0P_{T,0}italic_P start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT, respectively. SCAR essentially assumes P(S=1|N=0,𝐗=𝐱)=P(S=1|N=0)P(S=1|N=0,\mathbf{X}=\mathbf{x})=P(S=1|N=0)italic_P ( italic_S = 1 | italic_N = 0 , bold_X = bold_x ) = italic_P ( italic_S = 1 | italic_N = 0 ). However, we can show:

P(S=1|N=0,𝐗=𝐱)\displaystyle P(S=1|N=0,\mathbf{X}=\mathbf{x})italic_P ( italic_S = 1 | italic_N = 0 , bold_X = bold_x )
=\displaystyle== P(𝐗=𝐱|S=1,N=0)P(S=1,N=0)P(𝐗=𝐱|N=0)P(N=0)\displaystyle\frac{P(\mathbf{X}=\mathbf{x}|S=1,N=0)P(S=1,N=0)}{P(\mathbf{X}=% \mathbf{x}|N=0)P(N=0)}divide start_ARG italic_P ( bold_X = bold_x | italic_S = 1 , italic_N = 0 ) italic_P ( italic_S = 1 , italic_N = 0 ) end_ARG start_ARG italic_P ( bold_X = bold_x | italic_N = 0 ) italic_P ( italic_N = 0 ) end_ARG
=\displaystyle== P(𝐗=𝐱|S=1)P(𝐗=𝐱|N=0)P(S=1,N=0)P(N=0)𝑃𝐗conditional𝐱𝑆1𝑃𝐗conditional𝐱𝑁0𝑃formulae-sequence𝑆1𝑁0𝑃𝑁0\displaystyle\frac{P(\mathbf{X}=\mathbf{x}|S=1)}{P(\mathbf{X}=\mathbf{x}|N=0)}% \frac{P(S=1,N=0)}{P(N=0)}divide start_ARG italic_P ( bold_X = bold_x | italic_S = 1 ) end_ARG start_ARG italic_P ( bold_X = bold_x | italic_N = 0 ) end_ARG divide start_ARG italic_P ( italic_S = 1 , italic_N = 0 ) end_ARG start_ARG italic_P ( italic_N = 0 ) end_ARG
=\displaystyle== P(𝐗=𝐱|S=1)P(S=0|N=0)P(𝐗=𝐱|N=0,S=0)+P(S=1|N=0)P(𝐗=𝐱|S=1)\displaystyle\frac{P(\mathbf{X}=\mathbf{x}|S=1)}{P(S=0|N=0)P(\mathbf{X}=% \mathbf{x}|N=0,S=0)+P(S=1|N=0)P(\mathbf{X}=\mathbf{x}|S=1)}divide start_ARG italic_P ( bold_X = bold_x | italic_S = 1 ) end_ARG start_ARG italic_P ( italic_S = 0 | italic_N = 0 ) italic_P ( bold_X = bold_x | italic_N = 0 , italic_S = 0 ) + italic_P ( italic_S = 1 | italic_N = 0 ) italic_P ( bold_X = bold_x | italic_S = 1 ) end_ARG
×P(S=1,N=0)P(N=0)absent𝑃formulae-sequence𝑆1𝑁0𝑃𝑁0\displaystyle\hskip 236.80481pt\times\frac{P(S=1,N=0)}{P(N=0)}× divide start_ARG italic_P ( italic_S = 1 , italic_N = 0 ) end_ARG start_ARG italic_P ( italic_N = 0 ) end_ARG
=\displaystyle== pS(𝐱)P(S=0|N=0)pT,0(𝐱)+P(S=1|N=0)pS(𝐱)P(S=1,N=0)P(N=0)subscript𝑝𝑆𝐱𝑃𝑆conditional0𝑁0subscript𝑝𝑇0𝐱𝑃𝑆conditional1𝑁0subscript𝑝𝑆𝐱𝑃formulae-sequence𝑆1𝑁0𝑃𝑁0\displaystyle\frac{p_{S}(\mathbf{x})}{P(S=0|N=0)p_{T,0}(\mathbf{x})+P(S=1|N=0)% p_{S}(\mathbf{x})}\frac{P(S=1,N=0)}{P(N=0)}divide start_ARG italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) end_ARG start_ARG italic_P ( italic_S = 0 | italic_N = 0 ) italic_p start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT ( bold_x ) + italic_P ( italic_S = 1 | italic_N = 0 ) italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) end_ARG divide start_ARG italic_P ( italic_S = 1 , italic_N = 0 ) end_ARG start_ARG italic_P ( italic_N = 0 ) end_ARG
\displaystyle\neq P(S=1,N=0)P(N=0)𝑃formulae-sequence𝑆1𝑁0𝑃𝑁0\displaystyle\frac{P(S=1,N=0)}{P(N=0)}divide start_ARG italic_P ( italic_S = 1 , italic_N = 0 ) end_ARG start_ARG italic_P ( italic_N = 0 ) end_ARG
=\displaystyle== P(S=1|N=0)𝑃𝑆conditional1𝑁0\displaystyle P(S=1|N=0)italic_P ( italic_S = 1 | italic_N = 0 )

We get the first and final equalities through the Bayes rule. The second equality holds because N=0𝑁0N=0italic_N = 0 does not provide additional information when we already condition on S=1𝑆1S=1italic_S = 1. The third equality holds due to the law of total probability and the same reason that leads to the second equality. We get the fourth equality by the definition of the PDFs. Since pS(𝐱)pT,0(𝐱)subscript𝑝𝑆𝐱subscript𝑝𝑇0𝐱p_{S}(\mathbf{x})\neq p_{T,0}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_x ) ≠ italic_p start_POSTSUBSCRIPT italic_T , 0 end_POSTSUBSCRIPT ( bold_x ), the inequality above holds unless P(S=0|N=0)=0𝑃𝑆conditional0𝑁00P(S=0|N=0)=0italic_P ( italic_S = 0 | italic_N = 0 ) = 0, which is an extreme case that all non-novel samples are in the source domain and we do not consider such a case in our problem setting. Therefore, we can conclude that SCAR does not hold under subpopulation shift.

Table 6: Source ratio per category for three versions of Cora, CiteSeer, Computers, and Photo. Each entry represents the ratio of nodes belonging to a category that is assigned to the source domain. The source ratio of the last available category in each dataset is 00 because the category is novel.
Category label
1 2 3 4 5 6 7 8 9 10
Cora-S 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 00 - - -
CiteSeer-S 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.50.50.50.5 00 - - - -
Computers-S 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.50.50.50.5 00
Photo-S 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.90.90.90.9 0.10.10.10.1 0.50.50.50.5 00 - -
Cora-MS 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 00 - - -
CiteSeer-MS 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.50.50.50.5 00 - - - -
Computers-MS 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.50.50.50.5 00
Photo-MS 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.70.70.70.7 0.30.30.30.3 0.50.50.50.5 00 - -
Cora-NS 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 00 - - -
CiteSeer-NS 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 00 - - - -
Computers-NS 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 00
Photo-NS 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 0.50.50.50.5 00 - -

Appendix 0.B Additional Dataset Details

Cora [24] and CiteSeer [24] are academic citation graphs where nodes represent papers and edges represent citations. Computers [25] and Photo [25] are product co-purchasing networks where nodes denote products and an edge between two products indicates that the two products were frequently bought together. For our main experiments and study on shift intensity, we simulate 3 versions of these 4 datasets, denoted by the suffixes “-S”, “-MS”, and “-NS”, to represent subpopulation shift intensities ranging from significant to nonexistent. We show how we split nodes of each category to source and target for different versions of datasets in Table 6. We also plot their source and target distributions to visualize the subpopulation shifts in Fig. 4, 5, 6, and 7. As for the train/validation/test split, we randomly select 80% as train and 20% as validation among the source and 60% as train, 20% as validation, and 20% as test among the target.

arxiv is a subgraph of the ogbn-arxiv444https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv [13] academic citation graph from 1990199019901990 to 2012201220122012 containing 6666 robotics-related categories: cs:AI (Artificial Intelligence), cs:MA (Multiagent Systems), cs:CV (Computer Vision and Pattern Recognition), cs:SY (Systems and Control), cs:LG (Machine Learning), and cs:RO (Robotics). Nodes belonging to cs:SY did not exist until 2007200720072007. Therefore, we set the nodes with timestamps before 2007200720072007 as the source and the rest as the target for our experiments. The source and target distributions are visualized in Fig. 8. We select the nodes having timestamps in 2012 as the test set. For both the source nodes and the target nodes excluding the test set, we randomly select 80% as the training set and 20% as the validation set.

Refer to caption
Figure 4: The source and target distributions of Cora-S, Cora-MS, and Cora-NS. The last category (Category 7) is the novel category so it does not show up in the source domain. The subpopulation shift exhibited in Cora-S is significant as more than half of the source nodes belong to Category 4 while Category 4 only takes up less than 10% of the target nodes. In contrast, the relative proportions of the non-novel categories in the source and target are the same in Cora-NS.
Refer to caption
Figure 5: The source and target distributions of CiteSeer-S, CiteSeer-MS, and CiteSeer-NS.
Refer to caption
Figure 6: The source and target distributions of Computers-S, Computers-MS, and Computers-NS.
Refer to caption
Figure 7: The source and target distributions of Photo-S, Photo-MS, and Photo-NS.
Refer to caption
Refer to caption
Figure 8: A comparison between the source and target distributions of arxiv. The novel category cs:SY does not exist in the source but shows up in the target domain. The proportion of nodes falling in category cs:AI has a significant decrease from source to target while the ones of cs:CV and cs:LG increase notably.

Appendix 0.C Model and Hyper-Parameter Configurations

0.C.0.1 Model Structure

Models on all datasets have the following structure while gcn_input_dim, gcn_hidden_dim, gcn_output_dim, mlp_input_dim, and
mlp_hidden_dim vary.

  • GCN encoder

    • GCNConv(gcn_input_dim, gcn_hidden_dim)

    • BatchNorm1d

    • ReLU

    • Dropout(0.5)

    • GCNConv(gcn_hidden_dim, gcn_output_dim)

    • BatchNorm1d

    • ReLU

  • MLP head

    • Linear(mlp_input_dim, mlp_hidden_dim)

    • BatchNorm1d

    • ReLU

    • Dropout(0.5)

    • Linear(mlp_hidden_dim, 2)

0.C.0.2 Random Seeds

We use the following 10 random seeds for all experiments: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100.

0.C.0.3 Optimizer

We use the Adam optimizer [14] with a learning rate of 0.0010.0010.0010.001 for all methods, including both the primal and dual optimizers of RECO-SLIP.

0.C.0.4 Mixture Proportion Estimation (MPE)

For PU learning methods that require MPE, we use the best bin estimator (BBE)555https://github.com/acmi-lab/PU_learning/blob/main/estimator.py [11] to estimate the prior.

0.C.0.5 Domain Discriminator

  • Maximum epochs: 2000

0.C.0.6 uPU

  • MPE epochs: 150

  • Maximum total epochs: 1000

  • Patience: 50

0.C.0.7 nnPU

  • MPE epochs: 150

  • Maximum total epochs: 1000

  • Patience: 50

0.C.0.8 SAR-EM

  • Maximum EM steps: 500

  • Inner epochs for maximization (M): 200

  • Patience for EM steps: 50

0.C.0.9 LP-PUL

  • Initial novel ratio: 0.5

  • Label propagation layers: 3

  • Label propagation α𝛼\alphaitalic_α: 0.9

0.C.0.10 PU-GNN

  • MPE epochs: 150

  • Adapted Dist-PU δ𝛿\deltaitalic_δ: 3

  • Structural regularization weight: 0.1

  • Maximum total epochs: 1000

  • Patience: 50

0.C.0.11 RECO-SLIP

  • Link prediction loss weight ξ𝜉\xiitalic_ξ: 0.001

  • 𝜶~=[0.05,0.1,0.15,0.2,0.25]~𝜶0.050.10.150.20.25\tilde{\bm{\alpha}}=[0.05,0.1,0.15,0.2,0.25]over~ start_ARG bold_italic_α end_ARG = [ 0.05 , 0.1 , 0.15 , 0.2 , 0.25 ]

  • Epochs: 1000

We list dataset-dependent hyper-parameters in Table 7.

Table 7: Dataset-dependent hyper-parameters
Hyper-parameters Cora CiteSeer Computers Photo arxiv
Model structure gcn_input_dim 1433 3703 767 745 128
gcn_hidden_dim 16 64 16 64 64
gcn_output_dim 16 32 16 32 64
mlp_input_dim 16 32 16 32 64
mlp_hidden_dim 8 4 8 32 32
PU-GNN Structural regularization K 50 50 30 30 50
RECO-SLIP Initial dual variable λ𝜆\lambdaitalic_λ 0.1 0.1 0.4 0.1 0.2
β~~𝛽\tilde{\beta}over~ start_ARG italic_β end_ARG 0.01 0.05 0.01 0.05 0.01