HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: stackengine

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2402.17242v1 [cs.SI] 27 Feb 2024

Scalable Community Search with Accuracy Guarantee on Attributed Graphs

Yuxiang Wang1, Shuzhan Ye1, Xiaoliang Xu1, Yuxia Geng1, Zhenghe Zhao1, Xiangyu Ke2, Tianxing Wu3 {lsswyx,yeshuzhan123,xxl,yuxia.geng,zhaozh}@hdu.edu.cn, [email protected], [email protected] 1Hangzhou Dianzi University, China, 2Zhejiang University, Hangzhou, China, 3Southeast University, Nan**g, China
Abstract

Given an attributed graph G𝐺Gitalic_G and a query node q𝑞qitalic_q, Community Search over Attributed Graphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from G𝐺Gitalic_G that contains q𝑞qitalic_q. Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community’s quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node q𝑞qitalic_q. We formally define our CS-AG problem atop a q𝑞qitalic_q-centric attribute cohesiveness metric considering both textual and numerical attributes, for k𝑘kitalic_k-core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, k𝑘kitalic_k-truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54×\times× (41.1×\times× on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.

I Introduction

Recently, it is prevalent to observe that many large-scale and real-world attributed graphs have emerged in various domains, e.g., social networks, collaboration networks, and knowledge graphs [1, 2, 3, 4, 5]. In such graphs, nodes represent entities with attributes and edges represent the relationship between entities. Given an attributed graph G𝐺Gitalic_G and a query node q𝑞qitalic_q, community search (CS) aims to find a cohesive community HG𝐻𝐺H\subseteq Gitalic_H ⊆ italic_G that contains q𝑞qitalic_q. It is widely recognized that CS is important for many real-life applications [6, 7], such as event planning [8], biological data analysis [9, 10], and recommendation [11, 12, 7]. For example, one can input a specific disease-related gene to find a community of similar genes from a biological network (e.g., GEO [13]) that helps revealing the hidden causes of diseases; a system can recommend movies for a user by taking one of her favorite movies as a query and return a cohesive community of similar movies from IMDB.

Cohesiveness is an important metric for measuring a community’s quality, which is two-fold, including structure and attribute cohesiveness. Many models like k𝑘kitalic_k-core [8, 14, 15], k𝑘kitalic_k-truss [16, 17, 18, 5], and k𝑘kitalic_k-clique [19, 20] are proposed to measure the structure cohesiveness. In terms of the ability to measure structure cohesiveness, [21] ranks them as k𝑘kitalic_k-core precedes-or-equals\preceq k𝑘kitalic_k-truss precedes-or-equals\preceq k𝑘kitalic_k-clique. Here, ABprecedes-or-equals𝐴𝐵A\preceq Bitalic_A ⪯ italic_B indicates the model B𝐵Bitalic_B is more cohesive than model A𝐴Aitalic_A. As the structure cohesiveness increases, the algorithmic computational efficiency decreases as k𝑘kitalic_k-core succeeds-or-equals\succeq k𝑘kitalic_k-truss succeeds-or-equals\succeq k𝑘kitalic_k-clique [21]. Users may choose an appropriate model according to their actual demands. Attribute cohesiveness is another metric for enhancing a community’s quality. Given an attribute cohesiveness metric, one can form an optimization problem to find the most attribute-cohesive community. Essentially, existing CS on attributed graphs requires users first to define an appropriate community model and an attribute cohesiveness metric, then design an exact or approximate CS algorithm to return a community of interest [22, 23, 5, 3]. Figure 1(a) shows an example on IMDB. Each node represents an audiovisual work with attributes provided at the bottom (i.e., \langletype, {{\{{genres}\}\rangle} ⟩, \langleaverage rating, # ratings\rangle). Each edge shows that two works have a common actor. Suppose a user likes The Godfather (i.e., v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), it is easy to explore a high-rating community of crime drama movies for her by running a CS algorithm given the query as v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Refer to caption
Figure 1: An example of CS: (a) A snapshot of IMDB with attributes at the bottom. (b)-(e) Different results of four methods.

Challenges and solutions. We next use the example of movie community search mentioned above to illustrate the challenges faced by existing methods and our solutions.

Challenge I: How to design a metric to better measure a community’s attribute cohesiveness? Figure 1(b)-(d) shows the results of three representative CS methods on IMDB given q=v1𝑞subscript𝑣1q=v_{1}italic_q = italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (The Godfather) and k𝑘kitalic_k-core (k=3𝑘3k=3italic_k = 3) as the community model (a k𝑘kitalic_k-core is a subgraph of which each node has a degree kabsent𝑘\geq k≥ italic_k). ATC [3] uses the weighted sum of the contribution of q𝑞qitalic_q’s attributes as the metric. An attribute a𝑎aitalic_a’s contribution is defined as |VaVH|2|VH|superscriptsubscript𝑉𝑎subscript𝑉𝐻2subscript𝑉𝐻\frac{|V_{a}\cap V_{H}|^{2}}{|V_{H}|}divide start_ARG | italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∩ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | end_ARG, where Vasubscript𝑉𝑎V_{a}italic_V start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is a set of nodes that have attribute a𝑎aitalic_a and VHsubscript𝑉𝐻V_{H}italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the node set of a community H𝐻Hitalic_H. It excludes v13subscript𝑣13v_{13}italic_v start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT and v14subscript𝑣14v_{14}italic_v start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT with different type (TV series) and genres and returns an H𝐻Hitalic_H with the largest weighted sum of 13213+12213+12213=35.2superscript13213superscript12213superscript1221335.2\frac{13^{2}}{13}+\frac{12^{2}}{13}+\frac{12^{2}}{13}=35.2divide start_ARG 13 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG + divide start_ARG 12 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG + divide start_ARG 12 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG = 35.2 in Figure 1(b), where the contributions of \langlemovie,{{\{{crime,drama}\}\rangle} ⟩ are 13213superscript13213\frac{13^{2}}{13}divide start_ARG 13 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG, 12213superscript12213\frac{12^{2}}{13}divide start_ARG 12 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG, and 12213superscript12213\frac{12^{2}}{13}divide start_ARG 12 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 13 end_ARG. Since this metric is tailored for textual attributes based on equality matching, it involves six movies with dissimilar numerical attributes to q𝑞qitalic_q (colored nodes), e.g., two low-rating action movies v11subscript𝑣11v_{11}italic_v start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT and v12subscript𝑣12v_{12}italic_v start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT. Figure 1(c) shows the result of ACQ [22] using the # shared attributes as the metric. It increases the # shared attributes from 1 (movie) to 3 (\langlemovie,{{\{{crime,drama}}\}}) by deleting v11subscript𝑣11v_{11}italic_v start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT and v12subscript𝑣12v_{12}italic_v start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT. It also includes four dissimilar movies to q𝑞qitalic_q, as the # shared attributes is only valid for textual attributes. VAC [5] is the state-of-the-art work that aims to minimize the maximum attribute distance of any two nodes in a community. Since the maximum attribute distance is dominated by the most dissimilar pair of nodes, it is considered a method that optimizes only the worst case in a community but overlooks the similarity of other nodes to q𝑞qitalic_q. In Figure 1(d), v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is the most dissimilar node to v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, however, deleting it would collapse the k𝑘kitalic_k-core of v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. So, VAC halts at this step, as the worst case cannot be further improved, making an involvement of v8subscript𝑣8v_{8}italic_v start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, v9subscript𝑣9v_{9}italic_v start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT that are numerically dissimilar to q𝑞qitalic_q. In a nutshell, none of them well-support CS on attributed graphs, either because they cannot simultaneously handle textual and numerical attributes, or because they rely on metrics designed from a global perspective (e.g., optimizing the worst case), neglecting the crucial cohesiveness w.r.t. the query node q𝑞qitalic_q.

Therefore, we define a q𝑞qitalic_q-centric attribute distance of a community H𝐻Hitalic_H as δ(H)𝛿𝐻\delta(H)italic_δ ( italic_H ) considering both textual and numerical attributes (§II), based on which we define our CS problem on attributed graphs (CS-AG) and prove it is NP-hard (§III). We present an exact algorithm with three pruning strategies (§IV) serving as an exact baseline used in our experimental study.

Challenge II: How to design an efficient index-free approximate CS algorithm with a reliable accuracy guarantee? The conventional solution for CS is to design algorithms based on graph traversal, which is time-consuming for large graphs [7, 6, 5, 3, 22]. To improve efficiency, researchers resort to community-aware indices or approximate algorithms. For index-based solutions, a trade-off between efficiency and space overhead is achieved. However, an index that is comparable to the graph size usually is required [22]. In dynamic scenarios, the index update or reconstruction would introduce nonnegligible maintenance overhead. For approximate solutions, a trade-off between efficiency and effectiveness is achieved, e.g., [16, 5] provide 2222-approximation to the exact results via triangle inequality [24]. However, the approximate ratio only provides an upper (often loose) bound of attribute cohesiveness but fails to offer a reliable evaluation of how good a community is in runtime, i.e., what is the relative error of a community’s attribute distance w.r.t. that of the ground-truth community?

Comparing to a tardy exact or an approximate result with a loose approximation ratio, it’s more desirable if a method can quickly return an approximate result with a reliable accuracy guarantee [25, 26, 27, 28]. Thus, we present an index-free sampling-estimation-based approximate algorithm to first collect a set of high-quality samples (nodes) based on Hoeffding Inequality to form a candidate community. We then estimate its attribute distance as δ*superscript𝛿\delta^{*}italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with a 1α1𝛼1-\alpha1 - italic_α level confidence interval CI = δ±εplus-or-minussuperscript𝛿𝜀\delta^{\star}\pm\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε using Bag of Little Bootstraps, where ε𝜀\varepsilonitalic_ε is the half width of a CI. Given a user-desired error bound e𝑒eitalic_e, we guarantee that the relative error of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT w.r.t. the exact δ𝛿\deltaitalic_δ is bounded by e𝑒eitalic_e (i.e., |δδ|/δesuperscript𝛿𝛿𝛿𝑒|\delta^{\star}-\delta|/\delta\leq e| italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ | / italic_δ ≤ italic_e) if ε𝜀\varepsilonitalic_ε is small enough to satisfy Theorem 11. We early terminate the query once such a good δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is reached. Otherwise, we enlarge the sample and repeat above until Theorem 11 holds. Figure 1(e) shows the result of e𝑒eitalic_e = 1%percent11\%1 % and 1α1𝛼1-\alpha1 - italic_α = 95%percent9595\%95 %, it costs only 66 ms (at least 12X faster than others, see the table at the top) to return the same community as that of our exact algorithm (§IV), indicating a relative error of 0% (δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = δ𝛿\deltaitalic_δ = 0.123). It involves seven similar crime drama movies to v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with higher rating and more ratings. By varying e𝑒eitalic_e, we found that the results of e𝑒eitalic_e = 10%percent1010\%10 % and 25%percent2525\%25 % are the same as (d) and (c) with bounded relative errors. For example, given the CIs provided above Figure 1(e), we have a relative error for e𝑒eitalic_e = 10%percent1010\%10 % as 0.1330.1230.123=8.1%0.1330.1230.123percent8.1\frac{0.133-0.123}{0.123}=8.1\%divide start_ARG 0.133 - 0.123 end_ARG start_ARG 0.123 end_ARG = 8.1 %. A larger e𝑒eitalic_e and smaller 1α1𝛼1-\alpha1 - italic_α usually result in less time for estimation, e.g., 66 ms, 55 ms, and 49 ms for e=1%𝑒percent1e=1\%italic_e = 1 %, 10%percent1010\%10 %, and 25%percent2525\%25 %. Users may configure them based on their preferences for accuracy and efficiency.

Challenge III: How to enable a CS algorithm to be scalable for different scenarios? A CS algorithm usually is tailored for a specific graph type (homogeneous or heterogeneous) given a specific community model (k𝑘kitalic_k-core, k𝑘kitalic_k-truss, etc.) with a size-constraint (size-bounded CS [29]) or not, showing a strong dependence on a specific scenario. This implies that if we want to adapt a CS algorithm of one scenario to other scenarios, usually we must redesign it partially or even completely.

Since our approximate CS algorithm consists of two standard steps: sampling and estimation, it is easy to be extended to more general scenarios with lightweight modifications, including heterogeneous graphs (§VI-A), size-bounded CS (§VI-B), and other community models (§VI-C). For example, we can simply add a constraint on the sample size in sampling step to support size-bounded CS with an accuracy guarantee.

Contributions. The main contributions are as follows.

  • We define a CS-AG problem and its approximate version Approx-CS-AG based on a q𝑞qitalic_q-centric attribute cohesiveness metric in §II. We prove CS-AG is NP-hard in §III.

  • We present an exact method to CS-AG with three pruning strategies (§IV) to avoid redundant searches.

  • We propose an index-free sampling-estimation-based approximate CS method to Approx-CS-AG in §V, with a reliable runtime accuracy guarantee on attribute cohesiveness.

  • We extend our approximate CS method to more general scenarios such as CS on heterogeneous graphs, size-bounded CS, and CS with different community models in §VI.

  • We conduct extensive experiments to evaluate our method from: effectiveness and efficiency (§VII-B-VII-C), effect of pruning strategies (§VII-D), scalability analysis (§VII-E), case study (§VII-F), and parameter sensitivity (§VII-G).

II Preliminaries and Problems

II-A Preliminaries

Definition 1

Attributed Graph. An attributed graph is defined as G=(VG,EG,AG)normal-Gsubscriptnormal-Vnormal-Gsubscriptnormal-Enormal-Gsubscriptnormal-Anormal-GG=(V_{G},E_{G},A_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), where VGsubscriptnormal-Vnormal-GV_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (EGsubscriptnormal-Enormal-GE_{G}italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) is the node (edge) set and AGsubscriptnormal-Anormal-GA_{G}italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the set of attributes associated with nodes. For each node vVGnormal-vsubscriptnormal-Vnormal-Gv\in V_{G}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, it has a set of attributes A(v)={a0,,am}normal-Anormal-vsubscriptnormal-a0normal-…subscriptnormal-anormal-mA(v)=\{a_{0},\dots,a_{m}\}italic_A ( italic_v ) = { italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where A(v)AGnormal-Anormal-vsubscriptnormal-Anormal-GA(v)\subseteq A_{G}italic_A ( italic_v ) ⊆ italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and A(v)inormal-Asubscriptnormal-vnormal-iA(v)_{i}italic_A ( italic_v ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the inormal-iiitalic_i-th attribute aisubscriptnormal-anormal-ia_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of a node vnormal-vvitalic_v. In this paper, we consider both the textual and numerical attributes of a node, denoted by At(v)A(v)superscriptnormal-Anormal-tnormal-vnormal-Anormal-vA^{t}(v)\subseteq A(v)italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_v ) ⊆ italic_A ( italic_v ) and A#(v)A(v)superscriptnormal-Anormal-#normal-vnormal-Anormal-vA^{\#}(v)\subseteq A(v)italic_A start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_v ) ⊆ italic_A ( italic_v ), respectively.

Refer to caption
Figure 2: An example of k𝑘kitalic_k-core and connected k𝑘kitalic_k-core

Structure cohesiveness. We first employ k𝑘kitalic_k-core model [30] to measure the structure cohesiveness of a community, because it is the most efficient model compared to other models [21], i.e., k𝑘kitalic_k-truss [16], k𝑘kitalic_k-ECC [31], k𝑘kitalic_k-clique [19], and it performs well on cohesiveness evaluation [7, 21, 32, 15, 2]. In §VI, we show how to extend it to other community models.

Definition 2

k𝑘kitalic_k-core. Given a graph Gnormal-GGitalic_G and a non-negative integer knormal-kkitalic_k, a knormal-kkitalic_k-core of Gnormal-GGitalic_G is the largest subgraph HkGsubscriptnormal-Hnormal-knormal-GH_{k}\subseteq Gitalic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G, such that vHkfor-allnormal-vsubscriptnormal-Hnormal-k\forall v\in H_{k}∀ italic_v ∈ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has a degree at least knormal-kkitalic_k, i.e., deg(v,Hk)knormal-dnormal-enormal-gnormal-vsubscriptnormal-Hnormal-knormal-kdeg(v,H_{k})\geq kitalic_d italic_e italic_g ( italic_v , italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ italic_k.

In Figure 2(a), H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the entire graph as every node has at least one neighbor; H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the subgraph excluding v12subscript𝑣12v_{12}italic_v start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT; H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT consists of two components and each node has at least three neighbors. The larger k𝑘kitalic_k is, the more cohesiveness Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has.

Definition 3

Connected knormal-kkitalic_k-core [4]. Given a graph Gnormal-GGitalic_G and a non-negative integer knormal-kkitalic_k, a connected knormal-kkitalic_k-core of Gnormal-GGitalic_G is a connected subgraph H~kHkGsubscriptnormal-~normal-Hnormal-ksubscriptnormal-Hnormal-knormal-G\tilde{H}_{k}\subseteq H_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G, such that vH~kfor-allnormal-vsubscriptnormal-~normal-Hnormal-k\forall v\in\tilde{H}_{k}∀ italic_v ∈ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, deg(v,H~k)knormal-dnormal-enormal-gnormal-vsubscriptnormal-~normal-Hnormal-knormal-kdeg(v,\tilde{H}_{k})\geq kitalic_d italic_e italic_g ( italic_v , over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ italic_k.

To incorporate connectivity into a community, similar to [4, 33, 14, 22], we use connected k𝑘kitalic_k-core as the community model. Given a query node q𝑞qitalic_q, we aim to find a connected k𝑘kitalic_k-core containing q𝑞qitalic_q as the desired community of q𝑞qitalic_q. Figure 2(b-c) shows two communities H~3subscript~𝐻3\tilde{H}_{3}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and H~2subscript~𝐻2\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT containing q=v5𝑞subscript𝑣5q=v_{5}italic_q = italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT.

Attribute cohesiveness. We consider two types of attributes: textual and numerical. Given two nodes u,vVG𝑢𝑣subscript𝑉𝐺u,v\in V_{G}italic_u , italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, they are similar in textual attributes if their textual attributes have substantial overlap [5]. We use Jaccard distance to measure their textual attribute distance, denoted by ft(u,v)=1|At(u)At(v)||At(u)At(v)|superscript𝑓𝑡𝑢𝑣1superscript𝐴𝑡𝑢superscript𝐴𝑡𝑣superscript𝐴𝑡𝑢superscript𝐴𝑡𝑣f^{t}(u,v)=1-\frac{|A^{t}(u)\cap A^{t}(v)|}{|A^{t}(u)\cup A^{t}(v)|}italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_u , italic_v ) = 1 - divide start_ARG | italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_u ) ∩ italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_v ) | end_ARG start_ARG | italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_u ) ∪ italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_v ) | end_ARG. The essential of Jaccard distance is the equality matching, so that the higher the ratio of equally matched textual attributes, the smaller the ft(u,v)superscript𝑓𝑡𝑢𝑣f^{t}(u,v)italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_u , italic_v ). Unlike textual attributes, equality matching is often meaningless for numerical attributes. Many alternative choices are available, such as Manhattan distance and Euclidean distance. In this paper, we compute the numerical attribute distance based on Manhattan distance f#(u,v)=i=1m|Z(A#(u)i)Z(A#(v)i)|msuperscript𝑓#𝑢𝑣superscriptsubscript𝑖1𝑚𝑍superscript𝐴#subscript𝑢𝑖𝑍superscript𝐴#subscript𝑣𝑖𝑚f^{\#}(u,v)=\frac{\sum_{i=1}^{m}|Z(A^{\#}(u)_{i})-Z(A^{\#}(v)_{i})|}{m}italic_f start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_u , italic_v ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_Z ( italic_A start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_u ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_Z ( italic_A start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_v ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_m end_ARG, where we use Z()𝑍Z(\cdot)italic_Z ( ⋅ ) to normalize A#()superscript𝐴#A^{\#}(\cdot)italic_A start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( ⋅ ) to [0,1], thereby eliminating the dimensional influence. We next compute the composite attribute distance between two nodes as f(u,v)=γft(u,v)+(1γ)f#(u,v)𝑓𝑢𝑣𝛾superscript𝑓𝑡𝑢𝑣1𝛾superscript𝑓#𝑢𝑣f(u,v)=\gamma\cdot f^{t}(u,v)+(1-\gamma)\cdot f^{\#}(u,v)italic_f ( italic_u , italic_v ) = italic_γ ⋅ italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_u , italic_v ) + ( 1 - italic_γ ) ⋅ italic_f start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ( italic_u , italic_v ), where the parameter γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is configured as a balance factor to adjust user preferences for different types of attribute cohesiveness. Its worth mentioning that we may have other options to combine two types of attribute distances. For example, if two types of attributes are vectorized as a unified high-dimensional vector (i.e., embeddings), we can simply measure the attribute distance by computing cosine similarity of attribute vectors [34]. Using the composite distance, we can measure how similar a node is to a query node q𝑞qitalic_q. Intuitively, the more nodes are similar to q𝑞qitalic_q, the more attribute cohesiveness w.r.t. q𝑞qitalic_q it is. We next define the q𝑞qitalic_q-centric attribute distance of a community H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as follows.

Definition 4

Attribute distance of H~ksubscriptnormal-~normal-Hnormal-k\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Given a community H~ksubscriptnormal-~normal-Hnormal-k\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing qnormal-qqitalic_q, its qnormal-qqitalic_q-centric attribute distance is defined as the average composite attribute distance to qnormal-qqitalic_q over all nodes V~kqabsentsubscriptnormal-~normal-Vnormal-knormal-q\in\tilde{V}_{k}\setminus q∈ over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∖ italic_q, i.e., δ(H~k)=uV~kqf(u,q)|V~k|1normal-δsubscriptnormal-~normal-Hnormal-ksubscriptfor-allnormal-usubscriptnormal-~normal-Vnormal-knormal-qnormal-fnormal-unormal-qsubscriptnormal-~normal-Vnormal-k1\delta(\tilde{H}_{k})=\frac{\sum_{\forall u\in\tilde{V}_{k}\setminus q}f(u,q)}% {|\tilde{V}_{k}|-1}italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT ∀ italic_u ∈ over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_f ( italic_u , italic_q ) end_ARG start_ARG | over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | - 1 end_ARG, V~ksubscriptnormal-~normal-Vnormal-k\tilde{V}_{k}over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is H~ksubscriptnormal-~normal-Hnormal-k\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT’s node set.

II-B Problem Definition

Problem 1. (CS-AG problem) Given an attributed graph G=(VG,EG,AG)𝐺subscript𝑉𝐺subscript𝐸𝐺subscript𝐴𝐺G=(V_{G},E_{G},A_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), a query node qVG𝑞subscript𝑉𝐺q\in V_{G}italic_q ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and a parameter k>0𝑘0k>0italic_k > 0, the Community Search over Attributed Graphs (CS-AG) returns a subgraph H~kGsubscript~𝐻𝑘𝐺\tilde{H}_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G satisfying the following properties:

  • Query participation. H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT contains q𝑞qitalic_q, i.e., qV~k𝑞subscript~𝑉𝑘q\in\tilde{V}_{k}italic_q ∈ over~ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT;

  • Structure cohesiveness. H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a connected k𝑘kitalic_k-core;

  • Attribute cohesiveness. H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT has the smallest δ(H~k)𝛿subscript~𝐻𝑘\delta(\tilde{H}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

We prove that CS-AG problem is NP-hard in §III and provide an exact baseline used in experimental study (§VII). Since it is time-consuming for large graphs, we define an approximate version as Approx-CS-AG, and present a sampling-estimation-based method with an accuracy guarantee in §V.

Problem 2. (Approx-CS-AG problem) Given an attributed graph G=(VG,EG,AG)𝐺subscript𝑉𝐺subscript𝐸𝐺subscript𝐴𝐺G=(V_{G},E_{G},A_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), a query node qVG𝑞subscript𝑉𝐺q\in V_{G}italic_q ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, a parameter k>0𝑘0k>0italic_k > 0, a user-specific error bound e𝑒eitalic_e, and a confidence level 1α1𝛼1-\alpha1 - italic_α, Approx-CS-AG returns an approximate community H~kGsubscriptsuperscript~𝐻𝑘𝐺\tilde{H}^{\star}_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G containing q𝑞qitalic_q that satisfies: (1) the attribute distance δ(H~k)𝛿subscript~𝐻𝑘\delta(\tilde{H}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (shorten as δ𝛿\deltaitalic_δ) of the exact community H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is covered by a confidence interval of the attribute distance δ(H~k)𝛿subscriptsuperscript~𝐻𝑘\delta(\tilde{H}^{\star}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (shorten as δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT) of H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Eq. 1), and (2) the relative error of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT w.r.t. δ𝛿\deltaitalic_δ is bounded by the user-specific error bound e𝑒eitalic_e (Eq. 2).

Pr[δεδδ+ε]=1αPrdelimited-[]superscript𝛿𝜀𝛿superscript𝛿𝜀1𝛼{\rm Pr}[\delta^{\star}-\varepsilon\leq\delta\leq\delta^{\star}+\varepsilon]=1-\alpharoman_Pr [ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_ε ≤ italic_δ ≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_ε ] = 1 - italic_α (1)
|δδ|/δesuperscript𝛿𝛿𝛿𝑒|\delta^{\star}-\delta|/\delta\leq e| italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ | / italic_δ ≤ italic_e (2)

We use a confidence interval CI = δ±εplus-or-minussuperscript𝛿𝜀\delta^{\star}\pm\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε to quantify the quality of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT, which states that δ𝛿\deltaitalic_δ is covered by a range δ±εplus-or-minussuperscript𝛿𝜀\delta^{\star}\pm\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε with a probability of 1α1𝛼1-\alpha1 - italic_α. The half width ε𝜀\varepsilonitalic_ε is called the Margin of Error (MoE). The smaller ε𝜀\varepsilonitalic_ε shows a higher quality of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT [27]. In §V-B, we prove that the accuracy guarantee |δδ|/δesuperscript𝛿𝛿𝛿𝑒|\delta^{\star}-\delta|/\delta\leq e| italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ | / italic_δ ≤ italic_e is ensured if εδe/(1+e)𝜀superscript𝛿𝑒1𝑒\varepsilon\leq\delta^{\star}\cdot e/(1+e)italic_ε ≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e / ( 1 + italic_e ) (Theorem 11).

III Hardness Analysis

We first define CS-AG’s decision version as ρ𝜌\rhoitalic_ρCS-AG. Next, we outline the idea of reduction from a NP-hard R-set Maximum-Weight Connected Subgraph (R-MWCS) problem [35] to ρ𝜌\rhoitalic_ρCS-AG, and show the proof in Theorem 2.

Problem 3. (ρ𝜌\rhoitalic_ρCS-AG problem) Given an attributed graph G=(VG,EG,AG)𝐺subscript𝑉𝐺subscript𝐸𝐺subscript𝐴𝐺G=(V_{G},E_{G},A_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), a query node qVG𝑞subscript𝑉𝐺q\in V_{G}italic_q ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and two parameters k𝑘kitalic_k, ρ0𝜌0\rho\geq 0italic_ρ ≥ 0, the problem checks if there exists a connected k𝑘kitalic_k-core H~kGsubscript~𝐻𝑘𝐺\tilde{H}_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G containing q𝑞qitalic_q with the attribute distance δ(H~k)ρ𝛿subscript~𝐻𝑘𝜌\delta(\tilde{H}_{k})\leq\rhoitalic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≤ italic_ρ.

Generally, the decision version of a problem is easier than (or the same as) it’s optimization version [36]. Thus, if ρ𝜌\rhoitalic_ρCS-AG is NP-hard, then CS-AG is NP-hard. To achieve this, we first prove the decision version of R-MWCS (called τ𝜏\tauitalic_τR-MWCS, defined below) is NP-hard. Then, we reduce τ𝜏\tauitalic_τR-MWCS to our ρ𝜌\rhoitalic_ρCS-AG to complete the proof.

Problem 4. (τ𝜏\tauitalic_τR-MWCS problem). Given a graph G=(VG,EG)𝐺subscript𝑉𝐺subscript𝐸𝐺G=(V_{G},E_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), a node set RVG𝑅subscript𝑉𝐺R\subseteq V_{G}italic_R ⊆ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and node weights w:VG:𝑤subscript𝑉𝐺w:V_{G}\rightarrow\mathbb{R}italic_w : italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT → blackboard_R (w(v)𝑤𝑣w(v)italic_w ( italic_v ) indicates the weight of vVG𝑣subscript𝑉𝐺v\in V_{G}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, which could be any real number). It finds a connect subgraph H𝐻Hitalic_H that contains R𝑅Ritalic_R, satisfying the graph weight w(H)=vVHw(v)τ𝑤𝐻subscript𝑣subscript𝑉𝐻𝑤𝑣𝜏w(H)=\sum_{v\in V_{H}}w(v)\leq\tauitalic_w ( italic_H ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w ( italic_v ) ≤ italic_τ.

Theorem 1

The τ𝜏\tauitalic_τR-MWCS problem is NP-hard.

Proof:

We reduce the NP-hard Steiner tree problem (decision version, shorten as STP) [37, 38] to τ𝜏\tauitalic_τR-MWCS. Given a graph G=(VG,EG)𝐺subscript𝑉𝐺subscript𝐸𝐺G=(V_{G},E_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), a node set SVG𝑆subscript𝑉𝐺S\subseteq V_{G}italic_S ⊆ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, and a number τ𝜏\tau\in\mathbb{N}italic_τ ∈ blackboard_N, STP checks if there is a tree that contains S𝑆Sitalic_S and includes at most τ𝜏\tauitalic_τ edges. We complete the proof following the similar logic of [35]. First, we construct an instance graph Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT via a polynomial time transformation from G𝐺Gitalic_G. For each edge (v,u)EG𝑣𝑢subscript𝐸𝐺(v,u)\in E_{G}( italic_v , italic_u ) ∈ italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, we introduce a middle node w𝑤witalic_w to split (v,u)𝑣𝑢(v,u)( italic_v , italic_u ) into two edges (v,w)𝑣𝑤(v,w)( italic_v , italic_w ), (w,u)EG𝑤𝑢subscript𝐸superscript𝐺(w,u)\in E_{G^{\prime}}( italic_w , italic_u ) ∈ italic_E start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and assign weights on v,u,w𝑣𝑢𝑤v,u,witalic_v , italic_u , italic_w as 0,0,10010,0,10 , 0 , 1. We next prove that the instance of STP is a Yes-instance iff the instance of τ𝜏\tauitalic_τR-MWCS is a Yes-instance.

()(\Rightarrow)( ⇒ ) If a tree TG𝑇𝐺T\subseteq Gitalic_T ⊆ italic_G is a solution of STP, then it has |ET|τsubscript𝐸𝑇𝜏|E_{T}|\leq\tau| italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | ≤ italic_τ. Since we split each edge in ETsubscript𝐸𝑇E_{T}italic_E start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT into two edges with a new middle node having a weight of 1 and other two original nodes having a weight of 0, the corresponding tree TGsuperscript𝑇superscript𝐺T^{\prime}\subset G^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must have w(T)τ𝑤superscript𝑇𝜏w(T^{\prime})\leq\tauitalic_w ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_τ. Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is also a connected subgraph of Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so that Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a Yes-instance of τ𝜏\tauitalic_τR-MWCS.

()(\Leftarrow)( ⇐ ) If a subgraph HGsuperscript𝐻superscript𝐺H^{\prime}\subseteq G^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a solution of τ𝜏\tauitalic_τR-MWCS, then it has w(H)τ𝑤superscript𝐻𝜏w(H^{\prime})\leq\tauitalic_w ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_τ. Since Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is connected, a spanning tree THsuperscript𝑇superscript𝐻T^{\prime}\subseteq H^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT still satisfies w(T)τ𝑤superscript𝑇𝜏w(T^{\prime})\leq\tauitalic_w ( italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≤ italic_τ. For each node with weight of 1 in Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it corresponds to an edge (v,u)EG𝑣𝑢subscript𝐸𝐺(v,u)\in E_{G}( italic_v , italic_u ) ∈ italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. This implies that the edges in the corresponding TG𝑇𝐺T\subseteq Gitalic_T ⊆ italic_G are at most τ𝜏\tauitalic_τ. Thus, T𝑇Titalic_T is a Yes-instance of STP. ∎

Theorem 2

The ρ𝜌\rhoitalic_ρCS-AG problem is NP-hard.

Proof:

We reduce τ𝜏\tauitalic_τR-MWCS problem to ρ𝜌\rhoitalic_ρCS-AG. Given a graph G=(VG,EG,AG)𝐺subscript𝑉𝐺subscript𝐸𝐺subscript𝐴𝐺G=(V_{G},E_{G},A_{G})italic_G = ( italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ), we first construct an instance graph G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT via a polynomial time transformation from G𝐺Gitalic_G. First, we add all nodes and edges of G𝐺Gitalic_G into G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. Second, we add additional k|VG|𝑘subscript𝑉𝐺k|V_{G}|italic_k | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | nodes into G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and assign a unique node ID for all the (k+1)|VG|𝑘1subscript𝑉𝐺(k+1)|V_{G}|( italic_k + 1 ) | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | nodes: (1) nodes from VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT have IDs [0,|VG|1]absent0subscript𝑉𝐺1\in[0,|V_{G}|-1]∈ [ 0 , | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ] and (2) additional k|VG|𝑘subscript𝑉𝐺k|V_{G}|italic_k | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | nodes have IDs [|VG|,(k+1)|VG|1]absentsubscript𝑉𝐺𝑘1subscript𝑉𝐺1\in[|V_{G}|,(k+1)|V_{G}|-1]∈ [ | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | , ( italic_k + 1 ) | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ]. Third, we add additional edges for pair-wise nodes u𝑢uitalic_u and v𝑣vitalic_v, if u.𝑢u.italic_u .ID %|VG|=v.\%|V_{G}|=v.% | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | = italic_v .ID %|VG|\%|V_{G}|% | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | (e.g., nodes with IDs of {1,|VG|+1,,k|VG|+1}1subscript𝑉𝐺1𝑘subscript𝑉𝐺1\{1,|V_{G}|+1,\dots,k|V_{G}|+1\}{ 1 , | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | + 1 , … , italic_k | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | + 1 } have edges). Hence, we ensure that each node in G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT has at least k𝑘kitalic_k neighbors. Finally, given a node set R𝑅Ritalic_R with only one query node q𝑞qitalic_q, we assign weights on both G𝐺Gitalic_G and G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as follows: (1) vGfor-all𝑣𝐺\forall v\in G∀ italic_v ∈ italic_G has w(v)=f(v,q)ρ𝑤𝑣𝑓𝑣𝑞𝜌w(v)=f(v,q)-\rhoitalic_w ( italic_v ) = italic_f ( italic_v , italic_q ) - italic_ρ, where f(v,q)𝑓𝑣𝑞f(v,q)italic_f ( italic_v , italic_q ) is the composite attribute distance between v𝑣vitalic_v and q𝑞qitalic_q. (2) vG*for-all𝑣superscript𝐺\forall v\in G^{*}∀ italic_v ∈ italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with ID [0,|VG|1]absent0subscript𝑉𝐺1\in[0,|V_{G}|-1]∈ [ 0 , | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ] has w*(v)=w(v)+ρ=f(v,q)superscript𝑤𝑣𝑤𝑣𝜌𝑓𝑣𝑞w^{*}(v)=w(v)+\rho=f(v,q)italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) = italic_w ( italic_v ) + italic_ρ = italic_f ( italic_v , italic_q ), which equals to v𝑣vitalic_v’s original composite attribute distance to q𝑞qitalic_q. (3) uG*for-all𝑢superscript𝐺\forall u\in G^{*}∀ italic_u ∈ italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with ID [|VG|,(k+1)|VG|1]absentsubscript𝑉𝐺𝑘1subscript𝑉𝐺1\in[|V_{G}|,(k+1)|V_{G}|-1]∈ [ | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | , ( italic_k + 1 ) | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ], we assign u𝑢uitalic_u with a same weight w*(u)=ρsuperscript𝑤𝑢𝜌w^{*}(u)=\rhoitalic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_u ) = italic_ρ as its f(u,q)𝑓𝑢𝑞f(u,q)italic_f ( italic_u , italic_q ). In summary, considering G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, those nodes inherited from G𝐺Gitalic_G have weights as their original f(v,q)𝑓𝑣𝑞f(v,q)italic_f ( italic_v , italic_q ) and others are configured with the same weight ρ𝜌\rhoitalic_ρ. Given G*superscript𝐺G^{*}italic_G start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with the query node q𝑞qitalic_q and ρ[0,1]𝜌01\rho\in[0,1]italic_ρ ∈ [ 0 , 1 ], and G𝐺Gitalic_G with R={q}𝑅𝑞R=\{q\}italic_R = { italic_q } and τ=ρ𝜏𝜌\tau=-\rhoitalic_τ = - italic_ρ, we show that the instance of τ𝜏\tauitalic_τR-MWCS is a Yes-instance iff the instance of ρ𝜌\rhoitalic_ρCS-AG is a Yes-instance.

()(\Rightarrow)( ⇒ ) If HG𝐻𝐺H\subseteq Gitalic_H ⊆ italic_G is a solution to the τ𝜏\tauitalic_τR-MWCS problem that satisfies w(H)=vVHw(v)ρ𝑤𝐻subscript𝑣subscript𝑉𝐻𝑤𝑣𝜌w(H)=\sum_{v\in V_{H}}w(v)\leq-\rhoitalic_w ( italic_H ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w ( italic_v ) ≤ - italic_ρ, then the induced graph formed by nodes of VHVsubscript𝑉𝐻superscript𝑉V_{H}\cup V^{\prime}italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∪ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a connected k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where Vsuperscript𝑉V^{\prime}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT includes k|VH|𝑘subscript𝑉𝐻k|V_{H}|italic_k | italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT | nodes with IDs [|VG|,(k+1)|VG|1]absentsubscript𝑉𝐺𝑘1subscript𝑉𝐺1\in[|V_{G}|,(k+1)|V_{G}|-1]∈ [ | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | , ( italic_k + 1 ) | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ] that have edges with nodes in VHsubscript𝑉𝐻V_{H}italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT. Since w(H)=vVHw(v)ρ𝑤𝐻subscript𝑣subscript𝑉𝐻𝑤𝑣𝜌w(H)=\sum_{v\in V_{H}}w(v)\leq-\rhoitalic_w ( italic_H ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w ( italic_v ) ≤ - italic_ρ, we have vVHqw(v)0subscript𝑣subscript𝑉𝐻𝑞𝑤𝑣0\sum_{v\in V_{H}\setminus q}w(v)\leq 0∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w ( italic_v ) ≤ 0 (as w(q)=ρ𝑤𝑞𝜌w(q)=-\rhoitalic_w ( italic_q ) = - italic_ρ) and vVHq(w(v)+ρ)=vVHqw*(v)ρsubscript𝑣subscript𝑉𝐻𝑞𝑤𝑣𝜌subscript𝑣subscript𝑉𝐻𝑞superscript𝑤𝑣𝜌\sum_{v\in V_{H}\setminus q}(w(v)+\rho)=\sum_{v\in V_{H}\setminus q}w^{*}(v)\leq\rho∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT ( italic_w ( italic_v ) + italic_ρ ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) ≤ italic_ρ. Thus, δ(H~k)=vVHqw*(v)+uVρ|VHVq|ρ𝛿subscript~𝐻𝑘subscript𝑣subscript𝑉𝐻𝑞superscript𝑤𝑣subscript𝑢superscript𝑉𝜌subscript𝑉𝐻superscript𝑉𝑞𝜌\delta(\tilde{H}_{k})=\frac{\sum_{v\in V_{H}\setminus q}w^{*}(v)+\sum_{u\in V^% {\prime}}\rho}{|V_{H}\cup V^{\prime}\setminus q|}\leq\rhoitalic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) + ∑ start_POSTSUBSCRIPT italic_u ∈ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ρ end_ARG start_ARG | italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∪ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∖ italic_q | end_ARG ≤ italic_ρ. So, H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a Yes-instance of ρ𝜌\rhoitalic_ρCS-AG if H𝐻Hitalic_H is a Yes-instance of the τ𝜏\tauitalic_τR-MWCS.

()(\Leftarrow)( ⇐ ) Assume a connected k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing q𝑞qitalic_q is a solution to the ρ𝜌\rhoitalic_ρCS-AG problem. H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT involves part of nodes with IDs [0,|VG|1]absent0subscript𝑉𝐺1\in[0,|V_{G}|-1]∈ [ 0 , | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ] (i.e., VHVH~ksubscript𝑉𝐻subscript𝑉subscript~𝐻𝑘V_{H}\subseteq V_{\tilde{H}_{k}}italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⊆ italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and the rest of nodes with IDs [|VG|,(k+1)|VG|1]absentsubscript𝑉𝐺𝑘1subscript𝑉𝐺1\in[|V_{G}|,(k+1)|V_{G}|-1]∈ [ | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | , ( italic_k + 1 ) | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | - 1 ] (denoted by V=VH~ksuperscript𝑉subscript𝑉subscript~𝐻𝑘V^{\prime}=V_{\tilde{H}_{k}}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT \setminus VHsubscript𝑉𝐻V_{H}italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT). Given δ(H~k)=vVHqw*(v)+uVρ|VHVq|ρ𝛿subscript~𝐻𝑘subscript𝑣subscript𝑉𝐻𝑞superscript𝑤𝑣subscript𝑢superscript𝑉𝜌subscript𝑉𝐻superscript𝑉𝑞𝜌\delta(\tilde{H}_{k})=\frac{\sum_{v\in V_{H}\setminus q}w^{*}(v)+\sum_{u\in V^% {\prime}}\rho}{|V_{H}\cup V^{\prime}\setminus q|}\leq\rhoitalic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) + ∑ start_POSTSUBSCRIPT italic_u ∈ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_ρ end_ARG start_ARG | italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∪ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∖ italic_q | end_ARG ≤ italic_ρ, we have δ(H~k)ρ=vVHq(w*(v)ρ)|VHVq|0𝛿subscript~𝐻𝑘𝜌subscript𝑣subscript𝑉𝐻𝑞superscript𝑤𝑣𝜌subscript𝑉𝐻superscript𝑉𝑞0\delta(\tilde{H}_{k})-\rho=\frac{\sum_{v\in V_{H}\setminus q}(w^{*}(v)-\rho)}{% |V_{H}\cup V^{\prime}\setminus q|}\leq 0italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_ρ = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) - italic_ρ ) end_ARG start_ARG | italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∪ italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∖ italic_q | end_ARG ≤ 0, i.e.,vVHq(w*(v)ρ)subscript𝑣subscript𝑉𝐻𝑞superscript𝑤𝑣𝜌\sum_{v\in V_{H}\setminus q}(w^{*}(v)-\rho)∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT ( italic_w start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_v ) - italic_ρ ) === vVHqw(v)subscript𝑣subscript𝑉𝐻𝑞𝑤𝑣\sum_{v\in V_{H}\setminus q}w(v)∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w ( italic_v ) \leq 00. Thus, we have w(H)=vVHqw(v)+w(q)ρ=τ𝑤𝐻subscript𝑣subscript𝑉𝐻𝑞𝑤𝑣𝑤𝑞𝜌𝜏w(H)=\sum_{v\in V_{H}\setminus q}w(v)+w(q)\leq-\rho=\tauitalic_w ( italic_H ) = ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∖ italic_q end_POSTSUBSCRIPT italic_w ( italic_v ) + italic_w ( italic_q ) ≤ - italic_ρ = italic_τ (as w(q)=ρ𝑤𝑞𝜌w(q)=-\rhoitalic_w ( italic_q ) = - italic_ρ). So, H𝐻Hitalic_H is a Yes-instance of τ𝜏\tauitalic_τR-MWCS if H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a Yes-instance of ρ𝜌\rhoitalic_ρCS-AG. ∎

IV Exact Baseline

Given an attributed graph G𝐺Gitalic_G, we present an Exact method to solve CS-AG. Since the ground-truth community must be included in the maximal connected k𝑘kitalic_k-core H~kGsubscript~𝐻𝑘𝐺\tilde{H}_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G, we first find the maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing q𝑞qitalic_qIV-A). Then, we enumerate all the H~kiH~ksubscriptsuperscript~𝐻𝑖𝑘subscript~𝐻𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing q𝑞qitalic_q with three pruning strategies (§IV-B), and return the one with the smallest δ()𝛿\delta(\cdot)italic_δ ( ⋅ ).

IV-A Find the Maximal Connected k𝑘kitalic_k-core

We have two straightforward ways to obtain the maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT: one is the classic core-decomposition [39] and another is the search with expansion [7]. For the latter one, we start the search from q𝑞qitalic_q and maintain up to k𝑘kitalic_k neighbors for each explored node v𝑣vitalic_v; if v𝑣vitalic_v does not have k𝑘kitalic_k neighbors, then we delete v𝑣vitalic_v and maintain all previously explored nodes’ degree up to k𝑘kitalic_k if their degree is reduced to be <kabsent𝑘<k< italic_k after removing v𝑣vitalic_v. We repeat this until all nodes have been explored. No matter which method is adopted, the essence is the same that is to recursively remove nodes with degree <kabsent𝑘<k< italic_k from the connected component of q𝑞qitalic_q. We implement the former one in Exact.

IV-B Enumeration with Pruning Strategies

Given the maximal H~kGsubscript~𝐻𝑘𝐺\tilde{H}_{k}\subseteq Gover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G, we continuously peel nodes from it to form a new candidate community H~kiH~ksubscriptsuperscript~𝐻𝑖𝑘subscript~𝐻𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and check its attribute distance. This enumeration is represented as a search tree, where each state indicates a H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and the root H~k0subscriptsuperscript~𝐻0𝑘\tilde{H}^{0}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is initialized as H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. If H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT includes n𝑛nitalic_n nodes, then we can delete nodes iteratively (except q𝑞qitalic_q) to generate n1𝑛1n-1italic_n - 1 substates. Figure 3 illustrates a search tree starting from the H~2subscript~𝐻2\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Figure 2 (c), given q=v5𝑞subscript𝑣5q=v_{5}italic_q = italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT. For example, H~21H~2subscriptsuperscript~𝐻12subscript~𝐻2\tilde{H}^{1}_{2}\subseteq\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is obtained by deleting v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. A full enumeration is computationally expensive for a large H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, but not all H~kiH~ksubscriptsuperscript~𝐻𝑖𝑘subscript~𝐻𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT should be visited and some of them can be pruned safely to improve the efficiency.

Prune for duplicate states. In Figure 3, we show duplicate states with the same color, e.g., the green states are generated from H~2subscript~𝐻2\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by deleting {v1,v2,v6}subscript𝑣1subscript𝑣2subscript𝑣6\{v_{1},v_{2},v_{6}\}{ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT }. A simple way to avoid duplicate states is to record all the visited states and check if each new state has been visited before. The main drawback is that it requires extra memory overhead to maintain all visited states. To handle this, we present our first pruning strategy based on priority enumeration. More precisely, we enumerate states in a partial order w.r.t. every node’s composite attribute distance to q𝑞qitalic_q (i.e., f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q )), so that we quickly can decide to prune a state by simply evaluating f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) between the node to be deleted and q𝑞qitalic_q, without maintaining visited states.

(1) Priority enumeration. Given an arbitrary state, we enumerate its substates in a DFS manner by deleting nodes in descending order of their composite attribute distance f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) to q𝑞qitalic_q. In Figure 3, we first enumerate the state H~21subscriptsuperscript~𝐻12\tilde{H}^{1}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by deleting v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as f(v1,q)=0.7𝑓subscript𝑣1𝑞0.7f(v_{1},q)=0.7italic_f ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q ) = 0.7 is larger than other nodes’ distance to q𝑞qitalic_q (distance information are provided on the top of Figure 3).

Refer to caption
Figure 3: A search tree for the H~2subscript~𝐻2\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Figure 2 (c)
Lemma 1

Given a state H~kisubscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT includes two nodes vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT with f(vx,q)>f(vy,q)𝑓subscript𝑣𝑥𝑞𝑓subscript𝑣𝑦𝑞f(v_{x},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ), it must exists a substate H~kjH~kisubscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT generated by successively deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT from H~kisubscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

This lemma holds as we enumerate states by deleting nodes in descending order of f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) (i.e., priority enumeration). For example, H~21subscriptsuperscript~𝐻12\tilde{H}^{1}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (in Figure 3) has a substate with nodes {v3,v4,v5}subscript𝑣3subscript𝑣4subscript𝑣5\{v_{3},v_{4},v_{5}\}{ italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } after deleting v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, v6subscript𝑣6v_{6}italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT in turn (f(v2,q)>f(v6,q)𝑓subscript𝑣2𝑞𝑓subscript𝑣6𝑞f(v_{2},q)>f(v_{6},q)italic_f ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_q )).

(2) Prune based on f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ). Given a state H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is generated from its parent state H~kpsubscriptsuperscript~𝐻𝑝𝑘\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by deleting a node vyH~kpsubscript𝑣𝑦subscriptsuperscript~𝐻𝑝𝑘v_{y}\in\tilde{H}^{p}_{k}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Suppose we try to enumerate a substate H~kjH~kisubscriptsuperscript~𝐻𝑗𝑘subscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by deleting a current node vxH~kisubscript𝑣𝑥subscriptsuperscript~𝐻𝑖𝑘v_{x}\in\tilde{H}^{i}_{k}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, then we can prune H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the following cases.

\bullet Case 1. Let us consider a simple case, where deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT does not result in other nodes from H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT having a degree less than k𝑘kitalic_k, i.e., no other nodes would be removed after deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. In this case, we can prune H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT according to Theorem 3.

Theorem 3

Given a state H~kiH~kpsubscriptsuperscriptnormal-~𝐻𝑖𝑘subscriptsuperscriptnormal-~𝐻𝑝𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is obtained by deleting a node vyH~kpsubscript𝑣𝑦subscriptsuperscriptnormal-~𝐻𝑝𝑘v_{y}\in\tilde{H}^{p}_{k}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and a substate H~kjH~kisubscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT obtained by deleting only one node vxH~kisubscript𝑣𝑥subscriptsuperscriptnormal-~𝐻𝑖𝑘v_{x}\in\tilde{H}^{i}_{k}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. If f(vx,q)>f(vy,q)𝑓subscript𝑣𝑥𝑞𝑓subscript𝑣𝑦𝑞f(v_{x},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ), then we can prune H~kjsubscriptsuperscriptnormal-~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its subsequent sub-states.

Proof:

According to Lemma 1, a previously visited state must exist that is caused by successively deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT from H~kpsubscriptsuperscript~𝐻𝑝𝑘\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT because f(vx,q)>f(vy,q)𝑓subscript𝑣𝑥𝑞𝑓subscript𝑣𝑦𝑞f(v_{x},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ). This state is duplicated with H~kjH~kpsubscriptsuperscript~𝐻𝑗𝑘subscriptsuperscript~𝐻𝑝𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT generated by successively deleting vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Thus, we can directly prune H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT if f(vx,q)>f(vy,q)𝑓subscript𝑣𝑥𝑞𝑓subscript𝑣𝑦𝑞f(v_{x},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ). ∎

Example 1

Given the yellow state (in Figure 3) including nodes {v2,v3,v6,v5}subscript𝑣2subscript𝑣3subscript𝑣6subscript𝑣5\{v_{2},v_{3},v_{6},v_{5}\}{ italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT } that is obtained from H~21subscriptsuperscriptnormal-~𝐻12\tilde{H}^{1}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by deleting v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the next node to be deleted to get a substate. Since f(v2,v5)>f(v4,v5)𝑓subscript𝑣2subscript𝑣5𝑓subscript𝑣4subscript𝑣5f(v_{2},v_{5})>f(v_{4},v_{5})italic_f ( italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) > italic_f ( italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ), it exists a duplicate state that has been visited before, i.e., the one obtained by successively deleting v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, v4subscript𝑣4v_{4}italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT from H~21subscriptsuperscriptnormal-~𝐻12\tilde{H}^{1}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. So, we can prune this substate.

\bullet Case 2. Let us consider a more general case where deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT results in other nodes from H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT having a degree less than k𝑘kitalic_k, i.e., additional nodes would be removed when vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is deleted. In this case, we can prune H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT according to Theorem 4.

Theorem 4

Given a state H~kiH~kpsubscriptsuperscriptnormal-~𝐻𝑖𝑘subscriptsuperscriptnormal-~𝐻𝑝𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that is obtained by deleting a node vyH~kpsubscript𝑣𝑦subscriptsuperscriptnormal-~𝐻𝑝𝑘v_{y}\in\tilde{H}^{p}_{k}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and a substate H~kjH~kisubscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT obtained by deleting multiple nodes besides vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, i.e., H~kj=H~kp{vy,vx,,vm}subscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑝𝑘subscript𝑣𝑦subscript𝑣𝑥normal-⋯subscript𝑣𝑚\tilde{H}^{j}_{k}=\tilde{H}^{p}_{k}\setminus\{v_{y},v_{x},\cdots,v_{m}\}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∖ { italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is node with the maximal f(,q)𝑓normal-⋅𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) among all deleted nodes. If f(vm,q)>f(vy,q)𝑓subscript𝑣𝑚𝑞𝑓subscript𝑣𝑦𝑞f(v_{m},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ), then we can prune H~kjsubscriptsuperscriptnormal-~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and its subsequent substates.

Proof:

Similar to Theorem 3, a previously visited state must exist that is caused by successively deleting vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and other nodes from H~kpsubscriptsuperscript~𝐻𝑝𝑘\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, as f(vm,q)>f(vy,q)𝑓subscript𝑣𝑚𝑞𝑓subscript𝑣𝑦𝑞f(v_{m},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ). It is duplicated with the H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT generated by successively deleting {vy,vx,,vm}subscript𝑣𝑦subscript𝑣𝑥subscript𝑣𝑚\{v_{y},v_{x},\cdots,v_{m}\}{ italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } from H~kpsubscriptsuperscript~𝐻𝑝𝑘\tilde{H}^{p}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. So, we can prune it when f(vm,q)>f(vy,q)𝑓subscript𝑣𝑚𝑞𝑓subscript𝑣𝑦𝑞f(v_{m},q)>f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ). ∎

Since Case 1 is a special case of Case 2 when vx=vmsubscript𝑣𝑥subscript𝑣𝑚v_{x}=v_{m}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we only apply Theorem 4 in our implementation of Exact.

Prune for unnecessary states. We say a state is unnecessary to visit if the optimal community definitely can be visit before this state, so that we can directly prune this state.

Theorem 5

Given a state H~kisubscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with an attribute distance δ(H~ki)𝛿subscriptsuperscriptnormal-~𝐻𝑖𝑘\delta(\tilde{H}^{i}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), the substates that are generated by deleting nodes with f(,q)δ(H~ki)𝑓normal-⋅𝑞𝛿subscriptsuperscriptnormal-~𝐻𝑖𝑘f(\cdot,q)\leq\delta(\tilde{H}^{i}_{k})italic_f ( ⋅ , italic_q ) ≤ italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) are unnecessary to visit and can be pruned.

Proof:

Suppose we enumerate a new substate of H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by deleting a node vyH~kisubscript𝑣𝑦subscriptsuperscript~𝐻𝑖𝑘v_{y}\in\tilde{H}^{i}_{k}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with f(vy,q)δ(H~ki)𝑓subscript𝑣𝑦𝑞𝛿subscriptsuperscript~𝐻𝑖𝑘f(v_{y},q)\leq\delta(\tilde{H}^{i}_{k})italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ) ≤ italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). If vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT’s deletion only causes nodes vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with f(vx,q)δ(H~ki)𝑓subscript𝑣𝑥𝑞𝛿subscriptsuperscript~𝐻𝑖𝑘f(v_{x},q)\leq\delta(\tilde{H}^{i}_{k})italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) ≤ italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to be deleted, then this new substate’s attribute distance must δ(H~ki)absent𝛿subscriptsuperscript~𝐻𝑖𝑘\geq\delta(\tilde{H}^{i}_{k})≥ italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), indicating it is not the optimal community. This implies that if the optimal community exists, then at least one node vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT with f(vx,q)>δ(H~ki)𝑓subscript𝑣𝑥𝑞𝛿subscriptsuperscript~𝐻𝑖𝑘f(v_{x},q)>\delta(\tilde{H}^{i}_{k})italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) > italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) would be deleted recursively after vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is deleted. According to Lemma 1, a state H~kjH~kisubscriptsuperscript~𝐻𝑗𝑘subscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT generated by successively deleting vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and vysubscript𝑣𝑦v_{y}italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT must be visited previously as f(vx,q)𝑓subscript𝑣𝑥𝑞f(v_{x},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_q ) >>> δ(H~ki)𝛿subscriptsuperscript~𝐻𝑖𝑘\delta(\tilde{H}^{i}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) \geq f(vy,q)𝑓subscript𝑣𝑦𝑞f(v_{y},q)italic_f ( italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_q ). So, we can prune H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT based on Theorem 3-4. In summary, substates generated by deleting nodes with f(,q)δ(H~ki)𝑓𝑞𝛿subscriptsuperscript~𝐻𝑖𝑘f(\cdot,q)\leq\delta(\tilde{H}^{i}_{k})italic_f ( ⋅ , italic_q ) ≤ italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) can be pruned. ∎

According to Theorem 5, we only need to enumerate substates of H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by deleting those nodes with f(,q)>δ(H~ki)𝑓𝑞𝛿subscriptsuperscript~𝐻𝑖𝑘f(\cdot,q)>\delta(\tilde{H}^{i}_{k})italic_f ( ⋅ , italic_q ) > italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

Example 2

In Figure 3, states H~24subscriptsuperscriptnormal-~𝐻42\tilde{H}^{4}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H~25subscriptsuperscriptnormal-~𝐻52\tilde{H}^{5}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are not duplicated with any previously visited states, but they are unnecessary to visit. Note that, the attribute distance of H~2subscriptnormal-~𝐻2\tilde{H}_{2}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as δ(H~2)=0.7+0.6+0.6+0.5+0.35=0.54𝛿subscriptnormal-~𝐻20.70.60.60.50.350.54\delta(\tilde{H}_{2})=\frac{0.7+0.6+0.6+0.5+0.3}{5}=0.54italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 0.7 + 0.6 + 0.6 + 0.5 + 0.3 end_ARG start_ARG 5 end_ARG = 0.54. According to Theorem 5, the optimal community must exclude one node vx{v1,v2,v3}subscript𝑣𝑥subscript𝑣1subscript𝑣2subscript𝑣3v_{x}\in\{v_{1},v_{2},v_{3}\}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } (f(vx,v5)>0.54𝑓subscript𝑣𝑥subscript𝑣50.54f(v_{x},v_{5})>0.54italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) > 0.54) and it must be visited before H~24subscriptsuperscriptnormal-~𝐻42\tilde{H}^{4}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H~25subscriptsuperscriptnormal-~𝐻52\tilde{H}^{5}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as f(vx,v5)>f(v4,v5),f(v6,v5)𝑓subscript𝑣𝑥subscript𝑣5𝑓subscript𝑣4subscript𝑣5𝑓subscript𝑣6subscript𝑣5f(v_{x},v_{5})>f(v_{4},v_{5}),f(v_{6},v_{5})italic_f ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) > italic_f ( italic_v start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) , italic_f ( italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) (Lemma 1).

Prune for unpromising states. Another case where we can prune is when the lower bound of δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) for all substates of H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is larger than the optimal δ*()superscript𝛿\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) so far. This implies that we cannot find a better community by digging deeper from H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

(1) Compute the lower bound δ¯()¯𝛿\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ). Since a (k+1)𝑘1(k+1)( italic_k + 1 )-clique is the smallest k𝑘kitalic_k-core, each state (represents a connected k𝑘kitalic_k-core) in a search tree must have at least k+1𝑘1k+1italic_k + 1 nodes (q𝑞qitalic_q and other k𝑘kitalic_k nodes). So, we obtain k𝑘kitalic_k nodes with the smallest f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) from the current state H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (Eq. 3) and use the average f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) of the k𝑘kitalic_k nodes as the lower bound of attribute distance δ¯()¯𝛿\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) (Eq. 4).

Vmin={argminvH~kif(v,q)}s.t.|Vmin|=kformulae-sequencesubscript𝑉subscript𝑣subscriptsuperscript~𝐻𝑖𝑘𝑓𝑣𝑞𝑠𝑡subscript𝑉𝑘V_{\min}=\{\mathop{\arg\min}_{v\in\tilde{H}^{i}_{k}}f(v,q)\}\quad s.t.\ |V_{% \min}|=kitalic_V start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = { start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_v ∈ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_v , italic_q ) } italic_s . italic_t . | italic_V start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT | = italic_k (3)
δ¯()=vVminf(v,q)k¯𝛿subscript𝑣subscript𝑉𝑓𝑣𝑞𝑘\underline{\delta}(\cdot)=\frac{\sum_{v\in V_{\min}}f(v,q)}{k}under¯ start_ARG italic_δ end_ARG ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_v , italic_q ) end_ARG start_ARG italic_k end_ARG (4)
Lemma 2

Given an arbitrary substate H~kjH~kisubscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, its attribute distance must be lower bounded by δ¯()normal-¯𝛿normal-⋅\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ), i.e., δ(H~kj)δ¯()𝛿subscriptsuperscriptnormal-~𝐻𝑗𝑘normal-¯𝛿normal-⋅\delta(\tilde{H}^{j}_{k})\geq\underline{\delta}(\cdot)italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ under¯ start_ARG italic_δ end_ARG ( ⋅ ).

This naturally holds as any H~kjsubscriptsuperscript~𝐻𝑗𝑘\tilde{H}^{j}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT containing at least one node with f(,q)max{f(v,q)|vVmin}𝑓𝑞conditional𝑓𝑣𝑞𝑣subscript𝑉f(\cdot,q)\geq\max\{f(v,q)|v\in V_{\min}\}italic_f ( ⋅ , italic_q ) ≥ roman_max { italic_f ( italic_v , italic_q ) | italic_v ∈ italic_V start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT } would increase δ(H~kj)𝛿subscriptsuperscript~𝐻𝑗𝑘\delta(\tilde{H}^{j}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

(2) Prune based on δ¯()¯𝛿\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ). For each visited state H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we record its δ(H~ki)𝛿subscriptsuperscript~𝐻𝑖𝑘\delta(\tilde{H}^{i}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) to update the optimal δ*()superscript𝛿\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) so far, compute the lower bound δ¯()¯𝛿\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ), and decide to prune according to Theorem 6.

Theorem 6

Given a state H~kisubscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we say it is unpromising to find a better state H~kjH~kisubscriptsuperscriptnormal-~𝐻𝑗𝑘subscriptsuperscriptnormal-~𝐻𝑖𝑘\tilde{H}^{j}_{k}\subseteq\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with a smaller δ(H~kj)𝛿subscriptsuperscriptnormal-~𝐻𝑗𝑘\delta(\tilde{H}^{j}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) than the optimal δ*()superscript𝛿normal-⋅\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) and can be safely pruned, if δ¯()δ*()normal-¯𝛿normal-⋅superscript𝛿normal-⋅\underline{\delta}(\cdot)\geq\delta^{*}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) ≥ italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ).

Proof:

Given δ¯()δ*()¯𝛿superscript𝛿\underline{\delta}(\cdot)\geq\delta^{*}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) ≥ italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) and Lemma 2, we have δ(H~kj)δ¯()δ*()𝛿subscriptsuperscript~𝐻𝑗𝑘¯𝛿superscript𝛿\delta(\tilde{H}^{j}_{k})\geq\underline{\delta}(\cdot)\geq\delta^{*}(\cdot)italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ under¯ start_ARG italic_δ end_ARG ( ⋅ ) ≥ italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ). So, we can prune H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT safely. ∎

Example 3

In Figure 3, suppose we are now at state H~23subscriptsuperscriptnormal-~𝐻32\tilde{H}^{3}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the current optimal community is formed by nodes {v3,v5,v6}subscript𝑣3subscript𝑣5subscript𝑣6\{v_{3},v_{5},v_{6}\}{ italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT } with δ*()=0.45superscript𝛿normal-⋅0.45\delta^{*}(\cdot)=0.45italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) = 0.45. The lower bound δ¯()normal-¯𝛿normal-⋅\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) for all substates of H~23subscriptsuperscriptnormal-~𝐻32\tilde{H}^{3}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is 0.6+0.520.60.52\frac{0.6+0.5}{2}divide start_ARG 0.6 + 0.5 end_ARG start_ARG 2 end_ARG === 0.550.550.550.55 >>> δ*()superscript𝛿normal-⋅\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ), so we can prune H~23subscriptsuperscriptnormal-~𝐻32\tilde{H}^{3}_{2}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT safely.

Input: The maximal connect k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, a query node q𝑞qitalic_q
Output: The connect k𝑘kitalic_k-core H~k*subscriptsuperscript~𝐻𝑘\tilde{H}^{*}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the smallest δ()𝛿\delta(\cdot)italic_δ ( ⋅ )
1 H~k*H~ksubscriptsuperscript~𝐻𝑘subscript~𝐻𝑘\tilde{H}^{*}_{k}\leftarrow\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, δ*()δ(H~k)superscript𝛿𝛿subscript~𝐻𝑘\delta^{*}(\cdot)\leftarrow\delta(\tilde{H}_{k})italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) ← italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT );
u𝖭𝖴𝖫𝖫𝑢𝖭𝖴𝖫𝖫u\leftarrow{\sf NULL}italic_u ← sansserif_NULL ;
  /* Previously deleted node */
2 f(u,q)+𝑓𝑢𝑞f(u,q)\leftarrow+\inftyitalic_f ( italic_u , italic_q ) ← + ∞;
3 Enumerate(H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q𝑞qitalic_q, u𝑢uitalic_u);
4 return H~k*subscriptsuperscript~𝐻𝑘\tilde{H}^{*}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT;
Procedure Enumerate(H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q𝑞qitalic_q, u𝑢uitalic_u)
// Prune for unpromising states
δ¯()¯𝛿absent\underline{\delta}(\cdot)\leftarrowunder¯ start_ARG italic_δ end_ARG ( ⋅ ) ← lower bound for substates H~kabsentsubscript~𝐻𝑘\subseteq\tilde{H}_{k}⊆ over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ;
  /* Eq. 3-4 */
5 if δ¯()δ*()normal-¯𝛿normal-⋅superscript𝛿normal-⋅\underline{\delta}(\cdot)\geq\delta^{*}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) ≥ italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) then
       return ;
        /* Theorem 6 */
6      
7else
       // Prune for unnecessary states
       D>δsubscript𝐷absent𝛿absentD_{>\delta}\leftarrowitalic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT ← nodes of f(,q)>δ(H~k)𝑓𝑞𝛿subscript~𝐻𝑘f(\cdot,q)>\delta(\tilde{H}_{k})italic_f ( ⋅ , italic_q ) > italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ;
        /* Theorem 5 */
8       while D>δsubscript𝐷absent𝛿D_{>\delta}\neq\emptysetitalic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT ≠ ∅ do
             vD>δ𝑣subscript𝐷absent𝛿v\leftarrow D_{>\delta}italic_v ← italic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT.pop_max() ;
              /* node to delete */
9             H~ki,vmsubscriptsuperscript~𝐻𝑖𝑘subscript𝑣𝑚absent\langle\tilde{H}^{i}_{k},v_{m}\rangle\leftarrow⟨ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⟩ ← maintain a k𝑘kitalic_k-core for H~kvsubscript~𝐻𝑘𝑣\tilde{H}_{k}\setminus vover~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∖ italic_v;
             // Prune for duplicated states
10             if f(vm,q)>f(u,q)𝑓subscript𝑣𝑚𝑞𝑓𝑢𝑞f(v_{m},q)>f(u,q)italic_f ( italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_u , italic_q ) then
                   continue ;
                    /* Theorem 4 */
11                  
12            else
13                   if δ(H~ki)<δ*()𝛿subscriptsuperscriptnormal-~𝐻𝑖𝑘superscript𝛿normal-⋅\delta(\tilde{H}^{i}_{k})<\delta^{*}(\cdot)italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) < italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) then
14                         H~k*H~kisubscriptsuperscript~𝐻𝑘subscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{*}_{k}\leftarrow\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, δ*()δ(H~ki)superscript𝛿𝛿subscriptsuperscript~𝐻𝑖𝑘\delta^{*}(\cdot)\leftarrow\delta(\tilde{H}^{i}_{k})italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) ← italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT );
15                        
                  /* Update previously deleted node */
16                   uv𝑢𝑣u\leftarrow vitalic_u ← italic_v;
17                  Enumerate(H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, q𝑞qitalic_q, u𝑢uitalic_u);
18                  
19            
20      return ;
21      
Algorithm 1 Enumeration with three prunings

Combine three pruning strategies together. Algorithm 1 shows the whole procedure of enumeration with three pruning strategies. We study the effect of prunings on Exact’s efficiency in §VII-C. Given the maximal connected k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and a query node q𝑞qitalic_q, we first initialize the optimal community H~k*subscriptsuperscript~𝐻𝑘\tilde{H}^{*}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and attribute distance δ*()superscript𝛿\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) as H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and δ(H~k)𝛿subscript~𝐻𝑘\delta(\tilde{H}_{k})italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (line 1). Since H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the root of enumeration, it doesn’t have a previously deleted node u𝑢uitalic_u (u𝑢uitalic_u is used to indicate that the current H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is obtained from its parent state by deleting u𝑢uitalic_u). Thus we configure f(u,q)𝑓𝑢𝑞f(u,q)italic_f ( italic_u , italic_q ) as positive infinity (lines 2-3) and use it in the prune for duplicated states. We next enumerate from H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as follows: (1) We compute the lower bound of attribute distance δ¯()¯𝛿\underline{\delta}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) for all substates of H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT via Eq. 3-4, and then prune all unpromising substates if δ¯()>δ*()¯𝛿superscript𝛿\underline{\delta}(\cdot)>\delta^{*}(\cdot)under¯ start_ARG italic_δ end_ARG ( ⋅ ) > italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) (lines 6-8: Theorem 6). (2) For each H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that was not pruned before, we record all nodes with f(,q)>δ(H~k)𝑓𝑞𝛿subscript~𝐻𝑘f(\cdot,q)>\delta(\tilde{H}_{k})italic_f ( ⋅ , italic_q ) > italic_δ ( over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) from H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in a max-heap D>δsubscript𝐷absent𝛿D_{>\delta}italic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT, and prune all unnecessary substates by deleting nodes from D>δsubscript𝐷absent𝛿D_{>\delta}italic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT (lines 10-11: Theorem 5). (3) Given a D>δsubscript𝐷absent𝛿D_{>\delta}\neq\emptysetitalic_D start_POSTSUBSCRIPT > italic_δ end_POSTSUBSCRIPT ≠ ∅, we pop node v𝑣vitalic_v with the largest f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) and maintain a new connected k𝑘kitalic_k-core H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by deleting v𝑣vitalic_v from H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. During the k𝑘kitalic_k-core maintenance, we also record node vmsubscript𝑣𝑚v_{m}italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with the largest f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) that is recursively deleted after v𝑣vitalic_v is removed (lines 12-13). If f(vm,q)>f(u,q)𝑓subscript𝑣𝑚𝑞𝑓𝑢𝑞f(v_{m},q)>f(u,q)italic_f ( italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) > italic_f ( italic_u , italic_q ), then we can prune this duplicated state according to Theorem 4 (lines 14-15). Otherwise, we update the optimal H~k*subscriptsuperscript~𝐻𝑘\tilde{H}^{*}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and δ*()superscript𝛿\delta^{*}(\cdot)italic_δ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( ⋅ ) if necessary (lines 17-18), update the previously deleted node u𝑢uitalic_u as the current deleted node v𝑣vitalic_v (line 19), keep enumerating from this new H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (line 20), and finally return the optimal community H~k*subscriptsuperscript~𝐻𝑘\tilde{H}^{*}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (line 5).

IV-C Complexity Analysis

For the exact baseline without pruning strategies, the time complexity in the worst case is O(|EG|+|VH~k|!|EH~k|)𝑂subscript𝐸𝐺subscript𝑉subscript~𝐻𝑘subscript𝐸subscript~𝐻𝑘O(|E_{G}|+|V_{\tilde{H}_{k}}|!\cdot|E_{\tilde{H}_{k}}|)italic_O ( | italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | + | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ! ⋅ | italic_E start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ). The first term |EG|subscript𝐸𝐺|E_{G}|| italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | is the time of finding the maximal connected k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from the entire graph G𝐺Gitalic_G via core-decomposition [39]. While the second term |VH~k|!|EH~k|subscript𝑉subscript~𝐻𝑘subscript𝐸subscript~𝐻𝑘|V_{\tilde{H}_{k}}|!\cdot|E_{\tilde{H}_{k}}|| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ! ⋅ | italic_E start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | is the time complexity of enumeration over H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where |VH~k|!subscript𝑉subscript~𝐻𝑘|V_{\tilde{H}_{k}}|!| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ! is the total number of states in the search tree and |EH~k|subscript𝐸subscript~𝐻𝑘|E_{\tilde{H}_{k}}|| italic_E start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | is the time for core-maintenance on each state. Since the first pruning strategy prunes the duplicated states, the second term can be reduced to 2|VH~k||EH~k|superscript2subscript𝑉subscript~𝐻𝑘subscript𝐸subscript~𝐻𝑘2^{|V_{\tilde{H}_{k}}|}\cdot|E_{\tilde{H}_{k}}|2 start_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ⋅ | italic_E start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT |. While the last two pruning strategies can further reduce the number of states, thus we introduce a constant C𝐶Citalic_C ([2,200]absent2200\in[2,200]∈ [ 2 , 200 ] in practice) to represent the time as 2|VH~k|/C|EH~k|superscript2subscript𝑉subscript~𝐻𝑘𝐶subscript𝐸subscript~𝐻𝑘2^{|V_{\tilde{H}_{k}}|}/C\cdot|E_{\tilde{H}_{k}}|2 start_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT / italic_C ⋅ | italic_E start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT |.

Example 4

In DBLP, considering the community search with k=16𝑘16k=16italic_k = 16, the average # states for 200 random queries when using the first pruning strategy is nearly 7.82×10107.82superscript10107.82\times 10^{10}7.82 × 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. After applying the last two pruning strategies, the # states is reduced to 4.17×1084.17superscript1084.17\times 10^{8}4.17 × 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT. In this case, we have C=187.3𝐶187.3C=187.3italic_C = 187.3 and the running time is reduced from 694882s to 3964s.

V Sampling-Estimation Solution

We next present an sampling-estimation-based approximate method to improve Exact’s efficiency from two aspects: reducing the size of the maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT based on sampling and terminating the enumeration early if a reliable accuracy (in the form of a confidence interval) is obtained based on estimation.

Figure 4 shows the pipeline with three steps: (1) Sampling-based maximal H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT finding (§V-A). We determine the minimum population for sampling through Hoeffding Inequality [40], collect a set of samples (nodes) S𝑆Sitalic_S via an attribute-aware sampling, and take the maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from the induced graph of S𝑆Sitalic_S as the input for estimation. (2) Estimation with accuracy guarantee (§V-B). We estimate the δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) of each candidate community through Bag of Little Bootstrap [41] and terminate early when an accurate enough result is obtained. Otherwise, we iteratively find and estimate another candidate by greedy search. (3) Error-based incremental sampling (§V-C). If we cannot find a satisfactory community, then we enlarge S𝑆Sitalic_S via an error-based incremental sampling and repeat steps 1-2.

Refer to caption
Figure 4: The pipeline of our sampling-estimation method

V-A Sampling-based Maximal H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Finding

A straightforward idea is to collect a set of nodes (as samples S𝑆Sitalic_S) that are similar to the query node q𝑞qitalic_q from the entire graph G𝐺Gitalic_G. Given a node vVG𝑣subscript𝑉𝐺v\in V_{G}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, we can compute v𝑣vitalic_v’s sampling probability Ps(v)subscript𝑃𝑠𝑣P_{s}(v)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ) as a normalized value (Eq. 5). The smaller the attribute distance f(v,q)𝑓𝑣𝑞f(v,q)italic_f ( italic_v , italic_q ), the greater the Ps(v)subscript𝑃𝑠𝑣P_{s}(v)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ).

Ps(v)=1f(v,q)uVG(1f(u,q))subscript𝑃𝑠𝑣1𝑓𝑣𝑞subscript𝑢subscript𝑉𝐺1𝑓𝑢𝑞P_{s}(v)=\frac{1-f(v,q)}{\sum_{u\in V_{G}}(1-f(u,q))}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ) = divide start_ARG 1 - italic_f ( italic_v , italic_q ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_f ( italic_u , italic_q ) ) end_ARG (5)

The biggest issue is that the sampling population (i.e., G𝐺Gitalic_G) is too large to guarantee the sample quality. According to small world theory [42, 43], two nodes in the same cohesive community exhibit strong access locality [44]. This implies that nodes from the neighborhood of q𝑞qitalic_q are more likely to belong to the community of q𝑞qitalic_q. With this in mind, we prefer to collect samples from q𝑞qitalic_q’s neighborhood (denoted by GqGsubscript𝐺𝑞𝐺G_{q}\subseteq Gitalic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ⊆ italic_G) instead of the entire G𝐺Gitalic_G. Obviously, the size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (i.e., # nodes in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT) is important, a large (small) Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT may contain more (less) irrelevant (relevant) nodes w.r.t. q𝑞qitalic_q, thereby affecting the sample quality [45, 27, 28]. A critical problem is how to set an appropriate size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT in order to bound the sample quality. Straightforwardly, we can simply define an hhitalic_h-hop Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of q𝑞qitalic_q. However, it’s problematic, as an empirical hhitalic_h cannot adapt various k𝑘kitalic_k of k𝑘kitalic_k-core over different datasets. Hence, we resort to Hoeffding Inequality to determine the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

Minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Given a graph G𝐺Gitalic_G and a query node q𝑞qitalic_q, we first introduce the existence probability of a node uVG𝑢subscript𝑉𝐺u\in V_{G}italic_u ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT that belongs to q𝑞qitalic_q’s neighborhood Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, denoted by P(u)𝑃𝑢P(u)italic_P ( italic_u ) (discussed later), which is computed based on the node’s sampling probability Ps()subscript𝑃𝑠P_{s}(\cdot)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ). The larger the P(u)𝑃𝑢P(u)italic_P ( italic_u ), the higher the probability that u𝑢uitalic_u is included in a specific Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Suppose the ground-truth community of CS-AG is H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Let vVH~k𝑣subscript𝑉subscript~𝐻𝑘v\in V_{\tilde{H}_{k}}italic_v ∈ italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the node with the smallest existence probability P(v)𝑃𝑣P(v)italic_P ( italic_v ). We expect to find the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT that every node uVH~k𝑢subscript𝑉subscriptnormal-~𝐻𝑘u\in V_{\tilde{H}_{k}}italic_u ∈ italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT with P(u)P(v)𝑃𝑢𝑃𝑣P(u)\geq P(v)italic_P ( italic_u ) ≥ italic_P ( italic_v ) would be included in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with a large probability. In this way, sampling from Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be viewed as a good approximation to sampling from G𝐺Gitalic_G, because Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT includes sufficient relevant nodes w.r.t. q𝑞qitalic_q (i.e., nodes from H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with P(u)P(v)𝑃𝑢𝑃𝑣P(u)\geq P(v)italic_P ( italic_u ) ≥ italic_P ( italic_v )) and the least irrelevant nodes w.r.t. q𝑞qitalic_q (i.e., minimizing Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT). In the following, we leverage possible world semantics [46, 6, 47] to compute a node’s existence probability, then we apply Hoeffding Inequality to the existence probability to find the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

Possible worlds w.r.t. Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Given a graph G𝐺Gitalic_G and a query node q𝑞qitalic_q, we may easy to obtain various neighborhoods of q𝑞qitalic_q (e.g., different Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with different size) by collecting nodes based on their normalized sampling probabilities Ps(v)subscript𝑃𝑠𝑣P_{s}(v)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ) (Eq. 5). According to possible world semantics, each Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be seen as a possible world. Thus, we can measure an edge’s existence probability P(e)𝑃𝑒P(e)italic_P ( italic_e ) in a possible world as Eq. 6, which implies that an edge euvsubscript𝑒𝑢𝑣e_{uv}italic_e start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT exists in a Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT only if its two end nodes u𝑢uitalic_u and v𝑣vitalic_v are sampled simultaneously.

P(euv)=Ps(u)×Ps(v)𝑃subscript𝑒𝑢𝑣subscript𝑃𝑠𝑢subscript𝑃𝑠𝑣P(e_{uv})=P_{s}(u)\times P_{s}(v)italic_P ( italic_e start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) = italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) × italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ) (6)

Given the edge’s existence probability above, we can measure a specific Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT’s existence probability as follows.

P(Gq)=eGqP(e)eGGq(1P(e))𝑃subscript𝐺𝑞subscriptproduct𝑒subscript𝐺𝑞𝑃𝑒subscriptproduct𝑒𝐺subscript𝐺𝑞1𝑃𝑒P(G_{q})=\prod\limits_{e\in G_{q}}P(e)\prod\limits_{e\in G\setminus G_{q}}(1-P% (e))italic_P ( italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_e ∈ italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_e ) ∏ start_POSTSUBSCRIPT italic_e ∈ italic_G ∖ italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 1 - italic_P ( italic_e ) ) (7)

Given the aforementioned analysis, we show the probability of a node v𝑣vitalic_v that belongs to q𝑞qitalic_q’s neighborhood is equal to the aggregate probability over all possible worlds [45] (Eq. 8). 𝒢qsubscript𝒢𝑞\mathcal{G}_{q}caligraphic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT represents all possible worlds, and IGq(v)subscript𝐼subscript𝐺𝑞𝑣I_{G_{q}}(v)italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) is an indicator function denoting if v𝑣vitalic_v belongs to Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (IGq(v)=1subscript𝐼subscript𝐺𝑞𝑣1I_{G_{q}}(v)=1italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) = 1) or not.

P(v)=1uN(v)(1P(euv))=Gq𝒢qP(Gq)×IGq(v)𝑃𝑣1subscriptproduct𝑢𝑁𝑣1𝑃subscript𝑒𝑢𝑣subscriptsubscript𝐺𝑞subscript𝒢𝑞𝑃subscript𝐺𝑞subscript𝐼subscript𝐺𝑞𝑣P(v)=1-\prod\limits_{u\in N(v)}(1-P(e_{uv}))=\sum\limits_{G_{q}\in\mathcal{G}_% {q}}P(G_{q})\times I_{G_{q}}(v)italic_P ( italic_v ) = 1 - ∏ start_POSTSUBSCRIPT italic_u ∈ italic_N ( italic_v ) end_POSTSUBSCRIPT ( 1 - italic_P ( italic_e start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ caligraphic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P ( italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) × italic_I start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) (8)

Hoeffding Inequality. We next leverage P(v)𝑃𝑣P(v)italic_P ( italic_v ) to determine the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, ensuring that all nodes of the ground-truth community are included in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with a large probability.

Theorem 7

(Hoeffding Inequality [40] w.r.t. P(v)𝑃𝑣P(v)italic_P ( italic_v )) Given a set of possible worlds {Gq1,,Gqt}subscriptsuperscript𝐺1𝑞normal-…subscriptsuperscript𝐺𝑡𝑞\{G^{1}_{q},\dots,G^{t}_{q}\}{ italic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , … , italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } and an estimation error ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, let P^(v)normal-^𝑃𝑣\hat{P}(v)over^ start_ARG italic_P end_ARG ( italic_v ) be the estimation of P(v)𝑃𝑣P(v)italic_P ( italic_v ), where P^(v)=P(Gqi)×IGqi(v)normal-^𝑃𝑣𝑃subscriptsuperscript𝐺𝑖𝑞subscript𝐼subscriptsuperscript𝐺𝑖𝑞𝑣\hat{P}(v)=\sum P(G^{i}_{q})\times I_{G^{i}_{q}}(v)over^ start_ARG italic_P end_ARG ( italic_v ) = ∑ italic_P ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) × italic_I start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v ) and P^(v)[a,b]normal-^𝑃𝑣𝑎𝑏\hat{P}(v)\in[a,b]over^ start_ARG italic_P end_ARG ( italic_v ) ∈ [ italic_a , italic_b ] (a=0𝑎0a=0italic_a = 0,b=1𝑏1b=1italic_b = 1). Then, we have the following inequality.

Pr[P^(v)P(v)ϵ]exp(2t2ϵ2i=1t(ab)2)Prdelimited-[]^𝑃𝑣𝑃𝑣italic-ϵ2superscript𝑡2superscriptitalic-ϵ2superscriptsubscript𝑖1𝑡superscript𝑎𝑏2{\rm Pr}[\hat{P}(v)-P(v)\geq\epsilon]\leq\exp(-\frac{2t^{2}\epsilon^{2}}{\sum_% {i=1}^{t}(a-b)^{2}})roman_Pr [ over^ start_ARG italic_P end_ARG ( italic_v ) - italic_P ( italic_v ) ≥ italic_ϵ ] ≤ roman_exp ( - divide start_ARG 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_a - italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (9)

According to Hoeffding Inequality, this theorem provides an upper bound on the probability of the fact that P^(v)^𝑃𝑣\hat{P}(v)over^ start_ARG italic_P end_ARG ( italic_v ) has a large estimation error to P(v)𝑃𝑣P(v)italic_P ( italic_v ). It’s obvious that the larger the t𝑡titalic_t (# possible worlds), the smaller the upper bound it is, showing that P^(v)P(v)<ϵ^𝑃𝑣𝑃𝑣italic-ϵ\hat{P}(v)-P(v)<\epsilonover^ start_ARG italic_P end_ARG ( italic_v ) - italic_P ( italic_v ) < italic_ϵ holds with a larger probability. Based on Theorem 7, we have the following theorem holds.

Theorem 8

Given a set of possible worlds {Gq1,,Gqt}subscriptsuperscript𝐺1𝑞normal-…subscriptsuperscript𝐺𝑡𝑞\{G^{1}_{q},\dots,G^{t}_{q}\}{ italic_G start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , … , italic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT }, ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, and u𝑢uitalic_u, vVG𝑣subscript𝑉𝐺v\in V_{G}italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, if P(u)P(v)ϵ𝑃𝑢𝑃𝑣italic-ϵP(u)-P(v)\geq\epsilonitalic_P ( italic_u ) - italic_P ( italic_v ) ≥ italic_ϵ, then we have

Pr[P^(v)P^(u)>0]exp(tϵ2/2).Prdelimited-[]^𝑃𝑣^𝑃𝑢0𝑡superscriptitalic-ϵ22{\rm Pr}[\hat{P}(v)-\hat{P}(u)>0]\leq\exp(-t\epsilon^{2}/2)\quad.roman_Pr [ over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) > 0 ] ≤ roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) . (10)
Proof:

We consider P^(v)P^(u)^𝑃𝑣^𝑃𝑢\hat{P}(v)-\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) as the estimator of P(v)P(u)𝑃𝑣𝑃𝑢P(v)-P(u)italic_P ( italic_v ) - italic_P ( italic_u ) with P^(v)P^(u)[1,1]^𝑃𝑣^𝑃𝑢11\hat{P}(v)-\hat{P}(u)\in[-1,1]over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) ∈ [ - 1 , 1 ]. Then we have the following derivation by subjecting P^(v)P^(u)^𝑃𝑣^𝑃𝑢\hat{P}(v)-\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) to Eq. 9. ∎

Pr[P^(v)P^(u)>0]Prdelimited-[]^𝑃𝑣^𝑃𝑢0\displaystyle{\rm Pr}[\hat{P}(v)-\hat{P}(u)>0]roman_Pr [ over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) > 0 ] Pr[P^(v)P^(u)(P(v)P(u))ϵ]absentPrdelimited-[]^𝑃𝑣^𝑃𝑢𝑃𝑣𝑃𝑢italic-ϵ\displaystyle\leq{\rm Pr}[\hat{P}(v)-\hat{P}(u)-(P(v)-P(u))\geq\epsilon]≤ roman_Pr [ over^ start_ARG italic_P end_ARG ( italic_v ) - over^ start_ARG italic_P end_ARG ( italic_u ) - ( italic_P ( italic_v ) - italic_P ( italic_u ) ) ≥ italic_ϵ ]
exp(2t2ϵ2i=1t(1(1))2)absent2superscript𝑡2superscriptitalic-ϵ2superscriptsubscript𝑖1𝑡superscript112\displaystyle\leq\exp(-\frac{2t^{2}\epsilon^{2}}{\sum_{i=1}^{t}(1-(-1))^{2}})≤ roman_exp ( - divide start_ARG 2 italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - ( - 1 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG )
=exp(tϵ2/2)absent𝑡superscriptitalic-ϵ22\displaystyle=\exp(-t\epsilon^{2}/2)= roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 )

Theorem 8 shows the theoretical result of bounding the order of a pair of nodes. That is, if we have P(v)<P(u)𝑃𝑣𝑃𝑢P(v)<P(u)italic_P ( italic_v ) < italic_P ( italic_u ) for two nodes, then their estimated existence probabilities would satisfy P^(v)P^(u)^𝑃𝑣^𝑃𝑢\hat{P}(v)\leq\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_v ) ≤ over^ start_ARG italic_P end_ARG ( italic_u ) with a probability 1exp(tϵ2/2)absent1𝑡superscriptitalic-ϵ22\geq 1-\exp(-t\epsilon^{2}/2)≥ 1 - roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ). More precisely, given the assumption of P(v)P(u)ϵ<P(u)𝑃𝑣𝑃𝑢italic-ϵ𝑃𝑢P(v)\leq P(u)-\epsilon<P(u)italic_P ( italic_v ) ≤ italic_P ( italic_u ) - italic_ϵ < italic_P ( italic_u ), the probability of P^(v)>P^(u)^𝑃𝑣^𝑃𝑢\hat{P}(v)>\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_v ) > over^ start_ARG italic_P end_ARG ( italic_u ) is upper bounded by exp(tϵ2/2)𝑡superscriptitalic-ϵ22\exp(-t\epsilon^{2}/2)roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ). The smaller the upper bound, the more the probability of P^(v)P^(u)^𝑃𝑣^𝑃𝑢\hat{P}(v)\leq\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_v ) ≤ over^ start_ARG italic_P end_ARG ( italic_u ), showing that u𝑢uitalic_u is more likely to be included in Gq=i=1tGqisubscript𝐺𝑞superscriptsubscript𝑖1𝑡subscriptsuperscript𝐺𝑖𝑞G_{q}=\bigcup_{i=1}^{t}G^{i}_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT than v𝑣vitalic_v, given P(v)<P(u)𝑃𝑣𝑃𝑢P(v)<P(u)italic_P ( italic_v ) < italic_P ( italic_u ).

Given the ground-truth community H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of CS-AG, let vVH~k𝑣subscript𝑉subscript~𝐻𝑘v\in V_{\tilde{H}_{k}}italic_v ∈ italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the node with the smallest existence probability P(v)𝑃𝑣P(v)italic_P ( italic_v ) and there are m𝑚mitalic_m nodes in H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with existence probabilities P(v)absent𝑃𝑣\geq P(v)≥ italic_P ( italic_v ). Then, we can use Theorem 8 to derive the minimum number of possible worlds (i.e., t𝑡titalic_t) w.r.t. Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT that ensures all m𝑚mitalic_m nodes can be contained in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with at least 1β1𝛽1-\beta1 - italic_β probability, as Theorem 9 shows. Specifically, Theorem 9 bounds the order of m(nm)𝑚𝑛𝑚m(n-m)italic_m ( italic_n - italic_m ) pairs of nodes by applying Union Bound and Theorem 8, where n=|VG|𝑛subscript𝑉𝐺n=|V_{G}|italic_n = | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | is the number of nodes in G𝐺Gitalic_G.

Theorem 9

Given a desired probability of 1β1𝛽1-\beta1 - italic_β and ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, it requires t2ϵ2lnm(nm)β𝑡2superscriptitalic-ϵ2𝑚𝑛𝑚𝛽t\geq\frac{2}{\epsilon^{2}}\ln\frac{m(n-m)}{\beta}italic_t ≥ divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG italic_m ( italic_n - italic_m ) end_ARG start_ARG italic_β end_ARG possible worlds to ensure that Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT contains all m𝑚mitalic_m nodes (from H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) with existence probability P(v)absent𝑃𝑣\geq P(v)≥ italic_P ( italic_v ), where v𝑣vitalic_v is the node with the smallest existence probability in the ground-truth H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Proof:

Since we expect all m𝑚mitalic_m nodes with P()P(v)𝑃𝑃𝑣P(\cdot)\geq P(v)italic_P ( ⋅ ) ≥ italic_P ( italic_v ) would be contained in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with at least 1β1𝛽1-\beta1 - italic_β probability, we need to ensure that every node u𝑢uitalic_u from such m𝑚mitalic_m nodes has a larger P^(u)^𝑃𝑢\hat{P}(u)over^ start_ARG italic_P end_ARG ( italic_u ) than that of other nm𝑛𝑚n-mitalic_n - italic_m nodes, i.e., we need to bound the order of m(nm)𝑚𝑛𝑚m(n-m)italic_m ( italic_n - italic_m ) pairs of nodes. We have the following derivation through Bound Union and Theorem 8. ∎

1β1𝛽\displaystyle 1-\beta1 - italic_β 1m(nm)exp(tϵ2/2)absent1𝑚𝑛𝑚𝑡superscriptitalic-ϵ22\displaystyle\leq 1-m(n-m)\exp(-t\epsilon^{2}/2)≤ 1 - italic_m ( italic_n - italic_m ) roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 )
βm(nm)exp(tϵ2/2)t2ϵ2lnm(nm)βabsent𝛽𝑚𝑛𝑚𝑡superscriptitalic-ϵ22𝑡2superscriptitalic-ϵ2𝑚𝑛𝑚𝛽\displaystyle\Rightarrow\beta\geq m(n-m)\exp(-t\epsilon^{2}/2)\Rightarrow t% \geq\frac{2}{\epsilon^{2}}\ln\frac{m(n-m)}{\beta}\quad⇒ italic_β ≥ italic_m ( italic_n - italic_m ) roman_exp ( - italic_t italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 ) ⇒ italic_t ≥ divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG italic_m ( italic_n - italic_m ) end_ARG start_ARG italic_β end_ARG

We next show how to compute the the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (Theorem 10) based on the minimum t𝑡titalic_t given in Theorem 9.

Theorem 10

Given the ground-truth community H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of CS-AG, in the worst case, we require at least 2ϵ2ln(k+1)(nk1)β+12superscriptitalic-ϵ2𝑘1𝑛𝑘1𝛽1\frac{2}{\epsilon^{2}}\ln\frac{(k+1)(n-k-1)}{\beta}+1divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG ( italic_k + 1 ) ( italic_n - italic_k - 1 ) end_ARG start_ARG italic_β end_ARG + 1 nodes from the original G𝐺Gitalic_G to form Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, so that all nodes in H~ksubscriptnormal-~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT would be contained in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT with a probability of 1β1𝛽1-\beta1 - italic_β.

Proof:

A k𝑘kitalic_k-core has at least k+1𝑘1k+1italic_k + 1 nodes, which means that we have at least k+1𝑘1k+1italic_k + 1 nodes should be contained in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Thus, we need to compare at least (k+1)(nk1)𝑘1𝑛𝑘1(k+1)(n-k-1)( italic_k + 1 ) ( italic_n - italic_k - 1 ) pairs of nodes (i.e., m=k+1𝑚𝑘1m=k+1italic_m = italic_k + 1), which indicates that we need at least 2ϵ2ln(k+1)(nk1)β2superscriptitalic-ϵ2𝑘1𝑛𝑘1𝛽\frac{2}{\epsilon^{2}}\ln\frac{(k+1)(n-k-1)}{\beta}divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG ( italic_k + 1 ) ( italic_n - italic_k - 1 ) end_ARG start_ARG italic_β end_ARG possible worlds w.r.t. Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (Theorem 9). In the worst case, each possible world w.r.t. Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be an individual edge between q𝑞qitalic_q and another node. Thus, we require at least 2ϵ2ln(k+1)(nk1)β+12superscriptitalic-ϵ2𝑘1𝑛𝑘1𝛽1\frac{2}{\epsilon^{2}}\ln\frac{(k+1)(n-k-1)}{\beta}+1divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG ( italic_k + 1 ) ( italic_n - italic_k - 1 ) end_ARG start_ARG italic_β end_ARG + 1 nodes to form the final Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. ∎

Example 5

Given the DBLP with 682819682819682819682819 nodes, k=30𝑘30k=30italic_k = 30, ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05, and 1β=98%1𝛽percent981-\beta=98\%1 - italic_β = 98 %, it requires at least 20.052ln(31)(68281931)0.02+12superscript0.05231682819310.021\frac{2}{0.05^{2}}\ln\frac{(31)(682819-31)}{0.02}+1divide start_ARG 2 end_ARG start_ARG 0.05 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG ( 31 ) ( 682819 - 31 ) end_ARG start_ARG 0.02 end_ARG + 1 \approx 16625166251662516625 nodes to form a Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT.

We next conduct a BFS starting from the query node q𝑞qitalic_q to form Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. In this BFS, we preferentially expand the search from those nodes having smaller composite attribute distances to q𝑞qitalic_q, until the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is reached. In §VII-G, we investigate the effect of ϵitalic-ϵ\epsilonitalic_ϵ and 1β1𝛽1-\beta1 - italic_β on CS’s performance.

Attribute-aware sampling over Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. We perform attribute-aware sampling over the population Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT as follows. We compute the sampling probabilities Ps()subscript𝑃𝑠P_{s}(\cdot)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ) of all nodes in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT based on their composite attribute distances to q𝑞qitalic_q by replacing G𝐺Gitalic_G by Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT in Eq. 5. We randomly collect |S|𝑆|S|| italic_S | samples (i.e., nodes) from Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT according to their Ps()subscript𝑃𝑠P_{s}(\cdot)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ⋅ ). We initialize the sample size as a fraction λ𝜆\lambdaitalic_λ of nodes in Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, i.e., |S|=λ|VGq|𝑆𝜆subscript𝑉subscript𝐺𝑞|S|=\lambda\cdot|V_{G_{q}}|| italic_S | = italic_λ ⋅ | italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT |, and update |S|𝑆|S|| italic_S | with an appropriate |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | if necessary (discussed in §V-C). In §VII-G, we show the parameter sensitivity of λ𝜆\lambdaitalic_λ.

Find the maximal connected k𝑘kitalic_k-core. We maintain a maximal connected k𝑘kitalic_k-core from the induced graph Gq[S]subscript𝐺𝑞delimited-[]𝑆G_{q}[S]italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] of samples S𝑆Sitalic_S and take it as input of the estimation step (§V-B) to find an approximate community of q𝑞qitalic_q for Approx-CS-AG problem.

V-B Estimation with Accuracy Guarantee

Given the maximal H~kGq[S]subscript~𝐻𝑘subscript𝐺𝑞delimited-[]𝑆\tilde{H}_{k}\subseteq G_{q}[S]over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ], a user-input error bound e𝑒eitalic_e, and a confidence level 1α1𝛼1-\alpha1 - italic_α, Approx-CS-AG problem aims to find an approximate community H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with an attribute distance δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT satisfying |δδ|/δesuperscript𝛿𝛿𝛿𝑒|\delta^{\star}-\delta|/\delta\leq e| italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ | / italic_δ ≤ italic_e with a probability of 1α1𝛼1-\alpha1 - italic_α (Eq. 1-2). (1) We provide a confidence interval CI =δ±εabsentplus-or-minussuperscript𝛿𝜀=\delta^{\star}\pm\varepsilon= italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε at 1α1𝛼1-\alpha1 - italic_α level to quantify the quality of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT based on Central Limit Theorem (CLT) [27, 28], and apply Bag of Little Bootstrap (BLB) to compute the Margin of Error (MoE) ε𝜀\varepsilonitalic_ε of CI. (2) We return H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT when a tight CI (i.e., εδe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG, Theorem 11) is obtained. (3) Otherwise, we greedily remove the most dissimilar node from H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to get a new candidate H~kiH~ksubscriptsuperscript~𝐻𝑖𝑘subscriptsuperscript~𝐻𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and repeat above. If we cannot find a good H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we repeat steps (1)-(3) with enlarged samples S=SΔS𝑆𝑆Δ𝑆S=S\cup\Delta Sitalic_S = italic_S ∪ roman_Δ italic_SV-C).

Confidence interval calculation. Recall the attribute distance δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is the average composite attribute distance f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) of all nodes from VH~kqsubscript𝑉subscriptsuperscript~𝐻𝑘𝑞V_{\tilde{H}^{\star}_{k}}\setminus qitalic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∖ italic_q, where f(v,q)𝑓𝑣𝑞f(v,q)italic_f ( italic_v , italic_q ) of vVH~kfor-all𝑣subscript𝑉subscriptsuperscript~𝐻𝑘\forall v\in V_{\tilde{H}^{\star}_{k}}∀ italic_v ∈ italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT is considered as a random variable with the sampling probability of Ps(v)subscript𝑃𝑠𝑣P_{s}(v)italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_v ). From CLT, we know that a mean-like point estimator follows a normal distribution [48]. So, we have δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT similar-to\sim N(μδ,σδ2)𝑁subscript𝜇superscript𝛿subscriptsuperscript𝜎2superscript𝛿N(\mu_{\delta^{\star}},\sigma^{2}_{\delta^{\star}})italic_N ( italic_μ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), and the MoE ε𝜀\varepsilonitalic_ε of CI =δ±εabsentplus-or-minussuperscript𝛿𝜀=\delta^{\star}\pm\varepsilon= italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε at 1α1𝛼1-\alpha1 - italic_α level can be calculated based on CLT as ε=zα/2σδ𝜀subscript𝑧𝛼2subscript𝜎superscript𝛿\varepsilon=z_{\alpha/2}\cdot\sigma_{\delta^{\star}}italic_ε = italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, where zα/2subscript𝑧𝛼2z_{\alpha/2}italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT is the normal critical value with right-tail probability α/2𝛼2\alpha/2italic_α / 2 (obtained from a standard normal table). We next use BLB to estimate σδsubscript𝜎superscript𝛿\sigma_{\delta^{\star}}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

Bag of little bootstrap. Bootstrap [49] provides an automatic and widely applicable means of quantifying estimator quality [50]. Though it’s simple and powerful, it requires computing the estimators on resamples having a size comparable to the original data. If the original data are large, then bootstrap is costly. Thus, we resort to BLB [50], which incorporates features of both bootstrap and subsampling, for high-quality estimation with a quite small sample. Given the approximate community H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we do BLB as follows: (1) We collect s𝑠sitalic_s small subsamples {S1,,Ss}subscript𝑆1subscript𝑆𝑠\{S_{1},\dots,S_{s}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } from VH~ksubscript𝑉subscriptsuperscript~𝐻𝑘V_{\tilde{H}^{\star}_{k}}italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Sifor-allsubscript𝑆𝑖\forall S_{i}∀ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a size of |VH~k|msuperscriptsubscript𝑉superscriptsubscript~𝐻𝑘𝑚|V_{\tilde{H}_{k}^{\star}}|^{m}| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where m[0.5,1)𝑚0.51m\in[0.5,1)italic_m ∈ [ 0.5 , 1 ) is the scale factor used in [50] to ensure s|VH~k|m|VH~k|𝑠superscriptsubscript𝑉superscriptsubscript~𝐻𝑘𝑚subscript𝑉superscriptsubscript~𝐻𝑘s\cdot|V_{\tilde{H}_{k}^{\star}}|^{m}\leq|V_{\tilde{H}_{k}^{\star}}|italic_s ⋅ | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ≤ | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT |. We use Sblb=Sisubscript𝑆blbsubscript𝑆𝑖S_{\rm blb}=\bigcup S_{i}italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT = ⋃ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to indicate all subsamples for BLB estimation. (2) For each Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, BLB estimates σδsubscript𝜎superscript𝛿\sigma_{\delta^{\star}}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by a standard bootstrap (given below) and computes an MoE εi=zα/2σδsubscript𝜀𝑖subscript𝑧𝛼2subscript𝜎superscript𝛿\varepsilon_{i}=z_{\alpha/2}\cdot\sigma_{\delta^{\star}}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. (3) Given s𝑠sitalic_s MoEs {ε1,,εs}subscript𝜀1subscript𝜀𝑠\{\varepsilon_{1},\dots,\varepsilon_{s}\}{ italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, BLB computes the final ε=εi/s𝜀subscript𝜀𝑖𝑠\varepsilon=\sum\varepsilon_{i}/sitalic_ε = ∑ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_s.

Bootstrap. Given a subsample Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a standard bootstrap first collects r𝑟ritalic_r resamples having size |Si|subscript𝑆𝑖|S_{i}|| italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | from Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with replacement. Then, it computes δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT for each resample as {δ1,,δr}subscriptsuperscript𝛿1subscriptsuperscript𝛿𝑟\{\delta^{\star}_{1},\dots,\delta^{\star}_{r}\}{ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT }. Next, it takes the empirical distribution of {δ1,,δr}subscriptsuperscript𝛿1subscriptsuperscript𝛿𝑟\{\delta^{\star}_{1},\dots,\delta^{\star}_{r}\}{ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } as an approximation to N(μδ,σδ2)𝑁subscript𝜇superscript𝛿subscriptsuperscript𝜎2superscript𝛿N(\mu_{\delta^{\star}},\sigma^{2}_{\delta^{\star}})italic_N ( italic_μ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), so we estimate σδsubscript𝜎superscript𝛿\sigma_{\delta^{\star}}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by Eq. 11.

μδ=δi/r,σδ=(δiμδ)/(r1)formulae-sequencesubscript𝜇superscript𝛿subscriptsuperscript𝛿𝑖𝑟subscript𝜎superscript𝛿subscriptsuperscript𝛿𝑖subscript𝜇superscript𝛿𝑟1\mu_{\delta^{\star}}=\sum\delta^{\star}_{i}/r\ ,\quad\sigma_{\delta^{\star}}=% \sum(\delta^{\star}_{i}-\mu_{\delta^{\star}})/(r-1)italic_μ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∑ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_r , italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∑ ( italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) / ( italic_r - 1 ) (11)

Accuracy guarantee. Given a CI = δ±εplus-or-minussuperscript𝛿𝜀\delta^{\star}\pm\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε, we ensure that the relative error of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT is bounded by a user-input error bound e𝑒eitalic_e.

Theorem 11

If the MoE ε𝜀\varepsilonitalic_ε satisfies εδe1+e𝜀normal-⋅superscript𝛿normal-⋆𝑒1𝑒\varepsilon\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG, then the relative error is upper bounded by e𝑒eitalic_e with a probability of 1α1𝛼1-\alpha1 - italic_α.

Proof:

We prove this theorem in the following two steps.

Step 1. Suppose that the exact δ𝛿\deltaitalic_δ locates in the CI’s right half-width, i.e., δδδ+εsuperscript𝛿𝛿superscript𝛿𝜀\delta^{\star}\leq\delta\leq\delta^{\star}+\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_δ ≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_ε. Then we have the following derivation and (δδ)/δe𝛿superscript𝛿𝛿𝑒(\delta-\delta^{\star})/\delta\leq e( italic_δ - italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) / italic_δ ≤ italic_e holds if ε/δe𝜀superscript𝛿𝑒\varepsilon/\delta^{\star}\leq eitalic_ε / italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_e (i.e., εδe𝜀superscript𝛿𝑒\varepsilon\leq\delta^{\star}\cdot eitalic_ε ≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e).

(δδ)/δ(δδ)/δε/δ𝛿superscript𝛿𝛿𝛿superscript𝛿superscript𝛿𝜀superscript𝛿(\delta-\delta^{\star})/\delta\leq(\delta-\delta^{\star})/\delta^{\star}\leq% \varepsilon/\delta^{\star}( italic_δ - italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) / italic_δ ≤ ( italic_δ - italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) / italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_ε / italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT

Step 2. Suppose that the exact δ𝛿\deltaitalic_δ locates in the CI’s left half-width, i.e., δεδδsuperscript𝛿𝜀𝛿superscript𝛿\delta^{\star}-\varepsilon\leq\delta\leq\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_ε ≤ italic_δ ≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT. Then (δδ)/δesuperscript𝛿𝛿𝛿𝑒(\delta^{\star}-\delta)/\delta\leq e( italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ ) / italic_δ ≤ italic_e holds if ε/(δε)e𝜀superscript𝛿𝜀𝑒\varepsilon/(\delta^{\star}-\varepsilon)\leq eitalic_ε / ( italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_ε ) ≤ italic_e (i.e., εδe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG) by the following derivation.

(δδ)/δε/δε/(δε)superscript𝛿𝛿𝛿𝜀𝛿𝜀superscript𝛿𝜀(\delta^{\star}-\delta)/\delta\leq\varepsilon/\delta\leq\varepsilon/(\delta^{% \star}-\varepsilon)( italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ ) / italic_δ ≤ italic_ε / italic_δ ≤ italic_ε / ( italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_ε )

In summary, |δδ|/δesuperscript𝛿𝛿𝛿𝑒|\delta^{\star}-\delta|/\delta\leq e| italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT - italic_δ | / italic_δ ≤ italic_e holds if εδe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG, as δe1+esuperscript𝛿𝑒1𝑒\frac{\delta^{\star}\cdot e}{1+e}divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG is a tighter bound (δeabsentsuperscript𝛿𝑒\leq\delta^{\star}\cdot e≤ italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e). Since our CI of δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT has a confidence level 1α1𝛼1-\alpha1 - italic_α, the above holds with a probability of 1α1𝛼1-\alpha1 - italic_α. ∎

Greedy search of candidate communities. If the accuracy guarantee (Theorem 11) is not satisfied, we keep enumerating from H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to get another candidate H~kiH~ksubscriptsuperscript~𝐻𝑖𝑘subscriptsuperscript~𝐻𝑘\tilde{H}^{i}_{k}\subseteq\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for further estimation. A straightforward method is to apply our enumeration with pruningsIV-B) to enumerate a new candidate and do BLB estimation until Theorem 11 holds. Given the premise of finding an approximate community, we can simplify it to a greedy candidate search (without backtracking) by deleting the most dissimilar node at each state. Specifically, given a current H~kisubscriptsuperscript~𝐻𝑖𝑘\tilde{H}^{i}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we delete node v𝑣vitalic_v with the most dissimilar composite attribute distance f(v,q)𝑓𝑣𝑞f(v,q)italic_f ( italic_v , italic_q ) to q𝑞qitalic_q, then we maintain the maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of the remaining nodes as the next candidate and do BLB estimation for it. We terminate it once Theorem 11 holds.

V-C Error-based Incremental Sampling

If we cannot find a good H~ksubscriptsuperscript~𝐻𝑘\tilde{H}^{\star}_{k}over~ start_ARG italic_H end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, then we should enlarge the samples S=SΔS𝑆𝑆Δ𝑆S=S\cup\Delta Sitalic_S = italic_S ∪ roman_Δ italic_S. Intuitively, we need a large |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | when the MoE ε𝜀\varepsilonitalic_ε is large. Otherwise, a small |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | is sufficient. So, we present a method to automatically set |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | based on ε𝜀\varepsilonitalic_ε.

Consider an MoE ε>δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon>\frac{\delta^{\star}\cdot e}{1+e}italic_ε > divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG. We use ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG to denote how far ε𝜀\varepsilonitalic_ε is away from the desired value δe1+esuperscript𝛿𝑒1𝑒\frac{\delta^{\star}\cdot e}{1+e}divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG. The larger ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG is, the more nodes that ΔSΔ𝑆\Delta Sroman_Δ italic_S requires. Ideally, if we can reduce ε𝜀\varepsilonitalic_ε to a new εsuperscript𝜀\varepsilon^{\prime}italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by at least ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG times, we can satisfy εδe1+esuperscript𝜀superscript𝛿𝑒1𝑒\varepsilon^{\prime}\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG. Since ε=zα/2σδ𝜀subscript𝑧𝛼2subscript𝜎superscript𝛿\varepsilon=z_{\alpha/2}\cdot\sigma_{\delta^{\star}}italic_ε = italic_z start_POSTSUBSCRIPT italic_α / 2 end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, reducing ε𝜀\varepsilonitalic_ε by ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG times is equivalent to reducing σδsubscript𝜎superscript𝛿\sigma_{\delta^{\star}}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG times. Since σδ=σδ/|VH~k|subscript𝜎superscript𝛿subscript𝜎𝛿subscript𝑉subscript~𝐻𝑘\sigma_{\delta^{\star}}=\sigma_{\delta}/\sqrt{|V_{\tilde{H}_{k}}|}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT / square-root start_ARG | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | end_ARG according to CLT, where σδsubscript𝜎𝛿\sigma_{\delta}italic_σ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is the standard deviation of the population, we say that reducing σδsubscript𝜎superscript𝛿\sigma_{\delta^{\star}}italic_σ start_POSTSUBSCRIPT italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT by ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG times is equivalent to increasing |VH~k|subscript𝑉subscript~𝐻𝑘|V_{\tilde{H}_{k}}|| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | by (ε/δe1+e)2superscript𝜀superscript𝛿𝑒1𝑒2(\varepsilon/\frac{\delta^{\star}\cdot e}{1+e})^{2}( italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times. In summary, we can increase |VH~k|subscript𝑉subscript~𝐻𝑘|V_{\tilde{H}_{k}}|| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | by (ε/δe1+e)2superscript𝜀superscript𝛿𝑒1𝑒2(\varepsilon/\frac{\delta^{\star}\cdot e}{1+e})^{2}( italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT times to reduce ε𝜀\varepsilonitalic_ε by ε/δe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon/\frac{\delta^{\star}\cdot e}{1+e}italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG times. Hence, we derive |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | as follows.

|ΔS|=|Sblb|[(ε/δe1+e)2m1]Δ𝑆subscript𝑆blbdelimited-[]superscript𝜀superscript𝛿𝑒1𝑒2𝑚1|\Delta S|=|S_{\rm blb}|\cdot[(\varepsilon/\frac{\delta^{\star}\cdot e}{1+e})^% {2m}-1]| roman_Δ italic_S | = | italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT | ⋅ [ ( italic_ε / divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG ) start_POSTSUPERSCRIPT 2 italic_m end_POSTSUPERSCRIPT - 1 ] (12)
Example 6

Given a CI = δ±εplus-or-minussuperscript𝛿normal-⋆𝜀\delta^{\star}\pm\varepsilonitalic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ± italic_ε with δsuperscript𝛿normal-⋆\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = 0.30.30.30.3, ε𝜀\varepsilonitalic_ε = 3.5×1033.5superscript1033.5\times 10^{-3}3.5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, and |Sblb|subscript𝑆normal-blb|S_{\rm blb}|| italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT | = 1000100010001000. If we set the scale factor m𝑚mitalic_m = 0.60.60.60.6 and error bound e𝑒eitalic_e = 0.010.010.010.01, then we need |ΔS|normal-Δ𝑆|\Delta S|| roman_Δ italic_S | = 1000((3.5×103/0.30.011.01)20.61)253normal-⋅1000superscript3.5superscript103normal-⋅0.30.011.01normal-⋅20.612531000\cdot((3.5\times 10^{-3}/\frac{0.3\cdot 0.01}{1.01})^{2\cdot 0.6}-1)% \approx 2531000 ⋅ ( ( 3.5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT / divide start_ARG 0.3 ⋅ 0.01 end_ARG start_ARG 1.01 end_ARG ) start_POSTSUPERSCRIPT 2 ⋅ 0.6 end_POSTSUPERSCRIPT - 1 ) ≈ 253 to update S𝑆Sitalic_S. While for a large ε=8×103𝜀8superscript103\varepsilon=8\times 10^{-3}italic_ε = 8 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, we then require |ΔS|2284normal-Δ𝑆2284|\Delta S|\approx 2284| roman_Δ italic_S | ≈ 2284 to update S𝑆Sitalic_S.

V-D Complexity Analysis

The total time of our approximate solution consists of sampling time (Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) and estimation time (Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT). For sampling, we fist require a BFS to get a neighbrood graph Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT of q𝑞qitalic_q with a time of |VGq|+|EGq|subscript𝑉subscript𝐺𝑞subscript𝐸subscript𝐺𝑞|V_{G_{q}}|+|E_{G_{q}}|| italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT | + | italic_E start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT |, then we collect |S|=λ|VGq|𝑆𝜆subscript𝑉subscript𝐺𝑞|S|=\lambda\cdot|V_{G_{q}}|| italic_S | = italic_λ ⋅ | italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT | samples from Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and find the maximal connected k𝑘kitalic_k-core H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from the induced graph Gq[S]subscript𝐺𝑞delimited-[]𝑆G_{q}[S]italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] with a time of |S|+|VGq[S]|𝑆subscript𝑉subscript𝐺𝑞delimited-[]𝑆|S|+|V_{G_{q}[S]}|| italic_S | + | italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] end_POSTSUBSCRIPT |. So, we get Te=O((1+λ)|VGq|+|EGq|+|VGq[S]|)subscript𝑇𝑒𝑂1𝜆subscript𝑉subscript𝐺𝑞subscript𝐸subscript𝐺𝑞subscript𝑉subscript𝐺𝑞delimited-[]𝑆T_{e}=O((1+\lambda)\cdot|V_{G_{q}}|+|E_{G_{q}}|+|V_{G_{q}[S]}|)italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_O ( ( 1 + italic_λ ) ⋅ | italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT | + | italic_E start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT | + | italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] end_POSTSUBSCRIPT | ). For estimation, we introduce a constant Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to indicate the number of iterations of accuracy estimation till termination condition (Theorem 11) is reached (Ne5subscript𝑁𝑒5N_{e}\leq 5italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≤ 5 in practice). In each iteration, we greedily enumerate from H~kGq[S]subscript~𝐻𝑘subscript𝐺𝑞delimited-[]𝑆\tilde{H}_{k}\subseteq G_{q}[S]over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] by deleting one node with the most dissimilar attribute distance, for each explored state we perform BLB estimation over |Sblb|subscript𝑆blb|S_{\rm blb}|| italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT | subsamples. So, the total time of BLB estimation is |VH~k||Sblb|subscript𝑉subscript~𝐻𝑘subscript𝑆blb|V_{\tilde{H}_{k}}|\cdot|S_{\rm blb}|| italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ⋅ | italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT |. If we cannot find a good community, then we should include |ΔS|Δ𝑆|\Delta S|| roman_Δ italic_S | additional samples for the next iteration of accuracy estimation. Thus, the total time is Te=O(Ne(|VH~k||Sblb|+|ΔS|))subscript𝑇𝑒𝑂subscript𝑁𝑒subscript𝑉subscript~𝐻𝑘subscript𝑆blbΔ𝑆T_{e}=O(N_{e}\cdot(|V_{\tilde{H}_{k}}|\cdot|S_{\rm blb}|+|\Delta S|))italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_O ( italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⋅ ( | italic_V start_POSTSUBSCRIPT over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ⋅ | italic_S start_POSTSUBSCRIPT roman_blb end_POSTSUBSCRIPT | + | roman_Δ italic_S | ) ).

VI Extensions

We extend our sampling-estimation solution to three more general scenarios: (1) CS on heterogeneous graphs, (2) Size-bounded CS, and (3) CS with different community models.

VI-A Extension to Heterogeneous Graphs

A heterogeneous graph G𝐺Gitalic_G consists of a node (edge) set VGsubscript𝑉𝐺V_{G}italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT (EGsubscript𝐸𝐺E_{G}italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT) with multiple node types 𝒯𝒯\mathcal{T}caligraphic_T (\mathcal{R}caligraphic_R). For vVGfor-all𝑣subscript𝑉𝐺\forall v\in V_{G}∀ italic_v ∈ italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, it has a node type ϕ(v):VG𝒯:italic-ϕ𝑣subscript𝑉𝐺𝒯\phi(v):V_{G}\rightarrow\mathcal{T}italic_ϕ ( italic_v ) : italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT → caligraphic_T. For eEGfor-all𝑒subscript𝐸𝐺\forall e\in E_{G}∀ italic_e ∈ italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, it has an edge type ψ(e):EG:𝜓𝑒subscript𝐸𝐺\psi(e):E_{G}\rightarrow\mathcal{R}italic_ψ ( italic_e ) : italic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT → caligraphic_R. The meta-path 𝒫𝒫\mathcal{P}caligraphic_P is often used to indicate a specific relationship between two nodes with the same type. For example, in DBLP, 𝒫=𝙰𝒫𝙰\mathcal{P}=\texttt{A}caligraphic_P = A-P-A shows the co-authorship w.r.t. a paper between two authors. We call nodes with the type linked by 𝒫𝒫\mathcal{P}caligraphic_P as target nodes (e.g., authors for A-P-A) and we aim to find an approximate (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core community [7, 27, 12] of target nodes from G𝐺Gitalic_G satisfying the constrains of Approx-CS-AG problem. We refer readers to [7, 11, 12] for more details of heterogeneous graphs and meta-path. We extend our method to this scenario with three modifications: (1) We replace # nodes n=|VG|𝑛subscript𝑉𝐺n=|V_{G}|italic_n = | italic_V start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT | in Theorem 10 with # target nodes of G𝐺Gitalic_G to compute the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. (2) We construct Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT by a 𝒫𝒫\mathcal{P}caligraphic_P-neighbor-oriented BFS from the query node q𝑞qitalic_q, which expands the search by exploring a node’s 𝒫𝒫\mathcal{P}caligraphic_P-neighbors. We say two target nodes are 𝒫𝒫\mathcal{P}caligraphic_P-neighbors if they are connected by a path instance of 𝒫𝒫\mathcal{P}caligraphic_P. (3) We perform BLB estimation on the community of target nodes, using the attribute distance δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) computed by target nodes’ f(,q)𝑓𝑞f(\cdot,q)italic_f ( ⋅ , italic_q ) to q𝑞qitalic_q.

Besides the basic (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core, there are several variants of (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core. For example, (1) (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-truss [51] is an extension of (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core, we can support it by the same method as above but change the core-maintenance to truss-maintenance during the BLB estimation. (3) Heterogeneous influential community (HIC) is proposed in [52], it aims to identify a (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core community H𝐻Hitalic_H satisfying that there is no other community Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the influence vector f(H)𝑓superscript𝐻f(H^{\prime})italic_f ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) dominates the influence vector f(H)𝑓𝐻f(H)italic_f ( italic_H ) of H𝐻Hitalic_H. The dominance relationship is defined the same as skyline, that is for each element fi(H)subscript𝑓𝑖superscript𝐻f_{i}(H^{\prime})italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) in f(H)𝑓superscript𝐻f(H^{\prime})italic_f ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), it must fi(H)absentsubscript𝑓𝑖𝐻\geq f_{i}(H)≥ italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_H ). We may support HIC with a modification on the BLB estimation, i.e., estimating the MAX value of each element in the influence vector of an approximate community H𝐻Hitalic_H. More precisely, we may resort to Extreme Value Theory (EVT) [53] to conduct EVT-based MAX value estimation [27] for each element in the influence vector of an approximate community.

VI-B Extension to Size-bounded CS

The community’s size is critical to some applications [29, 54]. Many applications naturally require that the number of members in a community should fall within a certain range, e.g., organize a workshop with at least l𝑙litalic_l attendees and no more than hhitalic_h attendees. This motivates a size-bounded CS [29] to find a community with size [l,h]absent𝑙\in[l,h]∈ [ italic_l , italic_h ]. We extend our sampling-estimation solution to size-bounded CS with three modifications: (1) We require at least 2ϵ2lnl(nl)β+12superscriptitalic-ϵ2𝑙𝑛𝑙𝛽1\frac{2}{\epsilon^{2}}\ln\frac{l(n-l)}{\beta}+1divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG italic_l ( italic_n - italic_l ) end_ARG start_ARG italic_β end_ARG + 1 nodes to construct Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, because the desired community’ size is lower-bounded by l𝑙litalic_l. Thus, we replace k+1𝑘1k+1italic_k + 1 in Theorem 10 with l𝑙litalic_l to get the new minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. (2) We ignore the candidate communities with size >habsent>h> italic_h during the estimation and stop the greedy search of candidates when we have a size labsent𝑙\leq l≤ italic_l. (3) We early terminate the estimation when we get a community having the size [l,h]absent𝑙\in[l,h]∈ [ italic_l , italic_h ] and the MoE εδe1+e𝜀superscript𝛿𝑒1𝑒\varepsilon\leq\frac{\delta^{\star}\cdot e}{1+e}italic_ε ≤ divide start_ARG italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ⋅ italic_e end_ARG start_ARG 1 + italic_e end_ARG (Theorem 11).

Refer to caption
(a) Attribute distance δ𝛿\deltaitalic_δ
Refer to caption
(b) Relative error (%) of δ𝛿\deltaitalic_δ
Refer to caption
(c) Response time (ms)
Refer to caption
(d) Response time (ms) of three steps
Figure 5: Effectiveness (a)-(b) and efficiency (c)-(d) results over homogeneous graphs

VI-C Extension to Various Community Models

Besides k𝑘kitalic_k-core, k𝑘kitalic_k-truss [16] is another popular model to measure a community’s structure cohesiveness. According to [21], it is widely-recognized that k𝑘kitalic_k-truss shows a higher structure cohesiveness but is less efficient than k𝑘kitalic_k-core. Users may choose an appropriate model based on their actual demands. We extend our sampling-estimation solution to be adaptive to k𝑘kitalic_k-truss model with three modifications: (1) For a k𝑘kitalic_k-truss, every node must have at least k1𝑘1k-1italic_k - 1 neighbors, indicating it is a (k(k( italic_k-1))))-core. So, it has at least k𝑘kitalic_k nodes and we update the minimum size of Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT as 2ϵ2lnk(nk)β+12superscriptitalic-ϵ2𝑘𝑛𝑘𝛽1\frac{2}{\epsilon^{2}}\ln\frac{k(n-k)}{\beta}+1divide start_ARG 2 end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_ln divide start_ARG italic_k ( italic_n - italic_k ) end_ARG start_ARG italic_β end_ARG + 1 (Theorem 10). (2) Given the induced graph Gq[S]subscript𝐺𝑞delimited-[]𝑆G_{q}[S]italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_S ] of S𝑆Sitalic_S, we find the maximal connected k𝑘kitalic_k-truss from it instead of connected k𝑘kitalic_k-core, as input of BLB estimation. (3) During the estimation, we maintain a connected k𝑘kitalic_k-truss as a candidate community instead of k𝑘kitalic_k-core.

TABLE I: Statistics of datasets
Datasets \downarrow #Nodes #Edges #N-types #E-types 𝒅maxsubscript𝒅max\bm{d}_{\rm max}bold_italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT 𝒅avgsubscript𝒅avg\bm{d}_{\rm avg}bold_italic_d start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT 𝒌maxsubscript𝒌max\bm{k}_{\rm max}bold_italic_k start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT 𝒌avgsubscript𝒌avg\bm{k}_{\rm avg}bold_italic_k start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT
Facebook 4,039 88,234 1 1 1,045 43.69 117 22.44
GitHub 37,700 289,003 1 1 9,458 15.33 36 7.12
Twitch 168,114 6,797,557 1 1 35,259 80.86 151 36.72
LiveJournal 3,997,962 34,681,189 1 1 14,815 17.34 362 7.84
Twitter-2010 21,297,772 265,025,810 1 1 698,112 24.88 1,695 12.71
DBLP 682,819 1,951,209 4 6 345 3.75 28 2.64
IMDB 2,875,685 9,705,602 4 24 591 4.42 552 4.37
DBpedia 4,521,912 15,045,801 359 676 6760 289.79 422 149
Freebase 5,706,539 48,724,743 11,666 5,118 467 5.64 60 2.75
YAGO 7,308,072 36,624,106 6543 101 285 5.31 44 2.61

VII Experiments

We provide the experimental study on ten real-world datasets. Our code [55] was implemented in Java 1.8 and run on a 2.1 GHZ, 64 GB memory AMD-6272 Linux server. Our evaluation seeks to answer the following questions.

Q1: How do our Exact and approximate solutions (§IV-V) perform in effectiveness and efficiency? (§VII-B-VII-C)

Q2: What is the effect of pruning strategies? (§VII-D)

Q3: How is the scalability of approximate method? (§VII-E)

Q4: How does our method find an approximate community iteratively on real-world datasets? (a case study in §VII-F)

Q5: How do parameters (discussed in §V) affect the approximate method’s effectiveness and efficiency? (§VII-G)

VII-A Experimental Setup

Datasets. Table I summarizes some statistics, e.g., the maximum (average) coreness kmaxsubscript𝑘maxk_{\rm max}italic_k start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (kavgsubscript𝑘avgk_{\rm avg}italic_k start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT) and degree dmaxsubscript𝑑maxd_{\rm max}italic_d start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (davgsubscript𝑑avgd_{\rm avg}italic_d start_POSTSUBSCRIPT roman_avg end_POSTSUBSCRIPT), of 5 homogeneous and 5 heterogeneous graphs. (1) Facebook [56], (2) GitHub [57], (3) Twitch [58], (4) LiveJournal [44], and (5) Twitter-2010 [59] are social networks. (6) DBLP [60] provides relationships among authors, papers, venues, etc. Each author has several attributes, e.g., research interests, # publications, hhitalic_h-index and # citations. (7) IMDB [61] provides relationships among actors, directors, and movies, with attributes like category, # movies for actors; genres, ratings for movies. (8) DBpedia [62], (9) Freebase [63], and (10) YAGO [64] are well-known knowledge graphs. Similar to [27], we add attributes for several types of nodes via web crawling.

Queries. We generated 200 queries for each graph. For homogeneous graphs, we follow [22] to form a query with a random query node. For heterogeneous graphs, we generate a query the same as [7]. We first obtain the top-10 meta-paths with the highest frequencies. A meta-path 𝒫𝒫\mathcal{P}caligraphic_P’s frequency is measured by its # path instances. The more the path instances, the higher the frequency of 𝒫𝒫\mathcal{P}caligraphic_P. We next form a query with a randomly selected 𝒫𝒫\mathcal{P}caligraphic_P and a query node with the type linked by 𝒫𝒫\mathcal{P}caligraphic_P.

Metrics. We use the attribute distance δ()𝛿\delta(\cdot)italic_δ ( ⋅ ), relative error of δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) w.r.t. the ground-truth (obtained by Exact) to evaluate the effectiveness. We evaluate the efficiency by response time. We show the average result of 200 queries in each test.

Methods. We implemented Exact and Sampling-Estimation-based Approximate solution (SEA) for k𝑘kitalic_k-core (default) and k𝑘kitalic_k-truss: (1) Exact, (2) Exact-Truss, (3) SEA, and (4) SEA-Truss. We compare ours with representative CS methods using various attribute cohesiveness metrics: (5) LocATC-Core and (6) LocATC-Truss, the fastest local version of ATC [3] atop k𝑘kitalic_k-core and k𝑘kitalic_k-truss, which are two approximate methods. (7) ACQ-Core [22] is an exact core-based method. (8) VAC-Core is the core-based version of the truss-based (9) VAC-Truss [5], both two are approximate methods. (10) E-VAC-Core and (11) E-VAC-Truss are the corresponding exact VAC methods also from [5]. Ours (1)-(4) support two types of graphs, while (5)-(11) are designed for homogeneous graphs. We convert a graph from heterogeneous to homogeneous given a meta-path, then invoke (5)-(11) to find communities for heterogeneous graphs. Besides, we clarified that we only provide the results of E-VAC-Core for small graphs, i.e., Facebook and GitHub, because it cannot finish within one week for large graphs [5].

Parameters. The default parameters are: k4𝑘4k\geq 4italic_k ≥ 4, 1β=95%1𝛽percent951-\beta=95\%1 - italic_β = 95 % and ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05 for Hoeffding Inequality, λ=0.2𝜆0.2\lambda=0.2italic_λ = 0.2 is the initial sampling fraction, 1α=95%1𝛼percent951-\alpha=95\%1 - italic_α = 95 % and e=0.02𝑒0.02e=0.02italic_e = 0.02 for accuracy guarantee. We show the parameter sensitivity in §VII-G.

Remark. Since some datasets provide human-annotated ground-truth (HA-GT) community, e.g., Facebook, LiveJournal, Orkut [65], and Amazon [66], we also evaluate the effectiveness w.r.t. HA-GT using F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score as the metric (the same as [3]). The higher the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score, the more the similarity of a community to the HA-GT, showing that the community with strict structure and attribute cohesiveness constraints can reflect the characteristics of real communities to some extent.

TABLE II: Various attribute cohesiveness (Facebook)
Methods \downarrow Min-max (VAC) Attribute coverage (ATC) #Shared attributes (ACQ) δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) (Ours) Total rank
SEA (Ours) 0.486 (2) 161.84 (4) 0.06 (2) 0.304 (2) 10
LocATC-Core 0.491 (6) 209.39 (1) 0.04 (6) 0.331 (6) 19
ACQ-Core 0.489 (5) 196.79 (2) 0.08 (1) 0.328 (5) 13
VAC-Core 0.486 (2) 178.46 (3) 0.06 (2) 0.325 (4) 11
Exact (Ours) 0.486 (2) 155.13 (6) 0.06 (2) 0.297 (1) 11
E-VAC-Core 0.475 (1) 158.45 (5) 0.06 (2) 0.314 (3) 11

VII-B Effectiveness Evaluation

Figure 5(a) shows the results of homogeneous graphs. SEA has smaller δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) than others and it is quite close to that of Exact. From the perspective of relative error of δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) (Figure 5(b)), ours is at least one order of magnitude less than others and is bounded by the default error bound e=2%𝑒percent2e=2\%italic_e = 2 %. This is because we apply BLB estimation with an reliable accuracy guarantee (Theorem 11). Moreover, we measure each method’s attribute cohesiveness w.r.t. various metrics. Table II shows the results on Facebook (associated with a rank in parentheses). We highlight the best results in bold and indicate suboptimal values with underlines. Each method performs the best on its own metric. From the macro perspective (see total rank), SEA is the best for all metrics. We also apply the same method as [3] to evaluate the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score w.r.t. HA-GT community (Table III). Here, we use ‘-’ to indicate that a method cannot finish within a sufficiently long time. SEA and Exact have higher F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score than others, indicating that the community of ours is more similar to HA-GT than others. We also provide the F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score over 10 ego-networks of Facebook in Figure 6 and we found that ours has the best F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score on eight of them.

TABLE III: F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score w.r.t. HA-GT
Methods \downarrow Facebook LiveJournal Orkut Amazon
SEA (Ours) 0.61 0.86 0.56 0.91
LocATC-Core 0.54 0.76 0.45 0.73
ACQ-Core 0.31 0.31 0.28 0.45
VAC-Core 0.47 0.79 0.40 0.76
Exact (Ours) 0.64 0.88 - -
E-VAC-Core 0.51 - - -
Refer to caption
Figure 6: F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-score on each ego-network provided in Facebook

VII-C Efficiency Evaluation

Figure 5(c) shows the results of homogeneous graphs. We provide the speedup of SEA w.r.t. comparing methods, on the top of bars. Ours outperforms others and the improvement is getting obvious as the graph size increases. For ten-million-scale Twitter, ours is at least 28.2×\times× faster than others. For all datasets, ours is at least 1.54×\times× (41.1×\times× on average) faster than others. This is because our method can early terminate once an acceptable δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is obtained. Figure 5(d) shows the runtime of SEA’s three steps. S1: Sampling-based maximal H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT finding (§V-A). S2: BLB estimation (§V-B). S3: Error-based incremental sampling (§V-C). S2 is the most time-consuming step, as a greedy search is used to find candidate communities for BLB estimation. S3 is the most efficient step because in most cases, we can find a good community within 2 iterations.

TABLE IV: Effect of prunings on Exact’s efficiency (runtime in seconds and explored # states during the enumeration)
Methods \downarrow Facebook GitHub Twitch LiveJournal
Time # States Time # States Time # States Time # States
Exact 77 3.24×106absentsuperscript106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 1210 8.02×107absentsuperscript107\times 10^{7}× 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT 14721 1.07×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT 59292 4.13×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT
Exact\setminusP3 80 3.35×106absentsuperscript106\times 10^{6}× 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT 1890 1.24×108absentsuperscript108\times 10^{8}× 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT 15770 1.16×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT 59315 4.29×109absentsuperscript109\times 10^{9}× 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT
Exact\setminusP3+P2 388 8.23×107absentsuperscript107\times 10^{7}× 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT 174483 4.21×1010absentsuperscript1010\times 10^{10}× 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT 753701 5.48×1010absentsuperscript1010\times 10^{10}× 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT >>>8 days 1.02×1012absentsuperscript1012\times 10^{12}× 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT
Exact w/o P >>>8 days 6.87×1010absentsuperscript1010\times 10^{10}× 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT >>>8 days 8.79×1012absentsuperscript1012\times 10^{12}× 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT >>>8 days 2.81×1012absentsuperscript1012\times 10^{12}× 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT >>>8 days 4.51×1015absentsuperscript1015\times 10^{15}× 10 start_POSTSUPERSCRIPT 15 end_POSTSUPERSCRIPT

VII-D Effect of Pruning Strategies for Exact Method

In §IV, we propose Exact with three pruning strategies to prune duplicated states (P1), unnecessary states (P2), and unpromising states (P3). Table IV shows the effect of P1-P3 on the efficiency. Exact is the one with P1-P3, 𝙴𝚡𝚊𝚌𝚝limit-from𝙴𝚡𝚊𝚌𝚝\texttt{Exact}\setminusExact ∖P3 is the one with P1+P2, 𝙴𝚡𝚊𝚌𝚝limit-from𝙴𝚡𝚊𝚌𝚝\texttt{Exact}\setminusExact ∖P3+P2 is the one with P1, and Exact w/o P is the one without prunings. Note that, all strategies are effective and improve the runtime. Among them, P1 is the most efficient one which can significantly reduce # states explored in the searching. For example, for Facebook, P1 prunes 99.8% states comparing with Exact w/o P.

TABLE V: Response time (ms) and relative error of δ𝛿\deltaitalic_δ (%) for core- and truss-based methods on heterogeneous graphs.
Methods \downarrow DBLP IMDB DBpedia Yago Freebase
Time Error Time Error Time Error Time Error Time Error
SEA (Ours) 187.01 1.58 72.89 1.56 59.64 0.0082 76.57 1.26 51.97 1.43
ACQ-Core 799.34 13.45 850.26 41.57 - - - - - -
LocATC-Core 431.84 14.58 891.54 47.83 102.85 37.58 178.57 20.10 109.82 24.81
VAC-Core 1453.82 12.45 2700.96 23.87 397.70 25.58 562.72 18.99 447.73 19.28
SEA-Truss 334.57 0.21 89.67 1.15 72.99 1.23 93.59 1.17 64.26 1.81
LocATC-Truss 812.93 4.89 947.19 21.29 211.28 19.99 297.14 15.28 191.34 15.17
VAC-Truss 1857.71 6.25 2938.27 9.04 791.85 2.37 839.47 5.31 621.54 5.87

VII-E Scalability Analysis

Extension to heterogeneous graphs. We provide the results of core-based methods on five heterogeneous graphs in Table V (rows 1-4). Since we only have numerical attributes for DBpedia, Yago, and Freebase, the equality-matching-based method ACQ-Core cannot return any communities that share at least one numerical attribute. For all datasets, ours has at least one order of magnitude less relative error than others and is bounded by the default e=2%𝑒percent2e=2\%italic_e = 2 % (Theorem 11). Besides, ours is at least 1.72×\times× faster than others, as we can terminate early when Theorem 11 holds.

Extension to k𝑘kitalic_k-truss model. Table V (rows 5-7) show the results for truss-based methods on heterogeneous graphs. Ours outperforms others due to the aforementioned same reasons.

Extension to size-bounded CS. Figure 7 shows the results of size-bounded CS on DBLP and GitHub. The runtime decreases as the size increases, because the larger the community is desired, the less the time is required for greedy search of candidate communities. Besides, the relative error is bounded by the default e=2%𝑒percent2e=2\%italic_e = 2 %, showing BLB estimation is effective.

Refer to caption
(a) DBLP
Refer to caption
(b) GitHub
Figure 7: Results of size-bounded CS (SEA)
Refer to caption
(a) DBLP-λ𝜆\lambdaitalic_λ
Refer to caption
(b) Twitter-λ𝜆\lambdaitalic_λ
Refer to caption
(c) DBLP-ϵitalic-ϵ\epsilonitalic_ϵ
Refer to caption
(d) Twitter-ϵitalic-ϵ\epsilonitalic_ϵ
Refer to caption
(e) DBLP-(1-β𝛽\betaitalic_β)
Refer to caption
(f) Twitter-(1-β𝛽\betaitalic_β)
Refer to caption
(g) DBLP-e𝑒eitalic_e
Refer to caption
(h) Twitter-e𝑒eitalic_e
Refer to caption
(i) DBLP-(1-α𝛼\alphaitalic_α)
Refer to caption
(j) Twitter-(1-α𝛼\alphaitalic_α)
Refer to caption
(k) DBLP-k𝑘kitalic_k
Refer to caption
(l) Twitter-k𝑘kitalic_k
Figure 8: Parameter sensitivity (DBLP and Tiwtter): efficiency (left Y axis) and effectiveness (right Y axis).
Refer to caption
Figure 9: Case study of SEA with different size-bound
TABLE VI: Case study: detailed runtime information of SEA
Methods \downarrow Approximate result round by round
Round δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT MoE ε𝜀\varepsilonitalic_ε ΔSΔ𝑆\Delta Sroman_Δ italic_S Time (ms)
Error (%)
SEA w/ size-bound [10,30]1030[10,30][ 10 , 30 ] 1 4.39×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 9.23×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 5967 48.29 2.34
2 4.31×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.79×105absentsuperscript105\times 10^{-5}× 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 615 3.85 0.39
SEA w/ size-bound [30,50]3050[30,50][ 30 , 50 ] 1 5.43×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.48×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 6743 52.45 4.66
2 5.17×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 1.05×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT 3989 18.84 0.41

VII-F Case Study

We performed a case study on IMDB with q=𝑞absentq=italic_q = Robert De Niro. Figure 9 illustrates two communities returned by SEA with different size-bound [10,30]1030[10,30][ 10 , 30 ] and [30,50]3050[30,50][ 30 , 50 ]. We use different colors to distinguish persons with different levels of attribute similarities to Robert De Niro (Red succeeds-or-equals\succeq Blue succeeds-or-equals\succeq Green). The community in (a) includes a set of top-tire actors in Hollywood who are as famous as Robert De Niro, showing a greater attribute cohesiveness w.r.t. q𝑞qitalic_q than that of (b). While (b) has to includes few less similar persons, e.g., 50 Cent and Dom Deluise in order to satisfy the enlarged size bound. Table VI shows the detailed runtime information of SEA, where the community is refined iteratively (i.e., decreased δsuperscript𝛿\delta^{\star}italic_δ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and ε𝜀\varepsilonitalic_ε) and finally the relative error is bounded by the default e𝑒eitalic_e=2%percent22\%2 %. Besides, since we apply an error-based incremental sampling, we require a smaller ΔSΔ𝑆\Delta Sroman_Δ italic_S than the initial S𝑆Sitalic_S to update the result.

VII-G Parameter Sensitivity

Effect of λ𝜆\lambdaitalic_λ. Figure 8(a)-(b) show that the runtime increases as λ𝜆\lambdaitalic_λ increases, e.g., from 180 ms to 290 ms for DBLP. The more the samples, the more the time is required to greedy search communities from a larger H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for estimation. Besides, λ𝜆\lambdaitalic_λ has little effect on the attribute cohesiveness as the effectiveness is mainly dominated by the estimation with accuracy guarantee.

Effect of ϵitalic-ϵ\epsilonitalic_ϵ and 1β1𝛽1-\beta1 - italic_β for Hoeffding Inequality. Figure 8(c)-(f) show that the response time increases as ϵitalic-ϵ\epsilonitalic_ϵ (1β1𝛽1-\beta1 - italic_β) decreases (increases). The stricter the ϵitalic-ϵ\epsilonitalic_ϵ and 1β1𝛽1-\beta1 - italic_β, the more the nodes are required to form Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (Theorem 10), leading to more time for estimation over a larger H~kGqsubscript~𝐻𝑘subscript𝐺𝑞\tilde{H}_{k}\subseteq G_{q}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. Since ϵitalic-ϵ\epsilonitalic_ϵ and 1β1𝛽1-\beta1 - italic_β are used to control the probability of the event that Gqsubscript𝐺𝑞G_{q}italic_G start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT contains all nodes from the ground-truth community, the stricter the ϵitalic-ϵ\epsilonitalic_ϵ and 1β1𝛽1-\beta1 - italic_β, the more the possibility of finding a better community.

Effect of e𝑒eitalic_e and 1α1𝛼1-\alpha1 - italic_α for BLB estimation. Figure 8(g)-(j) show that the stricter the e𝑒eitalic_e and 1α1𝛼1-\alpha1 - italic_α, the more the response time is required to achieve a smaller relative error. The relative error are almost bounded by e𝑒eitalic_e except the case for e𝑒eitalic_e = 2%percent22\%2 % in Twitter. This is because Theorem 11 holds with a probability of 1α1𝛼1-\alpha1 - italic_α and this situation rarely happens for a large 1α1𝛼1-\alpha1 - italic_α.

Effect of k𝑘kitalic_k. Figure 8(k)-(l) show that δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) increases as k𝑘kitalic_k increases. A large k𝑘kitalic_k usually indicates a small community, of which many nodes may important to the structure cohesiveness and a k𝑘kitalic_k-core would collapse if we delete such a node. So, the returned community may include some dissimilar nodes but contribute a lot to the structure cohesiveness, leading a larger δ()𝛿\delta(\cdot)italic_δ ( ⋅ ). The runtime for a small k𝑘kitalic_k is usually more than that for a large k𝑘kitalic_k, as a small k𝑘kitalic_k often indicates that we need more time for greedy search over a large H~ksubscript~𝐻𝑘\tilde{H}_{k}over~ start_ARG italic_H end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for estimation.

Refer to caption
(a) DBLP
Refer to caption
(b) Twitter
Figure 10: Effect of γ𝛾\gammaitalic_γ on independent attribute cohesiveness

Effect of γ𝛾\gammaitalic_γ. Since γ𝛾\gammaitalic_γ is a balance factor to adjust user preferences for two types of attribute cohesiveness, we varied γ𝛾\gammaitalic_γ to study its effect on the two independent textual and numerical attribute cohesiveness in Figure 10. We observed that when γ=1𝛾1\gamma=1italic_γ = 1 (γ=0𝛾0\gamma=0italic_γ = 0), our method tends to identify communities with the highest (lowest) cohesion in textual attributes (i.e., Jaccard distance) but the lowest (highest) cohesion in numerical attributes (i.e., Manhattan distance). A balance is achieved if γ𝛾\gammaitalic_γ is close to 0.5, indicating that communities with good cohesion in both types of attributes can be identified.

VIII Related Work

Community search (CS) was first studied in [8], which can be divided into two categories according to graph types.

CS on homogeneous graphs. Many works focus on modeling the cohesive community based on minimum degree [8, 15, 67], k𝑘kitalic_k-core [39, 30, 67, 14], k𝑘kitalic_k-truss [18, 17, 16, 68], k𝑘kitalic_k-clique [19, 20, 69], k𝑘kitalic_k-edge [70, 71], and query-biased density model [72]. These works greatly boost the study of CS, but ignore the CS on attributed graphs. Thus, many works define different metrics of attribute cohesiveness, and then integrated it with the structure cohesiveness for CS [73, 74, 2, 75, 22, 3, 5, 1]. Although many metrics have been proposed, they are not strict enough to reflect a community’s attribute cohesiveness. For example, [3] measure a community’s attribute cohesiveness as the weighted sum of each attribute’s coverage, where coverage is computed as the ratio of nodes with exactly matched attribute. Similarly, [22] uses # shared attributes to measure cohesiveness, relying on equality matching too. Due to the constraints of equality matching, they are not well-suited for numerical attributes, for instance, it’s more reasonable to seek similar movies with similar ratings rather than identical ones. [5] aims to minimize the maximum attribute distance (i.e., optimize the worst case) in a community, but overlooks the similarity of nodes to the query node q𝑞qitalic_q. This motivates us to present a CS based on a q𝑞qitalic_q-centric attribute distance considering both textual and numerical attribute.

CS on heterogeneous graphs. Recently, CS on heterogeneous graphs has emerged. The meta-path 𝒫𝒫\mathcal{P}caligraphic_P is often used to indicate the relation between two node types. Some community models are proposed, e.g., (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-core [7], (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-Btruss and (k,𝒫)𝑘𝒫(k,\mathcal{P})( italic_k , caligraphic_P )-Ctruss [51]. Many follow-up works use them for various downstream applications, i.e., expert finding in [11, 12] and influential community search via skyline influence vectors in [52]. [76] presents a keyword-centric CS, which takes a set of keywords as input rather than a query node, and it cannot support numerical attributes. It ensures that any node in a community can reach to a keyword with a shorter path, rather than all nodes in a community are similar in their attributes.

None of above methods provide an efficient approximate solution with a reliable evaluation on community’s quality based on a metric that can better distinguish a community’s attribute cohesiveness, inspiring our study in this paper.

IX Conclusions

We study an NP-hard CS-AG problem atop a strict q𝑞qitalic_q-centric attribute cohesiveness metric for k𝑘kitalic_k-core model on homogeneous graphs. We first propose an exact method with three pruning strategies served as a baseline. Then, we propose a sampling-estimation-based method to quickly return an appropriate community with an accuracy guarantee (given as an error-bounded confidence interval). We extend our method to heterogeneous graphs, k𝑘kitalic_k-truss model, and size-bounded CS. Experimental studies on ten real-world datasets demonstrate our method’s superiority in both effectiveness and efficiency.

References

  • [1] Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centric community search,” in ICDE, 2019, pp. 422–433.
  • [2] Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective Community Search over Large Spatial Graphs,” PVLDB, vol. 10, no. 6, pp. 709–720, 2017.
  • [3] X. Huang and L. V. S. Lakshmanan, “Attribute-Driven Community Search,” PVLDB, vol. 10, no. 9, pp. 949–960, 2017.
  • [4] L. Sun, X. Huang, R. Li, B. Choi, and J. Xu, “Index-based intimate-core community search in large weighted graphs,” IEEE Trans. Knowl. Data Eng., 2020.
  • [5] Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “VAC: vertex-centric attributed community search,” in ICDE, 2020, pp. 937–948.
  • [6] X. Miao, Y. Liu, L. Chen, Y. Gao, and J. Yin, “Reliable community search on uncertain graphs,” in ICDE, 2022, pp. 1166–1179.
  • [7] Y. Fang, Y. Yang, W. Zhang, X. Lin, and X. Cao, “Effective and efficient community search over large heterogeneous information networks,” PVLDB, vol. 13, no. 6, pp. 854–867, 2020.
  • [8] M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in KDD, 2010, pp. 939–948.
  • [9] J. Dudley, T. Deshpande, and A. J. Butte, “Exploiting drug-disease relationships for computational drug repositioning,” Briefings Bioinform., vol. 12, no. 4, pp. 303–311, 2011.
  • [10] P. Pesantez-Cabrera and A. Kalyanaraman, “Efficient detection of communities in biological bipartite networks,” IEEE ACM Trans. Comput. Biol. Bioinform., vol. 16, no. 1, pp. 258–271, 2019.
  • [11] X. Xu, J. Liu, Y. Wang, and X. Ke, “Academic Expert Finding via (k,p)-core based Embedding over Heterogeneous Graphs,” in ICDE, 2022, pp. 338–351.
  • [12] Y. Wang, J. Liu, X. Xu, X. Ke, T. Wu, and X. Gou, “Efficient and effective academic expert finding on heterogeneous graphs through (k,p)-core based embedding,” ACM Trans. Knowl. Discov. Data, vol. 17, no. 6, mar 2023.
  • [13] E. Clough and T. Barrett, The Gene Expression Omnibus Database.   New York, NY: Springer New York, 2016, pp. 93–110. [Online]. Available: https://doi.org/10.1007/978-1-4939-3578-9_5
  • [14] N. Barbieri, F. Bonchi, E. Galimberti, and F. Gullo, “Efficient and effective community search,” Data Min. Knowl. Discov., vol. 29, no. 5, pp. 1406–1433, 2015.
  • [15] W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local Search of Communities in Large Graphs,” in SIGMOD, 2014, pp. 991–1002.
  • [16] X. Huang, L. V. S. Lakshmanan, J. X. Yu, and H. Cheng, “Approximate Closest Community Search in Networks,” PVLDB, vol. 9, no. 4, pp. 276–287, 2015.
  • [17] X. Huang, H. Cheng, L. Qin, W. Tian, and J. X. Yu, “Querying k-truss community in large and dynamic graphs,” in SIGMOD, 2014, pp. 1311–1322.
  • [18] E. Akbas and P. Zhao, “Truss-based community search: A truss-equivalence based indexing approach,” PVLDB, vol. 10, no. 11, pp. 1298–1309, 2017.
  • [19] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online Search of Overlap** Communities,” in SIGMOD, 2013, pp. 277–288.
  • [20] C. E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. A. Tsiarli, “Denser than the densest subgraph: Extracting optimal quasi-cliques with quality guarantees,” in KDD, 2013, pp. 104–112.
  • [21] Y. Fang, X. Huang, L. Qin, Y. Zhang, W. Zhang, R. Cheng, and X. Lin, “A survey of community search over big graphs,” VLDBJ, vol. 29, no. 1, pp. 353–392, 2020.
  • [22] Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search for large attributed graphs,” PVLDB, vol. 9, no. 12, pp. 1233–1244, 2016.
  • [23] Y. Zhu, J. He, J. Ye, L. Qin, X. Huang, and J. X. Yu, “When structure meets keywords: Cohesive attributed community search,” in CIKM, 2020, pp. 1913–1922.
  • [24] S. Kosub, “A note on the triangle inequality for the jaccard distance,” Pattern Recognit. Lett., vol. 120, pp. 36–38, 2019.
  • [25] N. Laptev, K. Zeng, and C. Zaniolo, “Early accurate results for advanced analytics on mapreduce,” PVLDB, vol. 5, no. 10, pp. 1028–1039, 2012.
  • [26] S. Chaudhuri, B. Ding, and S. Kandula, “Approximate query processing: No silver bullet,” in SIGMOD, S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, Eds., 2017, pp. 511–519.
  • [27] Y. Wang, A. Khan, X. Xu, J. **, Q. Hong, and T. Fu, “Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling,” in ICDE, 2022.
  • [28] Y. Wang, J. Luo, A. Song, and F. Dong, “A sampling-based hybrid approximate query processing system in the cloud,” in ICPP, 2014, pp. 291–300.
  • [29] K. Yao and L. Chang, “Efficient size-bounded community search over large networks,” Proc. VLDB Endow., vol. 14, no. 8, pp. 1441–1453, 2021.
  • [30] F. Bonchi, A. Khan, and L. Severini, “Distance-generalized core decomposition,” in SIGMOD, 2019, pp. 1006–1023.
  • [31] J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “Querying Minimal Steiner Maximum-connected Subgraphs in Large Graphs,” in CIKM, 2016, pp. 1241–1250.
  • [32] R. Li, L. Qin, J. X. Yu, and R. Mao, “Influential Community Search in Large Networks,” PVLDB, vol. 8, no. 5, pp. 509–520, 2015.
  • [33] R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skyline community search in multi-valued networks,” in SIGMOD, 2018, pp. 457–472.
  • [34] M. Wang, L. Lv, X. Xu, Y. Wang, Q. Yue, and J. Ni, “An efficient and robust framework for approximate nearest neighbor search with attribute constraint,” in NeurIPS, 2024.
  • [35] M. El-Kebir and G. W. Klau, “Solving the maximum-weight connected subgraph problem to optimality,” arXiv, vol. abs/1409.5308, 2014.
  • [36] J. M. Kleinberg and É. Tardos, Algorithm Design.   Addison-Wesley, 2006.
  • [37] A. Santuari, “Steiner tree np-completeness proof,” University of Trento, Tech. Rep., 2003.
  • [38] J. Byrka, F. Grandoni, T. Rothvoß, and L. Sanità, “An improved lp-based approximation for steiner tree,” in STOC, L. J. Schulman, Ed., 2010, pp. 583–592.
  • [39] V. Batagelj and M. Zaversnik, “An o (m) algorithm for cores decomposition of networks,” arXiv, vol. cs.DS/0310049, 2003.
  • [40] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, pp. 409–426, 1994.
  • [41] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A Scalable Bootstrap for Massive Data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 4, pp. 795–816, 2014.
  • [42] J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, pp. 845–845, 2000.
  • [43] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020.
  • [44] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in ICDM, 2012, pp. 745–754.
  • [45] D. Cheng, C. Chen, X. Wang, and S. Xiang, “Efficient top-k vulnerable nodes detection in uncertain graphs,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 2, pp. 1460–1472, 2023.
  • [46] F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich, “Core decomposition of uncertain graphs,” in SIGKDD, 2014, pp. 1316–1325.
  • [47] Y. Peng, Y. Zhang, W. Zhang, X. Lin, and L. Qin, “Efficient probabilistic k-core computation on uncertain graphs,” in ICDE, 2018, pp. 1192–1203.
  • [48] J. Gao, X. Li, Y. E. Xu, B. Sisman, X. L. Dong, and J. Yang, “Efficient knowledge graph accuracy evaluation,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1679–1691, 2019.
  • [49] B. Efron and R. Tibshirani, An Introduction to the Bootstrap.   Springer, 1993.
  • [50] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “The big data bootstrap,” in ICML, 2012.
  • [51] Y. Yang, Y. Fang, X. Lin, and W. Zhang, “Effective and Efficient Truss Computation over Large Heterogeneous Information Networks,” in ICDE, 2020, pp. 901–912.
  • [52] Y. Zhou, Y. Fang, W. Luo, and Y. Ye, “Influential community search over large heterogeneous information networks,” PVLDB, vol. 16, no. 8, pp. 2047–2060, 2023.
  • [53] S. Coles, J. Bawa, L. Trenner, and P. Dorazio, An Introduction to Statistical Modeling of Extreme Values.   Springer, 2001, vol. 208.
  • [54] Y. Ma, Y. Yuan, F. Zhu, G. Wang, J. Xiao, and J. Wang, “Who should be invited to my party: A size-constrained k-core problem in social networks,” J. Comput. Sci. Technol., vol. 34, no. 1, pp. 170–184, 2019.
  • [55] Code and datasets, “Code and datasets,” https://anonymous.4open.science/r/SEA-Update-D18E/README.md, 2023.
  • [56] J. J. McAuley and J. Leskovec, “Learning to discover social circles in ego networks,” in NIPS, 2012, pp. 548–556.
  • [57] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” J. Complex Networks, vol. 9, no. 2, 2021.
  • [58] B. Rozemberczki and R. Sarkar, “Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings,” arXiv, vol. abs/2101.03091, 2021.
  • [59] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015.
  • [60] “DBLP,” http://dblp.uni-trier.de/xml/, 2023.
  • [61] “IMDB,” https://www.imdb.com/interfaces/, 2023.
  • [62] “DBpedia,” https://wiki.dbpedia.org/Datasets, 2023.
  • [63] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” in SIGMOD, 2008, pp. 1247–1250.
  • [64] T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum, “YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames,” in ISWC, 2016, pp. 177–185.
  • [65] O. dataset, “Orkut dataset,” https://www.comp.hkbu.edu.hk/∼db/book/communitysearch.html, 2023.
  • [66] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowl. Inf. Syst., vol. 42, no. 1, pp. 181–213, 2015.
  • [67] Y. Fang, Z. Wang, R. Cheng, H. Wang, and J. Hu, “Effective and efficient community search over large directed graphs,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11, pp. 2093–2107, 2019.
  • [68] Q. Liu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Truss-based community search over large directed graphs,” in SIGMOD, 2020, pp. 2183–2197.
  • [69] L. Yuan, L. Qin, W. Zhang, L. Chang, and J. Yang, “Index-based densest clique percolation community search in networks,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 5, pp. 922–935, 2018.
  • [70] L. Chang, X. Lin, L. Qin, J. X. Yu, and W. Zhang, “Index-based optimal algorithms for computing steiner components with maximum connectivity,” in SIGMOD, 2015, pp. 459–474.
  • [71] J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “On minimal steiner maximum-connected subgraph queries,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 11, pp. 2455–2469, 2017.
  • [72] Y. Wu, R. **, J. Li, and X. Zhang, “Robust local community detection: On free rider effect and its elimination,” PVLDB, vol. 8, no. 7, pp. 798–809, 2015.
  • [73] L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximum co-located community search in large scale social networks,” PVLDB, vol. 11, no. 10, pp. 1233–1246, 2018.
  • [74] L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual community search over large social networks,” in ICDE, 2019, pp. 88–99.
  • [75] Y. Fang, R. Cheng, Y. Chen, S. Luo, and J. Hu, “Effective and efficient attributed community search,” VLDBJ, vol. 26, no. 6, pp. 803–828, 2017.
  • [76] L. Qiao, Z. Zhang, Y. Yuan, C. Chen, and G. Wang, “Keyword-centric community search over large heterogeneous information networks,” in DASFAA, vol. 12681, 2021, pp. 158–173.