Scalable Community Search with Accuracy Guarantee on Attributed Graphs
Abstract
Given an attributed graph and a query node , Community Search over Attributed Graphs (CS-AG) aims to find a structure- and attribute-cohesive subgraph from that contains . Although CS-AG has been widely studied, they still face three challenges. (1) Exact methods based on graph traversal are time-consuming, especially for large graphs. Some tailored indices can improve efficiency, but introduce nonnegligible storage and maintenance overhead. (2) Approximate methods with a loose approximation ratio only provide a coarse-grained evaluation of a community’s quality, rather than a reliable evaluation with an accuracy guarantee in runtime. (3) Attribute cohesiveness metrics often ignores the important correlation with the query node . We formally define our CS-AG problem atop a -centric attribute cohesiveness metric considering both textual and numerical attributes, for -core model on homogeneous graphs. We show the problem is NP-hard. To solve it, we first propose an exact baseline with three pruning strategies. Then, we propose an index-free sampling-estimation-based method to quickly return an approximate community with an accuracy guarantee, in the form of a confidence interval. Once a good result satisfying a user-desired error bound is reached, we terminate it early. We extend it to heterogeneous graphs, -truss model, and size-bounded CS. Comprehensive experimental studies on ten real-world datasets show its superiority, e.g., at least 1.54 (41.1 on average) faster in response time and a reliable relative error (within a user-specific error bound) of attribute cohesiveness is achieved.
I Introduction
Recently, it is prevalent to observe that many large-scale and real-world attributed graphs have emerged in various domains, e.g., social networks, collaboration networks, and knowledge graphs [1, 2, 3, 4, 5]. In such graphs, nodes represent entities with attributes and edges represent the relationship between entities. Given an attributed graph and a query node , community search (CS) aims to find a cohesive community that contains . It is widely recognized that CS is important for many real-life applications [6, 7], such as event planning [8], biological data analysis [9, 10], and recommendation [11, 12, 7]. For example, one can input a specific disease-related gene to find a community of similar genes from a biological network (e.g., GEO [13]) that helps revealing the hidden causes of diseases; a system can recommend movies for a user by taking one of her favorite movies as a query and return a cohesive community of similar movies from IMDB.
Cohesiveness is an important metric for measuring a community’s quality, which is two-fold, including structure and attribute cohesiveness. Many models like -core [8, 14, 15], -truss [16, 17, 18, 5], and -clique [19, 20] are proposed to measure the structure cohesiveness. In terms of the ability to measure structure cohesiveness, [21] ranks them as -core -truss -clique. Here, indicates the model is more cohesive than model . As the structure cohesiveness increases, the algorithmic computational efficiency decreases as -core -truss -clique [21]. Users may choose an appropriate model according to their actual demands. Attribute cohesiveness is another metric for enhancing a community’s quality. Given an attribute cohesiveness metric, one can form an optimization problem to find the most attribute-cohesive community. Essentially, existing CS on attributed graphs requires users first to define an appropriate community model and an attribute cohesiveness metric, then design an exact or approximate CS algorithm to return a community of interest [22, 23, 5, 3]. Figure 1(a) shows an example on IMDB. Each node represents an audiovisual work with attributes provided at the bottom (i.e., type, genres, average rating, # ratings). Each edge shows that two works have a common actor. Suppose a user likes The Godfather (i.e., ), it is easy to explore a high-rating community of crime drama movies for her by running a CS algorithm given the query as .
![Refer to caption](x1.png)
Challenges and solutions. We next use the example of movie community search mentioned above to illustrate the challenges faced by existing methods and our solutions.
Challenge I: How to design a metric to better measure a community’s attribute cohesiveness? Figure 1(b)-(d) shows the results of three representative CS methods on IMDB given (The Godfather) and -core () as the community model (a -core is a subgraph of which each node has a degree ). ATC [3] uses the weighted sum of the contribution of ’s attributes as the metric. An attribute ’s contribution is defined as , where is a set of nodes that have attribute and is the node set of a community . It excludes and with different type (TV series) and genres and returns an with the largest weighted sum of in Figure 1(b), where the contributions of movie,crime,drama are , , and . Since this metric is tailored for textual attributes based on equality matching, it involves six movies with dissimilar numerical attributes to (colored nodes), e.g., two low-rating action movies and . Figure 1(c) shows the result of ACQ [22] using the # shared attributes as the metric. It increases the # shared attributes from 1 (movie) to 3 (movie,crime,drama) by deleting and . It also includes four dissimilar movies to , as the # shared attributes is only valid for textual attributes. VAC [5] is the state-of-the-art work that aims to minimize the maximum attribute distance of any two nodes in a community. Since the maximum attribute distance is dominated by the most dissimilar pair of nodes, it is considered a method that optimizes only the worst case in a community but overlooks the similarity of other nodes to . In Figure 1(d), is the most dissimilar node to , however, deleting it would collapse the -core of . So, VAC halts at this step, as the worst case cannot be further improved, making an involvement of , that are numerically dissimilar to . In a nutshell, none of them well-support CS on attributed graphs, either because they cannot simultaneously handle textual and numerical attributes, or because they rely on metrics designed from a global perspective (e.g., optimizing the worst case), neglecting the crucial cohesiveness w.r.t. the query node .
Therefore, we define a -centric attribute distance of a community as considering both textual and numerical attributes (§II), based on which we define our CS problem on attributed graphs (CS-AG) and prove it is NP-hard (§III). We present an exact algorithm with three pruning strategies (§IV) serving as an exact baseline used in our experimental study.
Challenge II: How to design an efficient index-free approximate CS algorithm with a reliable accuracy guarantee? The conventional solution for CS is to design algorithms based on graph traversal, which is time-consuming for large graphs [7, 6, 5, 3, 22]. To improve efficiency, researchers resort to community-aware indices or approximate algorithms. For index-based solutions, a trade-off between efficiency and space overhead is achieved. However, an index that is comparable to the graph size usually is required [22]. In dynamic scenarios, the index update or reconstruction would introduce nonnegligible maintenance overhead. For approximate solutions, a trade-off between efficiency and effectiveness is achieved, e.g., [16, 5] provide -approximation to the exact results via triangle inequality [24]. However, the approximate ratio only provides an upper (often loose) bound of attribute cohesiveness but fails to offer a reliable evaluation of how good a community is in runtime, i.e., what is the relative error of a community’s attribute distance w.r.t. that of the ground-truth community?
Comparing to a tardy exact or an approximate result with a loose approximation ratio, it’s more desirable if a method can quickly return an approximate result with a reliable accuracy guarantee [25, 26, 27, 28]. Thus, we present an index-free sampling-estimation-based approximate algorithm to first collect a set of high-quality samples (nodes) based on Hoeffding Inequality to form a candidate community. We then estimate its attribute distance as with a level confidence interval CI = using Bag of Little Bootstraps, where is the half width of a CI. Given a user-desired error bound , we guarantee that the relative error of w.r.t. the exact is bounded by (i.e., ) if is small enough to satisfy Theorem 11. We early terminate the query once such a good is reached. Otherwise, we enlarge the sample and repeat above until Theorem 11 holds. Figure 1(e) shows the result of = and = , it costs only 66 ms (at least 12X faster than others, see the table at the top) to return the same community as that of our exact algorithm (§IV), indicating a relative error of 0% ( = = 0.123). It involves seven similar crime drama movies to with higher rating and more ratings. By varying , we found that the results of = and are the same as (d) and (c) with bounded relative errors. For example, given the CIs provided above Figure 1(e), we have a relative error for = as . A larger and smaller usually result in less time for estimation, e.g., 66 ms, 55 ms, and 49 ms for , , and . Users may configure them based on their preferences for accuracy and efficiency.
Challenge III: How to enable a CS algorithm to be scalable for different scenarios? A CS algorithm usually is tailored for a specific graph type (homogeneous or heterogeneous) given a specific community model (-core, -truss, etc.) with a size-constraint (size-bounded CS [29]) or not, showing a strong dependence on a specific scenario. This implies that if we want to adapt a CS algorithm of one scenario to other scenarios, usually we must redesign it partially or even completely.
Since our approximate CS algorithm consists of two standard steps: sampling and estimation, it is easy to be extended to more general scenarios with lightweight modifications, including heterogeneous graphs (§VI-A), size-bounded CS (§VI-B), and other community models (§VI-C). For example, we can simply add a constraint on the sample size in sampling step to support size-bounded CS with an accuracy guarantee.
Contributions. The main contributions are as follows.
- •
-
•
We present an exact method to CS-AG with three pruning strategies (§IV) to avoid redundant searches.
-
•
We propose an index-free sampling-estimation-based approximate CS method to Approx-CS-AG in §V, with a reliable runtime accuracy guarantee on attribute cohesiveness.
-
•
We extend our approximate CS method to more general scenarios such as CS on heterogeneous graphs, size-bounded CS, and CS with different community models in §VI.
- •
II Preliminaries and Problems
II-A Preliminaries
Definition 1
Attributed Graph. An attributed graph is defined as , where () is the node (edge) set and is the set of attributes associated with nodes. For each node , it has a set of attributes , where and indicates the -th attribute of a node . In this paper, we consider both the textual and numerical attributes of a node, denoted by and , respectively.
![Refer to caption](x2.png)
Structure cohesiveness. We first employ -core model [30] to measure the structure cohesiveness of a community, because it is the most efficient model compared to other models [21], i.e., -truss [16], -ECC [31], -clique [19], and it performs well on cohesiveness evaluation [7, 21, 32, 15, 2]. In §VI, we show how to extend it to other community models.
Definition 2
-core. Given a graph and a non-negative integer , a -core of is the largest subgraph , such that has a degree at least , i.e., .
In Figure 2(a), is the entire graph as every node has at least one neighbor; is the subgraph excluding ; consists of two components and each node has at least three neighbors. The larger is, the more cohesiveness has.
Definition 3
Connected -core [4]. Given a graph and a non-negative integer , a connected -core of is a connected subgraph , such that , .
To incorporate connectivity into a community, similar to [4, 33, 14, 22], we use connected -core as the community model. Given a query node , we aim to find a connected -core containing as the desired community of . Figure 2(b-c) shows two communities and containing .
Attribute cohesiveness. We consider two types of attributes: textual and numerical. Given two nodes , they are similar in textual attributes if their textual attributes have substantial overlap [5]. We use Jaccard distance to measure their textual attribute distance, denoted by . The essential of Jaccard distance is the equality matching, so that the higher the ratio of equally matched textual attributes, the smaller the . Unlike textual attributes, equality matching is often meaningless for numerical attributes. Many alternative choices are available, such as Manhattan distance and Euclidean distance. In this paper, we compute the numerical attribute distance based on Manhattan distance , where we use to normalize to [0,1], thereby eliminating the dimensional influence. We next compute the composite attribute distance between two nodes as , where the parameter is configured as a balance factor to adjust user preferences for different types of attribute cohesiveness. Its worth mentioning that we may have other options to combine two types of attribute distances. For example, if two types of attributes are vectorized as a unified high-dimensional vector (i.e., embeddings), we can simply measure the attribute distance by computing cosine similarity of attribute vectors [34]. Using the composite distance, we can measure how similar a node is to a query node . Intuitively, the more nodes are similar to , the more attribute cohesiveness w.r.t. it is. We next define the -centric attribute distance of a community as follows.
Definition 4
Attribute distance of . Given a community containing , its -centric attribute distance is defined as the average composite attribute distance to over all nodes , i.e., , is ’s node set.
II-B Problem Definition
Problem 1. (CS-AG problem) Given an attributed graph , a query node , and a parameter , the Community Search over Attributed Graphs (CS-AG) returns a subgraph satisfying the following properties:
-
•
Query participation. contains , i.e., ;
-
•
Structure cohesiveness. is a connected -core;
-
•
Attribute cohesiveness. has the smallest .
We prove that CS-AG problem is NP-hard in §III and provide an exact baseline used in experimental study (§VII). Since it is time-consuming for large graphs, we define an approximate version as Approx-CS-AG, and present a sampling-estimation-based method with an accuracy guarantee in §V.
Problem 2. (Approx-CS-AG problem) Given an attributed graph , a query node , a parameter , a user-specific error bound , and a confidence level , Approx-CS-AG returns an approximate community containing that satisfies: (1) the attribute distance (shorten as ) of the exact community is covered by a confidence interval of the attribute distance (shorten as ) of (Eq. 1), and (2) the relative error of w.r.t. is bounded by the user-specific error bound (Eq. 2).
(1) |
(2) |
III Hardness Analysis
We first define CS-AG’s decision version as CS-AG. Next, we outline the idea of reduction from a NP-hard R-set Maximum-Weight Connected Subgraph (R-MWCS) problem [35] to CS-AG, and show the proof in Theorem 2.
Problem 3. (CS-AG problem) Given an attributed graph , a query node , and two parameters , , the problem checks if there exists a connected -core containing with the attribute distance .
Generally, the decision version of a problem is easier than (or the same as) it’s optimization version [36]. Thus, if CS-AG is NP-hard, then CS-AG is NP-hard. To achieve this, we first prove the decision version of R-MWCS (called R-MWCS, defined below) is NP-hard. Then, we reduce R-MWCS to our CS-AG to complete the proof.
Problem 4. (R-MWCS problem). Given a graph , a node set , and node weights ( indicates the weight of , which could be any real number). It finds a connect subgraph that contains , satisfying the graph weight .
Theorem 1
The R-MWCS problem is NP-hard.
Proof:
We reduce the NP-hard Steiner tree problem (decision version, shorten as STP) [37, 38] to R-MWCS. Given a graph , a node set , and a number , STP checks if there is a tree that contains and includes at most edges. We complete the proof following the similar logic of [35]. First, we construct an instance graph via a polynomial time transformation from . For each edge , we introduce a middle node to split into two edges , , and assign weights on as . We next prove that the instance of STP is a Yes-instance iff the instance of R-MWCS is a Yes-instance.
If a tree is a solution of STP, then it has . Since we split each edge in into two edges with a new middle node having a weight of 1 and other two original nodes having a weight of 0, the corresponding tree must have . is also a connected subgraph of , so that is a Yes-instance of R-MWCS.
If a subgraph is a solution of R-MWCS, then it has . Since is connected, a spanning tree still satisfies . For each node with weight of 1 in , it corresponds to an edge . This implies that the edges in the corresponding are at most . Thus, is a Yes-instance of STP. ∎
Theorem 2
The CS-AG problem is NP-hard.
Proof:
We reduce R-MWCS problem to CS-AG. Given a graph , we first construct an instance graph via a polynomial time transformation from . First, we add all nodes and edges of into . Second, we add additional nodes into and assign a unique node ID for all the nodes: (1) nodes from have IDs and (2) additional nodes have IDs . Third, we add additional edges for pair-wise nodes and , if ID ID (e.g., nodes with IDs of have edges). Hence, we ensure that each node in has at least neighbors. Finally, given a node set with only one query node , we assign weights on both and as follows: (1) has , where is the composite attribute distance between and . (2) with ID has , which equals to ’s original composite attribute distance to . (3) with ID , we assign with a same weight as its . In summary, considering , those nodes inherited from have weights as their original and others are configured with the same weight . Given with the query node and , and with and , we show that the instance of R-MWCS is a Yes-instance iff the instance of CS-AG is a Yes-instance.
If is a solution to the R-MWCS problem that satisfies , then the induced graph formed by nodes of is a connected -core , where includes nodes with IDs that have edges with nodes in . Since , we have (as ) and . Thus, . So, is a Yes-instance of CS-AG if is a Yes-instance of the R-MWCS.
Assume a connected -core containing is a solution to the CS-AG problem. involves part of nodes with IDs (i.e., ) and the rest of nodes with IDs (denoted by ). Given , we have , i.e., . Thus, we have (as ). So, is a Yes-instance of R-MWCS if is a Yes-instance of CS-AG. ∎
IV Exact Baseline
Given an attributed graph , we present an Exact method to solve CS-AG. Since the ground-truth community must be included in the maximal connected -core , we first find the maximal containing (§IV-A). Then, we enumerate all the containing with three pruning strategies (§IV-B), and return the one with the smallest .
IV-A Find the Maximal Connected -core
We have two straightforward ways to obtain the maximal : one is the classic core-decomposition [39] and another is the search with expansion [7]. For the latter one, we start the search from and maintain up to neighbors for each explored node ; if does not have neighbors, then we delete and maintain all previously explored nodes’ degree up to if their degree is reduced to be after removing . We repeat this until all nodes have been explored. No matter which method is adopted, the essence is the same that is to recursively remove nodes with degree from the connected component of . We implement the former one in Exact.
IV-B Enumeration with Pruning Strategies
Given the maximal , we continuously peel nodes from it to form a new candidate community and check its attribute distance. This enumeration is represented as a search tree, where each state indicates a and the root is initialized as . If includes nodes, then we can delete nodes iteratively (except ) to generate substates. Figure 3 illustrates a search tree starting from the in Figure 2 (c), given . For example, is obtained by deleting . A full enumeration is computationally expensive for a large , but not all should be visited and some of them can be pruned safely to improve the efficiency.
Prune for duplicate states. In Figure 3, we show duplicate states with the same color, e.g., the green states are generated from by deleting . A simple way to avoid duplicate states is to record all the visited states and check if each new state has been visited before. The main drawback is that it requires extra memory overhead to maintain all visited states. To handle this, we present our first pruning strategy based on priority enumeration. More precisely, we enumerate states in a partial order w.r.t. every node’s composite attribute distance to (i.e., ), so that we quickly can decide to prune a state by simply evaluating between the node to be deleted and , without maintaining visited states.
(1) Priority enumeration. Given an arbitrary state, we enumerate its substates in a DFS manner by deleting nodes in descending order of their composite attribute distance to . In Figure 3, we first enumerate the state by deleting as is larger than other nodes’ distance to (distance information are provided on the top of Figure 3).
![Refer to caption](x3.png)
Lemma 1
Given a state includes two nodes and with , it must exists a substate generated by successively deleting and from .
This lemma holds as we enumerate states by deleting nodes in descending order of (i.e., priority enumeration). For example, (in Figure 3) has a substate with nodes after deleting , in turn ().
(2) Prune based on . Given a state that is generated from its parent state by deleting a node . Suppose we try to enumerate a substate by deleting a current node , then we can prune in the following cases.
Case 1. Let us consider a simple case, where deleting does not result in other nodes from having a degree less than , i.e., no other nodes would be removed after deleting . In this case, we can prune according to Theorem 3.
Theorem 3
Given a state that is obtained by deleting a node and a substate obtained by deleting only one node . If , then we can prune and its subsequent sub-states.
Proof:
According to Lemma 1, a previously visited state must exist that is caused by successively deleting and from because . This state is duplicated with generated by successively deleting and . Thus, we can directly prune if . ∎
Example 1
Given the yellow state (in Figure 3) including nodes that is obtained from by deleting , is the next node to be deleted to get a substate. Since , it exists a duplicate state that has been visited before, i.e., the one obtained by successively deleting , from . So, we can prune this substate.
Case 2. Let us consider a more general case where deleting results in other nodes from having a degree less than , i.e., additional nodes would be removed when is deleted. In this case, we can prune according to Theorem 4.
Theorem 4
Given a state that is obtained by deleting a node and a substate obtained by deleting multiple nodes besides , i.e., , where is node with the maximal among all deleted nodes. If , then we can prune and its subsequent substates.
Proof:
Similar to Theorem 3, a previously visited state must exist that is caused by successively deleting and other nodes from , as . It is duplicated with the generated by successively deleting from . So, we can prune it when . ∎
Since Case 1 is a special case of Case 2 when , we only apply Theorem 4 in our implementation of Exact.
Prune for unnecessary states. We say a state is unnecessary to visit if the optimal community definitely can be visit before this state, so that we can directly prune this state.
Theorem 5
Given a state with an attribute distance , the substates that are generated by deleting nodes with are unnecessary to visit and can be pruned.
Proof:
Suppose we enumerate a new substate of by deleting a node with . If ’s deletion only causes nodes with to be deleted, then this new substate’s attribute distance must , indicating it is not the optimal community. This implies that if the optimal community exists, then at least one node with would be deleted recursively after is deleted. According to Lemma 1, a state generated by successively deleting and must be visited previously as . So, we can prune based on Theorem 3-4. In summary, substates generated by deleting nodes with can be pruned. ∎
According to Theorem 5, we only need to enumerate substates of by deleting those nodes with .
Example 2
Prune for unpromising states. Another case where we can prune is when the lower bound of for all substates of is larger than the optimal so far. This implies that we cannot find a better community by digging deeper from .
(1) Compute the lower bound . Since a -clique is the smallest -core, each state (represents a connected -core) in a search tree must have at least nodes ( and other nodes). So, we obtain nodes with the smallest from the current state (Eq. 3) and use the average of the nodes as the lower bound of attribute distance (Eq. 4).
(3) |
(4) |
Lemma 2
Given an arbitrary substate , its attribute distance must be lower bounded by , i.e., .
This naturally holds as any containing at least one node with would increase .
(2) Prune based on . For each visited state , we record its to update the optimal so far, compute the lower bound , and decide to prune according to Theorem 6.
Theorem 6
Given a state , we say it is unpromising to find a better state with a smaller than the optimal and can be safely pruned, if .
Proof:
Given and Lemma 2, we have . So, we can prune safely. ∎
Example 3
In Figure 3, suppose we are now at state and the current optimal community is formed by nodes with . The lower bound for all substates of is , so we can prune safely.
Combine three pruning strategies together. Algorithm 1 shows the whole procedure of enumeration with three pruning strategies. We study the effect of prunings on Exact’s efficiency in §VII-C. Given the maximal connected -core and a query node , we first initialize the optimal community and attribute distance as and (line 1). Since is the root of enumeration, it doesn’t have a previously deleted node ( is used to indicate that the current is obtained from its parent state by deleting ). Thus we configure as positive infinity (lines 2-3) and use it in the prune for duplicated states. We next enumerate from as follows: (1) We compute the lower bound of attribute distance for all substates of via Eq. 3-4, and then prune all unpromising substates if (lines 6-8: Theorem 6). (2) For each that was not pruned before, we record all nodes with from in a max-heap , and prune all unnecessary substates by deleting nodes from (lines 10-11: Theorem 5). (3) Given a , we pop node with the largest and maintain a new connected -core by deleting from . During the -core maintenance, we also record node with the largest that is recursively deleted after is removed (lines 12-13). If , then we can prune this duplicated state according to Theorem 4 (lines 14-15). Otherwise, we update the optimal and if necessary (lines 17-18), update the previously deleted node as the current deleted node (line 19), keep enumerating from this new (line 20), and finally return the optimal community (line 5).
IV-C Complexity Analysis
For the exact baseline without pruning strategies, the time complexity in the worst case is . The first term is the time of finding the maximal connected -core from the entire graph via core-decomposition [39]. While the second term is the time complexity of enumeration over , where is the total number of states in the search tree and is the time for core-maintenance on each state. Since the first pruning strategy prunes the duplicated states, the second term can be reduced to . While the last two pruning strategies can further reduce the number of states, thus we introduce a constant ( in practice) to represent the time as .
Example 4
In DBLP, considering the community search with , the average # states for 200 random queries when using the first pruning strategy is nearly . After applying the last two pruning strategies, the # states is reduced to . In this case, we have and the running time is reduced from 694882s to 3964s.
V Sampling-Estimation Solution
We next present an sampling-estimation-based approximate method to improve Exact’s efficiency from two aspects: reducing the size of the maximal based on sampling and terminating the enumeration early if a reliable accuracy (in the form of a confidence interval) is obtained based on estimation.
Figure 4 shows the pipeline with three steps: (1) Sampling-based maximal finding (§V-A). We determine the minimum population for sampling through Hoeffding Inequality [40], collect a set of samples (nodes) via an attribute-aware sampling, and take the maximal from the induced graph of as the input for estimation. (2) Estimation with accuracy guarantee (§V-B). We estimate the of each candidate community through Bag of Little Bootstrap [41] and terminate early when an accurate enough result is obtained. Otherwise, we iteratively find and estimate another candidate by greedy search. (3) Error-based incremental sampling (§V-C). If we cannot find a satisfactory community, then we enlarge via an error-based incremental sampling and repeat steps 1-2.
![Refer to caption](x4.png)
V-A Sampling-based Maximal Finding
A straightforward idea is to collect a set of nodes (as samples ) that are similar to the query node from the entire graph . Given a node , we can compute ’s sampling probability as a normalized value (Eq. 5). The smaller the attribute distance , the greater the .
(5) |
The biggest issue is that the sampling population (i.e., ) is too large to guarantee the sample quality. According to small world theory [42, 43], two nodes in the same cohesive community exhibit strong access locality [44]. This implies that nodes from the neighborhood of are more likely to belong to the community of . With this in mind, we prefer to collect samples from ’s neighborhood (denoted by ) instead of the entire . Obviously, the size of (i.e., # nodes in ) is important, a large (small) may contain more (less) irrelevant (relevant) nodes w.r.t. , thereby affecting the sample quality [45, 27, 28]. A critical problem is how to set an appropriate size of in order to bound the sample quality. Straightforwardly, we can simply define an -hop of . However, it’s problematic, as an empirical cannot adapt various of -core over different datasets. Hence, we resort to Hoeffding Inequality to determine the minimum size of .
Minimum size of . Given a graph and a query node , we first introduce the existence probability of a node that belongs to ’s neighborhood , denoted by (discussed later), which is computed based on the node’s sampling probability . The larger the , the higher the probability that is included in a specific . Suppose the ground-truth community of CS-AG is . Let be the node with the smallest existence probability . We expect to find the minimum size of that every node with would be included in with a large probability. In this way, sampling from can be viewed as a good approximation to sampling from , because includes sufficient relevant nodes w.r.t. (i.e., nodes from with ) and the least irrelevant nodes w.r.t. (i.e., minimizing ). In the following, we leverage possible world semantics [46, 6, 47] to compute a node’s existence probability, then we apply Hoeffding Inequality to the existence probability to find the minimum size of .
Possible worlds w.r.t. . Given a graph and a query node , we may easy to obtain various neighborhoods of (e.g., different with different size) by collecting nodes based on their normalized sampling probabilities (Eq. 5). According to possible world semantics, each can be seen as a possible world. Thus, we can measure an edge’s existence probability in a possible world as Eq. 6, which implies that an edge exists in a only if its two end nodes and are sampled simultaneously.
(6) |
Given the edge’s existence probability above, we can measure a specific ’s existence probability as follows.
(7) |
Given the aforementioned analysis, we show the probability of a node that belongs to ’s neighborhood is equal to the aggregate probability over all possible worlds [45] (Eq. 8). represents all possible worlds, and is an indicator function denoting if belongs to () or not.
(8) |
Hoeffding Inequality. We next leverage to determine the minimum size of , ensuring that all nodes of the ground-truth community are included in with a large probability.
Theorem 7
(Hoeffding Inequality [40] w.r.t. ) Given a set of possible worlds and an estimation error , let be the estimation of , where and (,). Then, we have the following inequality.
(9) |
According to Hoeffding Inequality, this theorem provides an upper bound on the probability of the fact that has a large estimation error to . It’s obvious that the larger the (# possible worlds), the smaller the upper bound it is, showing that holds with a larger probability. Based on Theorem 7, we have the following theorem holds.
Theorem 8
Given a set of possible worlds , , and , , if , then we have
(10) |
Proof:
We consider as the estimator of with . Then we have the following derivation by subjecting to Eq. 9. ∎
Theorem 8 shows the theoretical result of bounding the order of a pair of nodes. That is, if we have for two nodes, then their estimated existence probabilities would satisfy with a probability . More precisely, given the assumption of , the probability of is upper bounded by . The smaller the upper bound, the more the probability of , showing that is more likely to be included in than , given .
Given the ground-truth community of CS-AG, let be the node with the smallest existence probability and there are nodes in with existence probabilities . Then, we can use Theorem 8 to derive the minimum number of possible worlds (i.e., ) w.r.t. that ensures all nodes can be contained in with at least probability, as Theorem 9 shows. Specifically, Theorem 9 bounds the order of pairs of nodes by applying Union Bound and Theorem 8, where is the number of nodes in .
Theorem 9
Given a desired probability of and , it requires possible worlds to ensure that contains all nodes (from ) with existence probability , where is the node with the smallest existence probability in the ground-truth .
Proof:
Since we expect all nodes with would be contained in with at least probability, we need to ensure that every node from such nodes has a larger than that of other nodes, i.e., we need to bound the order of pairs of nodes. We have the following derivation through Bound Union and Theorem 8. ∎
We next show how to compute the the minimum size of (Theorem 10) based on the minimum given in Theorem 9.
Theorem 10
Given the ground-truth community of CS-AG, in the worst case, we require at least nodes from the original to form , so that all nodes in would be contained in with a probability of .
Proof:
A -core has at least nodes, which means that we have at least nodes should be contained in . Thus, we need to compare at least pairs of nodes (i.e., ), which indicates that we need at least possible worlds w.r.t. (Theorem 9). In the worst case, each possible world w.r.t. can be an individual edge between and another node. Thus, we require at least nodes to form the final . ∎
Example 5
Given the DBLP with nodes, , , and , it requires at least nodes to form a .
We next conduct a BFS starting from the query node to form . In this BFS, we preferentially expand the search from those nodes having smaller composite attribute distances to , until the minimum size of is reached. In §VII-G, we investigate the effect of and on CS’s performance.
Attribute-aware sampling over . We perform attribute-aware sampling over the population as follows. We compute the sampling probabilities of all nodes in based on their composite attribute distances to by replacing by in Eq. 5. We randomly collect samples (i.e., nodes) from according to their . We initialize the sample size as a fraction of nodes in , i.e., , and update with an appropriate if necessary (discussed in §V-C). In §VII-G, we show the parameter sensitivity of .
Find the maximal connected -core. We maintain a maximal connected -core from the induced graph of samples and take it as input of the estimation step (§V-B) to find an approximate community of for Approx-CS-AG problem.
V-B Estimation with Accuracy Guarantee
Given the maximal , a user-input error bound , and a confidence level , Approx-CS-AG problem aims to find an approximate community with an attribute distance satisfying with a probability of (Eq. 1-2). (1) We provide a confidence interval CI at level to quantify the quality of based on Central Limit Theorem (CLT) [27, 28], and apply Bag of Little Bootstrap (BLB) to compute the Margin of Error (MoE) of CI. (2) We return when a tight CI (i.e., , Theorem 11) is obtained. (3) Otherwise, we greedily remove the most dissimilar node from to get a new candidate and repeat above. If we cannot find a good , we repeat steps (1)-(3) with enlarged samples (§V-C).
Confidence interval calculation. Recall the attribute distance is the average composite attribute distance of all nodes from , where of is considered as a random variable with the sampling probability of . From CLT, we know that a mean-like point estimator follows a normal distribution [48]. So, we have , and the MoE of CI at level can be calculated based on CLT as , where is the normal critical value with right-tail probability (obtained from a standard normal table). We next use BLB to estimate .
Bag of little bootstrap. Bootstrap [49] provides an automatic and widely applicable means of quantifying estimator quality [50]. Though it’s simple and powerful, it requires computing the estimators on resamples having a size comparable to the original data. If the original data are large, then bootstrap is costly. Thus, we resort to BLB [50], which incorporates features of both bootstrap and subsampling, for high-quality estimation with a quite small sample. Given the approximate community , we do BLB as follows: (1) We collect small subsamples from , has a size of , where is the scale factor used in [50] to ensure . We use to indicate all subsamples for BLB estimation. (2) For each , BLB estimates by a standard bootstrap (given below) and computes an MoE . (3) Given MoEs , BLB computes the final .
Bootstrap. Given a subsample , a standard bootstrap first collects resamples having size from with replacement. Then, it computes for each resample as . Next, it takes the empirical distribution of as an approximation to , so we estimate by Eq. 11.
(11) |
Accuracy guarantee. Given a CI = , we ensure that the relative error of is bounded by a user-input error bound .
Theorem 11
If the MoE satisfies , then the relative error is upper bounded by with a probability of .
Proof:
We prove this theorem in the following two steps.
Step 1. Suppose that the exact locates in the CI’s right half-width, i.e., . Then we have the following derivation and holds if (i.e., ).
Step 2. Suppose that the exact locates in the CI’s left half-width, i.e., . Then holds if (i.e., ) by the following derivation.
In summary, holds if , as is a tighter bound (). Since our CI of has a confidence level , the above holds with a probability of . ∎
Greedy search of candidate communities. If the accuracy guarantee (Theorem 11) is not satisfied, we keep enumerating from to get another candidate for further estimation. A straightforward method is to apply our enumeration with prunings (§IV-B) to enumerate a new candidate and do BLB estimation until Theorem 11 holds. Given the premise of finding an approximate community, we can simplify it to a greedy candidate search (without backtracking) by deleting the most dissimilar node at each state. Specifically, given a current , we delete node with the most dissimilar composite attribute distance to , then we maintain the maximal of the remaining nodes as the next candidate and do BLB estimation for it. We terminate it once Theorem 11 holds.
V-C Error-based Incremental Sampling
If we cannot find a good , then we should enlarge the samples . Intuitively, we need a large when the MoE is large. Otherwise, a small is sufficient. So, we present a method to automatically set based on .
Consider an MoE . We use to denote how far is away from the desired value . The larger is, the more nodes that requires. Ideally, if we can reduce to a new by at least times, we can satisfy . Since , reducing by times is equivalent to reducing by times. Since according to CLT, where is the standard deviation of the population, we say that reducing by times is equivalent to increasing by times. In summary, we can increase by times to reduce by times. Hence, we derive as follows.
(12) |
Example 6
Given a CI = with = , = , and = . If we set the scale factor = and error bound = , then we need = to update . While for a large , we then require to update .
V-D Complexity Analysis
The total time of our approximate solution consists of sampling time () and estimation time (). For sampling, we fist require a BFS to get a neighbrood graph of with a time of , then we collect samples from and find the maximal connected -core from the induced graph with a time of . So, we get . For estimation, we introduce a constant to indicate the number of iterations of accuracy estimation till termination condition (Theorem 11) is reached ( in practice). In each iteration, we greedily enumerate from by deleting one node with the most dissimilar attribute distance, for each explored state we perform BLB estimation over subsamples. So, the total time of BLB estimation is . If we cannot find a good community, then we should include additional samples for the next iteration of accuracy estimation. Thus, the total time is .
VI Extensions
We extend our sampling-estimation solution to three more general scenarios: (1) CS on heterogeneous graphs, (2) Size-bounded CS, and (3) CS with different community models.
VI-A Extension to Heterogeneous Graphs
A heterogeneous graph consists of a node (edge) set () with multiple node types (). For , it has a node type . For , it has an edge type . The meta-path is often used to indicate a specific relationship between two nodes with the same type. For example, in DBLP, -P-A shows the co-authorship w.r.t. a paper between two authors. We call nodes with the type linked by as target nodes (e.g., authors for A-P-A) and we aim to find an approximate -core community [7, 27, 12] of target nodes from satisfying the constrains of Approx-CS-AG problem. We refer readers to [7, 11, 12] for more details of heterogeneous graphs and meta-path. We extend our method to this scenario with three modifications: (1) We replace # nodes in Theorem 10 with # target nodes of to compute the minimum size of . (2) We construct by a -neighbor-oriented BFS from the query node , which expands the search by exploring a node’s -neighbors. We say two target nodes are -neighbors if they are connected by a path instance of . (3) We perform BLB estimation on the community of target nodes, using the attribute distance computed by target nodes’ to .
Besides the basic -core, there are several variants of -core. For example, (1) -truss [51] is an extension of -core, we can support it by the same method as above but change the core-maintenance to truss-maintenance during the BLB estimation. (3) Heterogeneous influential community (HIC) is proposed in [52], it aims to identify a -core community satisfying that there is no other community with the influence vector dominates the influence vector of . The dominance relationship is defined the same as skyline, that is for each element in , it must . We may support HIC with a modification on the BLB estimation, i.e., estimating the MAX value of each element in the influence vector of an approximate community . More precisely, we may resort to Extreme Value Theory (EVT) [53] to conduct EVT-based MAX value estimation [27] for each element in the influence vector of an approximate community.
VI-B Extension to Size-bounded CS
The community’s size is critical to some applications [29, 54]. Many applications naturally require that the number of members in a community should fall within a certain range, e.g., organize a workshop with at least attendees and no more than attendees. This motivates a size-bounded CS [29] to find a community with size . We extend our sampling-estimation solution to size-bounded CS with three modifications: (1) We require at least nodes to construct , because the desired community’ size is lower-bounded by . Thus, we replace in Theorem 10 with to get the new minimum size of . (2) We ignore the candidate communities with size during the estimation and stop the greedy search of candidates when we have a size . (3) We early terminate the estimation when we get a community having the size and the MoE (Theorem 11).
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
VI-C Extension to Various Community Models
Besides -core, -truss [16] is another popular model to measure a community’s structure cohesiveness. According to [21], it is widely-recognized that -truss shows a higher structure cohesiveness but is less efficient than -core. Users may choose an appropriate model based on their actual demands. We extend our sampling-estimation solution to be adaptive to -truss model with three modifications: (1) For a -truss, every node must have at least neighbors, indicating it is a -1-core. So, it has at least nodes and we update the minimum size of as (Theorem 10). (2) Given the induced graph of , we find the maximal connected -truss from it instead of connected -core, as input of BLB estimation. (3) During the estimation, we maintain a connected -truss as a candidate community instead of -core.
Datasets | #Nodes | #Edges | #N-types | #E-types | ||||
4,039 | 88,234 | 1 | 1 | 1,045 | 43.69 | 117 | 22.44 | |
GitHub | 37,700 | 289,003 | 1 | 1 | 9,458 | 15.33 | 36 | 7.12 |
Twitch | 168,114 | 6,797,557 | 1 | 1 | 35,259 | 80.86 | 151 | 36.72 |
LiveJournal | 3,997,962 | 34,681,189 | 1 | 1 | 14,815 | 17.34 | 362 | 7.84 |
Twitter-2010 | 21,297,772 | 265,025,810 | 1 | 1 | 698,112 | 24.88 | 1,695 | 12.71 |
DBLP | 682,819 | 1,951,209 | 4 | 6 | 345 | 3.75 | 28 | 2.64 |
IMDB | 2,875,685 | 9,705,602 | 4 | 24 | 591 | 4.42 | 552 | 4.37 |
DBpedia | 4,521,912 | 15,045,801 | 359 | 676 | 6760 | 289.79 | 422 | 149 |
Freebase | 5,706,539 | 48,724,743 | 11,666 | 5,118 | 467 | 5.64 | 60 | 2.75 |
YAGO | 7,308,072 | 36,624,106 | 6543 | 101 | 285 | 5.31 | 44 | 2.61 |
VII Experiments
We provide the experimental study on ten real-world datasets. Our code [55] was implemented in Java 1.8 and run on a 2.1 GHZ, 64 GB memory AMD-6272 Linux server. Our evaluation seeks to answer the following questions.
Q1: How do our Exact and approximate solutions (§IV-V) perform in effectiveness and efficiency? (§VII-B-VII-C)
Q2: What is the effect of pruning strategies? (§VII-D)
Q3: How is the scalability of approximate method? (§VII-E)
Q4: How does our method find an approximate community iteratively on real-world datasets? (a case study in §VII-F)
Q5: How do parameters (discussed in §V) affect the approximate method’s effectiveness and efficiency? (§VII-G)
VII-A Experimental Setup
Datasets. Table I summarizes some statistics, e.g., the maximum (average) coreness () and degree (), of 5 homogeneous and 5 heterogeneous graphs. (1) Facebook [56], (2) GitHub [57], (3) Twitch [58], (4) LiveJournal [44], and (5) Twitter-2010 [59] are social networks. (6) DBLP [60] provides relationships among authors, papers, venues, etc. Each author has several attributes, e.g., research interests, # publications, -index and # citations. (7) IMDB [61] provides relationships among actors, directors, and movies, with attributes like category, # movies for actors; genres, ratings for movies. (8) DBpedia [62], (9) Freebase [63], and (10) YAGO [64] are well-known knowledge graphs. Similar to [27], we add attributes for several types of nodes via web crawling.
Queries. We generated 200 queries for each graph. For homogeneous graphs, we follow [22] to form a query with a random query node. For heterogeneous graphs, we generate a query the same as [7]. We first obtain the top-10 meta-paths with the highest frequencies. A meta-path ’s frequency is measured by its # path instances. The more the path instances, the higher the frequency of . We next form a query with a randomly selected and a query node with the type linked by .
Metrics. We use the attribute distance , relative error of w.r.t. the ground-truth (obtained by Exact) to evaluate the effectiveness. We evaluate the efficiency by response time. We show the average result of 200 queries in each test.
Methods. We implemented Exact and Sampling-Estimation-based Approximate solution (SEA) for -core (default) and -truss: (1) Exact, (2) Exact-Truss, (3) SEA, and (4) SEA-Truss. We compare ours with representative CS methods using various attribute cohesiveness metrics: (5) LocATC-Core and (6) LocATC-Truss, the fastest local version of ATC [3] atop -core and -truss, which are two approximate methods. (7) ACQ-Core [22] is an exact core-based method. (8) VAC-Core is the core-based version of the truss-based (9) VAC-Truss [5], both two are approximate methods. (10) E-VAC-Core and (11) E-VAC-Truss are the corresponding exact VAC methods also from [5]. Ours (1)-(4) support two types of graphs, while (5)-(11) are designed for homogeneous graphs. We convert a graph from heterogeneous to homogeneous given a meta-path, then invoke (5)-(11) to find communities for heterogeneous graphs. Besides, we clarified that we only provide the results of E-VAC-Core for small graphs, i.e., Facebook and GitHub, because it cannot finish within one week for large graphs [5].
Parameters. The default parameters are: , and for Hoeffding Inequality, is the initial sampling fraction, and for accuracy guarantee. We show the parameter sensitivity in §VII-G.
Remark. Since some datasets provide human-annotated ground-truth (HA-GT) community, e.g., Facebook, LiveJournal, Orkut [65], and Amazon [66], we also evaluate the effectiveness w.r.t. HA-GT using -score as the metric (the same as [3]). The higher the -score, the more the similarity of a community to the HA-GT, showing that the community with strict structure and attribute cohesiveness constraints can reflect the characteristics of real communities to some extent.
Methods | Min-max (VAC) | Attribute coverage (ATC) | #Shared attributes (ACQ) | (Ours) | Total rank |
SEA (Ours) | 0.486 (2) | 161.84 (4) | 0.06 (2) | 0.304 (2) | 10 |
LocATC-Core | 0.491 (6) | 209.39 (1) | 0.04 (6) | 0.331 (6) | 19 |
ACQ-Core | 0.489 (5) | 196.79 (2) | 0.08 (1) | 0.328 (5) | 13 |
VAC-Core | 0.486 (2) | 178.46 (3) | 0.06 (2) | 0.325 (4) | 11 |
Exact (Ours) | 0.486 (2) | 155.13 (6) | 0.06 (2) | 0.297 (1) | 11 |
E-VAC-Core | 0.475 (1) | 158.45 (5) | 0.06 (2) | 0.314 (3) | 11 |
VII-B Effectiveness Evaluation
Figure 5(a) shows the results of homogeneous graphs. SEA has smaller than others and it is quite close to that of Exact. From the perspective of relative error of (Figure 5(b)), ours is at least one order of magnitude less than others and is bounded by the default error bound . This is because we apply BLB estimation with an reliable accuracy guarantee (Theorem 11). Moreover, we measure each method’s attribute cohesiveness w.r.t. various metrics. Table II shows the results on Facebook (associated with a rank in parentheses). We highlight the best results in bold and indicate suboptimal values with underlines. Each method performs the best on its own metric. From the macro perspective (see total rank), SEA is the best for all metrics. We also apply the same method as [3] to evaluate the -score w.r.t. HA-GT community (Table III). Here, we use ‘-’ to indicate that a method cannot finish within a sufficiently long time. SEA and Exact have higher -score than others, indicating that the community of ours is more similar to HA-GT than others. We also provide the -score over 10 ego-networks of Facebook in Figure 6 and we found that ours has the best -score on eight of them.
Methods | LiveJournal | Orkut | Amazon | |
SEA (Ours) | 0.61 | 0.86 | 0.56 | 0.91 |
LocATC-Core | 0.54 | 0.76 | 0.45 | 0.73 |
ACQ-Core | 0.31 | 0.31 | 0.28 | 0.45 |
VAC-Core | 0.47 | 0.79 | 0.40 | 0.76 |
Exact (Ours) | 0.64 | 0.88 | - | - |
E-VAC-Core | 0.51 | - | - | - |
![Refer to caption](x9.png)
VII-C Efficiency Evaluation
Figure 5(c) shows the results of homogeneous graphs. We provide the speedup of SEA w.r.t. comparing methods, on the top of bars. Ours outperforms others and the improvement is getting obvious as the graph size increases. For ten-million-scale Twitter, ours is at least 28.2 faster than others. For all datasets, ours is at least 1.54 (41.1 on average) faster than others. This is because our method can early terminate once an acceptable is obtained. Figure 5(d) shows the runtime of SEA’s three steps. S1: Sampling-based maximal finding (§V-A). S2: BLB estimation (§V-B). S3: Error-based incremental sampling (§V-C). S2 is the most time-consuming step, as a greedy search is used to find candidate communities for BLB estimation. S3 is the most efficient step because in most cases, we can find a good community within 2 iterations.
Methods | GitHub | Twitch | LiveJournal | |||||
Time | # States | Time | # States | Time | # States | Time | # States | |
Exact | 77 | 3.24 | 1210 | 8.02 | 14721 | 1.07 | 59292 | 4.13 |
ExactP3 | 80 | 3.35 | 1890 | 1.24 | 15770 | 1.16 | 59315 | 4.29 |
ExactP3+P2 | 388 | 8.23 | 174483 | 4.21 | 753701 | 5.48 | 8 days | 1.02 |
Exact w/o P | 8 days | 6.87 | 8 days | 8.79 | 8 days | 2.81 | 8 days | 4.51 |
VII-D Effect of Pruning Strategies for Exact Method
In §IV, we propose Exact with three pruning strategies to prune duplicated states (P1), unnecessary states (P2), and unpromising states (P3). Table IV shows the effect of P1-P3 on the efficiency. Exact is the one with P1-P3, P3 is the one with P1+P2, P3+P2 is the one with P1, and Exact w/o P is the one without prunings. Note that, all strategies are effective and improve the runtime. Among them, P1 is the most efficient one which can significantly reduce # states explored in the searching. For example, for Facebook, P1 prunes 99.8% states comparing with Exact w/o P.
Methods | DBLP | IMDB | DBpedia | Yago | Freebase | |||||
Time | Error | Time | Error | Time | Error | Time | Error | Time | Error | |
SEA (Ours) | 187.01 | 1.58 | 72.89 | 1.56 | 59.64 | 0.0082 | 76.57 | 1.26 | 51.97 | 1.43 |
ACQ-Core | 799.34 | 13.45 | 850.26 | 41.57 | - | - | - | - | - | - |
LocATC-Core | 431.84 | 14.58 | 891.54 | 47.83 | 102.85 | 37.58 | 178.57 | 20.10 | 109.82 | 24.81 |
VAC-Core | 1453.82 | 12.45 | 2700.96 | 23.87 | 397.70 | 25.58 | 562.72 | 18.99 | 447.73 | 19.28 |
SEA-Truss | 334.57 | 0.21 | 89.67 | 1.15 | 72.99 | 1.23 | 93.59 | 1.17 | 64.26 | 1.81 |
LocATC-Truss | 812.93 | 4.89 | 947.19 | 21.29 | 211.28 | 19.99 | 297.14 | 15.28 | 191.34 | 15.17 |
VAC-Truss | 1857.71 | 6.25 | 2938.27 | 9.04 | 791.85 | 2.37 | 839.47 | 5.31 | 621.54 | 5.87 |
VII-E Scalability Analysis
Extension to heterogeneous graphs. We provide the results of core-based methods on five heterogeneous graphs in Table V (rows 1-4). Since we only have numerical attributes for DBpedia, Yago, and Freebase, the equality-matching-based method ACQ-Core cannot return any communities that share at least one numerical attribute. For all datasets, ours has at least one order of magnitude less relative error than others and is bounded by the default (Theorem 11). Besides, ours is at least 1.72 faster than others, as we can terminate early when Theorem 11 holds.
Extension to -truss model. Table V (rows 5-7) show the results for truss-based methods on heterogeneous graphs. Ours outperforms others due to the aforementioned same reasons.
Extension to size-bounded CS. Figure 7 shows the results of size-bounded CS on DBLP and GitHub. The runtime decreases as the size increases, because the larger the community is desired, the less the time is required for greedy search of candidate communities. Besides, the relative error is bounded by the default , showing BLB estimation is effective.
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
Methods | Approximate result round by round | |||||
Round | MoE | Time (ms) |
|
|||
SEA w/ size-bound | 1 | 4.39 | 9.23 | 5967 | 48.29 | 2.34 |
2 | 4.31 | 1.79 | 615 | 3.85 | 0.39 | |
SEA w/ size-bound | 1 | 5.43 | 1.48 | 6743 | 52.45 | 4.66 |
2 | 5.17 | 1.05 | 3989 | 18.84 | 0.41 |
VII-F Case Study
We performed a case study on IMDB with Robert De Niro. Figure 9 illustrates two communities returned by SEA with different size-bound and . We use different colors to distinguish persons with different levels of attribute similarities to Robert De Niro (Red Blue Green). The community in (a) includes a set of top-tire actors in Hollywood who are as famous as Robert De Niro, showing a greater attribute cohesiveness w.r.t. than that of (b). While (b) has to includes few less similar persons, e.g., 50 Cent and Dom Deluise in order to satisfy the enlarged size bound. Table VI shows the detailed runtime information of SEA, where the community is refined iteratively (i.e., decreased and ) and finally the relative error is bounded by the default =. Besides, since we apply an error-based incremental sampling, we require a smaller than the initial to update the result.
VII-G Parameter Sensitivity
Effect of . Figure 8(a)-(b) show that the runtime increases as increases, e.g., from 180 ms to 290 ms for DBLP. The more the samples, the more the time is required to greedy search communities from a larger for estimation. Besides, has little effect on the attribute cohesiveness as the effectiveness is mainly dominated by the estimation with accuracy guarantee.
Effect of and for Hoeffding Inequality. Figure 8(c)-(f) show that the response time increases as () decreases (increases). The stricter the and , the more the nodes are required to form (Theorem 10), leading to more time for estimation over a larger . Since and are used to control the probability of the event that contains all nodes from the ground-truth community, the stricter the and , the more the possibility of finding a better community.
Effect of and for BLB estimation. Figure 8(g)-(j) show that the stricter the and , the more the response time is required to achieve a smaller relative error. The relative error are almost bounded by except the case for = in Twitter. This is because Theorem 11 holds with a probability of and this situation rarely happens for a large .
Effect of . Figure 8(k)-(l) show that increases as increases. A large usually indicates a small community, of which many nodes may important to the structure cohesiveness and a -core would collapse if we delete such a node. So, the returned community may include some dissimilar nodes but contribute a lot to the structure cohesiveness, leading a larger . The runtime for a small is usually more than that for a large , as a small often indicates that we need more time for greedy search over a large for estimation.
![Refer to caption](x25.png)
![Refer to caption](x26.png)
Effect of . Since is a balance factor to adjust user preferences for two types of attribute cohesiveness, we varied to study its effect on the two independent textual and numerical attribute cohesiveness in Figure 10. We observed that when (), our method tends to identify communities with the highest (lowest) cohesion in textual attributes (i.e., Jaccard distance) but the lowest (highest) cohesion in numerical attributes (i.e., Manhattan distance). A balance is achieved if is close to 0.5, indicating that communities with good cohesion in both types of attributes can be identified.
VIII Related Work
Community search (CS) was first studied in [8], which can be divided into two categories according to graph types.
CS on homogeneous graphs. Many works focus on modeling the cohesive community based on minimum degree [8, 15, 67], -core [39, 30, 67, 14], -truss [18, 17, 16, 68], -clique [19, 20, 69], -edge [70, 71], and query-biased density model [72]. These works greatly boost the study of CS, but ignore the CS on attributed graphs. Thus, many works define different metrics of attribute cohesiveness, and then integrated it with the structure cohesiveness for CS [73, 74, 2, 75, 22, 3, 5, 1]. Although many metrics have been proposed, they are not strict enough to reflect a community’s attribute cohesiveness. For example, [3] measure a community’s attribute cohesiveness as the weighted sum of each attribute’s coverage, where coverage is computed as the ratio of nodes with exactly matched attribute. Similarly, [22] uses # shared attributes to measure cohesiveness, relying on equality matching too. Due to the constraints of equality matching, they are not well-suited for numerical attributes, for instance, it’s more reasonable to seek similar movies with similar ratings rather than identical ones. [5] aims to minimize the maximum attribute distance (i.e., optimize the worst case) in a community, but overlooks the similarity of nodes to the query node . This motivates us to present a CS based on a -centric attribute distance considering both textual and numerical attribute.
CS on heterogeneous graphs. Recently, CS on heterogeneous graphs has emerged. The meta-path is often used to indicate the relation between two node types. Some community models are proposed, e.g., -core [7], -Btruss and -Ctruss [51]. Many follow-up works use them for various downstream applications, i.e., expert finding in [11, 12] and influential community search via skyline influence vectors in [52]. [76] presents a keyword-centric CS, which takes a set of keywords as input rather than a query node, and it cannot support numerical attributes. It ensures that any node in a community can reach to a keyword with a shorter path, rather than all nodes in a community are similar in their attributes.
None of above methods provide an efficient approximate solution with a reliable evaluation on community’s quality based on a metric that can better distinguish a community’s attribute cohesiveness, inspiring our study in this paper.
IX Conclusions
We study an NP-hard CS-AG problem atop a strict -centric attribute cohesiveness metric for -core model on homogeneous graphs. We first propose an exact method with three pruning strategies served as a baseline. Then, we propose a sampling-estimation-based method to quickly return an appropriate community with an accuracy guarantee (given as an error-bounded confidence interval). We extend our method to heterogeneous graphs, -truss model, and size-bounded CS. Experimental studies on ten real-world datasets demonstrate our method’s superiority in both effectiveness and efficiency.
References
- [1] Z. Zhang, X. Huang, J. Xu, B. Choi, and Z. Shang, “Keyword-centric community search,” in ICDE, 2019, pp. 422–433.
- [2] Y. Fang, R. Cheng, X. Li, S. Luo, and J. Hu, “Effective Community Search over Large Spatial Graphs,” PVLDB, vol. 10, no. 6, pp. 709–720, 2017.
- [3] X. Huang and L. V. S. Lakshmanan, “Attribute-Driven Community Search,” PVLDB, vol. 10, no. 9, pp. 949–960, 2017.
- [4] L. Sun, X. Huang, R. Li, B. Choi, and J. Xu, “Index-based intimate-core community search in large weighted graphs,” IEEE Trans. Knowl. Data Eng., 2020.
- [5] Q. Liu, Y. Zhu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “VAC: vertex-centric attributed community search,” in ICDE, 2020, pp. 937–948.
- [6] X. Miao, Y. Liu, L. Chen, Y. Gao, and J. Yin, “Reliable community search on uncertain graphs,” in ICDE, 2022, pp. 1166–1179.
- [7] Y. Fang, Y. Yang, W. Zhang, X. Lin, and X. Cao, “Effective and efficient community search over large heterogeneous information networks,” PVLDB, vol. 13, no. 6, pp. 854–867, 2020.
- [8] M. Sozio and A. Gionis, “The community-search problem and how to plan a successful cocktail party,” in KDD, 2010, pp. 939–948.
- [9] J. Dudley, T. Deshpande, and A. J. Butte, “Exploiting drug-disease relationships for computational drug repositioning,” Briefings Bioinform., vol. 12, no. 4, pp. 303–311, 2011.
- [10] P. Pesantez-Cabrera and A. Kalyanaraman, “Efficient detection of communities in biological bipartite networks,” IEEE ACM Trans. Comput. Biol. Bioinform., vol. 16, no. 1, pp. 258–271, 2019.
- [11] X. Xu, J. Liu, Y. Wang, and X. Ke, “Academic Expert Finding via (k,p)-core based Embedding over Heterogeneous Graphs,” in ICDE, 2022, pp. 338–351.
- [12] Y. Wang, J. Liu, X. Xu, X. Ke, T. Wu, and X. Gou, “Efficient and effective academic expert finding on heterogeneous graphs through (k,p)-core based embedding,” ACM Trans. Knowl. Discov. Data, vol. 17, no. 6, mar 2023.
- [13] E. Clough and T. Barrett, The Gene Expression Omnibus Database. New York, NY: Springer New York, 2016, pp. 93–110. [Online]. Available: https://doi.org/10.1007/978-1-4939-3578-9_5
- [14] N. Barbieri, F. Bonchi, E. Galimberti, and F. Gullo, “Efficient and effective community search,” Data Min. Knowl. Discov., vol. 29, no. 5, pp. 1406–1433, 2015.
- [15] W. Cui, Y. Xiao, H. Wang, and W. Wang, “Local Search of Communities in Large Graphs,” in SIGMOD, 2014, pp. 991–1002.
- [16] X. Huang, L. V. S. Lakshmanan, J. X. Yu, and H. Cheng, “Approximate Closest Community Search in Networks,” PVLDB, vol. 9, no. 4, pp. 276–287, 2015.
- [17] X. Huang, H. Cheng, L. Qin, W. Tian, and J. X. Yu, “Querying k-truss community in large and dynamic graphs,” in SIGMOD, 2014, pp. 1311–1322.
- [18] E. Akbas and P. Zhao, “Truss-based community search: A truss-equivalence based indexing approach,” PVLDB, vol. 10, no. 11, pp. 1298–1309, 2017.
- [19] W. Cui, Y. Xiao, H. Wang, Y. Lu, and W. Wang, “Online Search of Overlap** Communities,” in SIGMOD, 2013, pp. 277–288.
- [20] C. E. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. A. Tsiarli, “Denser than the densest subgraph: Extracting optimal quasi-cliques with quality guarantees,” in KDD, 2013, pp. 104–112.
- [21] Y. Fang, X. Huang, L. Qin, Y. Zhang, W. Zhang, R. Cheng, and X. Lin, “A survey of community search over big graphs,” VLDBJ, vol. 29, no. 1, pp. 353–392, 2020.
- [22] Y. Fang, R. Cheng, S. Luo, and J. Hu, “Effective community search for large attributed graphs,” PVLDB, vol. 9, no. 12, pp. 1233–1244, 2016.
- [23] Y. Zhu, J. He, J. Ye, L. Qin, X. Huang, and J. X. Yu, “When structure meets keywords: Cohesive attributed community search,” in CIKM, 2020, pp. 1913–1922.
- [24] S. Kosub, “A note on the triangle inequality for the jaccard distance,” Pattern Recognit. Lett., vol. 120, pp. 36–38, 2019.
- [25] N. Laptev, K. Zeng, and C. Zaniolo, “Early accurate results for advanced analytics on mapreduce,” PVLDB, vol. 5, no. 10, pp. 1028–1039, 2012.
- [26] S. Chaudhuri, B. Ding, and S. Kandula, “Approximate query processing: No silver bullet,” in SIGMOD, S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, Eds., 2017, pp. 511–519.
- [27] Y. Wang, A. Khan, X. Xu, J. **, Q. Hong, and T. Fu, “Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling,” in ICDE, 2022.
- [28] Y. Wang, J. Luo, A. Song, and F. Dong, “A sampling-based hybrid approximate query processing system in the cloud,” in ICPP, 2014, pp. 291–300.
- [29] K. Yao and L. Chang, “Efficient size-bounded community search over large networks,” Proc. VLDB Endow., vol. 14, no. 8, pp. 1441–1453, 2021.
- [30] F. Bonchi, A. Khan, and L. Severini, “Distance-generalized core decomposition,” in SIGMOD, 2019, pp. 1006–1023.
- [31] J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “Querying Minimal Steiner Maximum-connected Subgraphs in Large Graphs,” in CIKM, 2016, pp. 1241–1250.
- [32] R. Li, L. Qin, J. X. Yu, and R. Mao, “Influential Community Search in Large Networks,” PVLDB, vol. 8, no. 5, pp. 509–520, 2015.
- [33] R. Li, L. Qin, F. Ye, J. X. Yu, X. Xiao, N. Xiao, and Z. Zheng, “Skyline community search in multi-valued networks,” in SIGMOD, 2018, pp. 457–472.
- [34] M. Wang, L. Lv, X. Xu, Y. Wang, Q. Yue, and J. Ni, “An efficient and robust framework for approximate nearest neighbor search with attribute constraint,” in NeurIPS, 2024.
- [35] M. El-Kebir and G. W. Klau, “Solving the maximum-weight connected subgraph problem to optimality,” arXiv, vol. abs/1409.5308, 2014.
- [36] J. M. Kleinberg and É. Tardos, Algorithm Design. Addison-Wesley, 2006.
- [37] A. Santuari, “Steiner tree np-completeness proof,” University of Trento, Tech. Rep., 2003.
- [38] J. Byrka, F. Grandoni, T. Rothvoß, and L. Sanità, “An improved lp-based approximation for steiner tree,” in STOC, L. J. Schulman, Ed., 2010, pp. 583–592.
- [39] V. Batagelj and M. Zaversnik, “An o (m) algorithm for cores decomposition of networks,” arXiv, vol. cs.DS/0310049, 2003.
- [40] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, pp. 409–426, 1994.
- [41] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A Scalable Bootstrap for Massive Data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 4, pp. 795–816, 2014.
- [42] J. M. Kleinberg, “Navigation in a small world,” Nature, vol. 406, pp. 845–845, 2000.
- [43] Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, 2020.
- [44] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in ICDM, 2012, pp. 745–754.
- [45] D. Cheng, C. Chen, X. Wang, and S. Xiang, “Efficient top-k vulnerable nodes detection in uncertain graphs,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 2, pp. 1460–1472, 2023.
- [46] F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich, “Core decomposition of uncertain graphs,” in SIGKDD, 2014, pp. 1316–1325.
- [47] Y. Peng, Y. Zhang, W. Zhang, X. Lin, and L. Qin, “Efficient probabilistic k-core computation on uncertain graphs,” in ICDE, 2018, pp. 1192–1203.
- [48] J. Gao, X. Li, Y. E. Xu, B. Sisman, X. L. Dong, and J. Yang, “Efficient knowledge graph accuracy evaluation,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1679–1691, 2019.
- [49] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. Springer, 1993.
- [50] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “The big data bootstrap,” in ICML, 2012.
- [51] Y. Yang, Y. Fang, X. Lin, and W. Zhang, “Effective and Efficient Truss Computation over Large Heterogeneous Information Networks,” in ICDE, 2020, pp. 901–912.
- [52] Y. Zhou, Y. Fang, W. Luo, and Y. Ye, “Influential community search over large heterogeneous information networks,” PVLDB, vol. 16, no. 8, pp. 2047–2060, 2023.
- [53] S. Coles, J. Bawa, L. Trenner, and P. Dorazio, An Introduction to Statistical Modeling of Extreme Values. Springer, 2001, vol. 208.
- [54] Y. Ma, Y. Yuan, F. Zhu, G. Wang, J. Xiao, and J. Wang, “Who should be invited to my party: A size-constrained k-core problem in social networks,” J. Comput. Sci. Technol., vol. 34, no. 1, pp. 170–184, 2019.
- [55] Code and datasets, “Code and datasets,” https://anonymous.4open.science/r/SEA-Update-D18E/README.md, 2023.
- [56] J. J. McAuley and J. Leskovec, “Learning to discover social circles in ego networks,” in NIPS, 2012, pp. 548–556.
- [57] B. Rozemberczki, C. Allen, and R. Sarkar, “Multi-scale attributed node embedding,” J. Complex Networks, vol. 9, no. 2, 2021.
- [58] B. Rozemberczki and R. Sarkar, “Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings,” arXiv, vol. abs/2101.03091, 2021.
- [59] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graph analytics and visualization,” in AAAI, 2015.
- [60] “DBLP,” http://dblp.uni-trier.de/xml/, 2023.
- [61] “IMDB,” https://www.imdb.com/interfaces/, 2023.
- [62] “DBpedia,” https://wiki.dbpedia.org/Datasets, 2023.
- [63] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and J. Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” in SIGMOD, 2008, pp. 1247–1250.
- [64] T. Rebele, F. M. Suchanek, J. Hoffart, J. Biega, E. Kuzey, and G. Weikum, “YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames,” in ISWC, 2016, pp. 177–185.
- [65] O. dataset, “Orkut dataset,” https://www.comp.hkbu.edu.hk/∼db/book/communitysearch.html, 2023.
- [66] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowl. Inf. Syst., vol. 42, no. 1, pp. 181–213, 2015.
- [67] Y. Fang, Z. Wang, R. Cheng, H. Wang, and J. Hu, “Effective and efficient community search over large directed graphs,” IEEE Trans. Knowl. Data Eng., vol. 31, no. 11, pp. 2093–2107, 2019.
- [68] Q. Liu, M. Zhao, X. Huang, J. Xu, and Y. Gao, “Truss-based community search over large directed graphs,” in SIGMOD, 2020, pp. 2183–2197.
- [69] L. Yuan, L. Qin, W. Zhang, L. Chang, and J. Yang, “Index-based densest clique percolation community search in networks,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 5, pp. 922–935, 2018.
- [70] L. Chang, X. Lin, L. Qin, J. X. Yu, and W. Zhang, “Index-based optimal algorithms for computing steiner components with maximum connectivity,” in SIGMOD, 2015, pp. 459–474.
- [71] J. Hu, X. Wu, R. Cheng, S. Luo, and Y. Fang, “On minimal steiner maximum-connected subgraph queries,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 11, pp. 2455–2469, 2017.
- [72] Y. Wu, R. **, J. Li, and X. Zhang, “Robust local community detection: On free rider effect and its elimination,” PVLDB, vol. 8, no. 7, pp. 798–809, 2015.
- [73] L. Chen, C. Liu, R. Zhou, J. Li, X. Yang, and B. Wang, “Maximum co-located community search in large scale social networks,” PVLDB, vol. 11, no. 10, pp. 1233–1246, 2018.
- [74] L. Chen, C. Liu, K. Liao, J. Li, and R. Zhou, “Contextual community search over large social networks,” in ICDE, 2019, pp. 88–99.
- [75] Y. Fang, R. Cheng, Y. Chen, S. Luo, and J. Hu, “Effective and efficient attributed community search,” VLDBJ, vol. 26, no. 6, pp. 803–828, 2017.
- [76] L. Qiao, Z. Zhang, Y. Yuan, C. Chen, and G. Wang, “Keyword-centric community search over large heterogeneous information networks,” in DASFAA, vol. 12681, 2021, pp. 158–173.