-
Efficient Historical Butterfly Counting in Large Temporal Bipartite Networks via Graph Structure-aware Index
Authors:
Qiuyang Mang,
**gbang Chen,
Hangrui Zhou,
Yu Gao,
Yingli Zhou,
Richard Peng,
Yixiang Fang,
Chenhao Ma
Abstract:
Bipartite graphs are ubiquitous in many domains, e.g., e-commerce platforms, social networks, and academia, by modeling interactions between distinct entity sets. Within these graphs, the butterfly motif, a complete 2*2 biclique, represents the simplest yet significant subgraph structure, crucial for analyzing complex network patterns. Counting the butterflies offers significant benefits across va…
▽ More
Bipartite graphs are ubiquitous in many domains, e.g., e-commerce platforms, social networks, and academia, by modeling interactions between distinct entity sets. Within these graphs, the butterfly motif, a complete 2*2 biclique, represents the simplest yet significant subgraph structure, crucial for analyzing complex network patterns. Counting the butterflies offers significant benefits across various applications, including community analysis and recommender systems. Additionally, the temporal dimension of bipartite graphs, where edges activate within specific time frames, introduces the concept of historical butterfly counting, i.e., counting butterflies within a given time interval. This temporal analysis sheds light on the dynamics and evolution of network interactions, offering new insights into their mechanisms. Despite its importance, no existing algorithm can efficiently solve the historical butterfly counting task. To address this, we design two novel indices whose memory footprints are dependent on #butterflies and #wedges, respectively. Combining these indices, we propose a graph structure-aware indexing approach that significantly reduces memory usage while preserving exceptional query speed. We theoretically prove that our approach is particularly advantageous on power-law graphs, a common characteristic of real-world bipartite graphs, by surpassing traditional complexity barriers for general graphs. Extensive experiments reveal that our query algorithms outperform existing methods by up to five magnitudes, effectively balancing speed with manageable memory requirements.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Scalable Algorithm for Finding Balanced Subgraphs with Tolerance in Signed Networks
Authors:
**gbang Chen,
Qiuyang Mang,
Hangrui Zhou,
Richard Peng,
Yu Gao,
Chenhao Ma
Abstract:
Signed networks, characterized by edges labeled as either positive or negative, offer nuanced insights into interaction dynamics beyond the capabilities of unsigned graphs. Central to this is the task of identifying the maximum balanced subgraph, crucial for applications like polarized community detection in social networks and portfolio analysis in finance. Traditional models, however, are limite…
▽ More
Signed networks, characterized by edges labeled as either positive or negative, offer nuanced insights into interaction dynamics beyond the capabilities of unsigned graphs. Central to this is the task of identifying the maximum balanced subgraph, crucial for applications like polarized community detection in social networks and portfolio analysis in finance. Traditional models, however, are limited by an assumption of perfect partitioning, which fails to mirror the complexities of real-world data. Addressing this gap, we introduce an innovative generalized balanced subgraph model that incorporates tolerance for irregularities. Our proposed region-based heuristic algorithm, tailored for this NP-hard problem, strikes a balance between low time complexity and high-quality outcomes. Comparative experiments validate its superior performance against leading solutions, delivering enhanced effectiveness (notably larger subgraph sizes) and efficiency (achieving up to 100x speedup) in both traditional and generalized contexts.
△ Less
Submitted 16 June, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Nearly Optimal Internal Dictionary Matching
Authors:
**gbang Chen,
Jiangqi Dai,
Qiuyang Mang,
Qingyu Shi,
Tingqiang Xu
Abstract:
We study the internal dictionary matching (IDM) problem where a dictionary $\mathcal{D}$ containing $d$ substrings of a text $T$ is given, and each query concerns the occurrences of patterns in $\mathcal{D}$ in another substring of $T.$
We propose a novel $O(n)$-sized data structure named Basic Substring Structure (BASS) where $n$ is the length of the text $T.$ With BASS, we are able to handle a…
▽ More
We study the internal dictionary matching (IDM) problem where a dictionary $\mathcal{D}$ containing $d$ substrings of a text $T$ is given, and each query concerns the occurrences of patterns in $\mathcal{D}$ in another substring of $T.$
We propose a novel $O(n)$-sized data structure named Basic Substring Structure (BASS) where $n$ is the length of the text $T.$ With BASS, we are able to handle all types of queries in the IDM problem in nearly optimal query and preprocessing time. Specifically, our results include:
- The first algorithm that answers the *CountDistinct* query in $\tilde{O}(1)$ time with $\tilde{O}(n+d)$ preprocessing, where we need to compute the number of distinct patterns that exist in $T[i..j]$. Previously, the best result was $\tilde{O}(m)$ time per query after $\tilde{O}(n^2/m+d)$ or $\tilde{O}(nd/m+d)$ preprocessing, where $m$ is a chosen parameter. - Faster algorithms for two other types of internal queries. We improve the runtime for \textbf{(1)} Pattern counting (Count) queries to $O(\log n/\log\log n)$ time per query with $O(n+d\sqrt{\log n})$ preprocessing from $O(\log^2 n/\log\log n)$ time per query with $O(n\log n/\log \log n+d\log^{3/2} n)$ preprocessing. \textbf{(2)} Distinct pattern reporting (ReportDistinct) queries to $O(1+|\text{output}|)$ time per query from $O(\log n+|\text{output}|)$ per query.
In addition, we match the optimal runtime in the remaining two types of queries, pattern existence (Exist), and pattern reporting (Report). We also show that BASS is more generally applicable to other internal query problems.
△ Less
Submitted 6 July, 2024; v1 submitted 19 December, 2023;
originally announced December 2023.
-
Retromorphic Testing: A New Approach to the Test Oracle Problem
Authors:
Boxi Yu,
Qiuyang Mang,
Qingshuo Guo,
Pinjia He
Abstract:
A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept…
▽ More
A test oracle serves as a criterion or mechanism to assess the correspondence between software output and the anticipated behavior for a given input set. In automated testing, black-box techniques, known for their non-intrusive nature in test oracle construction, are widely used, including notable methodologies like differential testing and metamorphic testing. Inspired by the mathematical concept of inverse function, we present Retromorphic Testing, a novel black-box testing methodology. It leverages an auxiliary program in conjunction with the program under test, which establishes a dual-program structure consisting of a forward program and a backward program. The input data is first processed by the forward program and then its program output is reversed to its original input format using the backward program. In particular, the auxiliary program can operate as either the forward or backward program, leading to different testing modes. The process concludes by examining the relationship between the initial input and the transformed output within the input domain. For example, to test the implementation of the sine function $\sin(x)$, we can employ its inverse function, $\arcsin(x)$, and validate the equation $x = \sin(\arcsin(x)+2kπ), \forall k \in \mathbb{Z}$. In addition to the high-level concept of Retromorphic Testing, this paper presents its three testing modes with illustrative use cases across diverse programs, including algorithms, traditional software, and AI applications.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Automated Testing and Improvement of Named Entity Recognition Systems
Authors:
Boxi Yu,
Yiyan Hu,
Qiuyang Mang,
Wenhan Hu,
Pinjia He
Abstract:
Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain cir…
▽ More
Named entity recognition (NER) systems have seen rapid progress in recent years due to the development of deep neural networks. These systems are widely used in various natural language processing applications, such as information extraction, question answering, and sentiment analysis. However, the complexity and intractability of deep neural networks can make NER systems unreliable in certain circumstances, resulting in incorrect predictions. For example, NER systems may misidentify female names as chemicals or fail to recognize the names of minority groups, leading to user dissatisfaction. To tackle this problem, we introduce TIN, a novel, widely applicable approach for automatically testing and repairing various NER systems. The key idea for automated testing is that the NER predictions of the same named entities under similar contexts should be identical. The core idea for automated repairing is that similar named entities should have the same NER prediction under the same context. We use TIN to test two SOTA NER models and two commercial NER APIs, i.e., Azure NER and AWS NER. We manually verify 784 of the suspicious issues reported by TIN and find that 702 are erroneous issues, leading to high precision (85.0%-93.4%) across four categories of NER errors: omission, over-labeling, incorrect category, and range error. For automated repairing, TIN achieves a high error reduction rate (26.8%-50.6%) over the four systems under test, which successfully repairs 1,056 out of the 1,877 reported NER errors.
△ Less
Submitted 13 August, 2023;
originally announced August 2023.