Cover Edge-Based Novel Triangle Counting

David A. Bader [email protected] , Fuhuan Li [email protected] , Zhihui Du [email protected] , Palina Pauliuchenka [email protected] , Oliver Alvarado Rodriguez [email protected] New Jersey Institute of TechnologyNewarkNew JerseyUSA07102 , Anant Gupta John P. Stevens High SchoolEdisonUSA , Sai Sri Vastav Minnal Edison Academy Magnet SchoolEdisonUSA , Valmik Nahata New Providence High SchoolNew ProvidenceNew JerseyUSA , Anya Ganeshan Bergen County AcademiesHackensackNew JerseyUSA , Ahmet Gundogdu Paramus High SchoolParamusNew JerseyUSA and Jason Lew New Jersey Institute of TechnologyNewarkNew JerseyUSA07102 [email protected]

(2024)

Abstract.

Listing and counting triangles in graphs is a key algorithmic kernel for network analyses, including community detection, clustering coefficients, k-trusses, and triangle centrality. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. Leveraging the breadth-first search (BFS) method, we can quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms that employ cover-edge sets are presented. The novel sequential algorithm performs competitively with the fastest previous approaches on both real and synthetic graphs, such as those from the Graph500 Benchmark and the MIT/Amazon/IEEE Graph Challenge. We implement 22 sequential algorithms for performance evaluation and comparison. At the same time, we employ OpenMP to parallelize 11 sequential algorithms, presenting an in-depth analysis of their parallel performance. Furthermore, we develop a distributed parallel algorithm that can asymptotically reduce communication on massive graphs. In our estimate from massive-scale Graph500 graphs, our distributed parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x. Comprehensive experiments are conducted on the recently launched Intel Xeon 8480+ processor and shed light on how graph attributes, such as topology, diameter, and degree distribution, can affect the performance of these algorithms.

Graph Algorithms, High-Performance Data Analytics, Parallel Algorithms

^†^†copyright: none^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX

1. Introduction

Triangle listing and counting is a highly-studied problem in computer science and is a key building block in various graph analysis techniques such as clustering coefficients (Watts and Strogatz, 1998), k-truss (Cohen, 2008), and triangle centrality (Burkhardt, 2021), (Li and Bader, 2021). The significance of triangle counting is evident in its application in high-performance computing benchmarks like Graph500 (Graph 500 Steering Committee, 2010) and the MIT/Amazon/IEEE Graph Challenge (Samsi et al., 2018), as well as in the design of future architecture systems (e.g., IARPA AGILE (Harrod, 2020)).

There are at most $\binom{n}{3}=\mbox{\mbox{$\Theta$}$\left(n^{3}\right)$}$ triangles in a graph $G=(V,E)$ with $n=|V|$ vertices and $m=|E|$ edges. The naïve approach using triply-nest loops to check if each triple $(u,v,w)$ forms a triangle takes $\mathcal{O}\left(n^{3}\right)$ time and is inefficient for sparse graphs. It is well-known that listing all triangles in G is $\Omega$ $\left(m^{\frac{3}{2}}\right)$ time (Itai and Rodeh, 1978; Latapy, 2008). To enhance the performance of triangle counting, Cohen (Cohen, 2009) introduced a novel map-reduce parallelization technique that generates open wedges between triples of vertices in the graph. It determines whether a closing edge exists to complete a triangle, thus avoiding the redundant counting of the same triangle while maintaining load balancing. Many parallel approaches for triangle counting (Pearce, 2017; Ghosh and Halappanavar, 2020) partition the sparse graph data structure across multiple compute nodes and adopt the strategy of generating open wedges, which are sent to other compute nodes to determine the presence of a closing edge. Consequently, the communication time for these open wedges often dominates the running time of parallel triangle counting.

In this paper, we propose a novel approach that efficiently identifies all triangles using a reduced set of edges known as a cover-edge set. By leveraging the cover-edge-based triangle counting method, unnecessary edge checks can be skipped while ensuring that no triangles are missed. This significantly reduces the number of computational operations compared to existing methods.

The main contributions of this paper are

•

A novel triangle counting algorithm, Cover-Edge Triangle Counting (CETC), is proposed based on a new concept Cover-Edge Set. The essential idea is that we can identify all triangles from a significantly reduced cover-edge set instead of the complete edge set. A simple breadth-first search (BFS) is used to orient the graph’s vertices into levels and to generate the cover-edge set.
•

Various sequential variants of the CETC that combine the techniques of cover-edge, forward algorithm, and hashing are developed. Furthermore, the parallel implementations of CETC on both shared-memory (CETC-SM), and distributed-memory (CETC-DM) are introduced.
•

Freely-available, open-source software for more than 22 sequential triangle counting algorithms and 11 OpenMP parallel algorithms in the C programming language.
•

A comprehensive experimental study of implementations of the proposed novel triangle counting algorithms on real and synthetic graphs with the comparison against other existing algorithms.

2. Notations and Definitions

Let $G=(V,E)$ be an undirected graph with $n=|V|$ vertices and $m=|E|$ edges. A triangle in the graph is a set of three vertices $\{v_{a},v_{b},v_{c}\}\subseteq V$ such that $\{(v_{a},v_{b}),(v_{a},v_{c}),(v_{b},v_{c})\}\subseteq E$ . We will use $N(v)=\{u|u\in V\wedge((u,v)\in E)\}$ to denote the neighbor set of vertex $v\in V$ . The degree of vertex $v\in V$ is $d(v)=|N(v)|$ , and $d_{\text{max}}$ is the maximal degree of a vertex in graph $G$ .

With these notations, the total number of triangles in graph $G$ is denoted as $|\Delta(G)|$ . Specifically, $\Delta(G)=\{(u,v,w)|u,v,w$ are different vertices of $V$ and $(u,v),(v,w),(w,u)$ are edges of $E\}$ .

The triangle counting problem can be expressed in two ways based on edges and vertices:

•

For any edge $(u,v)\in E$ , the number of triangles including $(u,v)$ is $|\Delta(u,v)|$ , where $\Delta(u,v)=N(u)\cap N(v)$ . Since each triangle edge will count the same triangle and we will count both $\Delta(u,v)$ and $\Delta(v,u)$ , the total triangles are computed as $|\Delta(G)|=\frac{\sum_{(u,v)\in E}|\Delta(u,v)|}{6}$ , using the edge-iteration-based method.
•

For any vertex $v\in V$ , the number of triangles including $v$ is $|\Delta(v)|$ , where $\Delta(v)=\{(u,w)\,|\,u,w\in N(v)\land(u,w)\in E\}$ . The total triangles are computed as $|\Delta(G)|=\frac{\sum_{v\in V}|\Delta(v)|}{6}$ , using the vertex-iteration-based method.

3. Related Work

3.1. Existing Sequential Algorithms

For triangle counting, the obvious algorithm is brute-force search (see Alg. 1), enumerating over all $\Theta$ $\left(n^{3}\right)$ triples of distinct vertices, and checking how many of these triples are triangles. There are faster algorithms that require an adjacency matrix for the input graph representation and use fast matrix multiplication, such as the work of Alon, Yuster, and Zwick (Alon et al., 1997). Indeed, if $A$ is the adjacency matrix of $G$ , for any vertex $v$ , the value $A_{vv}^{3}$ on the diagonal of $A^{3}$ is twice the number of triangles to which $v$ belongs. So the number of trianlges is $\frac{1}{6}\sum tr(A^{3})$ . Triangle counting problems can therefore be solved in $\mathcal{O}\left(n^{1.5}\right)$ , where $\omega<2.732$ is the fast matrix product exponent (Alman and Williams, 2021) (Williams et al., 2023). Alon et al. (Alon et al., 1997) also show that it is possible to solve triangle counting problem in $\mathcal{O}\left(m^{\frac{2\omega}{\omega+1}}\right)$ $\subset$ $\mathcal{O}\left(m^{1.41}\right)$ time. However, the implementation is infeasible for large, sparse graphs, and certain matrix multiplication methods fall short of listing all the triangles. For these reasons, despite their evident theoretical strength, these algorithms have limited practical impact.

Algorithm 1 Triples

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall u\in V

\forall v\in V

\forall w\in V

7: if

(u,v)\in E\land(v,w)\in E\land(u,w)\in E

T\leftarrow T+1

9:return

T/6

Another category of fundamental problem formulation is called subgraph query, which aims to identify instances of a triangle subgraph within the input graph. It’s crucial to emphasize that determining the presence of a specific subgraph in a graph is an NP-hard problem. While various methods, including the backtracking strategy (Ullmann, 1976), have been introduced, they are not preferred choices for triangle counting problem, particularly for large-scale graphs.

Latapy (Latapy, 2008) provides a survey on triangle counting algorithms for very large, sparse graphs. One of the earliest algorithms, tree-listing, published in 1978 by Itai and Rodeh (Itai and Rodeh, 1978) first finds a rooted spanning tree of the graph. After iterating through the non-tree edges and using criteria to identify triangles, the tree edges are removed and the algorithm repeats until no edges are remaining (see Alg. 2). This approach takes $\mathcal{O}\left(m^{\frac{3}{2}}\right)$ time (or $\mathcal{O}\left(n\right)$ for planar graphs).

Algorithm 2 Tree-listing (IR) (Itai and Rodeh, 1978)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

4:while

E

is not empty

K\leftarrow

Covering tree(

G

)

\forall(u,v)\in E\ \land(u,v)\notin K

7: if

(\text{parent}(u),v)\in E

T\leftarrow T+1

9: elif

(\text{parent}(v),u)\in E

10:

T\leftarrow T+1

11:

E\leftarrow E-K

12:return

T/2

The most common triangle counting algorithms in the literature include vertex-iterator (Itai and Rodeh, 1978), (Latapy, 2008) and edge-iterator (Itai and Rodeh, 1978), (Latapy, 2008) approaches that run in $\mathcal{O}\left(m\cdot d_{max}\right)$ .

Algorithm 3 Vertex-Iterator (Itai and Rodeh, 1978), (Latapy, 2008)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall u\in V

\forall v\in N(u)

X=Intersection(N(u),N(v))

T\leftarrow T+X

8:return

T/6

In vertex-iterator (see Alg. 3), for each vertex $u\in V$ , the algorithm examines the adjacency list $N(v)$ of each vertex $v\in N(u)$ . If there is vertex $w$ in the intersection of $N(u)$ and $N(v)$ , then the triplet $(u,v,w)$ forms a triangle. Arifuzzaman et al. (Arifuzzaman et al., 2019) study modifications of the vertex-iterator algorithm based on various methods for vertex ordering.

Algorithm 4 Edge-Iterator (Itai and Rodeh, 1978), (Latapy, 2008)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall(u,v)\in E

X=Intersection(N(u),N(v))

T\leftarrow T+X

7:return

T/6

In edge-iterator (see Alg. 4), each edge $(u,v)$ in the graph is examined, and the intersection of $N(u)$ and $N(v)$ is computed to find triangles. A common optimization is to use a direction-oriented approach that only considers edges $(u,v)$ where $u<v$ . The variants of edge-iterator are often based on the algorithm used to perform $Intersection(N(u),N(v))$ . When the two adjacency lists are sorted, then MergePath and BinarySearch can be used. MergePath performs a linear scan through both lists counting the common elements. Makkar, Bader, and Green (Makkar et al., 2017) give an efficient MergePath algorithm for GPU. Mailthody et al. (Mailthody et al., 2018) use an optimized two-pointer intersection (MergePath) for set intersection. BinarySearch, as the name implies, uses a binary search to determine if each element of the smaller list is found in the larger list. Hash is another method for performing the intersection of two sets and it does not require the adjacency lists to be sorted. A typical implementation of Hash initializes a Boolean array of size $m$ to all false. Then, positions in Hash corresponding to the vertex values in $N(u)$ are set to true. Then $N(v)$ is scanned, looking up in $\Theta$ $\left(1\right)$ time whether or not there is a match for each vertex. Chiba and Nishizeki published one of the earliest edge iterators with hashing algorithms for triangle finding in 1985 (Chiba and Nishizeki, 1985). The running time is $\mathcal{O}\left(a(G)m\right)$ , where $a(G)$ is defined as the arboricity of $G$ , which is upper-bounded $a(G)\leq\lceil(2m+n)^{\frac{1}{2}}/2\rceil$ (Chiba and Nishizeki, 1985). In 2018, Davis rediscovered this method, which he calls tri_simple in his comparison with SuiteSparse GraphBLAS (Davis, 2018). Mowlaei (Mowlaei, 2017) gave a variant of the edge-iterator algorithm that uses vectorized sorted set intersection and reorders the vertices using the reverse Cuthill-McKee heuristic.

In 2005, Schank and Wagner (Schank and Wagner, 2005; Schank, 2007) designed a fast triangle counting algorithm called forward (see Alg. 5) that is a refinement of the edge-iterator approach. Instead of intersections of the full adjacency lists, the forward algorithm uses a dynamic data structure $A(v)$ to store a subset of the neighborhood $N(v)$ for $v\in V$ . Initially, each set $A()$ is empty, and after computing the intersection of the sets $A(u)$ and $A(v)$ for each edge $(u,v)$ (with $u<v$ ), $u$ is added to $A(v)$ . This significantly reduces the size of the intersections needed to find triangles. The running time is $\mathcal{O}\left(m\cdot d_{\mbox{max}}\right)$ . However, if one reorders the vertices in decreasing order of their degrees as a $\Theta$ $\left(n\log n\right)$ time pre-processing step, the forward algorithm’s running time reduces to $\mathcal{O}\left(m^{\frac{3}{2}}\right)$ . Ortmann and Brandes (Ortmann and Brandes, 2014) survey triangle counting algorithms, create a unifying framework for parsimonious implementations, and conclude that nearly every triangle listing variant is in $\mathcal{O}\left(m\cdot a(G)\right)$ .

Algorithm 5 Forward Triangle Counting (F) (Schank and Wagner, 2005; Schank, 2007)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

A(v)\leftarrow\emptyset

\forall(u,v)\in E

7: if

(u<v)

then

\forall w\in A(u)\cap A(v)

T\leftarrow T+1

10:

A(v)\leftarrow A(v)\cup\{u\}

11:return

T

The forward-hashed algorithm (Schank and Wagner, 2005; Schank, 2007) (also called compact-forward (Latapy, 2008)) is a variant of the forward algorithm that uses the hashing described previously for the intersections of the $A()$ sets, see Algorithm 6. Low et al. (Low et al., 2017) derive a linear algebra method for triangle counting that does not use matrix multiplication. Their algorithm results in the forward-hashed algorithm.

Algorithm 6 Forward-Hashed Triangle Counting (FH)(Schank and Wagner, 2005; Schank, 2007)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

A(v)\leftarrow\emptyset

\forall(u,v)\in E

7: if

(u<v)

then

\forall w\in A(u)

9: Hash[

w

]

\leftarrow

true

10:

\forall w\in A(v)

11: if Hash[

w

] then

12:

T\leftarrow T+1

13:

\forall w\in A(u)

14: Hash[

w

]

\leftarrow

false

15:

A(v)\leftarrow A(v)\cup\{u\}

16:return

T

3.2. Existing Parallel Algorithms

Although most of the sequential algorithms tend to run fast on graphs that fit in main memory, the expanding size of graphs, driven by ongoing technology advancements, poses a challenge. To further accelerate, the emergence of parallel version algorithms is inevitable. Alg. 7, Alg. 8, and Alg. 9 are the parallel versions of the three most common intersection-based triangle counting methods.

Algorithm 7 Parallel Edge Iterator with Merge Path (EMP)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall(u,v)\in E

do in parallel

A\leftarrow N(u)

B\leftarrow N(v)

x\leftarrow 0

y\leftarrow 0

7: while

x<|A|\ \land\ y<|B|

8: if

A[x]==B[y]

T\leftarrow T+1

;

10:

x\leftarrow x+1

y\leftarrow y+1

11: else

12: if

A[x]<B[y]

13:

x\leftarrow x+1

14: else

15:

y\leftarrow y+1

16:return

T/6

Algorithm 8 Parallel Edge Iterator with Binary Search (EBP)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall(u,v)\in E

do in parallel

5: for

l\in N(u)

K\leftarrow N(v)

bottom\leftarrow 0

top\leftarrow|K|

8: while

bottom<top

mid\leftarrow bottom+(top-bottom)/2

10: if

K[mid]==l

11:

T\leftarrow T+1

12: break

13: elif

K[mid]<l

14:

bottom\leftarrow mid+1

15: else

16:

top=mid

17:return

T/6

Algorithm 9 Parallel Edge Iterator with Hash (EHP)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall(u,v)\in E

do in parallel

5: for

w\in N(u)

6: hash(w) = True

7: for

w\in N(v)

8: if hash(w)

T\leftarrow T+1

10: for

w\in N(u)

11: hash(w) = False

12:return

T/6

Besides the intersection-based methods, there are several optimized parallel algorithms in the literature. Shun et al. (Shun and Tangwongsan, 2015) give a multi-core parallel algorithm for shared memory machines. The algorithm has two steps: in the first step each vertex is ranked based on degree and a ranked adjacency list of each vertex is generated, which contains only higher-ranked vertices than the current vertex; the second step counts triangles from the ranked adjacency list for each vertex using merge-path or hash. Parimalarangan et al. (Parimalarangan et al., 2017) present variations of triangle counting algorithms and how they relate to performance in shared-memory platforms. TriCore (Hu et al., 2018) partitions the graph held in a compressed-sparse row (CSR) data structure for multiple GPUs and uses stream buffers to load edge lists from CPU memory to GPU memory on the fly and then uses binary search to find the intersection. Hu et al. (Hu et al., 2021) employ a “copy-synchronize-search” pattern to improve the parallel threads efficiency of GPU and mix the computing and memory-intensive workloads together to improve the resource efficiency. Zeng et al. (Zeng et al., 2022) present a triangle counting algorithm that adaptively selects vertex-parallel and edge-parallel paradigm.

4. Cover-Edge Based Triangle Counting Algorithms

4.1. Cover-Edge Set

Definition 1 (Cover-Edge, Cover-Edge Set and Covering Ratio).

For any edge $e$ of a triangle $\Delta$ in graph $G$ , $e$ is referred to as a cover-edge of $\Delta$ . For a given graph $G$ , an edge set $S\subseteq E$ is called a cover-edge set if it contains at least one cover-edge for every triangle in $G$ . $c=|S|/|E|$ is called the covering ratio.

Based on the given definition, it is evident that the entire edge set $E$ can serve as a cover-edge set $S$ for graph $G$ . However, our proposed method aims to efficiently count all triangles using a smaller subset of edges instead of $E$ . Thus, the primary challenge lies in generating a compact cover-edge set, which forms the initial problem to be addressed in our approach. Our goal is to identify a cover-edge set with the smallest $c$ . In this paper, we propose using breadth-first search (BFS) to generate a compact cover-edge set.

Definition 2 (BFS-Edge).

Let $r$ be the root vertex of an undirected graph $G$ . The level $L(v)$ of a vertex $v$ is defined as the shortest distance from $r$ to $v$ obtained through a breadth-first search (BFS). From the BFS, we classify the edges into three types:

•

Tree-Edges: These edges belong to the BFS tree.
•

Strut-Edges: These are non-tree edges with endpoints on two adjacent levels in the BFS traversal.
•

Horizontal-Edges: These are non-tree edges with endpoints on the same level in the BFS traversal.

Refer to caption — Figure 1. An example to mark different edges based on a BFS spanning tree. The tree-edges are black, strut-edges are blue, and horizontal-edges are red.

Fig. 1 gives an example of these different edge types.

Lemma 0 ().

Each triangle $\{u,v,w\}$ in a graph contains at least one horizontal-edge in an arbitrarily rooted BFS tree.

Proof.

(Proof by contradiction) A triangle is a path of length 3 that starts and ends at the same vertex. Suppose there are no horizontal-edges in the triangle. In that case, every edge in the path (i.e., a tree-edge or strut-edge) either increases or decreases the level by one.

Since the path must end on the same level as the starting vertex, the number of edges in the path that decrease the level must be equal to the number of edges that increase the level. Consequently, the length of the path must be even to maintain level parity. However, this contradicts the fact that a triangle has an odd path length of 3.

Therefore, we conclude that there must be at least one horizontal-edge in every triangle. ∎

Theorem 4 (Cover-Edge Set Generation).

All horizontal-edges in an arbitrarily rooted BFS tree form a valid cover-edge set.

Proof.

According to Definition 1, for any triangle $\Delta$ in graph $G$ , we can always find at least one horizontal-edge that serves as a cover-edge for $\Delta$ . Thus, the set of all horizontal-edges constitutes a cover-edge set. ∎

Therefore, we can construct a cover-edge set, denoted as BFS-CES, by selecting all the horizontal-edges obtained during a breadth-first search (BFS). It is evident that BFS-CES is a subset of $E$ and is typically much smaller than the complete edge set $E$ .

4.2. Cover-Edge Triangle Counting: CETC

In this subsection, we provide a comprehensive description of the algorithm to identify all triangles using a cover-edge set generated through a breadth-first search.

Lemma 0 ().

Each triangle $\{u,v,w\}$ must contain either one or three horizontal-edges.

Proof.

By referring to the proof of Lemma 3, we know that the path corresponding to the triangle’s three edges consists of an even number of tree-edges and strut-edges. This implies that there can be either 0 or 2 tree- or strut-edges within each triangle.

In the case where there are 0 tree- or strut-edges, all three edges of the triangle must be horizontal-edges. This is because the absence of tree- or strut-edges implies that the entire path is composed of horizontal-edges.

In the case where there are 2 tree- or strut-edges, the triangle contains exactly one horizontal-edge. This is because having two tree- or strut-edges in the path means that there is one horizontal-edge connecting the remaining two vertices.

Therefore, we conclude that each triangle $\{u,v,w\}$ must contain either one or three horizontal-edges. ∎

Our sequential triangle counting approach (CETC-Seq), described in Alg. 10, efficiently counts triangles using a cover-edge set. In line 3, we initialize the counter $T$ to 0, which will store the total number of triangles. To generate the cover-edge set, we perform a breadth-first search (BFS) starting from any unvisited vertex, identifying the level $L(v)$ of each vertex $v$ in its respective component, as shown in lines 4 to 5. In lines 6 to 10 the algorithm iterates over each edge, selecting the cover-set of horizontal edges $(u,v)$ in a direction-oriented fashion in line 7. For each vertex $w$ in the intersection of $u$ and $v$ ’s neighborhoods (line 8), we check the following two conditions to determine if $(u,v,w)$ is a unique triangle to be counted (line 9). If $L(u)\neq L(w)$ then the edge $(u,v)$ is the only horizontal-edge in the triangle $(u,v,w)$ . If $L(u)\equiv L(w)$ , then the edge $(u,v)$ is one of three horizontal-edges in the triangle $(u,v,w)$ . To ensure uniqueness, the algorithm then checks the added constraint that $v<w$ . If the constraints are satisfied, we increment the triangle counter $T$ in line 10.

This approach effectively counts the triangles in the graph while avoiding redundant counting.

Algorithm 10 CETC: Cover-Edge Triangle Counting (CETC-Seq)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

5: if

v

unvisited, then BFS(

G

v

)

\forall(u,v)\in E

7: if

(L(u)\equiv L(v))\land(u<v)

\triangleright

(u,v)

is horizontal

\forall w\in N(u)\cap N(v)

9: if

(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)

then

10:

T\leftarrow T+1

11:return

T

Theorem 6 (Correctness).

Alg. 10 can accurately count all triangles in a graph $G$ .

Proof.

Lemma 5 establishes that a triangle in the graph falls into one of two cases: 1) the two endpoint vertices of the horizontal-edge are on the same level while the apex vertex is on a different level, or 2) all three vertices of the triangle are at the same level.

Consider a triangle $\{v_{a},v_{b},v_{c}\}$ in $G$ . Without loss of generality, assume that $(v_{a},v_{b})$ is a horizontal-edge, implying $L(v_{a})\equiv L(v_{b})$ . Let $v_{c}$ be the apex vertex. The two cases can be distinguished as follows:

For the first case, each triangle is uniquely defined by a horizontal-edge and an apex vertex from the common neighbors of the horizontal-edge’s endpoint vertices. Whenever Alg. 10 identifies such a triangle $\{v_{a},v_{b},v_{c}\}$ , it increments the total triangle count $T$ by 1.

In the second case, where all three vertices are at the same level ( $L(v_{c})\equiv L(v_{a})\equiv L(v_{b})$ ), Alg. 10 ensures that $T$ is increased by 1 only when $v_{a}<v_{b}<v_{c}$ . This condition ensures that triangle $\{v_{a},v_{b},v_{c}\}$ is counted only once, preventing triple-counting and ensuring the correctness of the triangle count.

Hence, Alg. 10 is proven to accurately count all triangles in the graph $G$ . ∎

The time complexity of Alg. 10 can be analyzed as follows. The computation of breadth-first search, including determining the level of each vertex and marking horizontal edges, requires $\mathcal{O}(n+m)$ time.

Since there are at most $\mathcal{O}(m)$ horizontal edges, finding the common neighbors of each horizontal edge individually can be done in $\mathcal{O}(d_{\text{max}})$ time. Here, $d_{\text{max}}$ represents the maximal degree of a vertex in the graph.

Therefore, the overall time complexity of CETC-Seq is $\mathcal{O}(m\cdot d_{\text{max}})$ .

4.3. Variants of CETC-Seq

4.3.1. CETC Forward Exchanging Triangle Counting Algorithm (CETC-Seq-FE)

The overall performance of CETC-Seq is closely related to the covering ratio $c$ . A higher covering ratio results in fewer reduced edges, which will increase the actual runtime of the algorithm. Therefore, after completing the BFS, selecting an appropriate algorithm can be based on $c$ . Alg. 11 presents the variant of CETC-Seq that dynamically selects the most suitable approach based on $c$ , called CETC-Seq-FE. Initially, it calculates $c$ using BFS results. If the $c$ value is below a specified threshold (The value of $c$ should be at least less than $(\frac{m-n+1}{m})$ . After comparing the performance of Alg. 10 and Alg. 5, we set this threshold to 0.7 in our experiments.), we will continue using Alg. 10; otherwise, Alg. 5 is chosen. Considering the analyses presented in Alg. 5 and Alg. 10, Alg. 11 maintains a time complexity of $\mathcal{O}\left(m^{1.5}\right)$

Algorithm 11 CETC Forward Exchanging (CETC-Seq-FE)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

5: if

v

unvisited, then BFS(

G

v

)

6:Calculate

c

based on the BFS results

7:If

(c<threshold)

T\leftarrow\mbox{CETC-Seq}(G)

\triangleright

Alg. 10

9:Else

10:

T\leftarrow\mbox{TC\_forward}(G)

\triangleright

Alg. 5

11:return

T

4.3.2. CETC Split Triangle Counting Algorithm (CETC-Seq-S)

Algorithm 12 CETC Split Triangle Counting (CETC-Seq-S)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

5: if

v

unvisited, then BFS(

G

v

)

\forall(u,v)\in E

7: if

(L(u)\equiv L(v))

then

\triangleright

(u,v)

is horizontal

8: Add

(u,v)

G_{0}

9: else

10: Add

(u,v)

G_{1}

11:

T\leftarrow\mbox{TC}\_{\mbox{forward-hashed}}(G_{0})

\triangleright

Alg. 6

12:

\forall u\in V_{G_{1}}

13:

\forall v\in N_{G_{1}}(u)

14: Hash[

v

]

\leftarrow

true

15:

\forall v\in N_{G_{0}}(u)

16: if

(u<v)

then

17:

\forall w\in N_{G_{1}}(v)

18: if Hash[

w

] then

19:

T\leftarrow T+1

20:

\forall v\in N_{G_{1}}(u)

21: Hash[

v

]

\leftarrow

false

22:return

T

Alg. 12 is another variant of CETC-Seq, called CETC-Seq-S. This variant is similar to cover-edge triangle counting in Alg. 10 and uses BFS to assign a level to each vertex in lines 4 and 5. Next in lines 6 to 10, the edges $E$ of the graph are partitioned into two sets $E_{0}$ – the horizontal edges where both endpoints are on the same level – and $E_{1}$ – the remaining tree and non-tree edges that span a level. Thus, we now have two graphs, $G_{0}=(V,E_{0})$ and $G_{1}=(V,E_{1})$ , where $E=E_{0}\cup E_{1}$ and $E_{0}\cap E_{1}=\emptyset$ . Triangles that are fully in $G_{0}$ are counted with one method and triangles not fully in $G_{0}$ are counted with another method. For $G_{0}$ , the graph with horizontal edges, we count the triangles efficiently using the forward-hashed method (line 11). For triangles not fully in $G_{0}$ , the algorithm uses the following approach to count these triangles. Using $G_{1}$ , the graph that contains the edges that span levels, we use a hashed intersection approach in lines 12 to 21. As per the cover-edge triangle counting, we need to find the intersections of the adjacency lists from the endpoints of horizontal edges. Thus, we use $G_{0}$ to select the edges and perform the hash-based intersections from the adjacency lists in graph $G_{1}$ . The proof of correctness for cover-edge triangle counting is given in Section. 4.2. Alg. 12 is a hybrid version of this algorithm, that partitions the edge set, and uses two different methods to count these two types of triangles. The proof of correctness is still valid with these new refinements to the algorithm. The running time of Alg. 12 is the maximum of the running time of forward-hashing and Alg. 10. Alg. 12 uses hashing for the set intersections. For vertices $u$ and $v$ , the cost is $\min(d(u),d(v))$ since the algorithm can check if the neighbors of the lower-degree endpoint are in the hash set of the higher-degree endpoint. Over all $(u,v)$ edges in $E$ , these intersections take $\mathcal{O}\left(m\cdot a(G)\right)$ expected time. Hence, Alg. 12 takes $\mathcal{O}\left(m\cdot a(G)\right)$ expected time.

Similar to the forward-hashed method, pre-processing the graph by re-ordering the vertices in decreasing order of degree in $\Theta$ $\left(n\log n\right)$ time often leads to a faster triangle counting algorithm in practice.

4.3.3. CETC-Split Recursive Triangle Counting Algorithm (CETC-Seq-SR)

Algorithm 13 CETC-Split Recursive Triangle Counting (CETC-Seq-SR)

1:Graph

G=(V,E)

2:Triangle Count

T

T\leftarrow 0

\forall v\in V

5: if

v

unvisited, then BFS(

G

v

)

\forall(u,v)\in E

7: if

(L(u)\equiv L(v))

then

\triangleright

(u,v)

is horizontal

8: Add

(u,v)

G_{0}

9: else

10: Add

(u,v)

G_{1}

11:if (size of

G_{0}>

threshold) then

12:

T\leftarrow\mbox{CESR}(G_{0})

13:else

14:

T\leftarrow\mbox{TC}\_{\mbox{forward-hashed}}(G_{0})

\triangleright

Alg. 6

15:

\forall u\in V_{G_{1}}

16:

\forall v\in N_{G_{1}}(u)

17: Hash[

v

]

\leftarrow

true

18:

\forall v\in N_{G_{0}}(u)

19: if

(u<v)

then

20:

\forall w\in N_{G_{1}}(v)

21: if Hash[

w

] then

22:

T\leftarrow T+1

23:

\forall v\in N_{G_{1}}(u)

24: Hash[

v

]

\leftarrow

false

25:return

T

The Alg. 13 is similar to Alg. 12. The only difference is that for the subgraph $G_{0}$ consisting of the horizontal edges. If its size is larger than the given threshold value, we will recursively call the algorithm to further reduce the graph size (line 9). If the size of $G_{0}$ is not larger than the given threshold value, we will directly call Alg. 6 to get the total number of triangles in $G_{0}$ (line 11). We use the same threshold value of 0.7 in the experiment as outlined in Alg. 11. The idea behind the recursive call is that we can quickly count the triangles containing edges across both $G_{0}$ and $G_{1}$ , and then we can safely remove all the edges in $G_{1}$ to reduce the graph size. Finally, Alg. 6 will focus on a smaller graph whose edges may contain multiple triangles.

4.4. Parallel CETC Algorithm on Shared-Memory (CETC-SM)

In Section 3, we introduced three commonly employed intersection-based methods: merge-path, binary search, and hash, alongside their corresponding parallel version as outlined in Alg. 7, Alg. 8, and Alg. 9.

The fundamental concept behind the proposed parallel algorithms is to calculate the intersection of neighbor lists of two endpoints of any $(u,v)$ in parallel, which will significantly increase the performance.

Algorithm 14 Shared Memory Parallel Cover-Edge Triangle Counting (CETC-SM)

1:Graph

G=(V,E)

2:Triangle Count

T

c_{1},c_{2}\leftarrow 0

4:Run Parallel BFS on

G

and mark the level.

\forall(u,v)\in E

do in parallel

6: if

(L(u)\equiv L(v))\land(u<v)

\triangleright

(u,v)

is horizontal

\forall w\in N(u)\cap N(v)

8: if (

L(w)\neq L(u)

) then

c_{1}\leftarrow c_{1}+1

10: else

11:

c_{2}\leftarrow c_{2}+1

12:

T\leftarrow c_{1}+c_{2}/3

13:return

T

\triangleright

See Alg. 10

Alg.14 demonstrates the parallelization of the Covering-Edge triangle counting algorithm for shared-memory. In the context of the PRAM (Parallel Random Access Machines) model, both parallel BFS and parallel sorting have been shown to achieve scalable performance (Cormen et al., 2022). For set intersection operations on a single edge, it is imperative that the computation remains well below $m^{0.5}$ (Schank, 2007), particularly when dealing with large input graphs, where $p$ represents the total number of processors. Consequently, the total work, which is $\mathcal{O}(m^{1.5})$ , can be evenly distributed among $p$ processors. As a result, CETC-SM exhibits a parallel time complexity of $\mathcal{O}(\frac{m^{1.5}}{p})$ , ensuring scalability as the number of parallel processors increases.

4.5. Communication-Efficient Parallel CETC Algorithm on Distributed-Memory (CETC-DM)

This subsection presents our communication-efficient parallel algorithm for counting triangles in massive graphs on a $p$ -processor distributed-memory parallel computer. We will take advantage of the concept of Cover-Edge Set to significantly improve the communication performance of our triangle counting method. Since distributed triangle counting is communication-bound (Pearce, 2017), this algorithm is expected to improve the overall running time. The input graph $G$ is stored in a compressed sparse row (CSR) format. The vertices are partitioned non-uniformly to the $p$ processors such that each processor stores approximately $2m/p$ edge endpoints. This graph input follows the format used by the majority of parallel graph algorithm implementations and benchmarks such as Graph500 and Graph Challenge.

Our communication-efficient parallel algorithm CETC-DM (see Alg. 15) is based on the same cover-edge approach proposed in Section 4.2. The binary operator $\oplus$ used in line 11 is bitwise exclusive OR (XOR).

Algorithm 15 CETC Communication Efficient Triangle Counting (CETC-DM)

1:Graph

G=(V,E)

2:Triangle Count

T

3:Run parallel BFS(

G

) and build partial cover-edge set

S_{i}

p_{i}

4:For all

p_{i},i\in\{0\ldots p-1\}

in parallel do:

t_{i}\leftarrow 0

\forall(u,v)\in S_{i}

with

u<v

p_{i}

\forall w\in V_{i}

such that

w\in N(u),N(v)

8: if

(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)

then

t_{i}=t_{i}+1

10: For

j\leftarrow 1

p-1

do:

11: Processors

i

and

i\oplus j

swap edge sets

S_{i}

and

S_{j}

12:

\forall(u,v)\in S_{j}

with

u<v

p_{i}

13:

\forall w\in V_{i}

such that

w\in N(u),N(v)

14: if

(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)

then

15:

t_{i}=t_{i}+1

16:

T\leftarrow

Reduce

(t_{i},+)

Similar to the sequential CETC-Seq algorithm, the cover-edge set $S=\cup_{i=0}^{p-1}S_{i}$ is determined in line 3 by labeling the horizontal edges from a parallel BFS.

Each processor runs lines 4 to 15 in parallel that consists of two main substeps. Local triangles are counted in lines 6 to 9 and a total exchange of cover-edges between each pair of processors to count triangles is performed in lines 10 to 15. Note at the end of each iteration of the for loop, processor $p_{i}$ can discard the cover-edge set $S_{j}$ . In lines 7 and 13, processor $p_{i}$ determines for each cover edge $(u,v)$ all the apex vertices $w$ held locally that are adjacent to both $u$ and $v$ . The logic for counting triangles in lines 8 and 14 is similar to Alg. 10 as to only count unique triangles. Finally, a reduction operation in line 16 calculates the total number of triangles by accumulating the $p$ triangle counters, i.e., $T=\sum_{i=0}^{p-1}t_{i}$ .

4.5.1. Cost Analysis

Space

In addition to the input graph data structure, an additional bit is needed per edge (for marking a horizontal-edge) and $\mathcal{O}\left(\lceil\log D\rceil\right)$ bits per vertex to store its level, where $D$ is the diameter of the graph. This is a total of at most $m+n\lceil\log D\rceil$ bits across the $p$ processors. Preserving the graph requires additional $\mathcal{O}\left(n+m\right)$ space for the graph.

Compute

The BFS costs $\mathcal{O}\left((n+m)/p\right)$ (Cormen et al., 2022). The search corresponding to one cover-edge in a vertex’s adjacency list takes at most $\mathcal{O}\left(\log(d_{max})\right)$ time using binary search, and only $\mathcal{O}\left(1\right)$ expected time using a hash table. Let $d_{i}$ be the degree of vertex $v_{i}$ where $0\leq i<n$ . Searching $km$ edges in all vertices’ adjacency lists takes $\mathcal{O}(km\sum_{i=0}^{n-1}\log(d_{i}))=\mathcal{O}(km\log(\Pi_{i=0}^{n-1}d% _{i}))$ time. Since $\sum_{i=0}^{n-1}d_{i}=2m$ , we know that $\log(\Pi_{i=0}^{n-1}d_{i})$ reaches its maximum value when $d_{i}=2m/n$ for $0\leq i<n$ . Thus, $\mathcal{O}(km\log(\Pi_{i=0}^{n-1}d_{i}))\leq\mathcal{O}(km\log((2m/n)^{n}))% \leq\mathcal{O}(kmn\log(2n^{2}/n))=\mathcal{O}(mn\log(n))$ .

Total Communication

In our analysis of communication cost for BFS, we measure the total communication volume independent of the number of processors. Thus, this is a conservative overestimate of communication since a fraction (e.g., $1/p$ ) of accesses will be on the same compute node versus message traffic between nodes. At the same time, we do not consider the savings from overlap** with the computation cost.

The cost of the breadth-first search is $m$ edge traversals with $\lceil\log D\rceil+3\lceil\log n\rceil$ bits communicated per edge traversal for the level information, pair of vertex ids, and vertex degree, yielding $m\cdot(\lceil\log D\rceil+3\lceil\log n\rceil)$ bits for the BFS. Transferring $km$ horizontal-edges requires $kmp\lceil\log n\rceil$ bits, where $p$ is the number of processors. The final reduction to find the total number of triangles requires $(p-1)\lceil\log n\rceil$ bits.

Hence, the total communication volume is $m\cdot(\lceil\log D\rceil+3\lceil\log n\rceil)+kmp\lceil\log n\rceil+(p-1)% \lceil\log n\rceil=m\cdot(\lceil\log D\rceil+(kp+3)\lceil\log n\rceil)+(p-1)% \lceil\log n\rceil$ bits. Hence, since the word size is $\Theta$ $\left(\log n\right)$ and $D\leq n$ , the communication of CETC-DM is $\mathcal{O}\left(pm\right)$ words.

5. Open Source Evaluation Framework

In the preceding sections, we presented all sequential and shared-memory algorithms from literature known to the authors plus our novel approaches. In this section, we introduce our open-source framework designed to integrate comprehensive triangle counting implementations.

There lacks a unified framework encompassing all implementations, which is important for researchers to conduct performance comparisons among existing algorithms and to assess their efficacy against newly proposed methods. Consequently, we have developed a comprehensive open-source framework to solve this problem. This framework is designed to ensure a thorough evaluation of triangle counting algorithms. It includes implementations of 22 sequential methods and 11 parallel methods on shared-memory as complete a set of what is found in the literature.

Each triangle counting routine has a single argument – a pointer to the graph in a compressed sparse row (CSR) format. The input is treated as read-only. Each algorithm is charged the full cost if the implementation needs auxiliary arrays, pre-processing steps, or additional data structures. Each implementation must manage memory and not contain any memory leaks – hence, any dynamically allocated memory must be freed before returning the result.

The output from each implementation is an integer with the number of triangles found. Each algorithm is run ten times, and the mean running time is reported. To reduce variance for random graphs, the same graph instance is used for all of the experiments. For sequential algorithms, the source code is sequential C code without any explicit parallelization. For parallel algorithms, we use OpenMP to parallelize the C code. The same coding style and effort were used for each implementation.

Here, we list algorithms subjected to the experiments given in the next section, including both established methods and the newly proposed algorithms. Algorithms then end with $P$ indicate that we have also developed parallel versions.

W/WP:: : Wedge-checking/Parallel version
WD/WDP:: : Wedge-checking(direction-oriented)/Parallel version
EM/EMP:: : Edge Iterator with MergePath for set intersection/Parallel version
EMD/EMDP:: : Edge Iterator with MergePath for set intersection (direction-oriented)/Parallel version
EB/EBP:: : Edge Iterator with BinarySearch for set intersection/Parallel version
EBD/EBDP:: : Edge Iterator with BinarySearch for set intersection (direction-oriented)/Parallel version
ET/ETP:: : Edge Iterator with partitioning for set intersection/Parallel version
ETD/ETDP:: : Edge Iterator with partitioning for set intersection (direction-oriented)/Parallel version
EH/EHP:: : Edge Iterator with Hashing for set intersection/Parallel version
EHD/EHDP:: : Edge Iterator with Hashing for set intersection (direction-oriented)/Parallel version
F:: : Forward
FH:: : Forward with Hashing
FHD:: : Forward with Hashing and degree-ordering
TS:: : Tri_simple (Davis (Davis, 2018))
LA:: : Linear Algebra (CMU (Low et al., 2017))
IR:: : Treelist from Itai-Rodeh (Itai and Rodeh, 1978)
CETC-Seq/CETC-SM:: : Cover Edge Triangle Counting (Bader, (Bader et al., 2023))/Parallel version on shared-memory
CETC-Seq-D:: : Cover Edge Triangle Counting with degree-ordering (Bader, (Bader et al., 2023))
CETC-Seq-FE:: : Cover Edge Forward Exchanging Triangle Counting
CETC-Seq-S:: : Cover Edge Split Triangle Counting (Bader, (Bader, 2023))
CETC-Seq-SD:: : Cover Edge Split Triangle Counting with degree-ordering (Bader, (Bader, 2023))
CETC-Seq-SR:: : Cover Edge-Split Recursive Triangle Counting

6. Experimental Results

6.1. Platform Configuration

We use the Intel Development Cloud for benchmarking our results on a GNU/Linux node. The compiler is Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) and ‘-O2‘ is used as a compiler optimization flag. we use a recently launched Intel Xeon processor (Sapphire Rapids launched Q1’23) with DDR5 memory for both sequential and parallel implementations. The node is a dedicated 2.00 GHz 56-core (112 thread) Intel(R) Xeon(R) Platinum 8480+ processor (formerly known as Sapphire Rapids) with 105M cache and 1024GB of DDR5 RAM.

6.2. Data Sets

We employ a diverse collection of graphs. The real-world datasets are from SNAP. For the synthetic graphs, we use large Graph500 RMAT graphs (Chakrabarti et al., 2004) with parameters $a=0.57$ , $b=0.19$ , $c=0.19$ , and $d=0.05$ , similar to the IARPA AGILE benchmark graphs. An overview of all 24 graphs in our dataset is presented in Table 1.

The values of $c$ exhibit substantial variation across different graphs, ranging from 0.90 to 0.14. Smaller $c$ values signify a higher potential for avoiding fruitless searches, thereby enhancing the efficiency of our approach.

Table 1. Data Sets for the Experiments

Graph Name	Graph ID	n	m	# triangles	$c$ (%)
RMAT 6	1	64	1024	9100	93.8
RMAT 7	2	128	2048	18855	90.9
RMAT 8	3	256	4096	39602	87.6
RMAT 9	4	512	8192	86470	87.2
RMAT 10	5	1024	16384	187855	82.8
RMAT 11	6	2048	32768	408876	81.1
RMAT 12	7	4096	65536	896224	77.5
RMAT 13	8	8192	131072	1988410	74.9
RMAT 14	9	16384	262144	4355418	70.5
RMAT 15	10	32768	524288	9576800	68.4
RMAT 16	11	65536	1048576	21133772	65.5
RMAT 17	12	131072	2097152	46439638	62.8
karate	13	34	78	45	35.9
amazon0302	14	262111	899792	717719	44.2
amazon0312	15	400727	2349869	3686467	52.4
amazon0505	16	410236	2439437	3951063	52.7
amazon0601	17	403394	2443408	3986507	52.8
loc-Brightkite	18	58228	214078	494728	43.2
loc-Gowalla	19	196591	950327	2273138	50.8
roadNet-CA	20	1971281	2766607	120676	14.5
roadNet-PA	21	1090920	1541898	67150	14.6
roadNet-TX	22	1393383	1921660	82869	14
soc-Epinions1	23	75888	405740	1624481	53.3
wiki-Vote	24	8297	100762	608389	54.3

6.3. Results and Analysis of Sequential Algorithms

The execution times of the sequential algorithms (in seconds) are presented in Table. 2.

Table 2. Execution time (in seconds) for sequential algorithms.

Graph W WD EM EMD EB EBD ET ETD EH EHD F RMAT 6 0.0023 0.000435 0.000608 0.000301 0.001534 0.000752 0.001722 0.000896 0.000118 0.000055 0.000055 RMAT 7 0.005482 0.001603 0.001973 0.000992 0.004567 0.002257 0.005145 0.002765 0.000354 0.000166 0.000219 RMAT 8 0.016084 0.004622 0.005455 0.002726 0.012873 0.006365 0.01379 0.007551 0.000866 0.000429 0.000642 RMAT 9 0.04884 0.014317 0.014643 0.007337 0.031095 0.014825 0.018992 0.010409 0.001124 0.000567 0.000912 RMAT 10 0.081409 0.023978 0.020342 0.01015 0.046255 0.023038 0.049716 0.02734 0.00289 0.001477 0.002433 RMAT 11 0.260367 0.077086 0.053735 0.026881 0.109204 0.054332 0.128502 0.071314 0.006977 0.00352 0.006196 RMAT 12 0.896607 0.262051 0.141176 0.070548 0.293548 0.146626 0.331856 0.185648 0.017128 0.008613 0.015863 RMAT 13 2.975701 0.876912 0.372609 0.186514 0.73989 0.369528 0.849023 0.476537 0.044211 0.022125 0.040797 RMAT 14 10.520327 3.108799 0.987748 0.492118 1.829937 0.914373 2.192774 1.241524 0.114199 0.056226 0.104752 RMAT 15 35.785918 10.461789 2.626837 1.307338 4.834125 2.397788 5.607495 3.185066 0.31823 0.152634 0.27252 RMAt 16 122.100925 35.690483 6.931398 3.452957 12.020692 5.942763 14.38633 8.202219 1.072639 0.51392 0.714004 RMAt 17 426.945522 124.153096 18.328512 9.153039 31.596123 15.577652 37.014029 21.238411 3.249189 1.582771 1.865376 karate 0.000015 0.000007 0.000012 0.000006 0.000019 0.000009 0.000033 0.000014 0.000009 0.000005 0.000004 amazon0302 0.298977 0.052727 0.143474 0.066293 0.193383 0.090165 0.295645 0.152895 0.06663 0.03324 0.024458 amazon0312 1.293403 0.410326 0.720808 0.352071 1.007608 0.474253 1.587602 0.895909 0.286009 0.135694 0.108818 amazon0505 1.4014 0.458928 0.75572 0.370004 1.071849 0.505645 1.687541 0.950925 0.296272 0.140679 0.114741 amazon0601 1.400537 0.458847 0.762124 0.372543 1.085706 0.511053 1.694514 0.952832 0.300645 0.142758 0.118401 loc-Brightkit 0.473063 0.158358 0.115803 0.058309 0.1789 0.089439 0.262685 0.153408 0.025772 0.013417 0.011734 loc-Gowalla 9.018113 4.425076 2.083049 1.03675 1.247894 0.610038 2.850531 2.066343 0.354619 0.173588 0.085867 roadNet-CA 0.089097 0.032726 0.102766 0.064406 0.106291 0.06895 0.168963 0.096467 0.073492 0.051154 0.035571 roadNet-PA 0.074498 0.035727 0.07553 0.036011 0.060924 0.039204 0.097066 0.05476 0.041419 0.028734 0.019805 roadNet-TX 0.088466 0.023446 0.070821 0.044442 0.071961 0.047007 0.117474 0.066475 0.050618 0.035551 0.024401 soc-Epinions1 4.892297 1.853538 0.615327 0.306373 1.144934 0.569414 1.540422 0.876375 0.09952 0.047979 0.062959 wiki-Vote 0.830841 0.210642 0.12436 0.062224 0.290031 0.145106 0.355414 0.190386 0.019756 0.009999 0.01816 Graph FH FHD TS LA IR CETC-Seq CETC-Seq-D CETC-Seq-FE CETC-Seq-S CETC-Seq-SD CETC-Seq-SR RMAT 6 0.000028 0.00003 0.000045 0.000049 0.001057 0.000096 0.00009 0.000036 0.000042 0.000041 0.000034 RMAT 7 0.000076 0.000079 0.000118 0.0002 0.00343 0.00045 0.000337 0.000083 0.000117 0.000105 0.000112 RMAT 8 0.00021 0.000213 0.000341 0.000601 0.010451 0.00118 0.001008 0.000242 0.000316 0.000326 0.000292 RMAT 9 0.000279 0.000284 0.000474 0.000844 0.017542 0.001573 0.001312 0.000309 0.000402 0.000415 0.000393 RMAT 10 0.000633 0.000623 0.001238 0.002231 0.056125 0.003793 0.00329 0.000721 0.000929 0.000907 0.000905 RMAT 11 0.001469 0.001386 0.003257 0.005701 0.175472 0.008988 0.007996 0.001635 0.001938 0.002022 0.001902 RMAT 12 0.0034 0.003114 0.008204 0.014481 0.541587 0.020948 0.018846 0.003683 0.004291 0.004306 0.004719 RMAT 13 0.007899 0.006939 0.021101 0.036664 1.762489 0.048973 0.043984 0.008508 0.009416 0.009234 0.010386 RMAT 14 0.018383 0.015386 0.05573 0.091965 5.926348 0.112335 0.100385 0.019555 0.02067 0.019464 0.020786 RMAT 15 0.045298 0.038136 0.212061 0.23366 19.623668 0.266065 0.233845 0.982818 0.047584 0.043019 0.046379 RMAT 16 0.120033 0.095672 0.502867 0.595802 63.575078 0.630521 0.539548 2.500813 0.117718 0.098857 0.111664 RMAT 17 0.326685 0.25219 1.509844 1.513653 209.558816 1.491119 1.245597 6.385706 0.325907 0.241169 0.284544 karate 0.000004 0.000008 0.000005 0.000002 0.000093 0.000006 0.00001 0.000006 0.000009 0.000011 0.000009 amazon0302 0.01929 0.03971 0.038784 0.0228 0.375845 0.04485 0.066 0.05224 0.044427 0.061649 0.064679 amazon0312 0.067849 0.120262 0.164739 0.100246 2.223462 0.145356 0.195258 0.216578 0.121874 0.165797 0.207789 amazon0505 0.070928 0.125915 0.169891 0.10574 2.194572 0.151568 0.204355 0.225309 0.130232 0.174099 0.200114 amazon0601 0.072972 0.127608 0.173651 0.107426 2.112184 0.153916 0.207247 0.229265 0.132083 0.176369 0.202132 loc-Brightkit 0.006487 0.00982 0.013752 0.011592 0.58484 0.011902 0.015295 0.028136 0.008623 0.012866 0.011192 loc-Gowalla 0.038467 0.054073 0.188503 0.079424 6.433136 0.074779 0.088801 0.276325 0.045462 0.063348 0.061056 roadNet-CA 0.03906 0.170009 0.05232 0.038795 0.656398 0.083151 0.209056 0.100744 0.125747 0.235934 0.141022 roadNet-PA 0.021824 0.083848 0.029642 0.021815 0.366677 0.044418 0.109172 0.054114 0.061127 0.123917 0.068803 roadNet-TX 0.027021 0.106699 0.03607 0.026994 0.506742 0.054871 0.138155 0.067203 0.075937 0.163585 0.08579 soc-Epinions1 0.019544 0.022428 0.051883 0.063396 4.939747 0.043198 0.041572 0.168057 0.021562 0.023076 0.026793 wiki-Vote 0.004964 0.00498 0.009605 0.019463 0.545068 0.013 0.014294 0.034383 0.005118 0.00535 0.005741

6.3.1. Effect of Direction-Oriented on Sequential Algorithms

The DO performance optimization is a pivotal strategy in triangle counting, designed to mitigate redundant calculations. In this section, we explore five distinct duplicate counting algorithms, each accompanied by its corresponding DO variant. The results presented in Fig. 2 vividly demonstrate the speedup achieved by the DO counterparts compared to their duplicate counting versions.

Evidently, across all scenarios, the majority of DO algorithms yield a speedup of at least two-fold. Particularly, the WD algorithm stands out with a higher average speedup of 3.637, surpassing the performance gains of other algorithms. EBD exhibits a speedup of $2.015\times$ , closely followed by EMD at $2.005\times$ , EHD at $1.965\times$ , and ETD at $1.784\times$ .

DO optimization primarily constitutes an algorithmic enhancement, resulting in a reduction in the overall number of operations. So, for any graph, it can improve the performance and our experimental results also confirm its efficiency. However, the practical performance gains can be impacted by various factors, including memory access patterns and cache utilization. Our comprehensive experiments, conducted on diverse graphs using a range of algorithms, underscore the substantial performance enhancements achievable through DO optimization.

In summary, DO optimization is efficient for eliminating duplicate triangle counting and significantly improving overall performance.

6.3.2. Effect of Hash Method on Sequential Algorithms

Similar to the DO optimization, the hash-based optimization proves highly efficient in most scenarios. In Fig. 3, we illustrate the speedup achieved by Hash methods compared to non-hash implementations. The first comparison showcases the speedup of the Hash set intersection (EH) compared with the non-hashed method (EM), while the second presents the speedup of (FH) compared with (F).

The average speedup of EH is $5.4\times$ , and for FH, it is $3.0\times$ , underscoring the effectiveness of the hash-based optimization. Notably, the results reveal that, for more efficient algorithms, like F, the speedup is slightly lower than that of less efficient algorithms, such as EM.

However, we observe several exceptions. For roadNet-CA (Graph ID=20), roadNet-PA (Graph ID=21), and roadNet-TX (Graph ID=22), the Hash algorithm FH performs worse than the non-hashed algorithm F. This is attributed to the unique topologies of these graphs, characterized by relatively long diameters and very few neighbors for each vertex. As the intersection sets are relatively small, the MergePath operation on small sets proves more efficient than the Hash method, given the relatively high hash table overhead for very small sets. Therefore, the Hash optimization method remains efficient but not for some special topology and diameter graphs, as the hash table overhead may not compensate for small intersection sets.

6.3.3. Effect of Forward Algorithm and Its Variants

Our experimental results underscore the effectiveness of Forward algorithm and its variants as robust algorithms for enhancing the performance of triangle counting. In Fig. 4, we present the speedup achieved by three algorithms—namely, the forward algorithm (F), the hashed forward algorithm (FH), and the hashed forward algorithm with degree ordering (FHD)—in comparison with the traditional MergePath algorithm.

The observed performance improvement is remarkably significant. Specifically, F achieves a $8.6\times$ speedup, while FH and FHD achieve even more substantial speedups at $28.7\times$ and $29.1\times$ , respectively. These results indicate that reducing the sizes of intersection sets and employing hash functions and degree ordering can collectively contribute to performance enhancements.

Similar to the hash method, degree ordering demonstrates substantial performance improvements across various scenarios. However, for roadNet-CA (Graph ID=20), roadNet-PA (Graph ID=21), and roadNet-TX (Graph ID=22), the hash-based algorithm FH performs worse than the non-hashed algorithm F, and the performance of degree ordering FD is inferior to that of MergePath. This arises from the fact that most vertices in these graphs possess similar and small numbers of degrees. Consequently, reordering the vertices has minimal impact on intersection performance and introduces additional overhead. Despite these exceptions, the combined approach of reducing intersection set sizes, hash functions, and degree ordering consistently enhances performance for a wide range of cases.

6.3.4. Effect of CETC-Seq Algorithm and Its Variants

The fundamental principle underlying the cover-edge method is minimizing unnecessary set intersection operations. In Fig. 5, we illustrate the impact of the CETC-Seq algorithm and its variants, namely CETC-Seq-D, CETC-Seq-FE, CETC-Seq-S, CETC-Seq-SD, CETC-Seq-SR.

Compared to the MergePath method, CETC-Seq demonstrates an average speedup of $7.4\times$ . CETC-Seq-D achieves a slightly lower average speedup of $7.39\times$ , mainly due to its low performance on the road networks.

CETC-Seq-FE combines CETC-Seq and F in a unique manner. It employs CETC-Seq for large graphs or when the $c$ value is small; otherwise, it uses F. This switching approach yields an average speedup of $14.6\times$ . The rationale behind CETC-Seq-FE lies in dynamically selecting the most suitable algorithm based on its compatibility with the characteristics of the graphs.

CETC-Seq-S splits a graph into two parts based on the vertex levels marked by a BFS pre-processing and applies CETC-Seq and F on each part. The performance of CETC-Seq-S achieves a speedup of $23.4\times$ . This represents a more efficient combination method. Additionally, when we integrate degree ordering into CETC-Seq-S, the resulting CETC-Seq-SD algorithm performs slightly better than CETC-Seq-S, achieving a speedup of $24.0\times$ . This result highlights that degree ordering works well with the F algorithm. The reason is that degree ordering can further reduce the size of intersecting sets of the F algorithm. CETC-Seq-SR employs the recursive method to simplify the problem. For a large graph, it recursively applies CETC-Seq to minimize set intersections, counting only triangles including non-horizontal edges, and finally applies F to the smaller graph consisting of all horizontal edges that are known to include all the other triangles. CETC-Seq-SR achieves an average speedup of $22.7\times$ .

Notably, CETC-Seq exhibits low performance on certain graphs compared to other methods. The relatively high overhead of BFS preprocessing in CETC-Seq, compared with set intersection, contributes to the low efficiency. A breakdown time analysis reveals that the percentages of BFS processing time are 60% of the total execution time. In the case of a long-diameter graph where each vertex has a small number of neighbors, the overhead of BFS becomes large despite its time complexity of $\mathcal{O}(m)$ compared to the time complexity of total set intersections at $\mathcal{O}(m^{1.5})$ . This overhead becomes particularly impactful when the neighbors of each vertex are limited, and the graph diameter is large. For road networks characterized by very small vertex degrees, where degree ordering introduces additional overhead without providing any significant benefit, CETC-Seq-D experiences further performance degradation.

6.3.5. Comprehensive Sequential Algorithms Comparison

In Fig. 6, we present the relative execution time for twenty-two sequential triangle counting algorithms. If some triangle searching operations cannot find any triangle, we name them as fruitless operations here.

While optimal in time complexity, the IR spanning tree-based triangle counting algorithm exhibits nearly the slowest performance among all the compared algorithms. This is due to the involvement of spanning tree generation, removal of tree edges, and regeneration of a smaller graph in each iteration. Although these operations can be completed in $\mathcal{O}(m)$ time, the cost is relatively high in terms of practical performance.

The W wedge-checking-based triangle counting algorithm often performs poorly. This is primarily because most graphs are sparse, resulting in that most wedge-checking operations are fruitless, or most wedges cannot form a triangle. For example, for the RMAT 6 graph, the percentage of wedges/triangles is 0.53%. For the RMAT 14 graph, the percentage reduces to 0.009%. This makes most of the checks useless for counting triangles. W can demonstrate better performance only when most of the graph’s wedges can form triangles. This scenario is not common in most practical applications.

The algorithmic structures of EM (Edge Merge Path), EB (Edge Binary Search), ET (Edge Partitioning), and EH (Edge Hash) are very similar to each other, differing primarily in the set intersection methods they employ. Merge path requires pre-sorted adjacency lists, enabling it to compare the two adjacency lists of a given edge $(u,v)$ in $d(u)+d(v)$ time. This is optimal because we have to check every neighbor. Binary search method EB searches each vertex in a small adjacency list (e.g., $N(u)$ ) in a larger adjacency list (e.g., $N(v)$ ) in $d(u)\times\log(d(v))$ time. ET is a specific case of EB and involves additional operations to find the midpoint of the two adjacency lists. Thus, from an algorithmic analysis perspective, ET’s performance will always be worse than EB’s. However, EB and ET can leverage parallelism effectively to improve performance. Our parallel results demonstrate that they may outperform EM. EH takes $min(d(u),d(v))<d(u)+d(v)$ operations to find triangles, and the Hash method doesn’t require pre-sorting adjacency lists, making it better than EM and often the best performer among the four methods.

TS (Triangle Summation) and LA (Linear Algebra) are two linear algebra-based methods. They can count the total number of triangles but cannot list all the triangles. Their performance improvements depend on optimizing formulas and architecture-related methods. The advantage of such methods lies in their ability to directly apply results from linear algebra theory and leverage highly optimized numerical techniques integrated into linear algebra libraries. Their performance is often superior to that of the EM method.

F (Forward) often demonstrates excellent performance in most scenarios but is inherently sequential. As we discussed earlier, F dynamically generates two sets that are much smaller than the size of the original adjacency lists. It is based on the DO method, which further reduces the fruitless checks in triangle counting operations. Additionally, pre-sorting vertices in non-increasing degrees enhance memory access locality and cache hit ratios. As one can observe, F effectively reduces the operations that cannot find new triangles. The results of FH and FHD show that the performance is further improved when Hash is used.

CETC-Seq and its variants, introduce another perspective for eliminating fruitless searches in triangle counting. First, it skips unnecessary edge searches based on a quick BFS operation that can be completed in $\mathcal{O}(m+n)$ time. By leveraging the directed-oriented technique, CETC-Seq achieves a further significant reduction in the fruitless searches during triangle counting. It is competitive with the fastest approaches and may be useful when the BFS preprocessing overhead can be negligible. CETC-Seq-S and its variants further optimize the performance with Hash, degree ordering and recursive method.

We assigned rank values to each test case and calculated the average rank value. The performance from high to low are FH, CETC-Seq-S, FHD, CETC-Seq, LA, F, EHD, TS, CETC-Seq-SR, CETC-Seq-SD, CETC-Seq-FE, EH, CETC-Seq-D, EMD, EBD, ET, WD, EB, EM, ETD, W, IR.

We can say the top 4 set intersection-based triangle counting algorithms include our novel CETC-Seq-S and CETC-Seq algorithms. The performance of CETC-Seq-S with an average rank of 2.80 is slightly worse than that of FH with an average rank of 2.0. The average rank of FHD is 4.6 and CETC-Seq is 6.4.

6.3.6. Influence of the $c$ Value on the Performance of the Novel Algorithm

Building upon the definition of our novel algorithm, its performance should be highly related to the covering ratio $c$ .

A noteworthy trend is identified when evaluating the results, particularly concerning the RMAT graphs. Our finding reveals that the forward algorithm and its variants tend to perform the fastest. As the scale of the RMAT graph increases, the parameter $c$ decreases, indicating a more substantial removal of fruitless checks after BFS. Under these conditions, our novel method demonstrates greater efficiency compared to the F algorithms.

These observations validate our hypothesis that the performance of our new algorithm is significantly correlated with the covering ratio $c$ . As $c$ decreases, performance improves.

Concurrently, an analysis of the performance of the road network graphs (roadNet-CA, roadNet-PA, roadNet-TX) reveals their divergence from the other graphs. Road networks, unlike social networks, often have only low-degree vertices (for instance, many degree four vertices), and large diameters. Although the covering ratio of these road networks is under 15%, we see less benefit from the new approach due to this low value of $c$ . So, a lower $c$ value does not always yield high performance.

6.4. Results and Analysis of Parallel Algorithms on Shared-Memory

The execution times of the parallel algorithms (in seconds) are presented in Table 3 for 32 threads and in Table 4 for 224 threads.

Table 3. Execution time (in seconds) for shared-memory parallel algorithms. (32 threads)

Graph WP WDP EMP EMDP EBP EBDP ETP ETDP EHP EHDP CETC-SM RMAT 6 0.000449 0.000268 0.000226 0.000137 0.000251 0.000149 0.000236 0.000163 0.00018 0.000102 0.000143 RMAT 7 0.001223 0.000446 0.000218 0.000154 0.000298 0.000159 0.000284 0.000232 0.000153 0.000088 0.000112 RMAT 8 0.004417 0.001139 0.000413 0.000259 0.000607 0.000314 0.000627 0.000486 0.000368 0.000207 0.00033 RMAT 9 0.006898 0.002485 0.000661 0.000604 0.001263 0.000684 0.001419 0.001355 0.000492 0.000273 0.00043 RMAT 10 0.018563 0.006674 0.001527 0.001515 0.003311 0.001667 0.003505 0.003399 0.000902 0.000511 0.000935 RMAT 11 0.043502 0.020521 0.003952 0.003951 0.007708 0.0041 0.008889 0.008896 0.002429 0.001303 0.002447 RMAT 12 0.117014 0.049646 0.005842 0.005817 0.011434 0.005798 0.012514 0.011956 0.00309 0.001837 0.003849 RMAT 13 0.279936 0.107251 0.014796 0.013801 0.027184 0.013905 0.03255 0.028344 0.006466 0.003631 0.00932 RMAT 14 0.738807 0.294576 0.035303 0.034693 0.064022 0.031186 0.078634 0.068282 0.015326 0.008446 0.024843 RMAT 15 1.945767 0.883749 0.093881 0.090363 0.159664 0.080606 0.193024 0.167142 0.034382 0.017455 0.054523 RMAT 16 5.445053 2.616904 0.232325 0.225416 0.388562 0.190387 0.471949 0.405698 0.075692 0.038662 0.123288 RMAT 17 16.105178 7.952496 0.600006 0.570236 1.017046 0.495515 1.201604 1.01826 0.173762 0.085809 0.272745 karate 0.000047 0.000046 0.000045 0.000052 0.000046 0.000049 0.000047 0.000047 0.000046 0.000052 0.00005 amazon0302 0.008796 0.022824 0.014149 0.036935 0.020339 0.039651 0.018675 0.019454 0.011521 0.029648 0.027349 amazon0312 0.040122 0.072721 0.039475 0.115382 0.058515 0.125074 0.069724 0.086947 0.045718 0.099768 0.095591 amazon0505 0.047963 0.084478 0.045625 0.14529 0.0728 0.147519 0.082892 0.090064 0.046591 0.10615 0.09897 amazon0601 0.048284 0.083188 0.05004 0.134631 0.072389 0.149232 0.082411 0.091988 0.049378 0.107361 0.101416 loc-Brightkit 0.026786 0.018883 0.01209 0.023231 0.006994 0.012797 0.008975 0.008192 0.004159 0.005755 0.005673 loc-Gowalla 1.720423 0.544702 0.509203 0.076514 0.048047 0.986971 0.991823 0.111531 0.072055 0.027526 0.027009 roadNet-CA 0.006707 0.099629 0.041482 0.090839 0.054792 0.11503 0.063274 0.095634 0.063806 0.06235 0.059879 roadNet-PA 0.002327 0.045298 0.04421 0.051884 0.027293 0.050139 0.031387 0.046944 0.034754 0.033182 0.02988 roadNet-TX 0.00303 0.061665 0.043471 0.07115 0.040102 0.080569 0.04338 0.060169 0.041411 0.042741 0.040724 soc-Epinions1 0.188828 0.029691 0.024934 0.044974 0.02401 0.065555 0.052619 0.02325 0.011699 0.018498 0.019213 wiki-Vote 0.020278 0.00745 0.004677 0.013179 0.007166 0.018188 0.011463 0.006211 0.003359 0.007699 0.007689

Table 4. Execution time (in seconds) for shared-memory parallel algorithms. (224 threads)

Graph WP WDP EMP EMDP EBP EBDP ETP ETDP EHP EHDP CETC-SM RMAT 6 0.00262 0.001665 0.00159 0.002007 0.001779 0.002791 0.001626 0.002978 0.002411 0.001657 0.001823 RMAT 7 0.005946 0.0012 0.001177 0.000863 0.001092 0.001765 0.001126 0.000937 0.000799 0.000756 0.000913 RMAT 8 0.01391 0.004556 0.002609 0.002877 0.002534 0.002047 0.003943 0.002736 0.002399 0.002099 0.002797 RMAT 9 0.032362 0.008507 0.003102 0.002114 0.002848 0.004165 0.005741 0.00493 0.002538 0.002196 0.002278 RMAT 10 0.096783 0.023608 0.006077 0.004446 0.006682 0.007183 0.009634 0.009532 0.002827 0.002962 0.00238 RMAT 11 0.244285 0.034017 0.005143 0.005314 0.008019 0.006563 0.012168 0.011372 0.002077 0.001958 0.002647 RMAT 12 0.603094 0.054447 0.007437 0.007343 0.005859 0.00551 0.019386 0.019021 0.002241 0.001818 0.003157 RMAT 13 1.767242 0.191662 0.017554 0.017532 0.013061 0.012225 0.048984 0.048837 0.004743 0.003995 0.005822 RMAT 14 5.805617 0.6103 0.043436 0.043317 0.032064 0.025181 0.119665 0.124235 0.008927 0.00792 0.011054 RMAT 15 19.502675 1.075846 0.1124 0.112662 0.06781 0.050401 0.284255 0.274718 0.021382 0.01982 0.026385 RMAT 16 4.564897 2.724765 0.27826 0.273954 0.125003 0.10814 0.566664 0.548104 0.052475 0.046825 0.046742 RMAT 17 14.404249 8.292626 0.636928 0.623498 0.359324 0.209283 1.255369 1.193338 0.126567 0.11914 0.10871 karate 0.000246 0.001053 0.001054 0.00139 0.001052 0.001067 0.001843 0.002898 0.002982 0.001061 0.001648 amazon0302 0.028943 0.007874 0.016644 0.00981 0.011362 0.006467 0.013704 0.007236 0.035765 0.024087 0.039814 amazon0312 0.10007 0.046193 0.029063 0.020546 0.030613 0.016588 0.067293 0.053589 0.065335 0.038084 0.084911 amazon0505 0.134932 0.043264 0.03243 0.021458 0.032284 0.017594 0.070966 0.055361 0.071815 0.039119 0.090892 amazon0601 0.111001 0.050005 0.031588 0.02097 0.032657 0.017569 0.071926 0.053841 0.063443 0.035649 0.092445 loc-Brightkit 0.298194 0.037024 0.007619 0.005149 0.005186 0.002977 0.009518 0.009214 0.008017 0.003441 0.005172 loc-Gowalla 4.642555 1.850222 0.551523 0.597855 0.042251 0.035543 0.99552 1.153058 0.142559 0.135726 0.02852 roadNet-CA 0.018777 0.008275 0.034299 0.018575 0.029132 0.016644 0.029072 0.016026 0.033069 0.021434 0.094693 roadNet-PA 0.012264 0.005497 0.02696 0.015148 0.016778 0.013736 0.016576 0.010023 0.020221 0.012846 0.048229 roadNet-TX 0.013831 0.005999 0.026138 0.01613 0.024868 0.011356 0.019374 0.011121 0.024195 0.014831 0.058367 soc-Epinions1 2.622935 0.322065 0.032298 0.029065 0.022567 0.01809 0.087574 0.08822 0.014144 0.00968 0.010896 wiki-Vote 0.527084 0.028837 0.005353 0.00425 0.007596 0.005707 0.0129 0.008829 0.002952 0.001597 0.003614

6.4.1. Performance Utilizing 32 Threads

While F and its variants excel as sequential algorithms, they are inherently sequential and cannot be parallelized. In this section, we focus on algorithms conducive to parallelization to showcase the speedups achieved with parallel methods. Fig. 7 illustrates the speedups of various parallel algorithms compared to their corresponding sequential counterparts, employing 32 threads.

The average speedups are as follows: WP is $10.5\times$ ; WDP is $7.5\times$ ; EMP is $13.6\times$ ; EMDP is $8.3\times$ ; EBP is $23.3\times$ ; EBDP is $19.3\times$ ; ETP is $16.2\times$ ; ETDP is $10.8\times$ ; EHP is $6.6\times$ ; EHDP is $5.0\times$ ; CETC-SM is $3.9\times$ . The results affirm that parallel optimization significantly improves performance.

However, certain scenarios highlight limitations. For instance, in the case of the small-sized graph “karate” (Graph ID=13), all parallel algorithms fail to exhibit performance improvements. This can be attributed to the inherent overhead of the OpenMP parallel method, which outweighs the benefits for very small graphs. A similar pattern is observed for the graph RMAT 6 (Graph ID=1), where three parallel methods—EHP, EHDP, and CETC-SM—show no performance improvement. As previously mentioned, the baseline algorithms EH, EHD, and CETC-Seq have already demonstrated high performance, and the parallel overhead for small graphs nullifies the potential benefits of parallelization.

6.4.2. Performance Utilizing All System Threads

When we harness our experimental system’s full parallel processing capacity, we can execute our OpenMP parallel programs with 224 threads. In Fig. 8, we provide the execution time percentages of various algorithms when employing all 224 system threads. We assigned rank values to each test case and calculated the average rank value. The performance from high to low are EHDP, EBDP, EMDP, EHP, EBP, CETC-SM, EMP, ETDP, WDP, ETP, WP.

The presented results highlight that not only can hash and binary search deliver commendable parallel performance by minimizing operations per parallel thread but also the application of degree ordering proves effective in improving the performance of individual threads.

6.4.3. Scalable Performance

This subsection delves into the performance of these algorithms in response to varying thread counts. We use RMAT 15 as an illustrative example of a synthetic graph and Amazon0312 as a representative instance of a real graph. By progressively increasing the number of threads to 2, 4, 8, 16, 32, 64, 128, and 224, we seek to identify changes in speedup corresponding to increasing thread counts.

Fig. 10 illustrates the change in speedup with the increasing number of threads on RMAT15. For most algorithms, a bottleneck emerges starting from 64 threads, with no discernible speedup observed with the continued increase in thread count. Notably, the WP algorithm exhibits a degradation in performance with the incorporation of additional parallel threads. The only algorithm demonstrating notable scalability is EBP, showcasing consistent performance improvement with the increasing number of threads. A similar observation is made for the real graph (see Fig. 10), where most algorithms encounter a bottleneck at 64 threads. However, EBP and EBDP exhibit good scalability, indicating that binary search-based methods possess superior scalability compared to other approaches.

6.4.4. Best Performance on Different Graphs

In this section, we use EM as the performance baseline to evaluate the best speedup achieved by different algorithms. The results are summarized in Table 5. The number following a specific algorithm name indicates how many parallel threads are used.

Integrated optimization methods demonstrate a substantial speedup, averaging at 75.8. Examining various algorithms on different graphs unveils insights into optimization methods.

Firstly, for small graphs like RMAT 6 (Graph ID=1), RMAT 7 (Graph ID=2), and karate (Graph ID=13), parallel optimization techniques fail to outperform the sequential FH and linear algebra LA methods. Practical performance considerations suggest that employing multiple parallel threads might introduce overhead for small graphs, making sequential methods more efficient.

Secondly, as graph size increases, optimal performance often requires more parallel resources. However, beyond a critical point, additional parallel resources may lead to decreased performance. For example, RMAT 8 (Graph ID=3) and roadNet-TX (Graph ID=22) achieve peak performance with 32 threads. In contrast, RMAT 9 to RMAT 10 (Graph ID 4-5), RMAT 12 (Graph ID=7), RMAT 14 to RMAT 17 (Graph ID 9-12), loc-Brightkite (Graph ID=18), and roadNet-PA (Graph ID=21) require 64 threads. Certain graphs, such as RMAT 11 (Graph ID=6), RMAT 13 (Graph ID=8), loc-Gowalla (Graph ID=19), roadNet-CA (Graph ID=20), soc-Epinions1 (Graph ID=23), wiki-Vote (Graph ID=24), demand 128 threads. Larger graphs like amazon0302, amazon0312, amazon0505, and amazon0601 (Graph ID 14-17) leverage the full system parallel resources (224 threads). Notably, graph size alone doesn’t determine parallel resource needs, as topology plays a crucial role in parallel performance. At the same time, the parallel algorithms that achieve the best performance are also different. Among all the parallel algorithms, EHDP has 13 times to achieve the best performance. EBDP has four times to achieve the best performance. WDP has three times to achieve the best performance and CETC-SM has one time to achieve the best performance.

Thirdly, the various sequential and parallel optimizations needed for better performance can differ. For instance, WD might not be ideal in a sequential scenario due to checking numerous wedges, many of which are not fruitful for sparse graphs. However, in a parallel scenario, WDP64 excels with 64 threads on roadNet-PA (Graph ID=21), surpassing other algorithms. The efficiency arises from the smaller number of wedges when vertex degrees are low, coupled with DO optimization method that reduces fruitless searches. Another case is EBD, which may not be favorable in sequential algorithms due to increased total operations compared to EMD. However, in parallel algorithms, EBDP could outperform MergePath by distributing work more efficiently through parallel binary searches.

In conclusion, our results highlight that different algorithms find their optimal scenarios based on specific graph topology and hardware configurations. Graph topology and available hardware resources are pivotal factors in selecting the most efficient triangle counting algorithm.

Table 5. Best Performance and Algorithms for Different Graphs (second).

Graph 1 2 3 4 5 6 7 8 9 10 11 12 Baseline Time 0.0006080 0.0019730 0.0054550 0.0146430 0.0203420 0.0537350 0.1411760 0.3726090 0.9877480 2.6268370 6.9313980 18.3285120 Best Time 0.0000280 0.0000760 0.0002070 0.0002170 0.0003830 0.0008570 0.0014210 0.0029630 0.0066920 0.0150250 0.030892 0.078430 Algorithm FH FH EHDP32 EHDP64 EHDP64 EHDP128 EHDP64 EHDP128 EHDP64 EHDP64 EHDP64 EHDP64 Speedup 21.7 26.0 26.4 67.5 53.1 62.7 99.3 125.8 147.6 174.8 224.4 233.7 Graph 13 14 15 16 17 18 19 20 21 22 23 24 Baseline Time 0.0000120 0.1434740 0.7208080 0.7557200 0.7621240 0.1158030 2.0830490 0.1027660 0.0755300 0.0708210 0.6153270 0.1243600 Best Time 0.0000020 0.0064670 0.0165880 0.0175940 0.0175690 0.0028280 0.0200290 0.002870 0.001671 0.003030 0.0088180 0.0015570 Algorithm LA EBDP224 EBDP224 EBDP224 EBDP224 EHDP64 CETC-SM128 WDP128 WDP64 WDP32 EHDP128 EHDP128 Speedup 6.0 22.2 43.5 43.0 43.4 40.9 104.0 35.8 45.2 23.4 69.8 79.9

6.5. Communication Analysis of CETC-DM

Table 6. Communication costs of CETC-DM for real and synthetic graph. The synthetic graphs are Graph500 RMAT graphs of scale 36 and 42. The column ‘Previous’ represents the communication volume of the best prior parallel algorithms (Dolev et al., 2012; Pearce and Sanders, 2018; Sanders and Uhl, 2023), that use wedge-checking based algorithms and ‘This paper’ represents the communication cost of our new approach CETC-DM. ‘Reduction’ represents the communication reduction between these two, and thus, the expected speedup of the parallel algorithm. Entries in italics are estimated values.

Graph n m # Triangles # Wedges $c$ $p$ Previous This paper Reduction ca-GrQc 5242 14484 48260 165798 0.522 4 526KB 122KB 4.31 ca-HepTh 9877 25973 28339 277389 0.423 4 948KB 218KB 4.35 as-caida20071105 26475 53381 36365 776895 0.225 4 2.78MB 401KB 7.10 facebook_combined 4039 88234 1612010 17051688 0.914 4 48.8MB 893KB 56.0 ca-CondMat 23133 93439 173361 1567373 0.511 4 5.61MB 897KB 6.40 ca-HepPh 12008 118489 3358499 5081984 0.621 4 17.0MB 1.13MB 15.1 email-Enron 36692 183831 727044 5933045 0.478 4 22.6MB 1.79MB 12.7 ca-AstroPh 18772 198050 1351441 8451765 0.667 4 30.2MB 2.08MB 14.6 loc-brightkite_edges 58228 214078 494728 6956250 0.441 4 26.5MB 2.02MB 20.4 soc-Epinions1 75879 405740 1624481 21377935 0.498 4 86.7MB 4.25MB 10.7 amazon0601 403394 2443408 3986507 96348699 0.529 8 436MB 40.9MB 10.7 com-Youtube 1134890 2987624 3056386 209811585 0.347 8 1.03GB 44.3MB 23.7 RMAT-36 68719476736 1099511627776 1.2E+14 2.73E+16 0.311 128 218PB 192TB 1156 RMAT-42 4398046511104 70368744177664 1.3E+16 5.79E+18 0.260 256 52.8EB 22.8PB 2368

In this section, we analyze the performance of the parallel triangle counting algorithm CETC-DM (Alg. 15) on both real and synthetic graphs. We implemented our new triangle counting algorithm using Python to accurately compute the exact communication volume and determine an analytic model based on the size of the graph and number of processors, and the covering ratio ( $c$ ) from the BFS. The results given in Table 6 are exact communication volumes from our new algorithm on all of the graphs except the two large RMAT graphs where we compute the communication volume from the validated analytic model. For the comparison with prior approaches (Dolev et al., 2012; Pearce and Sanders, 2018; Sanders and Uhl, 2023), we estimate the communication volume from the number of wedges which is exact for all graphs other than the last two large RMAT graphs where we estimate the number of wedges using graph theory.

For the real graphs, we find the actual value of $c$ , the percentage of graph edges that are cover-edges, for an arbitrary breadth-first search, and set the number $p$ of processors to a reasonable number given the size of the graph. For the synthetic graphs, we use large Graph500 RMAT graphs (Chakrabarti et al., 2004) with parameters $a=0.57$ , $b=0.19$ , $c=0.19$ , and $d=0.05$ , for scale 36 and 42 with $n=2^{\mbox{scale}}$ and $m=16n$ , similar with the IARPA AGILE benchmark graphs, and set $p$ according to estimates of potential system sizes with sufficient memory to hold these large instances.

For comparison, most prior parallel algorithms for triangle counting operate on the graph as follows. A parallel loop over the vertices $v\in V$ produces all 2-paths (wedges) where $(v,v_{1}),(v,v_{2})\in E$ and (w.l.o.g.) $v_{1}<v_{2}$ . The processor that produces this wedge will send an open wedge query message containing the vertex IDs of $v_{1}$ and $v_{2}$ to the processor that owns vertex $v_{1}$ . If the consumer processor that receives this query message finds an edge $(v_{1},v_{2})\in E$ , then a local triangle counter is incremented. After producers and consumers complete all work, a global reduction over the $p$ triangle counts computes the total number of triangles in $G$ .

6.5.1. Graph500 RMAT Graphs

For the large Graph500 RMAT graphs, the number of triangles is estimated from our model based on the number of triangles found in RMAT graphs up to scale 29 in the literature (Hoang et al., 2019; Giechaskiel et al., 2015; Chakrabarti et al., 2004; Burkhardt, 2017). The fitting equation is $\mbox{\#Triangles}=77.422n^{1.125}$ with $R^{2}=1.0$ , where $n$ is the total number of vertices. The number of triangles estimated for scale 36 and 42 RMAT graphs are $1.20\times 10^{14}$ and $1.30\times 10^{16}$ , respectively.

We estimate the number of wedges for the scale 36 and 42 Graph500 RMAT graphs based on the theorem given by Seshadhri et al. in (Seshadhri et al., 2011). According to their formula, we can estimate the expected number of vertices $N(d)$ for a given out-degree $d$ . The number of wedges that can be formed by vertices with such a degree is calculated as $\binom{d}{2}\times N(d)$ , where $\binom{d}{2}$ means choosing two from $d$ .

By summing all such wedges generated from the minimum ( $e\ln n$ ) to the maximum degree ( $\sqrt{n}$ ), which is the assumption of the formula, we can approximate the total number of wedges in the given graph, where $n$ is the total number of vertices. This is a conservative estimate because it only considers the out-degree instead of the sum of out and in-degrees. Employing the formula, we calculate the number of wedges to be $2.73\times 10^{16}$ for scale 36 and $5.8\times 10^{18}$ for scale 42. With $2\log n$ bits/wedge, the total volume of wedge checks is 218PB and 52.8EB for RMAT graphs of scales 36 and 42, respectively¹¹1Throughout this paper, a petabyte (PB) is $2^{50}$ bytes and an exabyte (EB) is $2^{60}$ bytes..

Beamer et al. (Beamer et al., 2011) find a typical BFS on a scale 27 Graph500 RMAT graph has 7 levels, so 4 bits is a reasonable estimate for $\log D$ in our analyses of scale 36 and 42 graphs.

The methodology for estimating the value of the covering ratio $c$ for RMAT graphs is as follows. RMAT graphs from scale 6 to 23 are generated, and the exact value of $c$ is determined for each by counting the horizontal-edges after a breadth-first search. The data fit to an exponential model $c=1.1773e^{-0.036\cdot\mbox{scale}}$ with very high $R^{2}=0.9956$ (see Fig. 11). For scale 36, $c$ is estimated to be 0.311 and for scale 42, $c$ is estimated to be 0.260.

In our new distributed-memory approach CETC-DM for scale 36, where the communication cost is $m\cdot(\lceil\log D\rceil+(kp+3)\lceil\log n\rceil)+(p-1)\lceil\log n\rceil$ bits. With $\lceil\log D\rceil=4$ , and assuming $p=128$ processors, we have a total communication volume of 192TB, for a communication reduction of $1156\times$ . For scale 42, and assuming $p=256$ processors, we estimate the communication of our new distributed-memory triangle counting algorithm CETC-DM as 22.8PB, for a communication reduction of $2368\times$ .

7. Conclusion

In this paper we design and implement a novel, fast triangle counting algorithm CETC, that uses new techniques called cover edge set, to improve the performance. It is the first algorithm in decades to shine a new light on triangle counting and use a wholly new method of cover-edge to reduce the work of set intersections, rather than other approaches that are variants of the well-known vertex-iterator and edge-iterator methods. We provide extensive performance results for sequential triangle counting algorithms for sparse graphs in a uniform manner. Furthermore, we employ OpenMP to parallelize most of the sequential algorithms we implemented and investigate their performance. The results use Intel’s latest processor family, the Intel Sapphire Rapids (Platinum 8480+) launched in the 1st quarter of 2023. The new triangle counting algorithm can benefit when the results of a BFS are available, which is often the case in network science. Additionally, this work will inspire much interest within the Graph Challenge community to implement versions of the presented algorithms for large-shared memory, distributed memory, GPU, or multi-GPU frameworks.

8. Reproducibility

The triangle counting source code is open source and available on GitHub at https://github.com/Bader-Research/triangle-counting. The input graphs are from the Stanford Network Analysis Project (SNAP) available from http://snap.stanford.edu/.

Acknowledgements.

This research was funded in part by NSF grant number CCF-2109988.

References

(1)
Alman and Williams (2021) Josh Alman and Virginia Vassilevska Williams. 2021. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 522–539.
Alon et al. (1997) Noga Alon, Raphael Yuster, and Uri Zwick. 1997. Finding and counting given length cycles. Algorithmica 17, 3 (1997), 209–223.
Arifuzzaman et al. (2019) Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2019. Fast Parallel Algorithms for Counting and Listing Triangles in Big Graphs. ACM Trans. Knowl. Discov. Data 14, 1, Article 5 (Dec 2019), 34 pages. https://doi.org/10.1145/3365676
Bader (2023) David A. Bader. 2023. Fast Triangle Counting. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC58863.2023.10363539
Bader et al. (2023) David A. Bader, Fuhuan Li, Anya Ganeshan, Ahmet Gundogdu, Jason Lew, Oliver Alvarado Rodriguez, and Zhihui Du. 2023. Triangle Counting Through Cover-Edges. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC58863.2023.10363465
Beamer et al. (2011) Scott Beamer, Krste Asanovic, and David Patterson. 2011. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for Graph500. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2011-117 (2011).
Burkhardt (2017) Paul Burkhardt. 2017. Graphing trillions of triangles. Information Visualization 16, 3 (2017), 157–166.
Burkhardt (2021) Paul Burkhardt. 2021. Triangle Centrality. CoRR abs/2105.00110 (2021). arXiv:2105.00110 https://arxiv.longhoe.net/abs/2105.00110
Chakrabarti et al. (2004) Deepayan Chakrabarti, Yi** Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 442–446.
Chiba and Nishizeki (1985) Norishige Chiba and Takao Nishizeki. 1985. Arboricity and Subgraph Listing Algorithms. SIAM J. Comput. 14, 1 (1985), 210–223. https://doi.org/10.1137/0214017
Cohen (2008) Jonathan Cohen. 2008. Trusses: Cohesive subgraphs for social network analysis. National Security Agency Technical Report 16, 3.1 (2008).
Cohen (2009) Jonathan Cohen. 2009. Graph twiddling in a MapReduce world. Computing in Science & Engineering 11, 4 (2009), 29–41.
Cormen et al. (2022) Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022. Introduction to algorithms. MIT press.
Davis (2018) Timothy A. Davis. 2018. Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2018.8547538
Dolev et al. (2012) Danny Dolev, Christoph Lenzen, and Shir Peled. 2012. ’Tri, tri again’: finding triangles and small subgraphs in a distributed setting. In International Symposium on Distributed Computing. Springer, 195–209.
Ghosh and Halappanavar (2020) Sayan Ghosh and Mahantesh Halappanavar. 2020. TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure. In IEEE High Performance Extreme Computing Conference (HPEC). 1–6.
Giechaskiel et al. (2015) Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. 2015. PDTL: Parallel and distributed triangle listing for massive graphs. In 2015 44th International Conference on Parallel Processing. IEEE, 370–379.
Graph 500 Steering Committee (2010) Graph 500 Steering Committee. 2010. The Graph500 benchmark. https://www.graph500.org
Harrod (2020) William Harrod. 2020. Advanced Graphical Intelligence Logical Computing Environment (AGILE). https://www.iarpa.gov/images/PropsersDayPDFs/AGILE/AGILE_Proposers_Day_sm.pdf. Accessed: 2022–05-24.
Hoang et al. (2019) Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav **ali. 2019. DistTC: High performance distributed triangle counting. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.
Hu et al. (2021) Lin Hu, Lei Zou, and Yu Liu. 2021. Accelerating triangle counting on GPU. In Proceedings of the 2021 International Conference on Management of Data. 736–748.
Hu et al. (2018) Yang Hu, Hang Liu, and H Howie Huang. 2018. TriCore: Parallel triangle counting on GPUs. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 171–182.
Itai and Rodeh (1978) Alon Itai and Michael Rodeh. 1978. Finding a minimum circuit in a graph. SIAM J. Comput. 7, 4 (1978), 413–423.
Latapy (2008) Matthieu Latapy. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoretical Computer Science 407, 1 (2008), 458–473. https://doi.org/10.1016/j.tcs.2008.07.017
Li and Bader (2021) Fuhuan Li and David A Bader. 2021. A graphblas implementation of triangle centrality. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–2.
Low et al. (2017) Tze Meng Low, Varun Nagaraj Rao, Matthew Lee, Doru Popovici, Franz Franchetti, and Scott McMillan. 2017. First look: Linear algebra-based triangle counting without matrix multiplication. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2017.8091046
Mailthody et al. (2018) Vikram S. Mailthody, Ketan Date, Zaid Qureshi, Carl Pearson, Rakesh Nagi, **jun Xiong, and Wen-mei Hwu. 2018. Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC.2018.8547517
Makkar et al. (2017) Devavret Makkar, David A. Bader, and Oded Green. 2017. Exact and Parallel Triangle Counting in Dynamic Graphs. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, Los Alamitos, CA, 2–12. https://doi.org/10.1109/HiPC.2017.00011
Mowlaei (2017) Shahir Mowlaei. 2017. Triangle counting via vectorized set intersection. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1–5. https://doi.org/10.1109/HPEC.2017.8091053
Ortmann and Brandes (2014) Mark Ortmann and Ulrik Brandes. 2014. Triangle Listing Algorithms: Back from the Diversion. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (Portland, Oregon). Society for Industrial and Applied Mathematics, USA, 1–8.
Parimalarangan et al. (2017) Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast parallel graph triad census and triangle counting on shared-memory platforms. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1500–1509.
Pearce (2017) Roger Pearce. 2017. Triangle counting for scale-free graphs at scale in distributed memory. In IEEE High Performance Extreme Computing Conference (HPEC). 1–4.
Pearce and Sanders (2018) Roger Pearce and Geoffrey Sanders. 2018. K-truss decomposition for scale-free graphs at scale in distributed memory. In 2018 IEEE high performance extreme computing conference (HPEC). IEEE, 1–6.
Samsi et al. (2018) Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2018. GraphChallenge.org: Raising the Bar on Graph Analytic Performance. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC.2018.8547527
Sanders and Uhl (2023) Peter Sanders and Tim Niklas Uhl. 2023. Engineering a Distributed-Memory Triangle Counting Algorithm. arXiv preprint arXiv:2302.11443 (2023).
Schank (2007) Thomas Schank. 2007. Algorithmic Aspects of Triangle-Based Network Analysis. Ph. D. Dissertation. Karlsruhe Institute of Technology.
Schank and Wagner (2005) Thomas Schank and Dorothea Wagner. 2005. Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study. In Proceedings of the 4th International Conference on Experimental and Efficient Algorithms (Santorini Island, Greece) (WEA’05). Springer-Verlag, Berlin, Heidelberg, 606–609. https://doi.org/10.1007/11427186_54
Seshadhri et al. (2011) C Seshadhri, Ali Pinar, and Tamara G Kolda. 2011. A hitchhiker’s guide to choosing parameters of stochastic Kronecker graphs. CoRR, abs/1102.5046 1 (2011).
Shun and Tangwongsan (2015) Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations without tuning. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 149–160.
Ullmann (1976) Julian R Ullmann. 1976. An algorithm for subgraph isomorphism. Journal of the ACM (JACM) 23, 1 (1976), 31–42.
Watts and Strogatz (1998) Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 6684 (1998), 440–442.
Williams et al. (2023) Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. 2023. New bounds for matrix multiplication: from alpha to omega. arXiv preprint arXiv:2307.07970 (2023).
Zeng et al. (2022) Li Zeng, Kang Yang, Haoran Cai, **hua Zhou, Rongqing Zhao, and Xin Chen. 2022. HTC: Hybrid vertex-parallel and edge-parallel Triangle Counting. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.

Cover Edge-Based Novel Triangle Counting

Abstract.

1. Introduction

2. Notations and Definitions

3. Related Work

3.1. Existing Sequential Algorithms

3.2. Existing Parallel Algorithms

4. Cover-Edge Based Triangle Counting Algorithms

4.1. Cover-Edge Set

Definition 1 (Cover-Edge, Cover-Edge Set and Covering Ratio).

Definition 2 (BFS-Edge).

Lemma 0 ().

Proof.

Theorem 4 (Cover-Edge Set Generation).

Proof.

4.2. Cover-Edge Triangle Counting: CETC

Lemma 0 ().

Proof.

Theorem 6 (Correctness).

Proof.

4.3. Variants of CETC-Seq

4.3.1. CETC Forward Exchanging Triangle Counting Algorithm (CETC-Seq-FE)

4.3.2. CETC Split Triangle Counting Algorithm (CETC-Seq-S)

4.3.3. CETC-Split Recursive Triangle Counting Algorithm (CETC-Seq-SR)

4.4. Parallel CETC Algorithm on Shared-Memory (CETC-SM)

4.5. Communication-Efficient Parallel CETC Algorithm on Distributed-Memory (CETC-DM)

4.5.1. Cost Analysis

Space

Compute

Total Communication

5. Open Source Evaluation Framework

6. Experimental Results

6.1. Platform Configuration

6.2. Data Sets

6.3. Results and Analysis of Sequential Algorithms

6.3.1. Effect of Direction-Oriented on Sequential Algorithms

6.3.2. Effect of Hash Method on Sequential Algorithms

6.3.3. Effect of Forward Algorithm and Its Variants

6.3.4. Effect of CETC-Seq Algorithm and Its Variants

6.3.5. Comprehensive Sequential Algorithms Comparison

6.3.6. Influence of the c𝑐citalic_c Value on the Performance of the Novel Algorithm

6.4. Results and Analysis of Parallel Algorithms on Shared-Memory

6.4.1. Performance Utilizing 32 Threads

6.4.2. Performance Utilizing All System Threads

6.4.3. Scalable Performance

6.4.4. Best Performance on Different Graphs

6.5. Communication Analysis of CETC-DM

6.5.1. Graph500 RMAT Graphs

7. Conclusion

8. Reproducibility

Acknowledgements.

References

6.3.6. Influence of the $c$ Value on the Performance of the Novel Algorithm