License: CC BY 4.0
arXiv:2403.02997v1 [cs.DS] 05 Mar 2024

Cover Edge-Based Novel Triangle Counting

David A. Bader [email protected] Fuhuan Li [email protected] Zhihui Du [email protected] Palina Pauliuchenka [email protected] Oliver Alvarado Rodriguez [email protected] New Jersey Institute of TechnologyNewarkNew JerseyUSA07102 Anant Gupta John P. Stevens High SchoolEdisonUSA Sai Sri Vastav Minnal Edison Academy Magnet SchoolEdisonUSA Valmik Nahata New Providence High SchoolNew ProvidenceNew JerseyUSA Anya Ganeshan Bergen County AcademiesHackensackNew JerseyUSA Ahmet Gundogdu Paramus High SchoolParamusNew JerseyUSA  and  Jason Lew New Jersey Institute of TechnologyNewarkNew JerseyUSA07102 [email protected]
(2024)
Abstract.

Listing and counting triangles in graphs is a key algorithmic kernel for network analyses, including community detection, clustering coefficients, k-trusses, and triangle centrality. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. Leveraging the breadth-first search (BFS) method, we can quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms that employ cover-edge sets are presented. The novel sequential algorithm performs competitively with the fastest previous approaches on both real and synthetic graphs, such as those from the Graph500 Benchmark and the MIT/Amazon/IEEE Graph Challenge. We implement 22 sequential algorithms for performance evaluation and comparison. At the same time, we employ OpenMP to parallelize 11 sequential algorithms, presenting an in-depth analysis of their parallel performance. Furthermore, we develop a distributed parallel algorithm that can asymptotically reduce communication on massive graphs. In our estimate from massive-scale Graph500 graphs, our distributed parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x. Comprehensive experiments are conducted on the recently launched Intel Xeon 8480+ processor and shed light on how graph attributes, such as topology, diameter, and degree distribution, can affect the performance of these algorithms.

Graph Algorithms, High-Performance Data Analytics, Parallel Algorithms
copyright: nonejournalyear: 2024doi: XXXXXXX.XXXXXXX

1. Introduction

Triangle listing and counting is a highly-studied problem in computer science and is a key building block in various graph analysis techniques such as clustering coefficients (Watts and Strogatz, 1998), k-truss (Cohen, 2008), and triangle centrality (Burkhardt, 2021), (Li and Bader, 2021). The significance of triangle counting is evident in its application in high-performance computing benchmarks like Graph500 (Graph 500 Steering Committee, 2010) and the MIT/Amazon/IEEE Graph Challenge (Samsi et al., 2018), as well as in the design of future architecture systems (e.g., IARPA AGILE (Harrod, 2020)).

There are at most (n3)=Θ(n3)binomial𝑛3Θ(n3)\binom{n}{3}=\mbox{\mbox{$\Theta$}$\left(n^{3}\right)$}( FRACOP start_ARG italic_n end_ARG start_ARG 3 end_ARG ) = roman_Θ ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) triangles in a graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) with n=|V|𝑛𝑉n=|V|italic_n = | italic_V | vertices and m=|E|𝑚𝐸m=|E|italic_m = | italic_E | edges. The naïve approach using triply-nest loops to check if each triple (u,v,w)𝑢𝑣𝑤(u,v,w)( italic_u , italic_v , italic_w ) forms a triangle takes 𝒪(n3)𝒪superscript𝑛3\mathcal{O}\left(n^{3}\right)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) time and is inefficient for sparse graphs. It is well-known that listing all triangles in G is ΩΩ\Omegaroman_Ω(m32)superscript𝑚32\left(m^{\frac{3}{2}}\right)( italic_m start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) time (Itai and Rodeh, 1978; Latapy, 2008). To enhance the performance of triangle counting, Cohen (Cohen, 2009) introduced a novel map-reduce parallelization technique that generates open wedges between triples of vertices in the graph. It determines whether a closing edge exists to complete a triangle, thus avoiding the redundant counting of the same triangle while maintaining load balancing. Many parallel approaches for triangle counting (Pearce, 2017; Ghosh and Halappanavar, 2020) partition the sparse graph data structure across multiple compute nodes and adopt the strategy of generating open wedges, which are sent to other compute nodes to determine the presence of a closing edge. Consequently, the communication time for these open wedges often dominates the running time of parallel triangle counting.

In this paper, we propose a novel approach that efficiently identifies all triangles using a reduced set of edges known as a cover-edge set. By leveraging the cover-edge-based triangle counting method, unnecessary edge checks can be skipped while ensuring that no triangles are missed. This significantly reduces the number of computational operations compared to existing methods.

The main contributions of this paper are

  • A novel triangle counting algorithm, Cover-Edge Triangle Counting (CETC), is proposed based on a new concept Cover-Edge Set. The essential idea is that we can identify all triangles from a significantly reduced cover-edge set instead of the complete edge set. A simple breadth-first search (BFS) is used to orient the graph’s vertices into levels and to generate the cover-edge set.

  • Various sequential variants of the CETC that combine the techniques of cover-edge, forward algorithm, and hashing are developed. Furthermore, the parallel implementations of CETC on both shared-memory (CETC-SM), and distributed-memory (CETC-DM) are introduced.

  • Freely-available, open-source software for more than 22 sequential triangle counting algorithms and 11 OpenMP parallel algorithms in the C programming language.

  • A comprehensive experimental study of implementations of the proposed novel triangle counting algorithms on real and synthetic graphs with the comparison against other existing algorithms.

2. Notations and Definitions

Let G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E ) be an undirected graph with n=|V|𝑛𝑉n=|V|italic_n = | italic_V | vertices and m=|E|𝑚𝐸m=|E|italic_m = | italic_E | edges. A triangle in the graph is a set of three vertices {va,vb,vc}Vsubscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑐𝑉\{v_{a},v_{b},v_{c}\}\subseteq V{ italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } ⊆ italic_V such that {(va,vb),(va,vc),(vb,vc)}Esubscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑎subscript𝑣𝑐subscript𝑣𝑏subscript𝑣𝑐𝐸\{(v_{a},v_{b}),(v_{a},v_{c}),(v_{b},v_{c})\}\subseteq E{ ( italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) , ( italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) } ⊆ italic_E. We will use N(v)={u|uV((u,v)E)}𝑁𝑣conditional-set𝑢𝑢𝑉𝑢𝑣𝐸N(v)=\{u|u\in V\wedge((u,v)\in E)\}italic_N ( italic_v ) = { italic_u | italic_u ∈ italic_V ∧ ( ( italic_u , italic_v ) ∈ italic_E ) } to denote the neighbor set of vertex vV𝑣𝑉v\in Vitalic_v ∈ italic_V. The degree of vertex vV𝑣𝑉v\in Vitalic_v ∈ italic_V is d(v)=|N(v)|𝑑𝑣𝑁𝑣d(v)=|N(v)|italic_d ( italic_v ) = | italic_N ( italic_v ) |, and dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximal degree of a vertex in graph G𝐺Gitalic_G.

With these notations, the total number of triangles in graph G𝐺Gitalic_G is denoted as |Δ(G)|Δ𝐺|\Delta(G)|| roman_Δ ( italic_G ) |. Specifically, Δ(G)={(u,v,w)|u,v,w\Delta(G)=\{(u,v,w)|u,v,wroman_Δ ( italic_G ) = { ( italic_u , italic_v , italic_w ) | italic_u , italic_v , italic_w are different vertices of V𝑉Vitalic_V and (u,v),(v,w),(w,u)𝑢𝑣𝑣𝑤𝑤𝑢(u,v),(v,w),(w,u)( italic_u , italic_v ) , ( italic_v , italic_w ) , ( italic_w , italic_u ) are edges of E}E\}italic_E }.

The triangle counting problem can be expressed in two ways based on edges and vertices:

  • For any edge (u,v)E𝑢𝑣𝐸(u,v)\in E( italic_u , italic_v ) ∈ italic_E, the number of triangles including (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is |Δ(u,v)|Δ𝑢𝑣|\Delta(u,v)|| roman_Δ ( italic_u , italic_v ) |, where Δ(u,v)=N(u)N(v)Δ𝑢𝑣𝑁𝑢𝑁𝑣\Delta(u,v)=N(u)\cap N(v)roman_Δ ( italic_u , italic_v ) = italic_N ( italic_u ) ∩ italic_N ( italic_v ). Since each triangle edge will count the same triangle and we will count both Δ(u,v)Δ𝑢𝑣\Delta(u,v)roman_Δ ( italic_u , italic_v ) and Δ(v,u)Δ𝑣𝑢\Delta(v,u)roman_Δ ( italic_v , italic_u ), the total triangles are computed as |Δ(G)|=(u,v)E|Δ(u,v)|6Δ𝐺subscript𝑢𝑣𝐸Δ𝑢𝑣6|\Delta(G)|=\frac{\sum_{(u,v)\in E}|\Delta(u,v)|}{6}| roman_Δ ( italic_G ) | = divide start_ARG ∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) ∈ italic_E end_POSTSUBSCRIPT | roman_Δ ( italic_u , italic_v ) | end_ARG start_ARG 6 end_ARG, using the edge-iteration-based method.

  • For any vertex vV𝑣𝑉v\in Vitalic_v ∈ italic_V, the number of triangles including v𝑣vitalic_v is |Δ(v)|Δ𝑣|\Delta(v)|| roman_Δ ( italic_v ) |, where Δ(v)={(u,w)|u,wN(v)(u,w)E}Δ𝑣conditional-set𝑢𝑤𝑢𝑤𝑁𝑣𝑢𝑤𝐸\Delta(v)=\{(u,w)\,|\,u,w\in N(v)\land(u,w)\in E\}roman_Δ ( italic_v ) = { ( italic_u , italic_w ) | italic_u , italic_w ∈ italic_N ( italic_v ) ∧ ( italic_u , italic_w ) ∈ italic_E }. The total triangles are computed as |Δ(G)|=vV|Δ(v)|6Δ𝐺subscript𝑣𝑉Δ𝑣6|\Delta(G)|=\frac{\sum_{v\in V}|\Delta(v)|}{6}| roman_Δ ( italic_G ) | = divide start_ARG ∑ start_POSTSUBSCRIPT italic_v ∈ italic_V end_POSTSUBSCRIPT | roman_Δ ( italic_v ) | end_ARG start_ARG 6 end_ARG, using the vertex-iteration-based method.

3. Related Work

3.1. Existing Sequential Algorithms

For triangle counting, the obvious algorithm is brute-force search (see Alg. 1), enumerating over all ΘΘ\Thetaroman_Θ(n3)superscript𝑛3\left(n^{3}\right)( italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) triples of distinct vertices, and checking how many of these triples are triangles. There are faster algorithms that require an adjacency matrix for the input graph representation and use fast matrix multiplication, such as the work of Alon, Yuster, and Zwick (Alon et al., 1997). Indeed, if A𝐴Aitalic_A is the adjacency matrix of G𝐺Gitalic_G, for any vertex v𝑣vitalic_v, the value Avv3superscriptsubscript𝐴𝑣𝑣3A_{vv}^{3}italic_A start_POSTSUBSCRIPT italic_v italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT on the diagonal of A3superscript𝐴3A^{3}italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is twice the number of triangles to which v𝑣vitalic_v belongs. So the number of trianlges is 16tr(A3)16𝑡𝑟superscript𝐴3\frac{1}{6}\sum tr(A^{3})divide start_ARG 1 end_ARG start_ARG 6 end_ARG ∑ italic_t italic_r ( italic_A start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ). Triangle counting problems can therefore be solved in 𝒪(n1.5)𝒪superscript𝑛1.5\mathcal{O}\left(n^{1.5}\right)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT ), where ω<2.732𝜔2.732\omega<2.732italic_ω < 2.732 is the fast matrix product exponent (Alman and Williams, 2021) (Williams et al., 2023). Alon et al. (Alon et al., 1997) also show that it is possible to solve triangle counting problem in 𝒪(m2ωω+1)𝒪superscript𝑚2𝜔𝜔1\mathcal{O}\left(m^{\frac{2\omega}{\omega+1}}\right)caligraphic_O ( italic_m start_POSTSUPERSCRIPT divide start_ARG 2 italic_ω end_ARG start_ARG italic_ω + 1 end_ARG end_POSTSUPERSCRIPT ) \subset 𝒪(m1.41)𝒪superscript𝑚1.41\mathcal{O}\left(m^{1.41}\right)caligraphic_O ( italic_m start_POSTSUPERSCRIPT 1.41 end_POSTSUPERSCRIPT ) time. However, the implementation is infeasible for large, sparse graphs, and certain matrix multiplication methods fall short of listing all the triangles. For these reasons, despite their evident theoretical strength, these algorithms have limited practical impact.

Algorithm 1 Triples
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:uVfor-all𝑢𝑉\forall u\in V∀ italic_u ∈ italic_V
5:    vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
6:       wVfor-all𝑤𝑉\forall w\in V∀ italic_w ∈ italic_V
7:          if (u,v)E(v,w)E(u,w)E𝑢𝑣𝐸𝑣𝑤𝐸𝑢𝑤𝐸(u,v)\in E\land(v,w)\in E\land(u,w)\in E( italic_u , italic_v ) ∈ italic_E ∧ ( italic_v , italic_w ) ∈ italic_E ∧ ( italic_u , italic_w ) ∈ italic_E
8:             TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
9:return T/6𝑇6T/6italic_T / 6

Another category of fundamental problem formulation is called subgraph query, which aims to identify instances of a triangle subgraph within the input graph. It’s crucial to emphasize that determining the presence of a specific subgraph in a graph is an NP-hard problem. While various methods, including the backtracking strategy (Ullmann, 1976), have been introduced, they are not preferred choices for triangle counting problem, particularly for large-scale graphs.

Latapy (Latapy, 2008) provides a survey on triangle counting algorithms for very large, sparse graphs. One of the earliest algorithms, tree-listing, published in 1978 by Itai and Rodeh (Itai and Rodeh, 1978) first finds a rooted spanning tree of the graph. After iterating through the non-tree edges and using criteria to identify triangles, the tree edges are removed and the algorithm repeats until no edges are remaining (see Alg. 2). This approach takes 𝒪(m32)𝒪superscript𝑚32\mathcal{O}\left(m^{\frac{3}{2}}\right)caligraphic_O ( italic_m start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) time (or 𝒪(n)𝒪𝑛\mathcal{O}\left(n\right)caligraphic_O ( italic_n ) for planar graphs).

Algorithm 2 Tree-listing (IR) (Itai and Rodeh, 1978)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:while E𝐸Eitalic_E is not empty
5:    K𝐾absentK\leftarrowitalic_K ← Covering tree(G𝐺Gitalic_G)
6:    (u,v)E(u,v)Kfor-all𝑢𝑣𝐸𝑢𝑣𝐾\forall(u,v)\in E\ \land(u,v)\notin K∀ ( italic_u , italic_v ) ∈ italic_E ∧ ( italic_u , italic_v ) ∉ italic_K
7:       if (parent(u),v)Eparent𝑢𝑣𝐸(\text{parent}(u),v)\in E( parent ( italic_u ) , italic_v ) ∈ italic_E
8:          TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
9:       elif (parent(v),u)Eparent𝑣𝑢𝐸(\text{parent}(v),u)\in E( parent ( italic_v ) , italic_u ) ∈ italic_E
10:          TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
11:    EEK𝐸𝐸𝐾E\leftarrow E-Kitalic_E ← italic_E - italic_K
12:return T/2𝑇2T/2italic_T / 2

The most common triangle counting algorithms in the literature include vertex-iterator (Itai and Rodeh, 1978), (Latapy, 2008) and edge-iterator (Itai and Rodeh, 1978), (Latapy, 2008) approaches that run in 𝒪(mdmax)𝒪𝑚subscript𝑑𝑚𝑎𝑥\mathcal{O}\left(m\cdot d_{max}\right)caligraphic_O ( italic_m ⋅ italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ).

Algorithm 3 Vertex-Iterator (Itai and Rodeh, 1978), (Latapy, 2008)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:uVfor-all𝑢𝑉\forall u\in V∀ italic_u ∈ italic_V
5:    vN(u)for-all𝑣𝑁𝑢\forall v\in N(u)∀ italic_v ∈ italic_N ( italic_u )
6:       X=Intersection(N(u),N(v))𝑋𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑁𝑢𝑁𝑣X=Intersection(N(u),N(v))italic_X = italic_I italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t italic_i italic_o italic_n ( italic_N ( italic_u ) , italic_N ( italic_v ) )
7:       TT+X𝑇𝑇𝑋T\leftarrow T+Xitalic_T ← italic_T + italic_X
8:return T/6𝑇6T/6italic_T / 6

In vertex-iterator (see Alg. 3), for each vertex uV𝑢𝑉u\in Vitalic_u ∈ italic_V, the algorithm examines the adjacency list N(v)𝑁𝑣N(v)italic_N ( italic_v ) of each vertex vN(u)𝑣𝑁𝑢v\in N(u)italic_v ∈ italic_N ( italic_u ). If there is vertex w𝑤witalic_w in the intersection of N(u)𝑁𝑢N(u)italic_N ( italic_u ) and N(v)𝑁𝑣N(v)italic_N ( italic_v ), then the triplet (u,v,w)𝑢𝑣𝑤(u,v,w)( italic_u , italic_v , italic_w ) forms a triangle. Arifuzzaman et al. (Arifuzzaman et al., 2019) study modifications of the vertex-iterator algorithm based on various methods for vertex ordering.

Algorithm 4 Edge-Iterator (Itai and Rodeh, 1978), (Latapy, 2008)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
5:    X=Intersection(N(u),N(v))𝑋𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑁𝑢𝑁𝑣X=Intersection(N(u),N(v))italic_X = italic_I italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t italic_i italic_o italic_n ( italic_N ( italic_u ) , italic_N ( italic_v ) )
6:    TT+X𝑇𝑇𝑋T\leftarrow T+Xitalic_T ← italic_T + italic_X
7:return T/6𝑇6T/6italic_T / 6

In edge-iterator (see Alg. 4), each edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) in the graph is examined, and the intersection of N(u)𝑁𝑢N(u)italic_N ( italic_u ) and N(v)𝑁𝑣N(v)italic_N ( italic_v ) is computed to find triangles. A common optimization is to use a direction-oriented approach that only considers edges (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) where u<v𝑢𝑣u<vitalic_u < italic_v. The variants of edge-iterator are often based on the algorithm used to perform Intersection(N(u),N(v))𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑁𝑢𝑁𝑣Intersection(N(u),N(v))italic_I italic_n italic_t italic_e italic_r italic_s italic_e italic_c italic_t italic_i italic_o italic_n ( italic_N ( italic_u ) , italic_N ( italic_v ) ). When the two adjacency lists are sorted, then MergePath and BinarySearch can be used. MergePath performs a linear scan through both lists counting the common elements. Makkar, Bader, and Green (Makkar et al., 2017) give an efficient MergePath algorithm for GPU. Mailthody et al. (Mailthody et al., 2018) use an optimized two-pointer intersection (MergePath) for set intersection. BinarySearch, as the name implies, uses a binary search to determine if each element of the smaller list is found in the larger list. Hash is another method for performing the intersection of two sets and it does not require the adjacency lists to be sorted. A typical implementation of Hash initializes a Boolean array of size m𝑚mitalic_m to all false. Then, positions in Hash corresponding to the vertex values in N(u)𝑁𝑢N(u)italic_N ( italic_u ) are set to true. Then N(v)𝑁𝑣N(v)italic_N ( italic_v ) is scanned, looking up in ΘΘ\Thetaroman_Θ(1)1\left(1\right)( 1 ) time whether or not there is a match for each vertex. Chiba and Nishizeki published one of the earliest edge iterators with hashing algorithms for triangle finding in 1985 (Chiba and Nishizeki, 1985). The running time is 𝒪(a(G)m)𝒪𝑎𝐺𝑚\mathcal{O}\left(a(G)m\right)caligraphic_O ( italic_a ( italic_G ) italic_m ), where a(G)𝑎𝐺a(G)italic_a ( italic_G ) is defined as the arboricity of G𝐺Gitalic_G, which is upper-bounded a(G)(2m+n)12/2𝑎𝐺superscript2𝑚𝑛122a(G)\leq\lceil(2m+n)^{\frac{1}{2}}/2\rceilitalic_a ( italic_G ) ≤ ⌈ ( 2 italic_m + italic_n ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT / 2 ⌉ (Chiba and Nishizeki, 1985). In 2018, Davis rediscovered this method, which he calls tri_simple in his comparison with SuiteSparse GraphBLAS (Davis, 2018). Mowlaei (Mowlaei, 2017) gave a variant of the edge-iterator algorithm that uses vectorized sorted set intersection and reorders the vertices using the reverse Cuthill-McKee heuristic.

In 2005, Schank and Wagner (Schank and Wagner, 2005; Schank, 2007) designed a fast triangle counting algorithm called forward (see Alg. 5) that is a refinement of the edge-iterator approach. Instead of intersections of the full adjacency lists, the forward algorithm uses a dynamic data structure A(v)𝐴𝑣A(v)italic_A ( italic_v ) to store a subset of the neighborhood N(v)𝑁𝑣N(v)italic_N ( italic_v ) for vV𝑣𝑉v\in Vitalic_v ∈ italic_V. Initially, each set A()𝐴A()italic_A ( ) is empty, and after computing the intersection of the sets A(u)𝐴𝑢A(u)italic_A ( italic_u ) and A(v)𝐴𝑣A(v)italic_A ( italic_v ) for each edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) (with u<v𝑢𝑣u<vitalic_u < italic_v), u𝑢uitalic_u is added to A(v)𝐴𝑣A(v)italic_A ( italic_v ). This significantly reduces the size of the intersections needed to find triangles. The running time is 𝒪(mdmax)𝒪𝑚subscript𝑑max\mathcal{O}\left(m\cdot d_{\mbox{max}}\right)caligraphic_O ( italic_m ⋅ italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ). However, if one reorders the vertices in decreasing order of their degrees as a ΘΘ\Thetaroman_Θ(nlogn)𝑛𝑛\left(n\log n\right)( italic_n roman_log italic_n ) time pre-processing step, the forward algorithm’s running time reduces to 𝒪(m32)𝒪superscript𝑚32\mathcal{O}\left(m^{\frac{3}{2}}\right)caligraphic_O ( italic_m start_POSTSUPERSCRIPT divide start_ARG 3 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ). Ortmann and Brandes (Ortmann and Brandes, 2014) survey triangle counting algorithms, create a unifying framework for parsimonious implementations, and conclude that nearly every triangle listing variant is in 𝒪(ma(G))𝒪𝑚𝑎𝐺\mathcal{O}\left(m\cdot a(G)\right)caligraphic_O ( italic_m ⋅ italic_a ( italic_G ) ).

Algorithm 5 Forward Triangle Counting (F) (Schank and Wagner, 2005; Schank, 2007)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    A(v)𝐴𝑣A(v)\leftarrow\emptysetitalic_A ( italic_v ) ← ∅
6:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
7:    if (u<v)𝑢𝑣(u<v)( italic_u < italic_v ) then
8:       wA(u)A(v)for-all𝑤𝐴𝑢𝐴𝑣\forall w\in A(u)\cap A(v)∀ italic_w ∈ italic_A ( italic_u ) ∩ italic_A ( italic_v )
9:          TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
10:       A(v)A(v){u}𝐴𝑣𝐴𝑣𝑢A(v)\leftarrow A(v)\cup\{u\}italic_A ( italic_v ) ← italic_A ( italic_v ) ∪ { italic_u }
11:return T𝑇Titalic_T

The forward-hashed algorithm (Schank and Wagner, 2005; Schank, 2007) (also called compact-forward (Latapy, 2008)) is a variant of the forward algorithm that uses the hashing described previously for the intersections of the A()𝐴A()italic_A ( ) sets, see Algorithm 6. Low et al. (Low et al., 2017) derive a linear algebra method for triangle counting that does not use matrix multiplication. Their algorithm results in the forward-hashed algorithm.

Algorithm 6 Forward-Hashed Triangle Counting (FH)(Schank and Wagner, 2005; Schank, 2007)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    A(v)𝐴𝑣A(v)\leftarrow\emptysetitalic_A ( italic_v ) ← ∅
6:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
7:    if (u<v)𝑢𝑣(u<v)( italic_u < italic_v ) then
8:       wA(u)for-all𝑤𝐴𝑢\forall w\in A(u)∀ italic_w ∈ italic_A ( italic_u )
9:          Hash[w𝑤witalic_w] \leftarrow true
10:       wA(v)for-all𝑤𝐴𝑣\forall w\in A(v)∀ italic_w ∈ italic_A ( italic_v )
11:          if Hash[w𝑤witalic_w] then
12:             TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
13:       wA(u)for-all𝑤𝐴𝑢\forall w\in A(u)∀ italic_w ∈ italic_A ( italic_u )
14:          Hash[w𝑤witalic_w] \leftarrow false
15:       A(v)A(v){u}𝐴𝑣𝐴𝑣𝑢A(v)\leftarrow A(v)\cup\{u\}italic_A ( italic_v ) ← italic_A ( italic_v ) ∪ { italic_u }
16:return T𝑇Titalic_T

3.2. Existing Parallel Algorithms

Although most of the sequential algorithms tend to run fast on graphs that fit in main memory, the expanding size of graphs, driven by ongoing technology advancements, poses a challenge. To further accelerate, the emergence of parallel version algorithms is inevitable. Alg. 7, Alg. 8, and Alg. 9 are the parallel versions of the three most common intersection-based triangle counting methods.

Algorithm 7 Parallel Edge Iterator with Merge Path (EMP)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E do in parallel
5:    AN(u)𝐴𝑁𝑢A\leftarrow N(u)italic_A ← italic_N ( italic_u ), BN(v)𝐵𝑁𝑣B\leftarrow N(v)italic_B ← italic_N ( italic_v )
6:    x0𝑥0x\leftarrow 0italic_x ← 0, y0𝑦0y\leftarrow 0italic_y ← 0
7:    while x<|A|y<|B|𝑥𝐴𝑦𝐵x<|A|\ \land\ y<|B|italic_x < | italic_A | ∧ italic_y < | italic_B |
8:       if A[x]==B[y]A[x]==B[y]italic_A [ italic_x ] = = italic_B [ italic_y ]
9:          TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1;
10:          xx+1𝑥𝑥1x\leftarrow x+1italic_x ← italic_x + 1, yy+1𝑦𝑦1y\leftarrow y+1italic_y ← italic_y + 1
11:       else
12:          if A[x]<B[y]𝐴delimited-[]𝑥𝐵delimited-[]𝑦A[x]<B[y]italic_A [ italic_x ] < italic_B [ italic_y ]
13:             xx+1𝑥𝑥1x\leftarrow x+1italic_x ← italic_x + 1
14:          else
15:             yy+1𝑦𝑦1y\leftarrow y+1italic_y ← italic_y + 1
16:return T/6𝑇6T/6italic_T / 6
Algorithm 8 Parallel Edge Iterator with Binary Search (EBP)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E do in parallel
5:    for lN(u)𝑙𝑁𝑢l\in N(u)italic_l ∈ italic_N ( italic_u ) do
6:       KN(v)𝐾𝑁𝑣K\leftarrow N(v)italic_K ← italic_N ( italic_v )
7:       bottom0𝑏𝑜𝑡𝑡𝑜𝑚0bottom\leftarrow 0italic_b italic_o italic_t italic_t italic_o italic_m ← 0, top|K|𝑡𝑜𝑝𝐾top\leftarrow|K|italic_t italic_o italic_p ← | italic_K |
8:       while bottom<top𝑏𝑜𝑡𝑡𝑜𝑚𝑡𝑜𝑝bottom<topitalic_b italic_o italic_t italic_t italic_o italic_m < italic_t italic_o italic_p
9:          midbottom+(topbottom)/2𝑚𝑖𝑑𝑏𝑜𝑡𝑡𝑜𝑚𝑡𝑜𝑝𝑏𝑜𝑡𝑡𝑜𝑚2mid\leftarrow bottom+(top-bottom)/2italic_m italic_i italic_d ← italic_b italic_o italic_t italic_t italic_o italic_m + ( italic_t italic_o italic_p - italic_b italic_o italic_t italic_t italic_o italic_m ) / 2
10:          if K[mid]==lK[mid]==litalic_K [ italic_m italic_i italic_d ] = = italic_l
11:             TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
12:             break
13:          elif K[mid]<l𝐾delimited-[]𝑚𝑖𝑑𝑙K[mid]<litalic_K [ italic_m italic_i italic_d ] < italic_l
14:             bottommid+1𝑏𝑜𝑡𝑡𝑜𝑚𝑚𝑖𝑑1bottom\leftarrow mid+1italic_b italic_o italic_t italic_t italic_o italic_m ← italic_m italic_i italic_d + 1
15:          else
16:             top=mid𝑡𝑜𝑝𝑚𝑖𝑑top=miditalic_t italic_o italic_p = italic_m italic_i italic_d
17:return T/6𝑇6T/6italic_T / 6
Algorithm 9 Parallel Edge Iterator with Hash (EHP)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E do in parallel
5:    for wN(u)𝑤𝑁𝑢w\in N(u)italic_w ∈ italic_N ( italic_u )
6:       hash(w) = True
7:    for wN(v)𝑤𝑁𝑣w\in N(v)italic_w ∈ italic_N ( italic_v )
8:       if hash(w)
9:          TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
10:    for wN(u)𝑤𝑁𝑢w\in N(u)italic_w ∈ italic_N ( italic_u )
11:       hash(w) = False
12:return T/6𝑇6T/6italic_T / 6

Besides the intersection-based methods, there are several optimized parallel algorithms in the literature. Shun et al. (Shun and Tangwongsan, 2015) give a multi-core parallel algorithm for shared memory machines. The algorithm has two steps: in the first step each vertex is ranked based on degree and a ranked adjacency list of each vertex is generated, which contains only higher-ranked vertices than the current vertex; the second step counts triangles from the ranked adjacency list for each vertex using merge-path or hash. Parimalarangan et al. (Parimalarangan et al., 2017) present variations of triangle counting algorithms and how they relate to performance in shared-memory platforms. TriCore (Hu et al., 2018) partitions the graph held in a compressed-sparse row (CSR) data structure for multiple GPUs and uses stream buffers to load edge lists from CPU memory to GPU memory on the fly and then uses binary search to find the intersection. Hu et al. (Hu et al., 2021) employ a “copy-synchronize-search” pattern to improve the parallel threads efficiency of GPU and mix the computing and memory-intensive workloads together to improve the resource efficiency. Zeng et al. (Zeng et al., 2022) present a triangle counting algorithm that adaptively selects vertex-parallel and edge-parallel paradigm.

4. Cover-Edge Based Triangle Counting Algorithms

4.1. Cover-Edge Set

Definition 1 (Cover-Edge, Cover-Edge Set and Covering Ratio).

For any edge e𝑒eitalic_e of a triangle Δnormal-Δ\Deltaroman_Δ in graph G𝐺Gitalic_G, e𝑒eitalic_e is referred to as a cover-edge of Δnormal-Δ\Deltaroman_Δ. For a given graph G𝐺Gitalic_G, an edge set SE𝑆𝐸S\subseteq Eitalic_S ⊆ italic_E is called a cover-edge set if it contains at least one cover-edge for every triangle in G𝐺Gitalic_G. c=|S|/|E|𝑐𝑆𝐸c=|S|/|E|italic_c = | italic_S | / | italic_E | is called the covering ratio.

Based on the given definition, it is evident that the entire edge set E𝐸Eitalic_E can serve as a cover-edge set S𝑆Sitalic_S for graph G𝐺Gitalic_G. However, our proposed method aims to efficiently count all triangles using a smaller subset of edges instead of E𝐸Eitalic_E. Thus, the primary challenge lies in generating a compact cover-edge set, which forms the initial problem to be addressed in our approach. Our goal is to identify a cover-edge set with the smallest c𝑐citalic_c. In this paper, we propose using breadth-first search (BFS) to generate a compact cover-edge set.

Definition 2 (BFS-Edge).

Let r𝑟ritalic_r be the root vertex of an undirected graph G𝐺Gitalic_G. The level L(v)𝐿𝑣L(v)italic_L ( italic_v ) of a vertex v𝑣vitalic_v is defined as the shortest distance from r𝑟ritalic_r to v𝑣vitalic_v obtained through a breadth-first search (BFS). From the BFS, we classify the edges into three types:

  • Tree-Edges: These edges belong to the BFS tree.

  • Strut-Edges: These are non-tree edges with endpoints on two adjacent levels in the BFS traversal.

  • Horizontal-Edges: These are non-tree edges with endpoints on the same level in the BFS traversal.

Refer to caption
Figure 1. An example to mark different edges based on a BFS spanning tree. The tree-edges are black, strut-edges are blue, and horizontal-edges are red.

Fig. 1 gives an example of these different edge types.

Lemma 0 ().

Each triangle {u,v,w}𝑢𝑣𝑤\{u,v,w\}{ italic_u , italic_v , italic_w } in a graph contains at least one horizontal-edge in an arbitrarily rooted BFS tree.

Proof.

(Proof by contradiction) A triangle is a path of length 3 that starts and ends at the same vertex. Suppose there are no horizontal-edges in the triangle. In that case, every edge in the path (i.e., a tree-edge or strut-edge) either increases or decreases the level by one.

Since the path must end on the same level as the starting vertex, the number of edges in the path that decrease the level must be equal to the number of edges that increase the level. Consequently, the length of the path must be even to maintain level parity. However, this contradicts the fact that a triangle has an odd path length of 3.

Therefore, we conclude that there must be at least one horizontal-edge in every triangle. ∎

Theorem 4 (Cover-Edge Set Generation).

All horizontal-edges in an arbitrarily rooted BFS tree form a valid cover-edge set.

Proof.

According to Definition 1, for any triangle ΔΔ\Deltaroman_Δ in graph G𝐺Gitalic_G, we can always find at least one horizontal-edge that serves as a cover-edge for ΔΔ\Deltaroman_Δ. Thus, the set of all horizontal-edges constitutes a cover-edge set. ∎

Therefore, we can construct a cover-edge set, denoted as BFS-CES, by selecting all the horizontal-edges obtained during a breadth-first search (BFS). It is evident that BFS-CES is a subset of E𝐸Eitalic_E and is typically much smaller than the complete edge set E𝐸Eitalic_E.

4.2. Cover-Edge Triangle Counting: CETC

In this subsection, we provide a comprehensive description of the algorithm to identify all triangles using a cover-edge set generated through a breadth-first search.

Lemma 0 ().

Each triangle {u,v,w}𝑢𝑣𝑤\{u,v,w\}{ italic_u , italic_v , italic_w } must contain either one or three horizontal-edges.

Proof.

By referring to the proof of Lemma 3, we know that the path corresponding to the triangle’s three edges consists of an even number of tree-edges and strut-edges. This implies that there can be either 0 or 2 tree- or strut-edges within each triangle.

In the case where there are 0 tree- or strut-edges, all three edges of the triangle must be horizontal-edges. This is because the absence of tree- or strut-edges implies that the entire path is composed of horizontal-edges.

In the case where there are 2 tree- or strut-edges, the triangle contains exactly one horizontal-edge. This is because having two tree- or strut-edges in the path means that there is one horizontal-edge connecting the remaining two vertices.

Therefore, we conclude that each triangle {u,v,w}𝑢𝑣𝑤\{u,v,w\}{ italic_u , italic_v , italic_w } must contain either one or three horizontal-edges. ∎

Our sequential triangle counting approach (CETC-Seq), described in Alg. 10, efficiently counts triangles using a cover-edge set. In line 3, we initialize the counter T𝑇Titalic_T to 0, which will store the total number of triangles. To generate the cover-edge set, we perform a breadth-first search (BFS) starting from any unvisited vertex, identifying the level L(v)𝐿𝑣L(v)italic_L ( italic_v ) of each vertex v𝑣vitalic_v in its respective component, as shown in lines 4 to 5. In lines 6 to 10 the algorithm iterates over each edge, selecting the cover-set of horizontal edges (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) in a direction-oriented fashion in line 7. For each vertex w𝑤witalic_w in the intersection of u𝑢uitalic_u and v𝑣vitalic_v’s neighborhoods (line 8), we check the following two conditions to determine if (u,v,w)𝑢𝑣𝑤(u,v,w)( italic_u , italic_v , italic_w ) is a unique triangle to be counted (line 9). If L(u)L(w)𝐿𝑢𝐿𝑤L(u)\neq L(w)italic_L ( italic_u ) ≠ italic_L ( italic_w ) then the edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is the only horizontal-edge in the triangle (u,v,w)𝑢𝑣𝑤(u,v,w)( italic_u , italic_v , italic_w ). If L(u)L(w)𝐿𝑢𝐿𝑤L(u)\equiv L(w)italic_L ( italic_u ) ≡ italic_L ( italic_w ), then the edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is one of three horizontal-edges in the triangle (u,v,w)𝑢𝑣𝑤(u,v,w)( italic_u , italic_v , italic_w ). To ensure uniqueness, the algorithm then checks the added constraint that v<w𝑣𝑤v<witalic_v < italic_w. If the constraints are satisfied, we increment the triangle counter T𝑇Titalic_T in line 10.

This approach effectively counts the triangles in the graph while avoiding redundant counting.

Algorithm 10 CETC: Cover-Edge Triangle Counting (CETC-Seq)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    if v𝑣vitalic_v unvisited, then BFS(G𝐺Gitalic_G, v𝑣vitalic_v)
6:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
7:    if (L(u)L(v))(u<v)𝐿𝑢𝐿𝑣𝑢𝑣(L(u)\equiv L(v))\land(u<v)( italic_L ( italic_u ) ≡ italic_L ( italic_v ) ) ∧ ( italic_u < italic_v ) \triangleright (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is horizontal
8:       wN(u)N(v)for-all𝑤𝑁𝑢𝑁𝑣\forall w\in N(u)\cap N(v)∀ italic_w ∈ italic_N ( italic_u ) ∩ italic_N ( italic_v )
9:          if (L(u)L(w))((L(u)L(w))(v<w))𝐿𝑢𝐿𝑤𝐿𝑢𝐿𝑤𝑣𝑤(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)( italic_L ( italic_u ) ≠ italic_L ( italic_w ) ) ∨ ( ( italic_L ( italic_u ) ≡ italic_L ( italic_w ) ) ∧ ( italic_v < italic_w ) ) then
10:             TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
11:return T𝑇Titalic_T
Theorem 6 (Correctness).

Alg. 10 can accurately count all triangles in a graph G𝐺Gitalic_G.

Proof.

Lemma 5 establishes that a triangle in the graph falls into one of two cases: 1) the two endpoint vertices of the horizontal-edge are on the same level while the apex vertex is on a different level, or 2) all three vertices of the triangle are at the same level.

Consider a triangle {va,vb,vc}subscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑐\{v_{a},v_{b},v_{c}\}{ italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } in G𝐺Gitalic_G. Without loss of generality, assume that (va,vb)subscript𝑣𝑎subscript𝑣𝑏(v_{a},v_{b})( italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) is a horizontal-edge, implying L(va)L(vb)𝐿subscript𝑣𝑎𝐿subscript𝑣𝑏L(v_{a})\equiv L(v_{b})italic_L ( italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ≡ italic_L ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). Let vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT be the apex vertex. The two cases can be distinguished as follows:

For the first case, each triangle is uniquely defined by a horizontal-edge and an apex vertex from the common neighbors of the horizontal-edge’s endpoint vertices. Whenever Alg. 10 identifies such a triangle {va,vb,vc}subscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑐\{v_{a},v_{b},v_{c}\}{ italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }, it increments the total triangle count T𝑇Titalic_T by 1.

In the second case, where all three vertices are at the same level (L(vc)L(va)L(vb)𝐿subscript𝑣𝑐𝐿subscript𝑣𝑎𝐿subscript𝑣𝑏L(v_{c})\equiv L(v_{a})\equiv L(v_{b})italic_L ( italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ≡ italic_L ( italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ≡ italic_L ( italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )), Alg. 10 ensures that T𝑇Titalic_T is increased by 1 only when va<vb<vcsubscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑐v_{a}<v_{b}<v_{c}italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. This condition ensures that triangle {va,vb,vc}subscript𝑣𝑎subscript𝑣𝑏subscript𝑣𝑐\{v_{a},v_{b},v_{c}\}{ italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } is counted only once, preventing triple-counting and ensuring the correctness of the triangle count.

Hence, Alg. 10 is proven to accurately count all triangles in the graph G𝐺Gitalic_G. ∎

The time complexity of Alg. 10 can be analyzed as follows. The computation of breadth-first search, including determining the level of each vertex and marking horizontal edges, requires 𝒪(n+m)𝒪𝑛𝑚\mathcal{O}(n+m)caligraphic_O ( italic_n + italic_m ) time.

Since there are at most 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ) horizontal edges, finding the common neighbors of each horizontal edge individually can be done in 𝒪(dmax)𝒪subscript𝑑max\mathcal{O}(d_{\text{max}})caligraphic_O ( italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) time. Here, dmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT represents the maximal degree of a vertex in the graph.

Therefore, the overall time complexity of CETC-Seq is 𝒪(mdmax)𝒪𝑚subscript𝑑max\mathcal{O}(m\cdot d_{\text{max}})caligraphic_O ( italic_m ⋅ italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ).

4.3. Variants of CETC-Seq

4.3.1. CETC Forward Exchanging Triangle Counting Algorithm (CETC-Seq-FE)

The overall performance of CETC-Seq is closely related to the covering ratio c𝑐citalic_c. A higher covering ratio results in fewer reduced edges, which will increase the actual runtime of the algorithm. Therefore, after completing the BFS, selecting an appropriate algorithm can be based on c𝑐citalic_c. Alg. 11 presents the variant of CETC-Seq that dynamically selects the most suitable approach based on c𝑐citalic_c, called CETC-Seq-FE. Initially, it calculates c𝑐citalic_c using BFS results. If the c𝑐citalic_c value is below a specified threshold (The value of c𝑐citalic_c should be at least less than (mn+1m)𝑚𝑛1𝑚(\frac{m-n+1}{m})( divide start_ARG italic_m - italic_n + 1 end_ARG start_ARG italic_m end_ARG ). After comparing the performance of Alg. 10 and Alg. 5, we set this threshold to 0.7 in our experiments.), we will continue using Alg. 10; otherwise, Alg. 5 is chosen. Considering the analyses presented in Alg. 5 and Alg. 10, Alg. 11 maintains a time complexity of 𝒪(m1.5)𝒪superscript𝑚1.5\mathcal{O}\left(m^{1.5}\right)caligraphic_O ( italic_m start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT )

Algorithm 11 CETC Forward Exchanging (CETC-Seq-FE)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    if v𝑣vitalic_v unvisited, then BFS(G𝐺Gitalic_G, v𝑣vitalic_v)
6:Calculate c𝑐citalic_c based on the BFS results
7:If (c<threshold)𝑐𝑡𝑟𝑒𝑠𝑜𝑙𝑑(c<threshold)( italic_c < italic_t italic_h italic_r italic_e italic_s italic_h italic_o italic_l italic_d )
8:    TCETC-Seq(G)𝑇CETC-Seq𝐺T\leftarrow\mbox{CETC-Seq}(G)italic_T ← CETC-Seq ( italic_G ) \triangleright Alg. 10
9:Else
10:    TTC_forward(G)𝑇TC_forward𝐺T\leftarrow\mbox{TC\_forward}(G)italic_T ← TC_forward ( italic_G ) \triangleright Alg. 5
11:return T𝑇Titalic_T

4.3.2. CETC Split Triangle Counting Algorithm (CETC-Seq-S)

Algorithm 12 CETC Split Triangle Counting (CETC-Seq-S)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    if v𝑣vitalic_v unvisited, then BFS(G𝐺Gitalic_G, v𝑣vitalic_v)
6:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
7:    if (L(u)L(v))𝐿𝑢𝐿𝑣(L(u)\equiv L(v))( italic_L ( italic_u ) ≡ italic_L ( italic_v ) ) then \triangleright (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is horizontal
8:       Add (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) to G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
9:    else
10:       Add (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) to G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
11:TTC_forward-hashed(G0)𝑇TC_forward-hashedsubscript𝐺0T\leftarrow\mbox{TC}\_{\mbox{forward-hashed}}(G_{0})italic_T ← TC _ forward-hashed ( italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright Alg. 6
12:uVG1for-all𝑢subscript𝑉subscript𝐺1\forall u\in V_{G_{1}}∀ italic_u ∈ italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
13:    vNG1(u)for-all𝑣subscript𝑁subscript𝐺1𝑢\forall v\in N_{G_{1}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
14:       Hash[v𝑣vitalic_v] \leftarrow true
15:    vNG0(u)for-all𝑣subscript𝑁subscript𝐺0𝑢\forall v\in N_{G_{0}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
16:       if (u<v)𝑢𝑣(u<v)( italic_u < italic_v ) then
17:          wNG1(v)for-all𝑤subscript𝑁subscript𝐺1𝑣\forall w\in N_{G_{1}}(v)∀ italic_w ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v )
18:             if Hash[w𝑤witalic_w] then
19:                TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
20:    vNG1(u)for-all𝑣subscript𝑁subscript𝐺1𝑢\forall v\in N_{G_{1}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
21:       Hash[v𝑣vitalic_v] \leftarrow false
22:return T𝑇Titalic_T

Alg. 12 is another variant of CETC-Seq, called CETC-Seq-S. This variant is similar to cover-edge triangle counting in Alg. 10 and uses BFS to assign a level to each vertex in lines 4 and 5. Next in lines 6 to 10, the edges E𝐸Eitalic_E of the graph are partitioned into two sets E0subscript𝐸0E_{0}italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT – the horizontal edges where both endpoints are on the same level – and E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT – the remaining tree and non-tree edges that span a level. Thus, we now have two graphs, G0=(V,E0)subscript𝐺0𝑉subscript𝐸0G_{0}=(V,E_{0})italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( italic_V , italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and G1=(V,E1)subscript𝐺1𝑉subscript𝐸1G_{1}=(V,E_{1})italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_V , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where E=E0E1𝐸subscript𝐸0subscript𝐸1E=E_{0}\cup E_{1}italic_E = italic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and E0E1=subscript𝐸0subscript𝐸1E_{0}\cap E_{1}=\emptysetitalic_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∩ italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∅. Triangles that are fully in G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are counted with one method and triangles not fully in G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are counted with another method. For G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the graph with horizontal edges, we count the triangles efficiently using the forward-hashed method (line 11). For triangles not fully in G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the algorithm uses the following approach to count these triangles. Using G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the graph that contains the edges that span levels, we use a hashed intersection approach in lines 12 to 21. As per the cover-edge triangle counting, we need to find the intersections of the adjacency lists from the endpoints of horizontal edges. Thus, we use G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to select the edges and perform the hash-based intersections from the adjacency lists in graph G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The proof of correctness for cover-edge triangle counting is given in Section. 4.2. Alg. 12 is a hybrid version of this algorithm, that partitions the edge set, and uses two different methods to count these two types of triangles. The proof of correctness is still valid with these new refinements to the algorithm. The running time of Alg. 12 is the maximum of the running time of forward-hashing and Alg. 10. Alg. 12 uses hashing for the set intersections. For vertices u𝑢uitalic_u and v𝑣vitalic_v, the cost is min(d(u),d(v))𝑑𝑢𝑑𝑣\min(d(u),d(v))roman_min ( italic_d ( italic_u ) , italic_d ( italic_v ) ) since the algorithm can check if the neighbors of the lower-degree endpoint are in the hash set of the higher-degree endpoint. Over all (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) edges in E𝐸Eitalic_E, these intersections take 𝒪(ma(G))𝒪𝑚𝑎𝐺\mathcal{O}\left(m\cdot a(G)\right)caligraphic_O ( italic_m ⋅ italic_a ( italic_G ) ) expected time. Hence, Alg. 12 takes 𝒪(ma(G))𝒪𝑚𝑎𝐺\mathcal{O}\left(m\cdot a(G)\right)caligraphic_O ( italic_m ⋅ italic_a ( italic_G ) ) expected time.

Similar to the forward-hashed method, pre-processing the graph by re-ordering the vertices in decreasing order of degree in ΘΘ\Thetaroman_Θ(nlogn)𝑛𝑛\left(n\log n\right)( italic_n roman_log italic_n ) time often leads to a faster triangle counting algorithm in practice.

4.3.3. CETC-Split Recursive Triangle Counting Algorithm (CETC-Seq-SR)

Algorithm 13 CETC-Split Recursive Triangle Counting (CETC-Seq-SR)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:T0𝑇0T\leftarrow 0italic_T ← 0
4:vVfor-all𝑣𝑉\forall v\in V∀ italic_v ∈ italic_V
5:    if v𝑣vitalic_v unvisited, then BFS(G𝐺Gitalic_G, v𝑣vitalic_v)
6:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E
7:    if (L(u)L(v))𝐿𝑢𝐿𝑣(L(u)\equiv L(v))( italic_L ( italic_u ) ≡ italic_L ( italic_v ) ) then \triangleright (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is horizontal
8:       Add (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) to G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
9:    else
10:       Add (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) to G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
11:if (size of G0>subscript𝐺0absentG_{0}>italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > threshold) then
12:    TCESR(G0)𝑇CESRsubscript𝐺0T\leftarrow\mbox{CESR}(G_{0})italic_T ← CESR ( italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
13:else
14:    TTC_forward-hashed(G0)𝑇TC_forward-hashedsubscript𝐺0T\leftarrow\mbox{TC}\_{\mbox{forward-hashed}}(G_{0})italic_T ← TC _ forward-hashed ( italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright Alg. 6
15:uVG1for-all𝑢subscript𝑉subscript𝐺1\forall u\in V_{G_{1}}∀ italic_u ∈ italic_V start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
16:    vNG1(u)for-all𝑣subscript𝑁subscript𝐺1𝑢\forall v\in N_{G_{1}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
17:       Hash[v𝑣vitalic_v] \leftarrow true
18:    vNG0(u)for-all𝑣subscript𝑁subscript𝐺0𝑢\forall v\in N_{G_{0}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
19:       if (u<v)𝑢𝑣(u<v)( italic_u < italic_v ) then
20:          wNG1(v)for-all𝑤subscript𝑁subscript𝐺1𝑣\forall w\in N_{G_{1}}(v)∀ italic_w ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v )
21:             if Hash[w𝑤witalic_w] then
22:                TT+1𝑇𝑇1T\leftarrow T+1italic_T ← italic_T + 1
23:    vNG1(u)for-all𝑣subscript𝑁subscript𝐺1𝑢\forall v\in N_{G_{1}}(u)∀ italic_v ∈ italic_N start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u )
24:       Hash[v𝑣vitalic_v] \leftarrow false
25:return T𝑇Titalic_T

The Alg. 13 is similar to Alg. 12. The only difference is that for the subgraph G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT consisting of the horizontal edges. If its size is larger than the given threshold value, we will recursively call the algorithm to further reduce the graph size (line 9). If the size of G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is not larger than the given threshold value, we will directly call Alg. 6 to get the total number of triangles in G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (line 11). We use the same threshold value of 0.7 in the experiment as outlined in Alg. 11. The idea behind the recursive call is that we can quickly count the triangles containing edges across both G0subscript𝐺0G_{0}italic_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and then we can safely remove all the edges in G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to reduce the graph size. Finally, Alg. 6 will focus on a smaller graph whose edges may contain multiple triangles.

4.4. Parallel CETC Algorithm on Shared-Memory (CETC-SM)

In Section 3, we introduced three commonly employed intersection-based methods: merge-path, binary search, and hash, alongside their corresponding parallel version as outlined in Alg. 7, Alg. 8, and Alg. 9.

The fundamental concept behind the proposed parallel algorithms is to calculate the intersection of neighbor lists of two endpoints of any (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) in parallel, which will significantly increase the performance.

Algorithm 14 Shared Memory Parallel Cover-Edge Triangle Counting (CETC-SM)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:c1,c20subscript𝑐1subscript𝑐20c_{1},c_{2}\leftarrow 0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← 0
4:Run Parallel BFS on G𝐺Gitalic_G and mark the level.
5:(u,v)Efor-all𝑢𝑣𝐸\forall(u,v)\in E∀ ( italic_u , italic_v ) ∈ italic_E do in parallel
6:    if (L(u)L(v))(u<v)𝐿𝑢𝐿𝑣𝑢𝑣(L(u)\equiv L(v))\land(u<v)( italic_L ( italic_u ) ≡ italic_L ( italic_v ) ) ∧ ( italic_u < italic_v ) \triangleright (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) is horizontal
7:      wN(u)N(v)for-all𝑤𝑁𝑢𝑁𝑣\forall w\in N(u)\cap N(v)∀ italic_w ∈ italic_N ( italic_u ) ∩ italic_N ( italic_v )
8:         if (L(w)L(u)𝐿𝑤𝐿𝑢L(w)\neq L(u)italic_L ( italic_w ) ≠ italic_L ( italic_u )) then
9:           c1c1+1subscript𝑐1subscript𝑐11c_{1}\leftarrow c_{1}+1italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1
10:         else
11:           c2c2+1subscript𝑐2subscript𝑐21c_{2}\leftarrow c_{2}+1italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1
12:Tc1+c2/3𝑇subscript𝑐1subscript𝑐23T\leftarrow c_{1}+c_{2}/3italic_T ← italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 3
13:return T𝑇Titalic_T \triangleright See Alg. 10

Alg.14 demonstrates the parallelization of the Covering-Edge triangle counting algorithm for shared-memory. In the context of the PRAM (Parallel Random Access Machines) model, both parallel BFS and parallel sorting have been shown to achieve scalable performance (Cormen et al., 2022). For set intersection operations on a single edge, it is imperative that the computation remains well below m0.5superscript𝑚0.5m^{0.5}italic_m start_POSTSUPERSCRIPT 0.5 end_POSTSUPERSCRIPT(Schank, 2007), particularly when dealing with large input graphs, where p𝑝pitalic_p represents the total number of processors. Consequently, the total work, which is 𝒪(m1.5)𝒪superscript𝑚1.5\mathcal{O}(m^{1.5})caligraphic_O ( italic_m start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT ), can be evenly distributed among p𝑝pitalic_p processors. As a result, CETC-SM exhibits a parallel time complexity of 𝒪(m1.5p)𝒪superscript𝑚1.5𝑝\mathcal{O}(\frac{m^{1.5}}{p})caligraphic_O ( divide start_ARG italic_m start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT end_ARG start_ARG italic_p end_ARG ), ensuring scalability as the number of parallel processors increases.

4.5. Communication-Efficient Parallel CETC Algorithm on Distributed-Memory (CETC-DM)

This subsection presents our communication-efficient parallel algorithm for counting triangles in massive graphs on a p𝑝pitalic_p-processor distributed-memory parallel computer. We will take advantage of the concept of Cover-Edge Set to significantly improve the communication performance of our triangle counting method. Since distributed triangle counting is communication-bound (Pearce, 2017), this algorithm is expected to improve the overall running time. The input graph G𝐺Gitalic_G is stored in a compressed sparse row (CSR) format. The vertices are partitioned non-uniformly to the p𝑝pitalic_p processors such that each processor stores approximately 2m/p2𝑚𝑝2m/p2 italic_m / italic_p edge endpoints. This graph input follows the format used by the majority of parallel graph algorithm implementations and benchmarks such as Graph500 and Graph Challenge.

Our communication-efficient parallel algorithm CETC-DM (see Alg. 15) is based on the same cover-edge approach proposed in Section 4.2. The binary operator direct-sum\oplus used in line 11 is bitwise exclusive OR (XOR).

Algorithm 15 CETC Communication Efficient Triangle Counting (CETC-DM)
1:Graph G=(V,E)𝐺𝑉𝐸G=(V,E)italic_G = ( italic_V , italic_E )
2:Triangle Count T𝑇Titalic_T
3:Run parallel BFS(G𝐺Gitalic_G) and build partial cover-edge set Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
4:For all pi,i{0p1}subscript𝑝𝑖𝑖0𝑝1p_{i},i\in\{0\ldots p-1\}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 0 … italic_p - 1 } in parallel do:
5:    ti0subscript𝑡𝑖0t_{i}\leftarrow 0italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← 0
6:    (u,v)Sifor-all𝑢𝑣subscript𝑆𝑖\forall(u,v)\in S_{i}∀ ( italic_u , italic_v ) ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with u<v𝑢𝑣u<vitalic_u < italic_v on pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
7:       wVifor-all𝑤subscript𝑉𝑖\forall w\in V_{i}∀ italic_w ∈ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that wN(u),N(v)𝑤𝑁𝑢𝑁𝑣w\in N(u),N(v)italic_w ∈ italic_N ( italic_u ) , italic_N ( italic_v )
8:          if (L(u)L(w))((L(u)L(w))(v<w))𝐿𝑢𝐿𝑤𝐿𝑢𝐿𝑤𝑣𝑤(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)( italic_L ( italic_u ) ≠ italic_L ( italic_w ) ) ∨ ( ( italic_L ( italic_u ) ≡ italic_L ( italic_w ) ) ∧ ( italic_v < italic_w ) ) then
9:             ti=ti+1subscript𝑡𝑖subscript𝑡𝑖1t_{i}=t_{i}+1italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1
10:    For j1𝑗1j\leftarrow 1italic_j ← 1 to p1𝑝1p-1italic_p - 1 do:
11:       Processors i𝑖iitalic_i and ijdirect-sum𝑖𝑗i\oplus jitalic_i ⊕ italic_j swap edge sets Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
12:       (u,v)Sjfor-all𝑢𝑣subscript𝑆𝑗\forall(u,v)\in S_{j}∀ ( italic_u , italic_v ) ∈ italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with u<v𝑢𝑣u<vitalic_u < italic_v on pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
13:          wVifor-all𝑤subscript𝑉𝑖\forall w\in V_{i}∀ italic_w ∈ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT such that wN(u),N(v)𝑤𝑁𝑢𝑁𝑣w\in N(u),N(v)italic_w ∈ italic_N ( italic_u ) , italic_N ( italic_v )
14:             if (L(u)L(w))((L(u)L(w))(v<w))𝐿𝑢𝐿𝑤𝐿𝑢𝐿𝑤𝑣𝑤(L(u)\neq L(w))\lor\left((L(u)\equiv L(w))\land(v<w)\right)( italic_L ( italic_u ) ≠ italic_L ( italic_w ) ) ∨ ( ( italic_L ( italic_u ) ≡ italic_L ( italic_w ) ) ∧ ( italic_v < italic_w ) ) then
15:                ti=ti+1subscript𝑡𝑖subscript𝑡𝑖1t_{i}=t_{i}+1italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1
16:T𝑇absentT\leftarrowitalic_T ← Reduce(ti,+)subscript𝑡𝑖(t_{i},+)( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , + )

Similar to the sequential CETC-Seq algorithm, the cover-edge set S=i=0p1Si𝑆superscriptsubscript𝑖0𝑝1subscript𝑆𝑖S=\cup_{i=0}^{p-1}S_{i}italic_S = ∪ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined in line 3 by labeling the horizontal edges from a parallel BFS.

Each processor runs lines 4 to 15 in parallel that consists of two main substeps. Local triangles are counted in lines 6 to 9 and a total exchange of cover-edges between each pair of processors to count triangles is performed in lines 10 to 15. Note at the end of each iteration of the for loop, processor pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can discard the cover-edge set Sjsubscript𝑆𝑗S_{j}italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In lines 7 and 13, processor pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT determines for each cover edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) all the apex vertices w𝑤witalic_w held locally that are adjacent to both u𝑢uitalic_u and v𝑣vitalic_v. The logic for counting triangles in lines 8 and 14 is similar to Alg. 10 as to only count unique triangles. Finally, a reduction operation in line 16 calculates the total number of triangles by accumulating the p𝑝pitalic_p triangle counters, i.e., T=i=0p1ti𝑇superscriptsubscript𝑖0𝑝1subscript𝑡𝑖T=\sum_{i=0}^{p-1}t_{i}italic_T = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p - 1 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

4.5.1. Cost Analysis

Space

In addition to the input graph data structure, an additional bit is needed per edge (for marking a horizontal-edge) and 𝒪(logD)𝒪𝐷\mathcal{O}\left(\lceil\log D\rceil\right)caligraphic_O ( ⌈ roman_log italic_D ⌉ ) bits per vertex to store its level, where D𝐷Ditalic_D is the diameter of the graph. This is a total of at most m+nlogD𝑚𝑛𝐷m+n\lceil\log D\rceilitalic_m + italic_n ⌈ roman_log italic_D ⌉ bits across the p𝑝pitalic_p processors. Preserving the graph requires additional 𝒪(n+m)𝒪𝑛𝑚\mathcal{O}\left(n+m\right)caligraphic_O ( italic_n + italic_m ) space for the graph.

Compute

The BFS costs 𝒪((n+m)/p)𝒪𝑛𝑚𝑝\mathcal{O}\left((n+m)/p\right)caligraphic_O ( ( italic_n + italic_m ) / italic_p ) (Cormen et al., 2022). The search corresponding to one cover-edge in a vertex’s adjacency list takes at most 𝒪(log(dmax))𝒪subscript𝑑𝑚𝑎𝑥\mathcal{O}\left(\log(d_{max})\right)caligraphic_O ( roman_log ( italic_d start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) ) time using binary search, and only 𝒪(1)𝒪1\mathcal{O}\left(1\right)caligraphic_O ( 1 ) expected time using a hash table. Let disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the degree of vertex visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT where 0i<n0𝑖𝑛0\leq i<n0 ≤ italic_i < italic_n. Searching km𝑘𝑚kmitalic_k italic_m edges in all vertices’ adjacency lists takes 𝒪(kmi=0n1log(di))=𝒪(kmlog(Πi=0n1di))𝒪𝑘𝑚superscriptsubscript𝑖0𝑛1subscript𝑑𝑖𝒪𝑘𝑚superscriptsubscriptΠ𝑖0𝑛1subscript𝑑𝑖\mathcal{O}(km\sum_{i=0}^{n-1}\log(d_{i}))=\mathcal{O}(km\log(\Pi_{i=0}^{n-1}d% _{i}))caligraphic_O ( italic_k italic_m ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT roman_log ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = caligraphic_O ( italic_k italic_m roman_log ( roman_Π start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) time. Since i=0n1di=2msuperscriptsubscript𝑖0𝑛1subscript𝑑𝑖2𝑚\sum_{i=0}^{n-1}d_{i}=2m∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 italic_m, we know that log(Πi=0n1di)superscriptsubscriptΠ𝑖0𝑛1subscript𝑑𝑖\log(\Pi_{i=0}^{n-1}d_{i})roman_log ( roman_Π start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) reaches its maximum value when di=2m/nsubscript𝑑𝑖2𝑚𝑛d_{i}=2m/nitalic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 italic_m / italic_n for 0i<n0𝑖𝑛0\leq i<n0 ≤ italic_i < italic_n. Thus, 𝒪(kmlog(Πi=0n1di))𝒪(kmlog((2m/n)n))𝒪(kmnlog(2n2/n))=𝒪(mnlog(n))𝒪𝑘𝑚superscriptsubscriptΠ𝑖0𝑛1subscript𝑑𝑖𝒪𝑘𝑚superscript2𝑚𝑛𝑛𝒪𝑘𝑚𝑛2superscript𝑛2𝑛𝒪𝑚𝑛𝑛\mathcal{O}(km\log(\Pi_{i=0}^{n-1}d_{i}))\leq\mathcal{O}(km\log((2m/n)^{n}))% \leq\mathcal{O}(kmn\log(2n^{2}/n))=\mathcal{O}(mn\log(n))caligraphic_O ( italic_k italic_m roman_log ( roman_Π start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ≤ caligraphic_O ( italic_k italic_m roman_log ( ( 2 italic_m / italic_n ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ) ≤ caligraphic_O ( italic_k italic_m italic_n roman_log ( 2 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n ) ) = caligraphic_O ( italic_m italic_n roman_log ( italic_n ) ).

Total Communication

In our analysis of communication cost for BFS, we measure the total communication volume independent of the number of processors. Thus, this is a conservative overestimate of communication since a fraction (e.g., 1/p1𝑝1/p1 / italic_p) of accesses will be on the same compute node versus message traffic between nodes. At the same time, we do not consider the savings from overlap** with the computation cost.

The cost of the breadth-first search is m𝑚mitalic_m edge traversals with logD+3logn𝐷3𝑛\lceil\log D\rceil+3\lceil\log n\rceil⌈ roman_log italic_D ⌉ + 3 ⌈ roman_log italic_n ⌉ bits communicated per edge traversal for the level information, pair of vertex ids, and vertex degree, yielding m(logD+3logn)𝑚𝐷3𝑛m\cdot(\lceil\log D\rceil+3\lceil\log n\rceil)italic_m ⋅ ( ⌈ roman_log italic_D ⌉ + 3 ⌈ roman_log italic_n ⌉ ) bits for the BFS. Transferring km𝑘𝑚kmitalic_k italic_m horizontal-edges requires kmplogn𝑘𝑚𝑝𝑛kmp\lceil\log n\rceilitalic_k italic_m italic_p ⌈ roman_log italic_n ⌉ bits, where p𝑝pitalic_p is the number of processors. The final reduction to find the total number of triangles requires (p1)logn𝑝1𝑛(p-1)\lceil\log n\rceil( italic_p - 1 ) ⌈ roman_log italic_n ⌉ bits.

Hence, the total communication volume is m(logD+3logn)+kmplogn+(p1)logn=m(logD+(kp+3)logn)+(p1)logn𝑚𝐷3𝑛𝑘𝑚𝑝𝑛𝑝1𝑛𝑚𝐷𝑘𝑝3𝑛𝑝1𝑛m\cdot(\lceil\log D\rceil+3\lceil\log n\rceil)+kmp\lceil\log n\rceil+(p-1)% \lceil\log n\rceil=m\cdot(\lceil\log D\rceil+(kp+3)\lceil\log n\rceil)+(p-1)% \lceil\log n\rceilitalic_m ⋅ ( ⌈ roman_log italic_D ⌉ + 3 ⌈ roman_log italic_n ⌉ ) + italic_k italic_m italic_p ⌈ roman_log italic_n ⌉ + ( italic_p - 1 ) ⌈ roman_log italic_n ⌉ = italic_m ⋅ ( ⌈ roman_log italic_D ⌉ + ( italic_k italic_p + 3 ) ⌈ roman_log italic_n ⌉ ) + ( italic_p - 1 ) ⌈ roman_log italic_n ⌉ bits. Hence, since the word size is ΘΘ\Thetaroman_Θ(logn)𝑛\left(\log n\right)( roman_log italic_n ) and Dn𝐷𝑛D\leq nitalic_D ≤ italic_n, the communication of CETC-DM is 𝒪(pm)𝒪𝑝𝑚\mathcal{O}\left(pm\right)caligraphic_O ( italic_p italic_m ) words.

5. Open Source Evaluation Framework

In the preceding sections, we presented all sequential and shared-memory algorithms from literature known to the authors plus our novel approaches. In this section, we introduce our open-source framework designed to integrate comprehensive triangle counting implementations.

There lacks a unified framework encompassing all implementations, which is important for researchers to conduct performance comparisons among existing algorithms and to assess their efficacy against newly proposed methods. Consequently, we have developed a comprehensive open-source framework to solve this problem. This framework is designed to ensure a thorough evaluation of triangle counting algorithms. It includes implementations of 22 sequential methods and 11 parallel methods on shared-memory as complete a set of what is found in the literature.

Each triangle counting routine has a single argument – a pointer to the graph in a compressed sparse row (CSR) format. The input is treated as read-only. Each algorithm is charged the full cost if the implementation needs auxiliary arrays, pre-processing steps, or additional data structures. Each implementation must manage memory and not contain any memory leaks – hence, any dynamically allocated memory must be freed before returning the result.

The output from each implementation is an integer with the number of triangles found. Each algorithm is run ten times, and the mean running time is reported. To reduce variance for random graphs, the same graph instance is used for all of the experiments. For sequential algorithms, the source code is sequential C code without any explicit parallelization. For parallel algorithms, we use OpenMP to parallelize the C code. The same coding style and effort were used for each implementation.

Here, we list algorithms subjected to the experiments given in the next section, including both established methods and the newly proposed algorithms. Algorithms then end with P𝑃Pitalic_P indicate that we have also developed parallel versions.

W/WP:

: Wedge-checking/Parallel version

WD/WDP:

: Wedge-checking(direction-oriented)/Parallel version

EM/EMP:

: Edge Iterator with MergePath for set intersection/Parallel version

EMD/EMDP:

: Edge Iterator with MergePath for set intersection (direction-oriented)/Parallel version

EB/EBP:

: Edge Iterator with BinarySearch for set intersection/Parallel version

EBD/EBDP:

: Edge Iterator with BinarySearch for set intersection (direction-oriented)/Parallel version

ET/ETP:

: Edge Iterator with partitioning for set intersection/Parallel version

ETD/ETDP:

: Edge Iterator with partitioning for set intersection (direction-oriented)/Parallel version

EH/EHP:

: Edge Iterator with Hashing for set intersection/Parallel version

EHD/EHDP:

: Edge Iterator with Hashing for set intersection (direction-oriented)/Parallel version

F:

: Forward

FH:

: Forward with Hashing

FHD:

: Forward with Hashing and degree-ordering

TS:

: Tri_simple (Davis (Davis, 2018))

LA:

: Linear Algebra (CMU (Low et al., 2017))

IR:

: Treelist from Itai-Rodeh (Itai and Rodeh, 1978)

CETC-Seq/CETC-SM:

: Cover Edge Triangle Counting (Bader, (Bader et al., 2023))/Parallel version on shared-memory

CETC-Seq-D:

: Cover Edge Triangle Counting with degree-ordering (Bader, (Bader et al., 2023))

CETC-Seq-FE:

: Cover Edge Forward Exchanging Triangle Counting

CETC-Seq-S:

: Cover Edge Split Triangle Counting (Bader, (Bader, 2023))

CETC-Seq-SD:

: Cover Edge Split Triangle Counting with degree-ordering (Bader, (Bader, 2023))

CETC-Seq-SR:

: Cover Edge-Split Recursive Triangle Counting

6. Experimental Results

6.1. Platform Configuration

We use the Intel Development Cloud for benchmarking our results on a GNU/Linux node. The compiler is Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) and ‘-O2‘ is used as a compiler optimization flag. we use a recently launched Intel Xeon processor (Sapphire Rapids launched Q1’23) with DDR5 memory for both sequential and parallel implementations. The node is a dedicated 2.00 GHz 56-core (112 thread) Intel(R) Xeon(R) Platinum 8480+ processor (formerly known as Sapphire Rapids) with 105M cache and 1024GB of DDR5 RAM.

6.2. Data Sets

We employ a diverse collection of graphs. The real-world datasets are from SNAP. For the synthetic graphs, we use large Graph500 RMAT graphs (Chakrabarti et al., 2004) with parameters a=0.57𝑎0.57a=0.57italic_a = 0.57, b=0.19𝑏0.19b=0.19italic_b = 0.19, c=0.19𝑐0.19c=0.19italic_c = 0.19, and d=0.05𝑑0.05d=0.05italic_d = 0.05, similar to the IARPA AGILE benchmark graphs. An overview of all 24 graphs in our dataset is presented in Table 1.

The values of c𝑐citalic_c exhibit substantial variation across different graphs, ranging from 0.90 to 0.14. Smaller c𝑐citalic_c values signify a higher potential for avoiding fruitless searches, thereby enhancing the efficiency of our approach.

Table 1. Data Sets for the Experiments
Graph Name Graph ID n m # triangles c𝑐citalic_c (%)
RMAT 6 1 64 1024 9100 93.8
RMAT 7 2 128 2048 18855 90.9
RMAT 8 3 256 4096 39602 87.6
RMAT 9 4 512 8192 86470 87.2
RMAT 10 5 1024 16384 187855 82.8
RMAT 11 6 2048 32768 408876 81.1
RMAT 12 7 4096 65536 896224 77.5
RMAT 13 8 8192 131072 1988410 74.9
RMAT 14 9 16384 262144 4355418 70.5
RMAT 15 10 32768 524288 9576800 68.4
RMAT 16 11 65536 1048576 21133772 65.5
RMAT 17 12 131072 2097152 46439638 62.8
karate 13 34 78 45 35.9
amazon0302 14 262111 899792 717719 44.2
amazon0312 15 400727 2349869 3686467 52.4
amazon0505 16 410236 2439437 3951063 52.7
amazon0601 17 403394 2443408 3986507 52.8
loc-Brightkite 18 58228 214078 494728 43.2
loc-Gowalla 19 196591 950327 2273138 50.8
roadNet-CA 20 1971281 2766607 120676 14.5
roadNet-PA 21 1090920 1541898 67150 14.6
roadNet-TX 22 1393383 1921660 82869 14
soc-Epinions1 23 75888 405740 1624481 53.3
wiki-Vote 24 8297 100762 608389 54.3

6.3. Results and Analysis of Sequential Algorithms

The execution times of the sequential algorithms (in seconds) are presented in Table. 2.

Table 2. Execution time (in seconds) for sequential algorithms.

Graph W WD EM EMD EB EBD ET ETD EH EHD F RMAT 6 0.0023 0.000435 0.000608 0.000301 0.001534 0.000752 0.001722 0.000896 0.000118 0.000055 0.000055 RMAT 7 0.005482 0.001603 0.001973 0.000992 0.004567 0.002257 0.005145 0.002765 0.000354 0.000166 0.000219 RMAT 8 0.016084 0.004622 0.005455 0.002726 0.012873 0.006365 0.01379 0.007551 0.000866 0.000429 0.000642 RMAT 9 0.04884 0.014317 0.014643 0.007337 0.031095 0.014825 0.018992 0.010409 0.001124 0.000567 0.000912 RMAT 10 0.081409 0.023978 0.020342 0.01015 0.046255 0.023038 0.049716 0.02734 0.00289 0.001477 0.002433 RMAT 11 0.260367 0.077086 0.053735 0.026881 0.109204 0.054332 0.128502 0.071314 0.006977 0.00352 0.006196 RMAT 12 0.896607 0.262051 0.141176 0.070548 0.293548 0.146626 0.331856 0.185648 0.017128 0.008613 0.015863 RMAT 13 2.975701 0.876912 0.372609 0.186514 0.73989 0.369528 0.849023 0.476537 0.044211 0.022125 0.040797 RMAT 14 10.520327 3.108799 0.987748 0.492118 1.829937 0.914373 2.192774 1.241524 0.114199 0.056226 0.104752 RMAT 15 35.785918 10.461789 2.626837 1.307338 4.834125 2.397788 5.607495 3.185066 0.31823 0.152634 0.27252 RMAt 16 122.100925 35.690483 6.931398 3.452957 12.020692 5.942763 14.38633 8.202219 1.072639 0.51392 0.714004 RMAt 17 426.945522 124.153096 18.328512 9.153039 31.596123 15.577652 37.014029 21.238411 3.249189 1.582771 1.865376 karate 0.000015 0.000007 0.000012 0.000006 0.000019 0.000009 0.000033 0.000014 0.000009 0.000005 0.000004 amazon0302 0.298977 0.052727 0.143474 0.066293 0.193383 0.090165 0.295645 0.152895 0.06663 0.03324 0.024458 amazon0312 1.293403 0.410326 0.720808 0.352071 1.007608 0.474253 1.587602 0.895909 0.286009 0.135694 0.108818 amazon0505 1.4014 0.458928 0.75572 0.370004 1.071849 0.505645 1.687541 0.950925 0.296272 0.140679 0.114741 amazon0601 1.400537 0.458847 0.762124 0.372543 1.085706 0.511053 1.694514 0.952832 0.300645 0.142758 0.118401 loc-Brightkit 0.473063 0.158358 0.115803 0.058309 0.1789 0.089439 0.262685 0.153408 0.025772 0.013417 0.011734 loc-Gowalla 9.018113 4.425076 2.083049 1.03675 1.247894 0.610038 2.850531 2.066343 0.354619 0.173588 0.085867 roadNet-CA 0.089097 0.032726 0.102766 0.064406 0.106291 0.06895 0.168963 0.096467 0.073492 0.051154 0.035571 roadNet-PA 0.074498 0.035727 0.07553 0.036011 0.060924 0.039204 0.097066 0.05476 0.041419 0.028734 0.019805 roadNet-TX 0.088466 0.023446 0.070821 0.044442 0.071961 0.047007 0.117474 0.066475 0.050618 0.035551 0.024401 soc-Epinions1 4.892297 1.853538 0.615327 0.306373 1.144934 0.569414 1.540422 0.876375 0.09952 0.047979 0.062959 wiki-Vote 0.830841 0.210642 0.12436 0.062224 0.290031 0.145106 0.355414 0.190386 0.019756 0.009999 0.01816 Graph FH FHD TS LA IR CETC-Seq CETC-Seq-D CETC-Seq-FE CETC-Seq-S CETC-Seq-SD CETC-Seq-SR RMAT 6 0.000028 0.00003 0.000045 0.000049 0.001057 0.000096 0.00009 0.000036 0.000042 0.000041 0.000034 RMAT 7 0.000076 0.000079 0.000118 0.0002 0.00343 0.00045 0.000337 0.000083 0.000117 0.000105 0.000112 RMAT 8 0.00021 0.000213 0.000341 0.000601 0.010451 0.00118 0.001008 0.000242 0.000316 0.000326 0.000292 RMAT 9 0.000279 0.000284 0.000474 0.000844 0.017542 0.001573 0.001312 0.000309 0.000402 0.000415 0.000393 RMAT 10 0.000633 0.000623 0.001238 0.002231 0.056125 0.003793 0.00329 0.000721 0.000929 0.000907 0.000905 RMAT 11 0.001469 0.001386 0.003257 0.005701 0.175472 0.008988 0.007996 0.001635 0.001938 0.002022 0.001902 RMAT 12 0.0034 0.003114 0.008204 0.014481 0.541587 0.020948 0.018846 0.003683 0.004291 0.004306 0.004719 RMAT 13 0.007899 0.006939 0.021101 0.036664 1.762489 0.048973 0.043984 0.008508 0.009416 0.009234 0.010386 RMAT 14 0.018383 0.015386 0.05573 0.091965 5.926348 0.112335 0.100385 0.019555 0.02067 0.019464 0.020786 RMAT 15 0.045298 0.038136 0.212061 0.23366 19.623668 0.266065 0.233845 0.982818 0.047584 0.043019 0.046379 RMAT 16 0.120033 0.095672 0.502867 0.595802 63.575078 0.630521 0.539548 2.500813 0.117718 0.098857 0.111664 RMAT 17 0.326685 0.25219 1.509844 1.513653 209.558816 1.491119 1.245597 6.385706 0.325907 0.241169 0.284544 karate 0.000004 0.000008 0.000005 0.000002 0.000093 0.000006 0.00001 0.000006 0.000009 0.000011 0.000009 amazon0302 0.01929 0.03971 0.038784 0.0228 0.375845 0.04485 0.066 0.05224 0.044427 0.061649 0.064679 amazon0312 0.067849 0.120262 0.164739 0.100246 2.223462 0.145356 0.195258 0.216578 0.121874 0.165797 0.207789 amazon0505 0.070928 0.125915 0.169891 0.10574 2.194572 0.151568 0.204355 0.225309 0.130232 0.174099 0.200114 amazon0601 0.072972 0.127608 0.173651 0.107426 2.112184 0.153916 0.207247 0.229265 0.132083 0.176369 0.202132 loc-Brightkit 0.006487 0.00982 0.013752 0.011592 0.58484 0.011902 0.015295 0.028136 0.008623 0.012866 0.011192 loc-Gowalla 0.038467 0.054073 0.188503 0.079424 6.433136 0.074779 0.088801 0.276325 0.045462 0.063348 0.061056 roadNet-CA 0.03906 0.170009 0.05232 0.038795 0.656398 0.083151 0.209056 0.100744 0.125747 0.235934 0.141022 roadNet-PA 0.021824 0.083848 0.029642 0.021815 0.366677 0.044418 0.109172 0.054114 0.061127 0.123917 0.068803 roadNet-TX 0.027021 0.106699 0.03607 0.026994 0.506742 0.054871 0.138155 0.067203 0.075937 0.163585 0.08579 soc-Epinions1 0.019544 0.022428 0.051883 0.063396 4.939747 0.043198 0.041572 0.168057 0.021562 0.023076 0.026793 wiki-Vote 0.004964 0.00498 0.009605 0.019463 0.545068 0.013 0.014294 0.034383 0.005118 0.00535 0.005741

6.3.1. Effect of Direction-Oriented on Sequential Algorithms

The DO performance optimization is a pivotal strategy in triangle counting, designed to mitigate redundant calculations. In this section, we explore five distinct duplicate counting algorithms, each accompanied by its corresponding DO variant. The results presented in Fig. 2 vividly demonstrate the speedup achieved by the DO counterparts compared to their duplicate counting versions.

Evidently, across all scenarios, the majority of DO algorithms yield a speedup of at least two-fold. Particularly, the WD algorithm stands out with a higher average speedup of 3.637, surpassing the performance gains of other algorithms. EBD exhibits a speedup of 2.015×2.015\times2.015 ×, closely followed by EMD at 2.005×2.005\times2.005 ×, EHD at 1.965×1.965\times1.965 ×, and ETD at 1.784×1.784\times1.784 ×.

DO optimization primarily constitutes an algorithmic enhancement, resulting in a reduction in the overall number of operations. So, for any graph, it can improve the performance and our experimental results also confirm its efficiency. However, the practical performance gains can be impacted by various factors, including memory access patterns and cache utilization. Our comprehensive experiments, conducted on diverse graphs using a range of algorithms, underscore the substantial performance enhancements achievable through DO optimization.

In summary, DO optimization is efficient for eliminating duplicate triangle counting and significantly improving overall performance.

Refer to caption
Figure 2. The speedups of direction-oriented optimization compared with the duplicate counting counterparts.

6.3.2. Effect of Hash Method on Sequential Algorithms

Similar to the DO optimization, the hash-based optimization proves highly efficient in most scenarios. In Fig. 3, we illustrate the speedup achieved by Hash methods compared to non-hash implementations. The first comparison showcases the speedup of the Hash set intersection (EH) compared with the non-hashed method (EM), while the second presents the speedup of (FH) compared with (F).

The average speedup of EH is 5.4×5.4\times5.4 ×, and for FH, it is 3.0×3.0\times3.0 ×, underscoring the effectiveness of the hash-based optimization. Notably, the results reveal that, for more efficient algorithms, like F, the speedup is slightly lower than that of less efficient algorithms, such as EM.

However, we observe several exceptions. For roadNet-CA (Graph ID=20), roadNet-PA (Graph ID=21), and roadNet-TX (Graph ID=22), the Hash algorithm FH performs worse than the non-hashed algorithm F. This is attributed to the unique topologies of these graphs, characterized by relatively long diameters and very few neighbors for each vertex. As the intersection sets are relatively small, the MergePath operation on small sets proves more efficient than the Hash method, given the relatively high hash table overhead for very small sets. Therefore, the Hash optimization method remains efficient but not for some special topology and diameter graphs, as the hash table overhead may not compensate for small intersection sets.

Refer to caption
Figure 3. The speedups of hash-based optimization compared with the MergePath method.

6.3.3. Effect of Forward Algorithm and Its Variants

Our experimental results underscore the effectiveness of Forward algorithm and its variants as robust algorithms for enhancing the performance of triangle counting. In Fig. 4, we present the speedup achieved by three algorithms—namely, the forward algorithm (F), the hashed forward algorithm (FH), and the hashed forward algorithm with degree ordering (FHD)—in comparison with the traditional MergePath algorithm.

The observed performance improvement is remarkably significant. Specifically, F achieves a 8.6×8.6\times8.6 × speedup, while FH and FHD achieve even more substantial speedups at 28.7×28.7\times28.7 × and 29.1×29.1\times29.1 ×, respectively. These results indicate that reducing the sizes of intersection sets and employing hash functions and degree ordering can collectively contribute to performance enhancements.

Similar to the hash method, degree ordering demonstrates substantial performance improvements across various scenarios. However, for roadNet-CA (Graph ID=20), roadNet-PA (Graph ID=21), and roadNet-TX (Graph ID=22), the hash-based algorithm FH performs worse than the non-hashed algorithm F, and the performance of degree ordering FD is inferior to that of MergePath. This arises from the fact that most vertices in these graphs possess similar and small numbers of degrees. Consequently, reordering the vertices has minimal impact on intersection performance and introduces additional overhead. Despite these exceptions, the combined approach of reducing intersection set sizes, hash functions, and degree ordering consistently enhances performance for a wide range of cases.

Refer to caption
Figure 4. The speedups of Forward Algorithm and its variants compared with the MergePath method.

6.3.4. Effect of CETC-Seq Algorithm and Its Variants

The fundamental principle underlying the cover-edge method is minimizing unnecessary set intersection operations. In Fig. 5, we illustrate the impact of the CETC-Seq algorithm and its variants, namely CETC-Seq-D, CETC-Seq-FE, CETC-Seq-S, CETC-Seq-SD, CETC-Seq-SR.

Refer to caption
Figure 5. The speedups of CETC-Seq and its variants compared with the MergePath method.

Compared to the MergePath method, CETC-Seq demonstrates an average speedup of 7.4×7.4\times7.4 ×. CETC-Seq-D achieves a slightly lower average speedup of 7.39×7.39\times7.39 ×, mainly due to its low performance on the road networks.

CETC-Seq-FE combines CETC-Seq and F in a unique manner. It employs CETC-Seq for large graphs or when the c𝑐citalic_c value is small; otherwise, it uses F. This switching approach yields an average speedup of 14.6×14.6\times14.6 ×. The rationale behind CETC-Seq-FE lies in dynamically selecting the most suitable algorithm based on its compatibility with the characteristics of the graphs.

CETC-Seq-S splits a graph into two parts based on the vertex levels marked by a BFS pre-processing and applies CETC-Seq and F on each part. The performance of CETC-Seq-S achieves a speedup of 23.4×23.4\times23.4 ×. This represents a more efficient combination method. Additionally, when we integrate degree ordering into CETC-Seq-S, the resulting CETC-Seq-SD algorithm performs slightly better than CETC-Seq-S, achieving a speedup of 24.0×24.0\times24.0 ×. This result highlights that degree ordering works well with the F algorithm. The reason is that degree ordering can further reduce the size of intersecting sets of the F algorithm. CETC-Seq-SR employs the recursive method to simplify the problem. For a large graph, it recursively applies CETC-Seq to minimize set intersections, counting only triangles including non-horizontal edges, and finally applies F to the smaller graph consisting of all horizontal edges that are known to include all the other triangles. CETC-Seq-SR achieves an average speedup of 22.7×22.7\times22.7 ×.

Notably, CETC-Seq exhibits low performance on certain graphs compared to other methods. The relatively high overhead of BFS preprocessing in CETC-Seq, compared with set intersection, contributes to the low efficiency. A breakdown time analysis reveals that the percentages of BFS processing time are 60% of the total execution time. In the case of a long-diameter graph where each vertex has a small number of neighbors, the overhead of BFS becomes large despite its time complexity of 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ) compared to the time complexity of total set intersections at 𝒪(m1.5)𝒪superscript𝑚1.5\mathcal{O}(m^{1.5})caligraphic_O ( italic_m start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT ). This overhead becomes particularly impactful when the neighbors of each vertex are limited, and the graph diameter is large. For road networks characterized by very small vertex degrees, where degree ordering introduces additional overhead without providing any significant benefit, CETC-Seq-D experiences further performance degradation.

6.3.5. Comprehensive Sequential Algorithms Comparison

In Fig. 6, we present the relative execution time for twenty-two sequential triangle counting algorithms. If some triangle searching operations cannot find any triangle, we name them as fruitless operations here.

While optimal in time complexity, the IR spanning tree-based triangle counting algorithm exhibits nearly the slowest performance among all the compared algorithms. This is due to the involvement of spanning tree generation, removal of tree edges, and regeneration of a smaller graph in each iteration. Although these operations can be completed in 𝒪(m)𝒪𝑚\mathcal{O}(m)caligraphic_O ( italic_m ) time, the cost is relatively high in terms of practical performance.

The W wedge-checking-based triangle counting algorithm often performs poorly. This is primarily because most graphs are sparse, resulting in that most wedge-checking operations are fruitless, or most wedges cannot form a triangle. For example, for the RMAT 6 graph, the percentage of wedges/triangles is 0.53%. For the RMAT 14 graph, the percentage reduces to 0.009%. This makes most of the checks useless for counting triangles. W can demonstrate better performance only when most of the graph’s wedges can form triangles. This scenario is not common in most practical applications.

The algorithmic structures of EM (Edge Merge Path), EB (Edge Binary Search), ET (Edge Partitioning), and EH (Edge Hash) are very similar to each other, differing primarily in the set intersection methods they employ. Merge path requires pre-sorted adjacency lists, enabling it to compare the two adjacency lists of a given edge (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) in d(u)+d(v)𝑑𝑢𝑑𝑣d(u)+d(v)italic_d ( italic_u ) + italic_d ( italic_v ) time. This is optimal because we have to check every neighbor. Binary search method EB searches each vertex in a small adjacency list (e.g., N(u)𝑁𝑢N(u)italic_N ( italic_u )) in a larger adjacency list (e.g., N(v)𝑁𝑣N(v)italic_N ( italic_v )) in d(u)×log(d(v))𝑑𝑢𝑑𝑣d(u)\times\log(d(v))italic_d ( italic_u ) × roman_log ( italic_d ( italic_v ) ) time. ET is a specific case of EB and involves additional operations to find the midpoint of the two adjacency lists. Thus, from an algorithmic analysis perspective, ET’s performance will always be worse than EB’s. However, EB and ET can leverage parallelism effectively to improve performance. Our parallel results demonstrate that they may outperform EM. EH takes min(d(u),d(v))<d(u)+d(v)𝑚𝑖𝑛𝑑𝑢𝑑𝑣𝑑𝑢𝑑𝑣min(d(u),d(v))<d(u)+d(v)italic_m italic_i italic_n ( italic_d ( italic_u ) , italic_d ( italic_v ) ) < italic_d ( italic_u ) + italic_d ( italic_v ) operations to find triangles, and the Hash method doesn’t require pre-sorting adjacency lists, making it better than EM and often the best performer among the four methods.

Refer to caption
Figure 6. Percentage of execution time for different sequential triangle counting algorithms.

TS (Triangle Summation) and LA (Linear Algebra) are two linear algebra-based methods. They can count the total number of triangles but cannot list all the triangles. Their performance improvements depend on optimizing formulas and architecture-related methods. The advantage of such methods lies in their ability to directly apply results from linear algebra theory and leverage highly optimized numerical techniques integrated into linear algebra libraries. Their performance is often superior to that of the EM method.

F (Forward) often demonstrates excellent performance in most scenarios but is inherently sequential. As we discussed earlier, F dynamically generates two sets that are much smaller than the size of the original adjacency lists. It is based on the DO method, which further reduces the fruitless checks in triangle counting operations. Additionally, pre-sorting vertices in non-increasing degrees enhance memory access locality and cache hit ratios. As one can observe, F effectively reduces the operations that cannot find new triangles. The results of FH and FHD show that the performance is further improved when Hash is used.

CETC-Seq and its variants, introduce another perspective for eliminating fruitless searches in triangle counting. First, it skips unnecessary edge searches based on a quick BFS operation that can be completed in 𝒪(m+n)𝒪𝑚𝑛\mathcal{O}(m+n)caligraphic_O ( italic_m + italic_n ) time. By leveraging the directed-oriented technique, CETC-Seq achieves a further significant reduction in the fruitless searches during triangle counting. It is competitive with the fastest approaches and may be useful when the BFS preprocessing overhead can be negligible. CETC-Seq-S and its variants further optimize the performance with Hash, degree ordering and recursive method.

We assigned rank values to each test case and calculated the average rank value. The performance from high to low are FH, CETC-Seq-S, FHD, CETC-Seq, LA, F, EHD, TS, CETC-Seq-SR, CETC-Seq-SD, CETC-Seq-FE, EH, CETC-Seq-D, EMD, EBD, ET, WD, EB, EM, ETD, W, IR.

We can say the top 4 set intersection-based triangle counting algorithms include our novel CETC-Seq-S and CETC-Seq algorithms. The performance of CETC-Seq-S with an average rank of 2.80 is slightly worse than that of FH with an average rank of 2.0. The average rank of FHD is 4.6 and CETC-Seq is 6.4.

6.3.6. Influence of the c𝑐citalic_c Value on the Performance of the Novel Algorithm

Building upon the definition of our novel algorithm, its performance should be highly related to the covering ratio c𝑐citalic_c.

A noteworthy trend is identified when evaluating the results, particularly concerning the RMAT graphs. Our finding reveals that the forward algorithm and its variants tend to perform the fastest. As the scale of the RMAT graph increases, the parameter c𝑐citalic_c decreases, indicating a more substantial removal of fruitless checks after BFS. Under these conditions, our novel method demonstrates greater efficiency compared to the F algorithms.

These observations validate our hypothesis that the performance of our new algorithm is significantly correlated with the covering ratio c𝑐citalic_c. As c𝑐citalic_c decreases, performance improves.

Concurrently, an analysis of the performance of the road network graphs (roadNet-CA, roadNet-PA, roadNet-TX) reveals their divergence from the other graphs. Road networks, unlike social networks, often have only low-degree vertices (for instance, many degree four vertices), and large diameters. Although the covering ratio of these road networks is under 15%, we see less benefit from the new approach due to this low value of c𝑐citalic_c. So, a lower c𝑐citalic_c value does not always yield high performance.

6.4. Results and Analysis of Parallel Algorithms on Shared-Memory

The execution times of the parallel algorithms (in seconds) are presented in Table  3 for 32 threads and in Table  4 for 224 threads.

Table 3. Execution time (in seconds) for shared-memory parallel algorithms. (32 threads)

Graph WP WDP EMP EMDP EBP EBDP ETP ETDP EHP EHDP CETC-SM RMAT 6 0.000449 0.000268 0.000226 0.000137 0.000251 0.000149 0.000236 0.000163 0.00018 0.000102 0.000143 RMAT 7 0.001223 0.000446 0.000218 0.000154 0.000298 0.000159 0.000284 0.000232 0.000153 0.000088 0.000112 RMAT 8 0.004417 0.001139 0.000413 0.000259 0.000607 0.000314 0.000627 0.000486 0.000368 0.000207 0.00033 RMAT 9 0.006898 0.002485 0.000661 0.000604 0.001263 0.000684 0.001419 0.001355 0.000492 0.000273 0.00043 RMAT 10 0.018563 0.006674 0.001527 0.001515 0.003311 0.001667 0.003505 0.003399 0.000902 0.000511 0.000935 RMAT 11 0.043502 0.020521 0.003952 0.003951 0.007708 0.0041 0.008889 0.008896 0.002429 0.001303 0.002447 RMAT 12 0.117014 0.049646 0.005842 0.005817 0.011434 0.005798 0.012514 0.011956 0.00309 0.001837 0.003849 RMAT 13 0.279936 0.107251 0.014796 0.013801 0.027184 0.013905 0.03255 0.028344 0.006466 0.003631 0.00932 RMAT 14 0.738807 0.294576 0.035303 0.034693 0.064022 0.031186 0.078634 0.068282 0.015326 0.008446 0.024843 RMAT 15 1.945767 0.883749 0.093881 0.090363 0.159664 0.080606 0.193024 0.167142 0.034382 0.017455 0.054523 RMAT 16 5.445053 2.616904 0.232325 0.225416 0.388562 0.190387 0.471949 0.405698 0.075692 0.038662 0.123288 RMAT 17 16.105178 7.952496 0.600006 0.570236 1.017046 0.495515 1.201604 1.01826 0.173762 0.085809 0.272745 karate 0.000047 0.000046 0.000045 0.000052 0.000046 0.000049 0.000047 0.000047 0.000046 0.000052 0.00005 amazon0302 0.008796 0.022824 0.014149 0.036935 0.020339 0.039651 0.018675 0.019454 0.011521 0.029648 0.027349 amazon0312 0.040122 0.072721 0.039475 0.115382 0.058515 0.125074 0.069724 0.086947 0.045718 0.099768 0.095591 amazon0505 0.047963 0.084478 0.045625 0.14529 0.0728 0.147519 0.082892 0.090064 0.046591 0.10615 0.09897 amazon0601 0.048284 0.083188 0.05004 0.134631 0.072389 0.149232 0.082411 0.091988 0.049378 0.107361 0.101416 loc-Brightkit 0.026786 0.018883 0.01209 0.023231 0.006994 0.012797 0.008975 0.008192 0.004159 0.005755 0.005673 loc-Gowalla 1.720423 0.544702 0.509203 0.076514 0.048047 0.986971 0.991823 0.111531 0.072055 0.027526 0.027009 roadNet-CA 0.006707 0.099629 0.041482 0.090839 0.054792 0.11503 0.063274 0.095634 0.063806 0.06235 0.059879 roadNet-PA 0.002327 0.045298 0.04421 0.051884 0.027293 0.050139 0.031387 0.046944 0.034754 0.033182 0.02988 roadNet-TX 0.00303 0.061665 0.043471 0.07115 0.040102 0.080569 0.04338 0.060169 0.041411 0.042741 0.040724 soc-Epinions1 0.188828 0.029691 0.024934 0.044974 0.02401 0.065555 0.052619 0.02325 0.011699 0.018498 0.019213 wiki-Vote 0.020278 0.00745 0.004677 0.013179 0.007166 0.018188 0.011463 0.006211 0.003359 0.007699 0.007689

Table 4. Execution time (in seconds) for shared-memory parallel algorithms. (224 threads)

Graph WP WDP EMP EMDP EBP EBDP ETP ETDP EHP EHDP CETC-SM RMAT 6 0.00262 0.001665 0.00159 0.002007 0.001779 0.002791 0.001626 0.002978 0.002411 0.001657 0.001823 RMAT 7 0.005946 0.0012 0.001177 0.000863 0.001092 0.001765 0.001126 0.000937 0.000799 0.000756 0.000913 RMAT 8 0.01391 0.004556 0.002609 0.002877 0.002534 0.002047 0.003943 0.002736 0.002399 0.002099 0.002797 RMAT 9 0.032362 0.008507 0.003102 0.002114 0.002848 0.004165 0.005741 0.00493 0.002538 0.002196 0.002278 RMAT 10 0.096783 0.023608 0.006077 0.004446 0.006682 0.007183 0.009634 0.009532 0.002827 0.002962 0.00238 RMAT 11 0.244285 0.034017 0.005143 0.005314 0.008019 0.006563 0.012168 0.011372 0.002077 0.001958 0.002647 RMAT 12 0.603094 0.054447 0.007437 0.007343 0.005859 0.00551 0.019386 0.019021 0.002241 0.001818 0.003157 RMAT 13 1.767242 0.191662 0.017554 0.017532 0.013061 0.012225 0.048984 0.048837 0.004743 0.003995 0.005822 RMAT 14 5.805617 0.6103 0.043436 0.043317 0.032064 0.025181 0.119665 0.124235 0.008927 0.00792 0.011054 RMAT 15 19.502675 1.075846 0.1124 0.112662 0.06781 0.050401 0.284255 0.274718 0.021382 0.01982 0.026385 RMAT 16 4.564897 2.724765 0.27826 0.273954 0.125003 0.10814 0.566664 0.548104 0.052475 0.046825 0.046742 RMAT 17 14.404249 8.292626 0.636928 0.623498 0.359324 0.209283 1.255369 1.193338 0.126567 0.11914 0.10871 karate 0.000246 0.001053 0.001054 0.00139 0.001052 0.001067 0.001843 0.002898 0.002982 0.001061 0.001648 amazon0302 0.028943 0.007874 0.016644 0.00981 0.011362 0.006467 0.013704 0.007236 0.035765 0.024087 0.039814 amazon0312 0.10007 0.046193 0.029063 0.020546 0.030613 0.016588 0.067293 0.053589 0.065335 0.038084 0.084911 amazon0505 0.134932 0.043264 0.03243 0.021458 0.032284 0.017594 0.070966 0.055361 0.071815 0.039119 0.090892 amazon0601 0.111001 0.050005 0.031588 0.02097 0.032657 0.017569 0.071926 0.053841 0.063443 0.035649 0.092445 loc-Brightkit 0.298194 0.037024 0.007619 0.005149 0.005186 0.002977 0.009518 0.009214 0.008017 0.003441 0.005172 loc-Gowalla 4.642555 1.850222 0.551523 0.597855 0.042251 0.035543 0.99552 1.153058 0.142559 0.135726 0.02852 roadNet-CA 0.018777 0.008275 0.034299 0.018575 0.029132 0.016644 0.029072 0.016026 0.033069 0.021434 0.094693 roadNet-PA 0.012264 0.005497 0.02696 0.015148 0.016778 0.013736 0.016576 0.010023 0.020221 0.012846 0.048229 roadNet-TX 0.013831 0.005999 0.026138 0.01613 0.024868 0.011356 0.019374 0.011121 0.024195 0.014831 0.058367 soc-Epinions1 2.622935 0.322065 0.032298 0.029065 0.022567 0.01809 0.087574 0.08822 0.014144 0.00968 0.010896 wiki-Vote 0.527084 0.028837 0.005353 0.00425 0.007596 0.005707 0.0129 0.008829 0.002952 0.001597 0.003614

6.4.1. Performance Utilizing 32 Threads

While F and its variants excel as sequential algorithms, they are inherently sequential and cannot be parallelized. In this section, we focus on algorithms conducive to parallelization to showcase the speedups achieved with parallel methods. Fig. 7 illustrates the speedups of various parallel algorithms compared to their corresponding sequential counterparts, employing 32 threads.

The average speedups are as follows: WP is 10.5×10.5\times10.5 ×; WDP is 7.5×7.5\times7.5 ×; EMP is 13.6×13.6\times13.6 ×; EMDP is 8.3×8.3\times8.3 ×; EBP is 23.3×23.3\times23.3 ×; EBDP is 19.3×19.3\times19.3 ×; ETP is 16.2×16.2\times16.2 ×; ETDP is 10.8×10.8\times10.8 ×; EHP is 6.6×6.6\times6.6 ×; EHDP is 5.0×5.0\times5.0 ×; CETC-SM is 3.9×3.9\times3.9 ×. The results affirm that parallel optimization significantly improves performance.

However, certain scenarios highlight limitations. For instance, in the case of the small-sized graph “karate” (Graph ID=13), all parallel algorithms fail to exhibit performance improvements. This can be attributed to the inherent overhead of the OpenMP parallel method, which outweighs the benefits for very small graphs. A similar pattern is observed for the graph RMAT 6 (Graph ID=1), where three parallel methods—EHP, EHDP, and CETC-SM—show no performance improvement. As previously mentioned, the baseline algorithms EH, EHD, and CETC-Seq have already demonstrated high performance, and the parallel overhead for small graphs nullifies the potential benefits of parallelization.

Refer to caption
Figure 7. The speedups of parallel optimization methods compared with their sequential counterparts using 32 threads.

6.4.2. Performance Utilizing All System Threads

When we harness our experimental system’s full parallel processing capacity, we can execute our OpenMP parallel programs with 224 threads. In Fig. 8, we provide the execution time percentages of various algorithms when employing all 224 system threads. We assigned rank values to each test case and calculated the average rank value. The performance from high to low are EHDP, EBDP, EMDP, EHP, EBP, CETC-SM, EMP, ETDP, WDP, ETP, WP.

The presented results highlight that not only can hash and binary search deliver commendable parallel performance by minimizing operations per parallel thread but also the application of degree ordering proves effective in improving the performance of individual threads.

Refer to caption
Figure 8. The execution time percentage of different parallel triangle counting algorithms with all 224 parallel threads/cores.

6.4.3. Scalable Performance

This subsection delves into the performance of these algorithms in response to varying thread counts. We use RMAT 15 as an illustrative example of a synthetic graph and Amazon0312 as a representative instance of a real graph. By progressively increasing the number of threads to 2, 4, 8, 16, 32, 64, 128, and 224, we seek to identify changes in speedup corresponding to increasing thread counts.

Fig. 10 illustrates the change in speedup with the increasing number of threads on RMAT15. For most algorithms, a bottleneck emerges starting from 64 threads, with no discernible speedup observed with the continued increase in thread count. Notably, the WP algorithm exhibits a degradation in performance with the incorporation of additional parallel threads. The only algorithm demonstrating notable scalability is EBP, showcasing consistent performance improvement with the increasing number of threads. A similar observation is made for the real graph (see Fig. 10), where most algorithms encounter a bottleneck at 64 threads. However, EBP and EBDP exhibit good scalability, indicating that binary search-based methods possess superior scalability compared to other approaches.

Figure 9. Speedups of various algorithms on RMAT15 compared with a single-thread setup
Refer to caption
Refer to caption
Figure 9. Speedups of various algorithms on RMAT15 compared with a single-thread setup
Figure 10. Speedups of various algorithms on Amazon0312 compared with a single-thread setup

6.4.4. Best Performance on Different Graphs

In this section, we use EM as the performance baseline to evaluate the best speedup achieved by different algorithms. The results are summarized in Table 5. The number following a specific algorithm name indicates how many parallel threads are used.

Integrated optimization methods demonstrate a substantial speedup, averaging at 75.8. Examining various algorithms on different graphs unveils insights into optimization methods.

Firstly, for small graphs like RMAT 6 (Graph ID=1), RMAT 7 (Graph ID=2), and karate (Graph ID=13), parallel optimization techniques fail to outperform the sequential FH and linear algebra LA methods. Practical performance considerations suggest that employing multiple parallel threads might introduce overhead for small graphs, making sequential methods more efficient.

Secondly, as graph size increases, optimal performance often requires more parallel resources. However, beyond a critical point, additional parallel resources may lead to decreased performance. For example, RMAT 8 (Graph ID=3) and roadNet-TX (Graph ID=22) achieve peak performance with 32 threads. In contrast, RMAT 9 to RMAT 10 (Graph ID 4-5), RMAT 12 (Graph ID=7), RMAT 14 to RMAT 17 (Graph ID 9-12), loc-Brightkite (Graph ID=18), and roadNet-PA (Graph ID=21) require 64 threads. Certain graphs, such as RMAT 11 (Graph ID=6), RMAT 13 (Graph ID=8), loc-Gowalla (Graph ID=19), roadNet-CA (Graph ID=20), soc-Epinions1 (Graph ID=23), wiki-Vote (Graph ID=24), demand 128 threads. Larger graphs like amazon0302, amazon0312, amazon0505, and amazon0601 (Graph ID 14-17) leverage the full system parallel resources (224 threads). Notably, graph size alone doesn’t determine parallel resource needs, as topology plays a crucial role in parallel performance. At the same time, the parallel algorithms that achieve the best performance are also different. Among all the parallel algorithms, EHDP has 13 times to achieve the best performance. EBDP has four times to achieve the best performance. WDP has three times to achieve the best performance and CETC-SM has one time to achieve the best performance.

Thirdly, the various sequential and parallel optimizations needed for better performance can differ. For instance, WD might not be ideal in a sequential scenario due to checking numerous wedges, many of which are not fruitful for sparse graphs. However, in a parallel scenario, WDP64 excels with 64 threads on roadNet-PA (Graph ID=21), surpassing other algorithms. The efficiency arises from the smaller number of wedges when vertex degrees are low, coupled with DO optimization method that reduces fruitless searches. Another case is EBD, which may not be favorable in sequential algorithms due to increased total operations compared to EMD. However, in parallel algorithms, EBDP could outperform MergePath by distributing work more efficiently through parallel binary searches.

In conclusion, our results highlight that different algorithms find their optimal scenarios based on specific graph topology and hardware configurations. Graph topology and available hardware resources are pivotal factors in selecting the most efficient triangle counting algorithm.

Table 5. Best Performance and Algorithms for Different Graphs (second).

Graph 1 2 3 4 5 6 7 8 9 10 11 12 Baseline Time 0.0006080 0.0019730 0.0054550 0.0146430 0.0203420 0.0537350 0.1411760 0.3726090 0.9877480 2.6268370 6.9313980 18.3285120 Best Time 0.0000280 0.0000760 0.0002070 0.0002170 0.0003830 0.0008570 0.0014210 0.0029630 0.0066920 0.0150250 0.030892 0.078430 Algorithm FH FH EHDP32 EHDP64 EHDP64 EHDP128 EHDP64 EHDP128 EHDP64 EHDP64 EHDP64 EHDP64 Speedup 21.7 26.0 26.4 67.5 53.1 62.7 99.3 125.8 147.6 174.8 224.4 233.7 Graph 13 14 15 16 17 18 19 20 21 22 23 24 Baseline Time 0.0000120 0.1434740 0.7208080 0.7557200 0.7621240 0.1158030 2.0830490 0.1027660 0.0755300 0.0708210 0.6153270 0.1243600 Best Time 0.0000020 0.0064670 0.0165880 0.0175940 0.0175690 0.0028280 0.0200290 0.002870 0.001671 0.003030 0.0088180 0.0015570 Algorithm LA EBDP224 EBDP224 EBDP224 EBDP224 EHDP64 CETC-SM128 WDP128 WDP64 WDP32 EHDP128 EHDP128 Speedup 6.0 22.2 43.5 43.0 43.4 40.9 104.0 35.8 45.2 23.4 69.8 79.9

6.5. Communication Analysis of CETC-DM

Table 6. Communication costs of CETC-DM for real and synthetic graph. The synthetic graphs are Graph500 RMAT graphs of scale 36 and 42. The column ‘Previous’ represents the communication volume of the best prior parallel algorithms (Dolev et al., 2012; Pearce and Sanders, 2018; Sanders and Uhl, 2023), that use wedge-checking based algorithms and ‘This paper’ represents the communication cost of our new approach CETC-DM. ‘Reduction’ represents the communication reduction between these two, and thus, the expected speedup of the parallel algorithm. Entries in italics are estimated values.

Graph n m # Triangles # Wedges c𝑐citalic_c p𝑝pitalic_p Previous This paper Reduction ca-GrQc 5242 14484 48260 165798 0.522 4 526KB 122KB 4.31 ca-HepTh 9877 25973 28339 277389 0.423 4 948KB 218KB 4.35 as-caida20071105 26475 53381 36365 776895 0.225 4 2.78MB 401KB 7.10 facebook_combined 4039 88234 1612010 17051688 0.914 4 48.8MB 893KB 56.0 ca-CondMat 23133 93439 173361 1567373 0.511 4 5.61MB 897KB 6.40 ca-HepPh 12008 118489 3358499 5081984 0.621 4 17.0MB 1.13MB 15.1 email-Enron 36692 183831 727044 5933045 0.478 4 22.6MB 1.79MB 12.7 ca-AstroPh 18772 198050 1351441 8451765 0.667 4 30.2MB 2.08MB 14.6 loc-brightkite_edges 58228 214078 494728 6956250 0.441 4 26.5MB 2.02MB 20.4 soc-Epinions1 75879 405740 1624481 21377935 0.498 4 86.7MB 4.25MB 10.7 amazon0601 403394 2443408 3986507 96348699 0.529 8 436MB 40.9MB 10.7 com-Youtube 1134890 2987624 3056386 209811585 0.347 8 1.03GB 44.3MB 23.7 RMAT-36 68719476736 1099511627776 1.2E+14 2.73E+16 0.311 128 218PB 192TB 1156 RMAT-42 4398046511104 70368744177664 1.3E+16 5.79E+18 0.260 256 52.8EB 22.8PB 2368

In this section, we analyze the performance of the parallel triangle counting algorithm CETC-DM (Alg. 15) on both real and synthetic graphs. We implemented our new triangle counting algorithm using Python to accurately compute the exact communication volume and determine an analytic model based on the size of the graph and number of processors, and the covering ratio (c𝑐citalic_c) from the BFS. The results given in Table 6 are exact communication volumes from our new algorithm on all of the graphs except the two large RMAT graphs where we compute the communication volume from the validated analytic model. For the comparison with prior approaches (Dolev et al., 2012; Pearce and Sanders, 2018; Sanders and Uhl, 2023), we estimate the communication volume from the number of wedges which is exact for all graphs other than the last two large RMAT graphs where we estimate the number of wedges using graph theory.

For the real graphs, we find the actual value of c𝑐citalic_c, the percentage of graph edges that are cover-edges, for an arbitrary breadth-first search, and set the number p𝑝pitalic_p of processors to a reasonable number given the size of the graph. For the synthetic graphs, we use large Graph500 RMAT graphs (Chakrabarti et al., 2004) with parameters a=0.57𝑎0.57a=0.57italic_a = 0.57, b=0.19𝑏0.19b=0.19italic_b = 0.19, c=0.19𝑐0.19c=0.19italic_c = 0.19, and d=0.05𝑑0.05d=0.05italic_d = 0.05, for scale 36 and 42 with n=2scale𝑛superscript2scalen=2^{\mbox{scale}}italic_n = 2 start_POSTSUPERSCRIPT scale end_POSTSUPERSCRIPT and m=16n𝑚16𝑛m=16nitalic_m = 16 italic_n, similar with the IARPA AGILE benchmark graphs, and set p𝑝pitalic_p according to estimates of potential system sizes with sufficient memory to hold these large instances.

For comparison, most prior parallel algorithms for triangle counting operate on the graph as follows. A parallel loop over the vertices vV𝑣𝑉v\in Vitalic_v ∈ italic_V produces all 2-paths (wedges) where (v,v1),(v,v2)E𝑣subscript𝑣1𝑣subscript𝑣2𝐸(v,v_{1}),(v,v_{2})\in E( italic_v , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_v , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_E and (w.l.o.g.) v1<v2subscript𝑣1subscript𝑣2v_{1}<v_{2}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The processor that produces this wedge will send an open wedge query message containing the vertex IDs of v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and v2subscript𝑣2v_{2}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to the processor that owns vertex v1subscript𝑣1v_{1}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. If the consumer processor that receives this query message finds an edge (v1,v2)Esubscript𝑣1subscript𝑣2𝐸(v_{1},v_{2})\in E( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ italic_E, then a local triangle counter is incremented. After producers and consumers complete all work, a global reduction over the p𝑝pitalic_p triangle counts computes the total number of triangles in G𝐺Gitalic_G.

6.5.1. Graph500 RMAT Graphs

Refer to caption
Figure 11. Estimate of c𝑐citalic_c using an exponential model, based on observations of c𝑐citalic_c for RMAT graph scale 6 to 23 graphs.

For the large Graph500 RMAT graphs, the number of triangles is estimated from our model based on the number of triangles found in RMAT graphs up to scale 29 in the literature (Hoang et al., 2019; Giechaskiel et al., 2015; Chakrabarti et al., 2004; Burkhardt, 2017). The fitting equation is #Triangles=77.422n1.125#Triangles77.422superscript𝑛1.125\mbox{\#Triangles}=77.422n^{1.125}#Triangles = 77.422 italic_n start_POSTSUPERSCRIPT 1.125 end_POSTSUPERSCRIPT with R2=1.0superscript𝑅21.0R^{2}=1.0italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1.0, where n𝑛nitalic_n is the total number of vertices. The number of triangles estimated for scale 36 and 42 RMAT graphs are 1.20×10141.20superscript10141.20\times 10^{14}1.20 × 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT and 1.30×10161.30superscript10161.30\times 10^{16}1.30 × 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT, respectively.

We estimate the number of wedges for the scale 36 and 42 Graph500 RMAT graphs based on the theorem given by Seshadhri et al. in (Seshadhri et al., 2011). According to their formula, we can estimate the expected number of vertices N(d)𝑁𝑑N(d)italic_N ( italic_d ) for a given out-degree d𝑑ditalic_d. The number of wedges that can be formed by vertices with such a degree is calculated as (d2)×N(d)binomial𝑑2𝑁𝑑\binom{d}{2}\times N(d)( FRACOP start_ARG italic_d end_ARG start_ARG 2 end_ARG ) × italic_N ( italic_d ), where (d2)binomial𝑑2\binom{d}{2}( FRACOP start_ARG italic_d end_ARG start_ARG 2 end_ARG ) means choosing two from d𝑑ditalic_d.

By summing all such wedges generated from the minimum (elnn𝑒𝑛e\ln nitalic_e roman_ln italic_n) to the maximum degree (n𝑛\sqrt{n}square-root start_ARG italic_n end_ARG), which is the assumption of the formula, we can approximate the total number of wedges in the given graph, where n𝑛nitalic_n is the total number of vertices. This is a conservative estimate because it only considers the out-degree instead of the sum of out and in-degrees. Employing the formula, we calculate the number of wedges to be 2.73×10162.73superscript10162.73\times 10^{16}2.73 × 10 start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT for scale 36 and 5.8×10185.8superscript10185.8\times 10^{18}5.8 × 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT for scale 42. With 2logn2𝑛2\log n2 roman_log italic_n bits/wedge, the total volume of wedge checks is 218PB and 52.8EB for RMAT graphs of scales 36 and 42, respectively111Throughout this paper, a petabyte (PB) is 250superscript2502^{50}2 start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT bytes and an exabyte (EB) is 260superscript2602^{60}2 start_POSTSUPERSCRIPT 60 end_POSTSUPERSCRIPT bytes..

Beamer et al. (Beamer et al., 2011) find a typical BFS on a scale 27 Graph500 RMAT graph has 7 levels, so 4 bits is a reasonable estimate for logD𝐷\log Droman_log italic_D in our analyses of scale 36 and 42 graphs.

The methodology for estimating the value of the covering ratio c𝑐citalic_c for RMAT graphs is as follows. RMAT graphs from scale 6 to 23 are generated, and the exact value of c𝑐citalic_c is determined for each by counting the horizontal-edges after a breadth-first search. The data fit to an exponential model c=1.1773e0.036scale𝑐1.1773superscript𝑒0.036scalec=1.1773e^{-0.036\cdot\mbox{scale}}italic_c = 1.1773 italic_e start_POSTSUPERSCRIPT - 0.036 ⋅ scale end_POSTSUPERSCRIPT with very high R2=0.9956superscript𝑅20.9956R^{2}=0.9956italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.9956 (see Fig. 11). For scale 36, c𝑐citalic_c is estimated to be 0.311 and for scale 42, c𝑐citalic_c is estimated to be 0.260.

In our new distributed-memory approach CETC-DM for scale 36, where the communication cost is m(logD+(kp+3)logn)+(p1)logn𝑚𝐷𝑘𝑝3𝑛𝑝1𝑛m\cdot(\lceil\log D\rceil+(kp+3)\lceil\log n\rceil)+(p-1)\lceil\log n\rceilitalic_m ⋅ ( ⌈ roman_log italic_D ⌉ + ( italic_k italic_p + 3 ) ⌈ roman_log italic_n ⌉ ) + ( italic_p - 1 ) ⌈ roman_log italic_n ⌉ bits. With logD=4𝐷4\lceil\log D\rceil=4⌈ roman_log italic_D ⌉ = 4, and assuming p=128𝑝128p=128italic_p = 128 processors, we have a total communication volume of 192TB, for a communication reduction of 1156×1156\times1156 ×. For scale 42, and assuming p=256𝑝256p=256italic_p = 256 processors, we estimate the communication of our new distributed-memory triangle counting algorithm CETC-DM as 22.8PB, for a communication reduction of 2368×2368\times2368 ×.

7. Conclusion

In this paper we design and implement a novel, fast triangle counting algorithm CETC, that uses new techniques called cover edge set, to improve the performance. It is the first algorithm in decades to shine a new light on triangle counting and use a wholly new method of cover-edge to reduce the work of set intersections, rather than other approaches that are variants of the well-known vertex-iterator and edge-iterator methods. We provide extensive performance results for sequential triangle counting algorithms for sparse graphs in a uniform manner. Furthermore, we employ OpenMP to parallelize most of the sequential algorithms we implemented and investigate their performance. The results use Intel’s latest processor family, the Intel Sapphire Rapids (Platinum 8480+) launched in the 1st quarter of 2023. The new triangle counting algorithm can benefit when the results of a BFS are available, which is often the case in network science. Additionally, this work will inspire much interest within the Graph Challenge community to implement versions of the presented algorithms for large-shared memory, distributed memory, GPU, or multi-GPU frameworks.

8. Reproducibility

The triangle counting source code is open source and available on GitHub at https://github.com/Bader-Research/triangle-counting. The input graphs are from the Stanford Network Analysis Project (SNAP) available from http://snap.stanford.edu/.

Acknowledgements.
This research was funded in part by NSF grant number CCF-2109988.

References

  • (1)
  • Alman and Williams (2021) Josh Alman and Virginia Vassilevska Williams. 2021. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA). SIAM, 522–539.
  • Alon et al. (1997) Noga Alon, Raphael Yuster, and Uri Zwick. 1997. Finding and counting given length cycles. Algorithmica 17, 3 (1997), 209–223.
  • Arifuzzaman et al. (2019) Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2019. Fast Parallel Algorithms for Counting and Listing Triangles in Big Graphs. ACM Trans. Knowl. Discov. Data 14, 1, Article 5 (Dec 2019), 34 pages. https://doi.org/10.1145/3365676
  • Bader (2023) David A. Bader. 2023. Fast Triangle Counting. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC58863.2023.10363539
  • Bader et al. (2023) David A. Bader, Fuhuan Li, Anya Ganeshan, Ahmet Gundogdu, Jason Lew, Oliver Alvarado Rodriguez, and Zhihui Du. 2023. Triangle Counting Through Cover-Edges. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC58863.2023.10363465
  • Beamer et al. (2011) Scott Beamer, Krste Asanovic, and David Patterson. 2011. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for Graph500. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2011-117 (2011).
  • Burkhardt (2017) Paul Burkhardt. 2017. Graphing trillions of triangles. Information Visualization 16, 3 (2017), 157–166.
  • Burkhardt (2021) Paul Burkhardt. 2021. Triangle Centrality. CoRR abs/2105.00110 (2021). arXiv:2105.00110 https://arxiv.longhoe.net/abs/2105.00110
  • Chakrabarti et al. (2004) Deepayan Chakrabarti, Yi** Zhan, and Christos Faloutsos. 2004. R-MAT: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 442–446.
  • Chiba and Nishizeki (1985) Norishige Chiba and Takao Nishizeki. 1985. Arboricity and Subgraph Listing Algorithms. SIAM J. Comput. 14, 1 (1985), 210–223. https://doi.org/10.1137/0214017
  • Cohen (2008) Jonathan Cohen. 2008. Trusses: Cohesive subgraphs for social network analysis. National Security Agency Technical Report 16, 3.1 (2008).
  • Cohen (2009) Jonathan Cohen. 2009. Graph twiddling in a MapReduce world. Computing in Science & Engineering 11, 4 (2009), 29–41.
  • Cormen et al. (2022) Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022. Introduction to algorithms. MIT press.
  • Davis (2018) Timothy A. Davis. 2018. Graph algorithms via SuiteSparse: GraphBLAS: triangle counting and K-truss. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2018.8547538
  • Dolev et al. (2012) Danny Dolev, Christoph Lenzen, and Shir Peled. 2012. ’Tri, tri again’: finding triangles and small subgraphs in a distributed setting. In International Symposium on Distributed Computing. Springer, 195–209.
  • Ghosh and Halappanavar (2020) Sayan Ghosh and Mahantesh Halappanavar. 2020. TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure. In IEEE High Performance Extreme Computing Conference (HPEC). 1–6.
  • Giechaskiel et al. (2015) Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. 2015. PDTL: Parallel and distributed triangle listing for massive graphs. In 2015 44th International Conference on Parallel Processing. IEEE, 370–379.
  • Graph 500 Steering Committee (2010) Graph 500 Steering Committee. 2010. The Graph500 benchmark. https://www.graph500.org
  • Harrod (2020) William Harrod. 2020. Advanced Graphical Intelligence Logical Computing Environment (AGILE). https://www.iarpa.gov/images/PropsersDayPDFs/AGILE/AGILE_Proposers_Day_sm.pdf. Accessed: 2022–05-24.
  • Hoang et al. (2019) Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav **ali. 2019. DistTC: High performance distributed triangle counting. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.
  • Hu et al. (2021) Lin Hu, Lei Zou, and Yu Liu. 2021. Accelerating triangle counting on GPU. In Proceedings of the 2021 International Conference on Management of Data. 736–748.
  • Hu et al. (2018) Yang Hu, Hang Liu, and H Howie Huang. 2018. TriCore: Parallel triangle counting on GPUs. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 171–182.
  • Itai and Rodeh (1978) Alon Itai and Michael Rodeh. 1978. Finding a minimum circuit in a graph. SIAM J. Comput. 7, 4 (1978), 413–423.
  • Latapy (2008) Matthieu Latapy. 2008. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theoretical Computer Science 407, 1 (2008), 458–473. https://doi.org/10.1016/j.tcs.2008.07.017
  • Li and Bader (2021) Fuhuan Li and David A Bader. 2021. A graphblas implementation of triangle centrality. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–2.
  • Low et al. (2017) Tze Meng Low, Varun Nagaraj Rao, Matthew Lee, Doru Popovici, Franz Franchetti, and Scott McMillan. 2017. First look: Linear algebra-based triangle counting without matrix multiplication. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1–6. https://doi.org/10.1109/HPEC.2017.8091046
  • Mailthody et al. (2018) Vikram S. Mailthody, Ketan Date, Zaid Qureshi, Carl Pearson, Rakesh Nagi, **jun Xiong, and Wen-mei Hwu. 2018. Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC.2018.8547517
  • Makkar et al. (2017) Devavret Makkar, David A. Bader, and Oded Green. 2017. Exact and Parallel Triangle Counting in Dynamic Graphs. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, Los Alamitos, CA, 2–12. https://doi.org/10.1109/HiPC.2017.00011
  • Mowlaei (2017) Shahir Mowlaei. 2017. Triangle counting via vectorized set intersection. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1–5. https://doi.org/10.1109/HPEC.2017.8091053
  • Ortmann and Brandes (2014) Mark Ortmann and Ulrik Brandes. 2014. Triangle Listing Algorithms: Back from the Diversion. In Proceedings of the Meeting on Algorithm Engineering & Expermiments (Portland, Oregon). Society for Industrial and Applied Mathematics, USA, 1–8.
  • Parimalarangan et al. (2017) Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast parallel graph triad census and triangle counting on shared-memory platforms. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1500–1509.
  • Pearce (2017) Roger Pearce. 2017. Triangle counting for scale-free graphs at scale in distributed memory. In IEEE High Performance Extreme Computing Conference (HPEC). 1–4.
  • Pearce and Sanders (2018) Roger Pearce and Geoffrey Sanders. 2018. K-truss decomposition for scale-free graphs at scale in distributed memory. In 2018 IEEE high performance extreme computing conference (HPEC). IEEE, 1–6.
  • Samsi et al. (2018) Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2018. GraphChallenge.org: Raising the Bar on Graph Analytic Performance. In 2018 IEEE High Performance Extreme Computing Conference (HPEC). 1–7. https://doi.org/10.1109/HPEC.2018.8547527
  • Sanders and Uhl (2023) Peter Sanders and Tim Niklas Uhl. 2023. Engineering a Distributed-Memory Triangle Counting Algorithm. arXiv preprint arXiv:2302.11443 (2023).
  • Schank (2007) Thomas Schank. 2007. Algorithmic Aspects of Triangle-Based Network Analysis. Ph. D. Dissertation. Karlsruhe Institute of Technology.
  • Schank and Wagner (2005) Thomas Schank and Dorothea Wagner. 2005. Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study. In Proceedings of the 4th International Conference on Experimental and Efficient Algorithms (Santorini Island, Greece) (WEA’05). Springer-Verlag, Berlin, Heidelberg, 606–609. https://doi.org/10.1007/11427186_54
  • Seshadhri et al. (2011) C Seshadhri, Ali Pinar, and Tamara G Kolda. 2011. A hitchhiker’s guide to choosing parameters of stochastic Kronecker graphs. CoRR, abs/1102.5046 1 (2011).
  • Shun and Tangwongsan (2015) Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations without tuning. In 2015 IEEE 31st International Conference on Data Engineering. IEEE, 149–160.
  • Ullmann (1976) Julian R Ullmann. 1976. An algorithm for subgraph isomorphism. Journal of the ACM (JACM) 23, 1 (1976), 31–42.
  • Watts and Strogatz (1998) Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small-world’ networks. Nature 393, 6684 (1998), 440–442.
  • Williams et al. (2023) Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. 2023. New bounds for matrix multiplication: from alpha to omega. arXiv preprint arXiv:2307.07970 (2023).
  • Zeng et al. (2022) Li Zeng, Kang Yang, Haoran Cai, **hua Zhou, Rongqing Zhao, and Xin Chen. 2022. HTC: Hybrid vertex-parallel and edge-parallel Triangle Counting. In IEEE High Performance Extreme Computing Conference (HPEC). 1–7.