-
Parallelizing Maximal Clique Enumeration on GPUs
Authors:
Mohammad Almasri,
Yen-Hsiang Chang,
Izzat El Hajj,
Rakesh Nagi,
**jun Xiong,
Wen-mei Hwu
Abstract:
We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first tra…
▽ More
We present a GPU solution for exact maximal clique enumeration (MCE) that performs a search tree traversal following the Bron-Kerbosch algorithm. Prior works on parallelizing MCE on GPUs perform a breadth-first traversal of the tree, which has limited scalability because of the explosion in the number of tree nodes at deep levels. We propose to parallelize MCE on GPUs by performing depth-first traversal of independent subtrees in parallel. Since MCE suffers from high load imbalance and memory capacity requirements, we propose a worker list for dynamic load balancing, as well as partial induced subgraphs and a compact representation of excluded vertex sets to regulate memory consumption. Our evaluation shows that our GPU implementation on a single GPU outperforms the state-of-the-art parallel CPU implementation by a geometric mean of 4.9x (up to 16.7x), and scales efficiently to multiple GPUs. Our code has been open-sourced to enable further research on accelerating MCE.
△ Less
Submitted 25 October, 2023; v1 submitted 2 December, 2022;
originally announced December 2022.
-
Limited-Trust in Diffusion of Competing Alternatives over Social Networks
Authors:
Vincent Leon,
S. Rasoul Etesami,
Rakesh Nagi
Abstract:
We consider the diffusion of two alternatives in social networks using a game-theoretic approach. Each individual plays a coordination game with its neighbors repeatedly and decides which to adopt. As products are used in conjunction with others and through repeated interactions, individuals are more interested in their long-term benefits and tend to show trust to others to maximize their long-ter…
▽ More
We consider the diffusion of two alternatives in social networks using a game-theoretic approach. Each individual plays a coordination game with its neighbors repeatedly and decides which to adopt. As products are used in conjunction with others and through repeated interactions, individuals are more interested in their long-term benefits and tend to show trust to others to maximize their long-term utility by choosing a suboptimal option with respect to instantaneous payoff. To capture such trust behavior, we deploy limited-trust equilibrium (LTE) in diffusion process. We analyze the convergence of emerging dynamics to equilibrium points using mean-field approximation and study the equilibrium state and the convergence rate of diffusion using absorption probability and expected absorption time of a reduced-size absorbing Markov chain. We also show that the diffusion model on LTE under the best-response strategy can be converted to the well-known linear threshold model. Simulation results show that when agents behave trustworthy, their long-term utility will increase significantly compared to the case when they are solely self-interested. Moreover, the Markov chain analysis provides a good estimate of convergence properties over random networks.
△ Less
Submitted 2 October, 2023; v1 submitted 13 June, 2022;
originally announced June 2022.
-
Parallel K-Clique Counting on GPUs
Authors:
Mohammad Almasri,
Izzat El Hajj,
Rakesh Nagi,
**jun Xiong,
Wen-mei Hwu
Abstract:
Counting k-cliques in a graph is an important problem in graph analysis with many applications such as community detection and graph partitioning. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. Parallelizing k-clique counting has been well-studied on CPUs and many solutions exist. However, there are no performant solutions for k-clique countin…
▽ More
Counting k-cliques in a graph is an important problem in graph analysis with many applications such as community detection and graph partitioning. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. Parallelizing k-clique counting has been well-studied on CPUs and many solutions exist. However, there are no performant solutions for k-clique counting on GPUs.
Parallelizing k-clique counting on GPUs comes with numerous challenges such as the need for extracting fine-grain multi-level parallelism, sensitivity to load imbalance, and constrained physical memory capacity. While there has been work on related problems such as finding maximal cliques and generalized sub-graph matching on GPUs, k-clique counting in particular has yet to be explored in depth. In this paper, we present the first parallel GPU solution specialized for the k-clique counting problem. Our solution supports both graph orientation and pivoting for eliminating redundant clique discovery. It incorporates both vertex-centric and edge-centric parallelization schemes for distributing work across thread blocks, and further partitions work within each thread block to extract fine-grain multi-level parallelism while tolerating load imbalance. It also includes optimizations such as binary encoding of induced sub-graphs and sub-warp partitioning to limit memory consumption and improve the utilization of execution resources.
Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean of 12.39x, 6.21x, and 18.99x for k=4, 7, and 10, respectively. We also perform a detailed evaluation of the trade-offs involved in the choice of parallelization scheme, and the incremental speedup of each optimization to provide an in-depth understanding of the optimization space. ...
△ Less
Submitted 6 June, 2022; v1 submitted 27 April, 2021;
originally announced April 2021.
-
Limited-Trust in Social Network Games
Authors:
Timothy Murray,
Jugal Garg,
Rakesh Nagi
Abstract:
We consider agents in a social network competing to be selected as partners in collaborative, mutually beneficial activities. We study this through a model in which an agent i can initiate a limited number k_i>0 of games and selects the ideal partners from its one-hop neighborhood. On the flip side it can accept as many games offered from its neighbors. Each game signifies a productive joint econo…
▽ More
We consider agents in a social network competing to be selected as partners in collaborative, mutually beneficial activities. We study this through a model in which an agent i can initiate a limited number k_i>0 of games and selects the ideal partners from its one-hop neighborhood. On the flip side it can accept as many games offered from its neighbors. Each game signifies a productive joint economic activity, and players attempt to maximize their individual utilities. Unsurprisingly, more trustworthy agents are more desirable as partners. Trustworthiness is measured by the game theoretic concept of Limited-Trust, which quantifies the maximum cost an agent is willing to incur in order to improve the net utility of all agents. Agents learn about their neighbors' trustworthiness through interactions and their behaviors evolve in response. Empirical trials performed on realistic social networks show that when given the option, many agents become highly trustworthy; most or all become highly trustworthy when knowledge of their neighbors' trustworthiness is based on past interactions rather than known a priori. This trustworthiness is not the result of altruism, instead agents are intrinsically motivated to become trustworthy partners by competition. Two insights are presented: first, trustworthy behavior drives an increase in the utility of all agents, where maintaining a relatively modest level of trustworthiness may easily improve net utility by as much as 14.5%. If only one agent exhibits modest trust among self-centered ones, it can increase its average utility by up to 25% in certain cases! Second, and counter-intuitively, when partnership opportunities are abundant agents become less trustworthy.
△ Less
Submitted 20 January, 2024; v1 submitted 1 March, 2021;
originally announced March 2021.
-
At-Scale Sparse Deep Neural Network Inference with Efficient GPU Implementation
Authors:
Mert Hidayetoglu,
Carl Pearson,
Vikram Sharma Mailthody,
Eiman Ebrahimi,
**jun Xiong,
Rakesh Nagi,
Wen-Mei Hwu
Abstract:
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footpr…
▽ More
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the proposed kernel design and multi-GPU parallelization achieve up to 180 tera-edges per second inference throughput. These results are up to 4.3x faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 Sparse Deep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs. Using the same implementation, we also show single-GPU throughput on NVIDIA A100 is 2.37$\times$ faster than V100.
△ Less
Submitted 2 September, 2020; v1 submitted 28 July, 2020;
originally announced July 2020.
-
Risk-Averse Equilibrium for Autonomous Vehicles in Stochastic Congestion Games
Authors:
Ali Yekkehkhany,
Rakesh Nagi
Abstract:
The fast-growing market of autonomous vehicles, unmanned aerial vehicles, and fleets in general necessitates the design of smart and automatic navigation systems considering the stochastic latency along different paths in the traffic network. The longstanding shortest path problem in a deterministic network, whose counterpart in a congestion game setting is Wardrop equilibrium, has been studied ex…
▽ More
The fast-growing market of autonomous vehicles, unmanned aerial vehicles, and fleets in general necessitates the design of smart and automatic navigation systems considering the stochastic latency along different paths in the traffic network. The longstanding shortest path problem in a deterministic network, whose counterpart in a congestion game setting is Wardrop equilibrium, has been studied extensively, but it is well known that finding the notion of an optimal path is challenging in a traffic network with stochastic arc delays. In this work, we propose three classes of risk-averse equilibria for an atomic stochastic congestion game in its general form where the arc delay distributions are load dependent and not necessarily independent of each other. The three classes are risk-averse equilibrium (RAE), mean-variance equilibrium (MVE), and conditional value at risk level $α$ equilibrium (CVaR$_α$E) whose notions of risk-averse best responses are based on maximizing the probability of taking the shortest path, minimizing a linear combination of mean and variance of path delay, and minimizing the expected delay at a specified risky quantile of the delay distributions, respectively. We prove that for any finite stochastic atomic congestion game, the risk-averse, mean-variance, and CVaR$_α$ equilibria exist. We show that for risk-averse travelers, the Braess paradox may not occur to the extent presented originally since players do not necessarily travel along the shortest path in expectation, but they take the uncertainty of travel time into consideration as well. We show through some examples that the price of anarchy can be improved when players are risk-averse and travel according to one of the three classes of risk-averse equilibria rather than the Wardrop equilibrium.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
Risk-Averse Equilibrium for Games
Authors:
Ali Yekkehkhany,
Timothy Murray,
Rakesh Nagi
Abstract:
The term rational has become synonymous with maximizing expected payoff in the definition of the best response in Nash setting. In this work, we consider stochastic games in which players engage only once, or at most a limited number of times. In such games, it may not be rational for players to maximize their expected payoff as they cannot wait for the Law of Large Numbers to take effect. We inst…
▽ More
The term rational has become synonymous with maximizing expected payoff in the definition of the best response in Nash setting. In this work, we consider stochastic games in which players engage only once, or at most a limited number of times. In such games, it may not be rational for players to maximize their expected payoff as they cannot wait for the Law of Large Numbers to take effect. We instead define a new notion of a risk-averse best response, that results in a risk-averse equilibrium (RAE) in which players choose to play the strategy that maximizes the probability of them being rewarded the most in a single round of the game rather than maximizing the expected received reward, subject to the actions of other players. We prove the risk-averse equilibrium to exist in all finite games and numerically compare its performance to Nash equilibrium in finite-time stochastic games.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
Risk-Averse Explore-Then-Commit Algorithms for Finite-Time Bandits
Authors:
Ali Yekkehkhany,
Ebrahim Arian,
Mohammad Hajiesmaili,
Rakesh Nagi
Abstract:
In this paper, we study multi-armed bandit problems in explore-then-commit setting. In our proposed explore-then-commit setting, the goal is to identify the best arm after a pure experimentation (exploration) phase and exploit it once or for a given finite number of times. We identify that although the arm with the highest expected reward is the most desirable objective for infinite exploitations,…
▽ More
In this paper, we study multi-armed bandit problems in explore-then-commit setting. In our proposed explore-then-commit setting, the goal is to identify the best arm after a pure experimentation (exploration) phase and exploit it once or for a given finite number of times. We identify that although the arm with the highest expected reward is the most desirable objective for infinite exploitations, it is not necessarily the one that is most probable to have the highest reward in a single or finite-time exploitations. Alternatively, we advocate the idea of risk-aversion where the objective is to compete against the arm with the best risk-return trade-off. Then, we propose two algorithms whose objectives are to select the arm that is most probable to reward the most. Using a new notion of finite-time exploitation regret, we find an upper bound for the minimum number of experiments before commitment, to guarantee an upper bound for the regret. As compared to existing risk-averse bandit algorithms, our algorithms do not rely on hyper-parameters, resulting in a more robust behavior in practice, which is verified by the numerical evaluation.
△ Less
Submitted 11 September, 2019; v1 submitted 30 April, 2019;
originally announced April 2019.
-
Blind GB-PANDAS: A Blind Throughput-Optimal Load Balancing Algorithm for Affinity Scheduling
Authors:
Ali Yekkehkhany,
Rakesh Nagi
Abstract:
Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infin…
▽ More
Dynamic affinity load balancing of multi-type tasks on multi-skilled servers, when the service rate of each task type on each of the servers is known and can possibly be different from each other, is an open problem for over three decades. The goal is to do task assignment on servers in a real time manner so that the system becomes stable, which means that the queue lengths do not diverge to infinity in steady state (throughput optimality), and the mean task completion time is minimized (delay optimality). The fluid model planning, Max-Weight, and c-$μ$-rule algorithms have theoretical guarantees on optimality in some aspects for the affinity problem, but they consider a complicated queueing structure and either require the task arrival rates, the service rates of tasks on servers, or both. In many cases that are discussed in the introduction section, both task arrival rates and service rates of different task types on different servers are unknown. In this work, the Blind GB-PANDAS algorithm is proposed which is completely blind to task arrival rates and service rates. Blind GB-PANDAS uses an exploration-exploitation approach for load balancing. We prove that Blind GB-PANDAS is throughput optimal under arbitrary and unknown distributions for service times of different task types on different servers and unknown task arrival rates. Blind GB-PANDAS desires to route an incoming task to the server with the minimum weighted-workload, but since the service rates are unknown, such routing of incoming tasks is not guaranteed which makes the throughput optimality analysis more complicated than the case where service rates are known. Our extensive experimental results reveal that Blind GB-PANDAS significantly outperforms existing methods in terms of mean task completion time at high loads.
△ Less
Submitted 3 March, 2020; v1 submitted 13 January, 2019;
originally announced January 2019.
-
RLT2-based Parallel Algorithms for Solving Large Quadratic Assignment Problems on Graphics Processing Unit Clusters
Authors:
Ketan Date,
Rakesh Nagi
Abstract:
This paper discusses efficient parallel algorithms for obtaining strong lower bounds and exact solutions for large instances of the Quadratic Assignment Problem (QAP). Our parallel architecture is comprised of both multi-core processors and Compute Unified Device Architecture (CUDA) enabled NVIDIA Graphics Processing Units (GPUs) on the Blue Waters Supercomputing Facility at the University of Illi…
▽ More
This paper discusses efficient parallel algorithms for obtaining strong lower bounds and exact solutions for large instances of the Quadratic Assignment Problem (QAP). Our parallel architecture is comprised of both multi-core processors and Compute Unified Device Architecture (CUDA) enabled NVIDIA Graphics Processing Units (GPUs) on the Blue Waters Supercomputing Facility at the University of Illinois at Urbana-Champaign. We propose novel parallelization of the Lagrangian Dual Ascent algorithm on the GPUs, which is used for solving a QAP formulation based on Level-2 Refactorization Linearization Technique (RLT2). The Linear Assignment sub-problems (LAPs) in this procedure are solved using our accelerated Hungarian algorithm [Date, Ketan, Rakesh Nagi. 2016. GPU-accelerated Hungarian algorithms for the Linear Assignment Problem. Parallel Computing 57 52-72]. We embed this accelerated dual ascent algorithm in a parallel branch-and-bound scheme and conduct extensive computational experiments on single and multiple GPUs, using problem instances with up to 42 facilities from the QAPLIB. The experiments suggest that our GPU-based approach is scalable and it can be used to obtain tight lower bounds on large QAP instances. Our accelerated branch-and-bound scheme is able to comfortably solve Nugent and Taillard instances (up to 30 facilities) from the QAPLIB, using modest number of GPUs.
△ Less
Submitted 10 October, 2017;
originally announced October 2017.