-
Controlling Data Access Load in Distributed Systems
Authors:
Mehmet Aktas,
Emina Soljanin
Abstract:
Distributed systems store data objects redundantly to balance the data access load over multiple nodes. Load balancing performance depends mainly on 1) the level of storage redundancy and 2) the assignment of data objects to storage nodes. We analyze the performance implications of these design choices by considering four practical storage schemes that we refer to as clustering, cyclic, block and…
▽ More
Distributed systems store data objects redundantly to balance the data access load over multiple nodes. Load balancing performance depends mainly on 1) the level of storage redundancy and 2) the assignment of data objects to storage nodes. We analyze the performance implications of these design choices by considering four practical storage schemes that we refer to as clustering, cyclic, block and random design. We formulate the problem of load balancing as maintaining the load on any node below a given threshold. Regarding the level of redundancy, we find that the desired load balance can be achieved in a system of $n$ nodes only if the replication factor $d = Ω(\log(n)^{1/3})$, which is a necessary condition for any storage design. For clustering and cyclic designs, $d = Ω(\log(n))$ is necessary and sufficient. For block and random designs, $d = Ω(\log(n))$ is sufficient but unnecessary. Whether $d = Ω(\log(n)^{1/3})$ is sufficient remains open. The assignment of objects to nodes essentially determines which objects share the access capacity on each node. We refer to the number of nodes jointly shared by a set of objects as the \emph{overlap} between those objects. We find that many consistently slight overlaps between the objects (block, random) are better than few but occasionally significant overlaps (clustering, cyclic). However, when the demand is ''skewed beyond a level'' the impact of overlaps becomes the opposite. We derive our results by connecting the load-balancing problem to mathematical constructs that have been used to study other problems. For a class of storage designs containing the clustering and cyclic design, we express load balance in terms of the maximum of moving sums of i.i.d. random variables, which is known as the scan statistic. For random design, we express load balance by using the occupancy metric for random allocation with complexes.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
Topology-guided Hypergraph Transformer Network: Unveiling Structural Insights for Improved Representation
Authors:
Khaled Mohammed Saifuddin,
Mehmet Emin Aktas,
Esra Akbas
Abstract:
Hypergraphs, with their capacity to depict high-order relationships, have emerged as a significant extension of traditional graphs. Although Graph Neural Networks (GNNs) have remarkable performance in graph representation learning, their extension to hypergraphs encounters challenges due to their intricate structures. Furthermore, current hypergraph transformers, a special variant of GNN, utilize…
▽ More
Hypergraphs, with their capacity to depict high-order relationships, have emerged as a significant extension of traditional graphs. Although Graph Neural Networks (GNNs) have remarkable performance in graph representation learning, their extension to hypergraphs encounters challenges due to their intricate structures. Furthermore, current hypergraph transformers, a special variant of GNN, utilize semantic feature-based self-attention, ignoring topological attributes of nodes and hyperedges. To address these challenges, we propose a Topology-guided Hypergraph Transformer Network (THTN). In this model, we first formulate a hypergraph from a graph while retaining its structural essence to learn higher-order relations within the graph. Then, we design a simple yet effective structural and spatial encoding module to incorporate the topological and spatial information of the nodes into their representation. Further, we present a structure-aware self-attention mechanism that discovers the important nodes and hyperedges from both semantic and structural viewpoints. By leveraging these two modules, THTN crafts an improved node representation, capturing both local and global topological expressions. Extensive experiments conducted on node classification tasks demonstrate that the performance of the proposed model consistently exceeds that of the existing approaches.
△ Less
Submitted 21 May, 2024; v1 submitted 14 October, 2023;
originally announced October 2023.
-
Theory vs. Practice in Modeling Edge Storage Systems
Authors:
Oleg Kolosov,
Mehmet Fatih Aktas,
Emina Soljanin,
Gala Yadgar
Abstract:
Edge systems promise to bring data and computing closer to the users of time-critical applications. Specifically, edge storage systems are emerging as a new system paradigm, where users can retrieve data from small-scale servers inter-operating at the network's edge. The analysis, design, and optimization of such systems require a tractable model that will reflect their costs and bottlenecks. Alas…
▽ More
Edge systems promise to bring data and computing closer to the users of time-critical applications. Specifically, edge storage systems are emerging as a new system paradigm, where users can retrieve data from small-scale servers inter-operating at the network's edge. The analysis, design, and optimization of such systems require a tractable model that will reflect their costs and bottlenecks. Alas, most existing mathematical models for edge systems focus on stateless tasks, network performance, or isolated nodes and are inapplicable for evaluating edge-based storage performance.
We analyze the capacity-region model - the most promising model proposed so far for edge storage systems. The model addresses the system's ability to serve a set of user demands. Our analysis reveals five inherent gaps between this model and reality, demonstrating the significant remaining challenges in modeling storage service at the edge.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Hypergraph Classification via Persistent Homology
Authors:
Mehmet Emin Aktas,
Thu Nguyen,
Rakin Riza,
Muhammad Ifte Islam,
Esra Akbas
Abstract:
Persistent homology is a mathematical tool used for studying the shape of data by extracting its topological features. It has gained popularity in network science due to its applicability in various network mining problems, including clustering, graph classification, and graph neural networks. The definition of persistent homology for graphs is relatively straightforward, as graphs possess distinc…
▽ More
Persistent homology is a mathematical tool used for studying the shape of data by extracting its topological features. It has gained popularity in network science due to its applicability in various network mining problems, including clustering, graph classification, and graph neural networks. The definition of persistent homology for graphs is relatively straightforward, as graphs possess distinct intrinsic distances and a simplicial complex structure. However, hypergraphs present a challenge in preserving topological information since they may not have a simplicial complex structure. In this paper, we define several topological characterizations of hypergraphs in defining hypergraph persistent homology to prioritize different higher-order structures within hypergraphs. We further use these persistent homology filtrations in classifying four different real-world hypergraphs and compare their performance to the state-of-the-art graph neural network models. Experimental results demonstrate that persistent homology filtrations are effective in classifying hypergraphs and outperform the baseline models. To the best of our knowledge, this study represents the first systematic attempt to tackle the hypergraph classification problem using persistent homology.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Liars are more influential: Effect of Deception in Influence Maximization on Social Networks
Authors:
Mehmet Emin Aktas,
Esra Akbas,
Ashley Hahn
Abstract:
Detecting influential users, called the influence maximization problem on social networks, is an important graph mining problem with many diverse applications such as information propagation, market advertising, and rumor controlling. There are many studies in the literature for influential users detection problem in social networks. Although the current methods are successfully used in many diffe…
▽ More
Detecting influential users, called the influence maximization problem on social networks, is an important graph mining problem with many diverse applications such as information propagation, market advertising, and rumor controlling. There are many studies in the literature for influential users detection problem in social networks. Although the current methods are successfully used in many different applications, they assume that users are honest with each other and ignore the role of deception on social networks. On the other hand, deception appears to be surprisingly common among humans within social networks. In this paper, we study the effect of deception in influence maximization on social networks. We first model deception in social networks. Then, we model the opinion dynamics on these networks taking the deception into consideration thanks to a recent opinion dynamics model via sheaf Laplacian. We then extend two influential node detection methods, namely Laplacian centrality and DFF centrality, for the sheaf Laplacian to measure the effect of deception in influence maximization. Our experimental results on synthetic and real-world networks suggest that liars are more influential than honest users in social networks.
△ Less
Submitted 21 March, 2022;
originally announced March 2022.
-
Identifying critical higher-order interactions in complex networks
Authors:
Mehmet Emin Aktas,
Thu Nguyen,
Sidra Jawaid,
Rakin Riza,
Esra Akbas
Abstract:
Information diffusion on networks is an important concept in network science observed in many situations such as information spreading and rumor controlling in social networks, disease contagion between individuals, cascading failures in power grids. The critical interactions in networks are the ones that play critical roles in information diffusion and primarily affect network structure and funct…
▽ More
Information diffusion on networks is an important concept in network science observed in many situations such as information spreading and rumor controlling in social networks, disease contagion between individuals, cascading failures in power grids. The critical interactions in networks are the ones that play critical roles in information diffusion and primarily affect network structure and functions. Besides, interactions can occur between not only two nodes as pairwise interactions, i.e., edges, but also three or more nodes, described as higher-order interactions. This report presents a novel method to identify critical higher-order interactions. We propose two new Laplacians that allow redefining classical graph centrality measures for higher-order interactions. We then compare the redefined centrality measures using the Susceptible-Infected-Recovered (SIR) simulation model. Experimental results suggest that the proposed method is promising in identifying critical higher-order interactions.
△ Less
Submitted 7 May, 2021; v1 submitted 6 May, 2021;
originally announced May 2021.
-
Hypergraph Laplacians in Diffusion Framework
Authors:
Mehmet Emin Aktas,
Esra Akbas
Abstract:
Networks are important structures used to model complex systems where interactions take place. In a basic network model, entities are represented as nodes, and interaction and relations among them are represented as edges. However, in a complex system, we cannot describe all relations as pairwise interactions, rather should describe as higher-order interactions. Hypergraphs are successfully used t…
▽ More
Networks are important structures used to model complex systems where interactions take place. In a basic network model, entities are represented as nodes, and interaction and relations among them are represented as edges. However, in a complex system, we cannot describe all relations as pairwise interactions, rather should describe as higher-order interactions. Hypergraphs are successfully used to model higher-order interactions in complex systems. In this paper, we present two new hypergraph Laplacians based on diffusion framework. Our Laplacians take the relations between higher-order interactions into consideration, hence can be used to model diffusion on hypergraphs not only between vertices but also higher-order structures. These Laplacians can be employed in different network mining problems on hypergraphs, such as social contagion models on hypergraphs, influence study on hypergraphs, and hypergraph classification, to list a few.
△ Less
Submitted 17 February, 2021;
originally announced February 2021.
-
Service Rate Region: A New Aspect of Coded Distributed System Design
Authors:
Mehmet Aktas,
Gauri Joshi,
Swanand Kadhe,
Fatemeh Kazemi,
Emina Soljanin
Abstract:
Erasure coding has been recognized as a powerful method to mitigate delays due to slow or straggling nodes in distributed systems. This work shows that erasure coding of data objects can flexibly handle skews in the request rates. Coding can help boost the \emph{service rate region}, that is, increase the overall volume of data access requests that the system can handle. This paper aims to postula…
▽ More
Erasure coding has been recognized as a powerful method to mitigate delays due to slow or straggling nodes in distributed systems. This work shows that erasure coding of data objects can flexibly handle skews in the request rates. Coding can help boost the \emph{service rate region}, that is, increase the overall volume of data access requests that the system can handle. This paper aims to postulate the service rate region as an important consideration in the design of erasure-coded distributed systems. We highlight several open problems that can be grouped into two broad threads: 1) characterizing the service rate region of a given code and finding the optimal request allocation, and2) designing the underlying erasure code for a given service rate region. As contributions along the first thread, we characterize the rate regions of maximum-distance-separable, locally repairable, and Simplex codes. We show the effectiveness of hybrid codes that combine replication and erasure coding in terms of code design. We also discover fundamental connections between multi-set batch codes and the problem of maximizing the service rate region.
△ Less
Submitted 27 June, 2021; v1 submitted 3 September, 2020;
originally announced September 2020.
-
Graph Classification via Heat Diffusion on Simplicial Complexes
Authors:
Mehmet Emin Aktas,
Esra Akbas
Abstract:
In this paper, we study the graph classification problem in vertex-labeled graphs. Our main goal is to classify the graphs comparing their higher-order structures thanks to heat diffusion on their simplices. We first represent vertex-labeled graphs as simplex-weighted super-graphs. We then define the diffusion Frechet function over their simplices to encode the higher-order network topology and fi…
▽ More
In this paper, we study the graph classification problem in vertex-labeled graphs. Our main goal is to classify the graphs comparing their higher-order structures thanks to heat diffusion on their simplices. We first represent vertex-labeled graphs as simplex-weighted super-graphs. We then define the diffusion Frechet function over their simplices to encode the higher-order network topology and finally reach our goal by combining the function values with machine learning algorithms. Our experiments on real-world bioinformatics networks show that using diffusion Fr{é}chet function on simplices is promising in graph classification and more effective than the baseline methods. To the best of our knowledge, this paper is the first paper in the literature using heat diffusion on higher-dimensional simplices in a graph mining problem. We believe that our method can be extended to different graph mining domains, not only the graph classification problem.
△ Less
Submitted 26 June, 2020;
originally announced July 2020.
-
Download Time Analysis for Distributed Storage Codes with Locality and Availability
Authors:
Mehmet Fatih Aktas,
Swanand Kadhe,
Emina Soljanin,
Alex Sprintson
Abstract:
The paper presents techniques for analyzing the expected download time in distributed storage systems that employ systematic availability codes. These codes provide access to hot data through the systematic server containing the object and multiple recovery groups. When a request for an object is received, it can be replicated (forked) to the systematic server and all recovery groups. We first con…
▽ More
The paper presents techniques for analyzing the expected download time in distributed storage systems that employ systematic availability codes. These codes provide access to hot data through the systematic server containing the object and multiple recovery groups. When a request for an object is received, it can be replicated (forked) to the systematic server and all recovery groups. We first consider the low-traffic regime and present the close-form expression for the download time. By comparison across systems with availability, maximum distance separable (MDS), and replication codes, we demonstrate that availability codes can reduce download time in some settings but are not always optimal. In the high-traffic regime, the system consists of multiple inter-dependent Fork-Join queues, making exact analysis intractable. Accordingly, we present upper and lower bounds on the download time, and an M/G/1 queue approximation for several cases of interest. Via extensive numerical simulations, we evaluate our bounds and demonstrate that the M/G/1 queue approximation has a high degree of accuracy.
△ Less
Submitted 10 March, 2021; v1 submitted 20 December, 2019;
originally announced December 2019.
-
Evaluating Load Balancing Performance in Distributed Storage with Redundancy
Authors:
Mehmet Fatih Aktas,
Amir Behrouzi-Far,
Emina Soljanin,
Philip Whiting
Abstract:
To facilitate load balancing, distributed systems store data redundantly. We evaluate the load balancing performance of storage schemes in which each object is stored at $d$ different nodes, and each node stores the same number of objects. In our model, the load offered for the objects is sampled uniformly at random from all the load vectors with a fixed cumulative value. We find that the load bal…
▽ More
To facilitate load balancing, distributed systems store data redundantly. We evaluate the load balancing performance of storage schemes in which each object is stored at $d$ different nodes, and each node stores the same number of objects. In our model, the load offered for the objects is sampled uniformly at random from all the load vectors with a fixed cumulative value. We find that the load balance in a system of $n$ nodes improves multiplicatively with $d$ as long as $d = o\left(\log(n)\right)$, and improves exponentially once $d = Θ\left(\log(n)\right)$. We show that the load balance improves in the same way with $d$ when the service choices are created with XOR's of $r$ objects rather than object replicas. In such redundancy schemes, storage overhead is reduced multiplicatively by $r$. However, recovery of an object requires downloading content from $r$ nodes. At the same time, the load balance increases additively by $r$. We express the system's load balance in terms of the maximal spacing or maximum of $d$ consecutive spacings between the ordered statistics of uniform random variables. Using this connection and the limit results on the maximal $d$-spacings, we derive our main results.
△ Less
Submitted 22 January, 2021; v1 submitted 13 October, 2019;
originally announced October 2019.
-
Anonymity Mixes as (Partial) Assembly Queues: Modeling and Analysis
Authors:
Mehmet Fatih Aktas,
Emina Soljanin
Abstract:
Anonymity platforms route the traffic over a network of special routers that are known as mixes and implement various traffic disruption techniques to hide the communicating users' identities. Batch mixes in particular anonymize communicating peers by allowing message exchange to take place only after a sufficient number of messages (a batch) accumulate, thus introducing delay. We introduce a queu…
▽ More
Anonymity platforms route the traffic over a network of special routers that are known as mixes and implement various traffic disruption techniques to hide the communicating users' identities. Batch mixes in particular anonymize communicating peers by allowing message exchange to take place only after a sufficient number of messages (a batch) accumulate, thus introducing delay. We introduce a queueing model for batch mix and study its delay properties. Our analysis shows that delay of a batch mix grows quickly as the batch size gets close to the number of senders connected to the mix. We then propose a randomized batch mixing strategy and show that it achieves much better delay scaling in terms of the batch size. However, randomization is shown to reduce the anonymity preserving capabilities of the mix. We also observe that queueing models are particularly useful to study anonymity metrics that are more practically relevant such as the time-to-deanonymize metric.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
Persistence Homology of Networks: Methods and Applications
Authors:
Mehmet Emin Aktas,
Esra Akbas,
Ahmed El Fatmaoui
Abstract:
Information networks are becoming increasingly popular to capture complex relationships across various disciplines, such as social networks, citation networks, and biological networks. The primary challenge in this domain is measuring similarity or distance between networks based on topology. However, classical graph-theoretic measures are usually local and mainly based on differences between eith…
▽ More
Information networks are becoming increasingly popular to capture complex relationships across various disciplines, such as social networks, citation networks, and biological networks. The primary challenge in this domain is measuring similarity or distance between networks based on topology. However, classical graph-theoretic measures are usually local and mainly based on differences between either node or edge measurements or correlations without considering the topology of networks such as the connected components or holes. In recent years, mathematical tools and deep learning based methods have become popular to extract the topological features of networks. Persistent homology (PH) is a mathematical tool in computational topology that measures the topological features of data that persist across multiple scales with applications ranging from biological networks to social networks. In this paper, we provide a conceptual review of key advancements in this area of using PH on complex network science. We give a brief mathematical background on PH, review different methods (i.e. filtrations) to define PH on networks and highlight different algorithms and applications where PH is used in solving network mining problems. In doing so, we develop a unified framework to describe these recent approaches and emphasize major conceptual distinctions. We conclude with directions for future work. We focus our review on recent approaches that get significant attention in the mathematics and data mining communities working on network data. We believe our summary of the analysis of PH on networks will provide important insights to researchers in applied network science.
△ Less
Submitted 19 July, 2019;
originally announced July 2019.
-
Network Embedding: on Compression and Learning
Authors:
Esra Akbas,
Mehmet Aktas
Abstract:
Recently, network embedding that encodes structural information of graphs into a vector space has become popular for network analysis. Although recent methods show promising performance for various applications, the huge sizes of graphs may hinder a direct application of existing network embedding method to them. This paper presents NECL, a novel efficient Network Embedding method with two goals.…
▽ More
Recently, network embedding that encodes structural information of graphs into a vector space has become popular for network analysis. Although recent methods show promising performance for various applications, the huge sizes of graphs may hinder a direct application of existing network embedding method to them. This paper presents NECL, a novel efficient Network Embedding method with two goals. 1) Is there an ideal Compression of a network? 2) Will the compression of a network significantly boost the representation Learning of the network? For the first problem, we propose a neighborhood similarity based graph compression method that compresses the input graph to get a smaller graph without losing any/much information about the global structure of the graph and the local proximity of the vertices in the graph. For the second problem, we use the compressed graph for network embedding instead of the original large graph to bring down the embedding cost. NECL is a general meta-strategy to improve the efficiency of all of the state-of-the-art graph embedding algorithms based on random walks, including DeepWalk and Node2vec, without losing their effectiveness. Extensive experiments on large real-world networks validate the efficiency of NECL method that yields an average improvement of 23 - 57% embedding time, including walking and learning time without decreasing classification accuracy as evaluated on single and multi-label classification tasks on real-world graphs such as DBLP, BlogCatalog, Cora and Wiki.
△ Less
Submitted 8 July, 2019; v1 submitted 5 July, 2019;
originally announced July 2019.
-
Straggler Mitigation at Scale
Authors:
Mehmet Fatih Aktas,
Emina Soljanin
Abstract:
Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed…
▽ More
Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. Tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for the cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded redundancy compare in the cost vs. latency tradeoff? 2) Can we introduce redundancy after waiting some time and expect to reduce the cost? 3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? We validate the answers we found for each of the questions via simulations that use empirical distributions extracted from a Google cluster data.
△ Less
Submitted 9 October, 2019; v1 submitted 25 June, 2019;
originally announced June 2019.
-
Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation
Authors:
Mehmet Fatih Aktas,
Emina Soljanin
Abstract:
Runtime variability in computing systems causes some tasks to straggle and take much longer than expected to complete. These straggler tasks are known to significantly slowdown distributed computation. Job execution with speculative execution of redundant tasks has been the most widely deployed technique for mitigating the impact of stragglers, and many recent theoretical papers have studied the a…
▽ More
Runtime variability in computing systems causes some tasks to straggle and take much longer than expected to complete. These straggler tasks are known to significantly slowdown distributed computation. Job execution with speculative execution of redundant tasks has been the most widely deployed technique for mitigating the impact of stragglers, and many recent theoretical papers have studied the advantages and disadvantages of using redundancy under various system and service models. However, no clear guidelines could yet be found on when, for which jobs, and how much redundancy should be employed in Master-Worker compute clusters, which is the most widely adopted architecture in modern compute systems. We are concerned with finding a strategy for scheduling jobs with redundancy that works well in practice. This is a complex optimization problem, which we address in stages. We first use Reinforcement Learning (RL) techniques to learn good scheduling principles from realistic experience. Building on these principles, we derive a simple scheduling policy and present an approximate analysis of its performance. Specifically, we derive expressions to decide when and which jobs should be scheduled with how much redundancy. We show that policy that we devise in this way performs as good as the more complex policies that are derived by RL. Finally, we extend our approximate analysis to the case when system employs the other widely deployed remedy for stragglers, which is relaunching straggler tasks after waiting some time. We show that scheduling with redundancy significantly outperforms straggler relaunch policy when the offered load on the system is low or moderate, and performs slightly worse when the offered load is very high.
△ Less
Submitted 12 June, 2019;
originally announced June 2019.
-
Simplex Queues for Hot-Data Download
Authors:
Mehmet Fatih Aktas,
Elie Najm,
Emina Soljanin
Abstract:
In cloud storage systems, hot data is usually replicated over multiple nodes in order to accommodate simultaneous access by multiple users as well as increase the fault tolerance of the system. Recent cloud storage research has proposed using availability codes, which is a special class of erasure codes, as a more storage efficient way to store hot data. These codes enable data recovery from multi…
▽ More
In cloud storage systems, hot data is usually replicated over multiple nodes in order to accommodate simultaneous access by multiple users as well as increase the fault tolerance of the system. Recent cloud storage research has proposed using availability codes, which is a special class of erasure codes, as a more storage efficient way to store hot data. These codes enable data recovery from multiple, small disjoint groups of servers. The number of the recovery groups is referred to as the availability and the size of each group as the locality of the code. Until now, we have very limited knowledge on how code locality and availability affect data access time. Data download from these systems involves multiple fork-join queues operating in-parallel, making the analysis of access time a very challenging problem. In this paper, we present an approximate analysis of data access time in storage systems that employ simplex codes, which are an important and in certain sense optimal class of availability codes. We consider and compare three strategies in assigning download requests to servers; first one aggressively exploits the storage availability for faster download, second one implements only load balancing, and the last one employs storage availability only for hot data download without incurring any negative impact on the cold data download.
△ Less
Submitted 17 April, 2018;
originally announced April 2018.
-
On the Service Capacity Region of Accessing Erasure Coded Content
Authors:
Mehmet Aktas,
Sarah E. Anderson,
Ann Johnston,
Gauri Joshi,
Swanand Kadhe,
Gretchen L. Matthews,
Carolyn Mayer,
Emina Soljanin
Abstract:
Cloud storage systems generally add redundancy in storing content files such that $K$ files are replicated or erasure coded and stored on $N > K$ nodes. In addition to providing reliability against failures, the redundant copies can be used to serve a larger volume of content access requests. A request for one of the files can be either be sent to a systematic node, or one of the repair groups. In…
▽ More
Cloud storage systems generally add redundancy in storing content files such that $K$ files are replicated or erasure coded and stored on $N > K$ nodes. In addition to providing reliability against failures, the redundant copies can be used to serve a larger volume of content access requests. A request for one of the files can be either be sent to a systematic node, or one of the repair groups. In this paper, we seek to maximize the service capacity region, that is, the set of request arrival rates for the $K$ files that can be supported by a coded storage system. We explore two aspects of this problem: 1) for a given erasure code, how to optimally split incoming requests between systematic nodes and repair groups, and 2) choosing an underlying erasure code that maximizes the achievable service capacity region. In particular, we consider MDS and Simplex codes. Our analysis demonstrates that erasure coding makes the system more robust to skews in file popularity than simply replicating a file at multiple servers, and that coding and replication together can make the capacity region larger than either alone.
△ Less
Submitted 9 October, 2017;
originally announced October 2017.
-
Effective Straggler Mitigation: Which Clones Should Attack and When?
Authors:
Mehmet Fatih Aktas,
Pei Peng,
Emina Soljanin
Abstract:
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present…
▽ More
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. We quantify the effect of the tail of task execution times and discuss tail heaviness as a decisive parameter for the cost and latency of using redundancy. Specifically, we find that coded redundancy achieves better cost vs. latency and allows for greater achievable latency and cost tradeoff region compared to replication and can yield reduction in both cost and latency under less heavy tailed execution times. We show that delaying redundancy is not effective in reducing cost.
△ Less
Submitted 2 October, 2017;
originally announced October 2017.
-
Straggler Mitigation by Delayed Relaunch of Tasks
Authors:
Mehmet Fatih Aktas,
Pei Peng,
Emina Soljanin
Abstract:
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present…
▽ More
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. We quantify the effect of the tail of task execution times and discuss tail heaviness as a decisive parameter for the cost and latency of using redundancy. Specifically, we find that coded redundancy achieves better cost vs. latency tradeoff than simple replication and can yield reduction in both cost and latency under less heavy tailed execution times. We show that delaying redundancy is not effective in reducing cost and that delayed relaunch of stragglers can yield significant reduction in cost and latency. We validate these observations by comparing with the simulations that use empirical distributions extracted from Google cluster data.
△ Less
Submitted 1 October, 2017;
originally announced October 2017.