-
FlexCast: genuine overlay-based atomic multicast
Authors:
Eliã Batista,
Paulo Coelho,
Eduardo Alchieri,
Fernando Dotti,
Fernando Pedone
Abstract:
Atomic multicast is a communication abstraction where messages are propagated to groups of processes with reliability and order guarantees. Atomic multicast is at the core of strongly consistent storage and transactional systems. This paper presents FlexCast, the first genuine overlay-based atomic multicast protocol. Genuineness captures the essence of atomic multicast in that only the sender of a…
▽ More
Atomic multicast is a communication abstraction where messages are propagated to groups of processes with reliability and order guarantees. Atomic multicast is at the core of strongly consistent storage and transactional systems. This paper presents FlexCast, the first genuine overlay-based atomic multicast protocol. Genuineness captures the essence of atomic multicast in that only the sender of a message and the message's destinations coordinate to order the message, leading to efficient protocols. Overlay-based protocols restrict how process groups can communicate. Limiting communication leads to simpler protocols and reduces the amount of information each process must keep about the rest of the system. FlexCast implements genuine atomic multicast using a complete DAG overlay. We experimentally evaluate FlexCast in a geographically distributed environment using gTPC-C, a variation of the TPC-C benchmark that takes into account geographical distribution and locality. We show that, by exploiting genuineness and workload locality, FlexCast outperforms well-established atomic multicast protocols without the inherent communication overhead of state-of-the-art non-genuine multicast protocols.
△ Less
Submitted 28 September, 2023; v1 submitted 25 September, 2023;
originally announced September 2023.
-
From Byzantine Replication to Blockchain: Consensus is only the Beginning
Authors:
Alysson Bessani,
Eduardo Alchieri,
João Sousa,
André Oliveira,
Fernando Pedone
Abstract:
The popularization of blockchains leads to a resurgence of interest in Byzantine Fault-Tolerant (BFT) state machine replication protocols. However, much of the work on this topic focuses on the underlying consensus protocols, with emphasis on their lack of scalability, leaving other subtle limitations unaddressed. These limitations are related to the effects of maintaining a durable blockchain ins…
▽ More
The popularization of blockchains leads to a resurgence of interest in Byzantine Fault-Tolerant (BFT) state machine replication protocols. However, much of the work on this topic focuses on the underlying consensus protocols, with emphasis on their lack of scalability, leaving other subtle limitations unaddressed. These limitations are related to the effects of maintaining a durable blockchain instead of a write-ahead log and the requirement for reconfiguring the set of replicas in a decentralized way. We demonstrate these limitations using a digital coin blockchain application and BFT-SMaRt, a popular BFT replication library. We show how they can be addressed both at a conceptual level, in a protocol-agnostic way, and by implementing SMaRtChain, a blockchain platform based on BFT-SMaRt. SMaRtChain improves the performance of our digital coin application by a factor of eight when compared with a naive implementation on top of BFT-SMaRt. Moreover, SMaRtChain achieves a throughput $8\times$ and $33\times$ better than Tendermint and Hyperledger Fabric, respectively, when ensuring strong durability on its blockchain.
△ Less
Submitted 29 April, 2020;
originally announced April 2020.
-
Smart Contracts on the Move
Authors:
Enrique Fynn,
Alysson Bessani,
Fernando Pedone
Abstract:
Blockchain systems have received much attention and promise to revolutionize many services. Yet, despite their popularity, current blockchain systems exist in isolation, that is, they cannot share information. While interoperability is crucial for blockchain to reach widespread adoption, it is difficult to achieve due to differences among existing blockchain technologies. This paper presents a tec…
▽ More
Blockchain systems have received much attention and promise to revolutionize many services. Yet, despite their popularity, current blockchain systems exist in isolation, that is, they cannot share information. While interoperability is crucial for blockchain to reach widespread adoption, it is difficult to achieve due to differences among existing blockchain technologies. This paper presents a technique to allow blockchain interoperability. The core idea is to provide a primitive operation to developers so that contracts and objects can switch from one blockchain to another, without breaking consistency and violating key blockchain properties. To validate our ideas, we implemented our protocol in two popular blockchain clients that use the Ethereum virtual machine. We discuss how to build applications using the proposed protocol and show examples of applications based on real use cases that can move across blockchains. To analyze the system performance we use a real trace from one of the most popular Ethereum applications and replay it in a multi-blockchain environment.
△ Less
Submitted 23 April, 2020; v1 submitted 13 April, 2020;
originally announced April 2020.
-
Partitioned Paxos via the Network Data Plane
Authors:
Huynh Tu Dang,
Pietro Bressana,
Han Wang,
Ki Suh Lee,
Noa Zilberman,
Hakim Weatherspoon,
Marco Canini,
Fernando Pedone,
Robert Soulé
Abstract:
Consensus protocols are the foundation for building fault-tolerant, distributed systems, and services. They are also widely acknowledged as performance bottlenecks. Several recent systems have proposed accelerating these protocols using the network data plane. But, while network-accelerated consensus shows great promise, current systems suffer from an important limitation: they assume that the net…
▽ More
Consensus protocols are the foundation for building fault-tolerant, distributed systems, and services. They are also widely acknowledged as performance bottlenecks. Several recent systems have proposed accelerating these protocols using the network data plane. But, while network-accelerated consensus shows great promise, current systems suffer from an important limitation: they assume that the network hardware also accelerates the application itself. Consequently, they provide a specialized replicated service, rather than providing a general-purpose high-performance consensus that fits any off-the-shelf application.
To address this problem, this paper proposes Partitioned Paxos, a novel approach to network-accelerated consensus. The key insight behind Partitioned Paxos is to separate the two aspects of Paxos, agreement, and execution, and optimize them separately. First, Partitioned Paxos uses the network forwarding plane to accelerate agreement. Then, it uses state partitioning and parallelization to accelerate execution at the replicas. Our experiments show that using this combination of data plane acceleration and parallelization, Partitioned Paxos is able to provide at least x3 latency improvement and x11 throughput improvement for a replicated instance of a RocksDB key-value store.
△ Less
Submitted 25 January, 2019;
originally announced January 2019.
-
Early Scheduling in Parallel State Machine Replication
Authors:
Eduardo Alchieri,
Fernando Dotti,
Fernando Pedone
Abstract:
State machine replication is standard approach to fault tolerance. One of the key assumptions of state machine replication is that replicas must execute operations deterministically and thus serially. To benefit from multi-core servers, some techniques allow concurrent execution of operations in state machine replication. Invariably, these techniques exploit the fact that independent operations (t…
▽ More
State machine replication is standard approach to fault tolerance. One of the key assumptions of state machine replication is that replicas must execute operations deterministically and thus serially. To benefit from multi-core servers, some techniques allow concurrent execution of operations in state machine replication. Invariably, these techniques exploit the fact that independent operations (those that do not share any common state or do not update shared state) can execute concurrently. A promising category of solutions trades scheduling freedom for simplicity. This paper generalizes this category of scheduling solutions. In doing so, it proposes an automated mechanism to schedule operations on worker threads at replicas. We integrate our contributions to a popular state machine replication framework and experimentally compare the resulting system to more classic approaches.
△ Less
Submitted 14 May, 2018;
originally announced May 2018.
-
Challenges and pitfalls of partitioning blockchains
Authors:
Enrique Fynn,
Fernando Pedone
Abstract:
Blockchain has received much attention in recent years. This immense popularity has raised a number of concerns, scalability of blockchain systems being a common one. In this paper, we seek to understand how Ethereum, a well-established blockchain system, would respond to sharding. Sharding is a prevalent technique to increase the scalability of distributed systems. To understand how sharding woul…
▽ More
Blockchain has received much attention in recent years. This immense popularity has raised a number of concerns, scalability of blockchain systems being a common one. In this paper, we seek to understand how Ethereum, a well-established blockchain system, would respond to sharding. Sharding is a prevalent technique to increase the scalability of distributed systems. To understand how sharding would affect Ethereum, we model Ethereum blockchain as a graph and evaluate five methods to partition the graph. We analyze the results using three metrics: the balance among shards, the number of transactions that would involve multiple shards, and the amount of data that would be relocated across shards upon a repartitioning of the system.
△ Less
Submitted 9 May, 2018; v1 submitted 19 April, 2018;
originally announced April 2018.
-
Optimistic Aborts for Geo-distributed Transactions
Authors:
Theo Jepsen,
Leandro Pacheco de Sousa,
Huynh Tu Dang,
Fernando Pedone,
Robert Soulé
Abstract:
Network latency can have a significant impact on the performance of transactional storage systems, particularly in wide area or geo-distributed deployments. To reduce latency, systems typically rely on a cache to service read-requests closer to the client. However, caches are not effective for write-heavy workloads, which have to be processed by the storage system in order to maintain serializabil…
▽ More
Network latency can have a significant impact on the performance of transactional storage systems, particularly in wide area or geo-distributed deployments. To reduce latency, systems typically rely on a cache to service read-requests closer to the client. However, caches are not effective for write-heavy workloads, which have to be processed by the storage system in order to maintain serializability.
This paper presents a new technique, called optimistic abort, which reduces network latency for high-contention, write-heavy workloads by identifying transactions that will abort as early as possible, and aborting them before they reach the store. We have implemented optimistic abort in a system called Gotthard, which leverages recent advances in network data plane programmability to execute transaction processing logic directly in network devices. Gotthard examines network traffic to observe and log transaction requests. If Gotthard suspects that a transaction is likely to be aborted at the store, it aborts the transaction early by re-writing the packet header, and routing the packets back to the client. Gotthard significantly reduces the overall latency and improves the throughput for high-contention workloads.
△ Less
Submitted 24 October, 2016;
originally announced October 2016.
-
Network Hardware-Accelerated Consensus
Authors:
Huynh Tu Dang,
Pietro Bressana,
Han Wang,
Ki Suh Lee,
Hakim Weatherspoon,
Marco Canini,
Fernando Pedone,
Robert Soulé
Abstract:
Consensus protocols are the foundation for building many fault-tolerant distributed systems and services. This paper posits that there are significant performance benefits to be gained by offering consensus as a network service (CAANS). CAANS leverages recent advances in commodity networking hardware design and programmability to implement consensus protocol logic in network devices. CAANS provide…
▽ More
Consensus protocols are the foundation for building many fault-tolerant distributed systems and services. This paper posits that there are significant performance benefits to be gained by offering consensus as a network service (CAANS). CAANS leverages recent advances in commodity networking hardware design and programmability to implement consensus protocol logic in network devices. CAANS provides a complete Paxos protocol, is a drop-in replacement for software-based implementations of Paxos, makes no restrictions on network topologies, and is implemented in a higher-level, data-plane programming language, allowing for portability across a range of target devices. At the same time, CAANS significantly increases throughput and reduces latency for consensus operations. Consensus logic executing in hardware can transmit consensus messages at line speed, with latency only slightly higher than simply forwarding packets.
△ Less
Submitted 18 May, 2016;
originally announced May 2016.
-
Paxos Made Switch-y
Authors:
Huynh Tu Dang,
Marco Canini,
Fernando Pedone,
Robert Soulé
Abstract:
This paper describes an implementation of the well-known consensus protocol, Paxos, in the P4 programming language. P4 is a language for programming the behavior of network forwarding devices (i.e., the network data plane). Moving consensus logic into network devices could significantly improve the performance of the core infrastructure and services in data centers. Moreover, implementing Paxos in…
▽ More
This paper describes an implementation of the well-known consensus protocol, Paxos, in the P4 programming language. P4 is a language for programming the behavior of network forwarding devices (i.e., the network data plane). Moving consensus logic into network devices could significantly improve the performance of the core infrastructure and services in data centers. Moreover, implementing Paxos in P4 provides a critical use case and set of requirements for data plane language designers. In the long term, we imagine that consensus could someday be offered as a network service, just as point-to-point communication is provided today.
△ Less
Submitted 16 November, 2015;
originally announced November 2015.
-
Stretching Multi-Ring Paxos
Authors:
Samuel Benz,
Leandro Pacheco de Sousa,
Fernando Pedone
Abstract:
Internet-scale services rely on data partitioning and replication to provide scalable performance and high availability. Moreover, to reduce user-perceived response times and tolerate disasters (i.e., the failure of a whole datacenter), services are increasingly becoming geographically distributed. Data partitioning and replication, combined with local and geographical distribution, introduce daun…
▽ More
Internet-scale services rely on data partitioning and replication to provide scalable performance and high availability. Moreover, to reduce user-perceived response times and tolerate disasters (i.e., the failure of a whole datacenter), services are increasingly becoming geographically distributed. Data partitioning and replication, combined with local and geographical distribution, introduce daunting challenges, including the need to carefully order requests among replicas and partitions. One way to tackle this problem is to use group communication primitives that encapsulate order requirements. This paper presents a detailed performance evaluation of Multi-Ring Paxos, a scalable group communication primitive. We focus our analysis on "extreme conditions" with deployments including high-end 10 Gbps networks, a large number of combined rings (i.e., independent Paxos instances), a large number of replicas in a ring, and a global deployment. We also report on the performance of recovery under peak load and present two novel extensions to boost Multi-Ring Paxos's performance.
△ Less
Submitted 20 April, 2015;
originally announced April 2015.
-
Merlin: A Language for Provisioning Network Resources
Authors:
Robert Soulé,
Shrutarshi Basu,
Parisa Jalili Marandi,
Fernando Pedone,
Robert Kleinberg,
Emin Gün Sirer,
Nate Foster
Abstract:
This paper presents Merlin, a new framework for managing resources in software-defined networks. With Merlin, administrators express high-level policies using programs in a declarative language. The language includes logical predicates to identify sets of packets, regular expressions to encode forwarding paths, and arithmetic formulas to specify bandwidth constraints. The Merlin compiler uses a co…
▽ More
This paper presents Merlin, a new framework for managing resources in software-defined networks. With Merlin, administrators express high-level policies using programs in a declarative language. The language includes logical predicates to identify sets of packets, regular expressions to encode forwarding paths, and arithmetic formulas to specify bandwidth constraints. The Merlin compiler uses a combination of advanced techniques to translate these policies into code that can be executed on network elements including a constraint solver that allocates bandwidth using parameterizable heuristics. To facilitate dynamic adaptation, Merlin provides mechanisms for delegating control of sub-policies and for verifying that modifications made to sub-policies do not violate global constraints. Experiments demonstrate the expressiveness and scalability of Merlin on real-world topologies and applications. Overall, Merlin simplifies network administration by providing high-level abstractions for specifying network policies and scalable infrastructure for enforcing them.
△ Less
Submitted 4 July, 2014;
originally announced July 2014.
-
Building global and scalable systems with Atomic Multicast
Authors:
Samuel Benz,
Parisa Jalili Marandi,
Fernando Pedone,
Benoît Garbinato
Abstract:
The rise of worldwide Internet-scale services demands large distributed systems. Indeed, when handling several millions of users, it is common to operate thousands of servers spread across the globe. Here, replication plays a central role, as it contributes to improve the user experience by hiding failures and by providing acceptable latency. In this paper, we claim that atomic multicast, with str…
▽ More
The rise of worldwide Internet-scale services demands large distributed systems. Indeed, when handling several millions of users, it is common to operate thousands of servers spread across the globe. Here, replication plays a central role, as it contributes to improve the user experience by hiding failures and by providing acceptable latency. In this paper, we claim that atomic multicast, with strong and well-defined properties, is the appropriate abstraction to efficiently design and implement globally scalable distributed systems. We substantiate our claim with the design of two modern online services atop atomic multicast, a strongly consistent key-value store and a distributed log. In addition to presenting the design of these services, we experimentally assess their performance in a geographically distributed deployment.
△ Less
Submitted 29 June, 2014;
originally announced June 2014.
-
Optimistic Parallel State-Machine Replication
Authors:
Parisa Jalili Marandi,
Fernando Pedone
Abstract:
State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses servers, which are increasingly parallel (i.e., multicore). To narrow the gap between state-machine replication requirements and the characteristics of modern s…
▽ More
State-machine replication, a fundamental approach to fault tolerance, requires replicas to execute commands deterministically, which usually results in sequential execution of commands. Sequential execution limits performance and underuses servers, which are increasingly parallel (i.e., multicore). To narrow the gap between state-machine replication requirements and the characteristics of modern servers, researchers have recently come up with alternative execution models. This paper surveys existing approaches to parallel state-machine replication and proposes a novel optimistic protocol that inherits the scalable features of previous techniques. Using a replicated B+-tree service, we demonstrate in the paper that our protocol outperforms the most efficient techniques by a factor of 2.4 times.
△ Less
Submitted 27 April, 2014;
originally announced April 2014.
-
Practical Experience Report: The Performance of Paxos in the Cloud
Authors:
Parisa Jalili Marandi,
Samuel Benz,
Fernando Pedone,
Ken Birman
Abstract:
This experience report presents the results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2. Paxos is a fundamental algorithm for building fault-tolerant services, at the core of state-machine replication. Implementations of Paxos are currently used in many prototypes and production systems in both academia and industry. Alt…
▽ More
This experience report presents the results of an extensive performance evaluation conducted using four open-source implementations of Paxos deployed in Amazon's EC2. Paxos is a fundamental algorithm for building fault-tolerant services, at the core of state-machine replication. Implementations of Paxos are currently used in many prototypes and production systems in both academia and industry. Although all protocols surveyed in the paper implement Paxos, they are optimized in a number of different ways, resulting in very different behavior, as we show in the paper. We have considered a variety of configurations and failure-free and faulty executions. In addition to reporting our findings, we propose and assess additional optimizations to existing implementations.
△ Less
Submitted 27 April, 2014;
originally announced April 2014.
-
Ring Paxos: High-Throughput Atomic Broadcast
Authors:
Parisa Jalili Marandi,
Marco Primi,
Nicolas Schiper,
Fernando Pedone
Abstract:
Atomic broadcast is an important communication primitive often used to implement state-machine replication. Despite the large number of atomic broadcast algorithms proposed in the literature, few papers have discussed how to turn these algorithms into efficient executable protocols. This paper focuses on a class of atomic broadcast algorithms based on Paxos, with its corresponding desirable proper…
▽ More
Atomic broadcast is an important communication primitive often used to implement state-machine replication. Despite the large number of atomic broadcast algorithms proposed in the literature, few papers have discussed how to turn these algorithms into efficient executable protocols. This paper focuses on a class of atomic broadcast algorithms based on Paxos, with its corresponding desirable properties: safety under asynchrony assumptions, liveness under weak synchrony assumptions, and resiliency-optimality. The paper presents two protocols, M-Ring Paxos and U-Ring Paxos, derived from Paxos. The protocols inherit the properties of Paxos and can be implemented very efficiently. We report a detailed performance analysis of M-Ring Paxos and U-Ring Paxos and compare them to other atomic broadcast protocols.
△ Less
Submitted 23 January, 2014;
originally announced January 2014.
-
Parallel Deferred Update Replication
Authors:
Leandro Pacheco,
Daniele Sciascia,
Fernando Pedone
Abstract:
Deferred update replication (DUR) is an established approach to implementing highly efficient and available storage. While the throughput of read-only transactions scales linearly with the number of deployed replicas in DUR, the throughput of update transactions experiences limited improvements as replicas are added. This paper presents Parallel Deferred Update Replication (P-DUR), a variation of…
▽ More
Deferred update replication (DUR) is an established approach to implementing highly efficient and available storage. While the throughput of read-only transactions scales linearly with the number of deployed replicas in DUR, the throughput of update transactions experiences limited improvements as replicas are added. This paper presents Parallel Deferred Update Replication (P-DUR), a variation of classical DUR that scales both read-only and update transactions with the number of cores available in a replica. In addition to introducing the new approach, we describe its full implementation and compare its performance to classical DUR and to Berkeley DB, a well-known standalone database.
△ Less
Submitted 3 December, 2013;
originally announced December 2013.
-
Rethinking State-Machine Replication for Parallelism
Authors:
Parisa Jalili Marandi,
Carlos Eduardo Bezerra,
Fernando Pedone
Abstract:
State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the same output upon executing the same sequence of commands. These requirements usually result in single-threaded replicas, which hinders service performance. This pa…
▽ More
State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the same output upon executing the same sequence of commands. These requirements usually result in single-threaded replicas, which hinders service performance. This paper introduces Parallel State-Machine Replication (P-SMR), a new approach to parallelism in state-machine replication. P-SMR scales better than previous proposals since no component plays a centralizing role in the execution of independent commands---those that can be executed concurrently, as defined by the service. The paper introduces P-SMR, describes a "commodified architecture" to implement it, and compares its performance to other proposals using a key-value store and a networked file system.
△ Less
Submitted 24 November, 2013;
originally announced November 2013.