Search | arXiv e-print repository

DSig: Breaking the Barrier of Signatures in Data Centers

Authors: Marcos K. Aguilera, Clément Burgelin, Rachid Guerraoui, Antoine Murat, Athanasios Xygkis, Igor Zablotchi

Abstract: Data centers increasingly host mutually distrustful users on shared infrastructure. A powerful tool to safeguard such users are digital signatures. Digital signatures have revolutionized Internet-scale applications, but current signatures are too slow for the growing genre of microsecond-scale systems in modern data centers. We propose DSig, the first digital signature system to achieve single-dig… ▽ More Data centers increasingly host mutually distrustful users on shared infrastructure. A powerful tool to safeguard such users are digital signatures. Digital signatures have revolutionized Internet-scale applications, but current signatures are too slow for the growing genre of microsecond-scale systems in modern data centers. We propose DSig, the first digital signature system to achieve single-digit microsecond latency to sign, transmit, and verify signatures in data center systems. DSig is based on the observation that, in many data center applications, the signer of a message knows most of the time who will verify its signature. We introduce a new hybrid signature scheme that combines cheap single-use hash-based signatures verified in the foreground with traditional signatures pre-verified in the background. Compared to prior state-of-the-art signatures, DSig reduces signing time from 18.9 to 0.7 us and verification time from 35.6 to 5.1 us, while kee** signature transmission time below 2.5 us. Moreover, DSig achieves 2.5x higher signing throughput and 6.9x higher verification throughput than the state of the art. We use DSig to (a) bring auditability to two key-value stores (HERD and Redis) and a financial trading system (based on Liquibook) for 86% lower added latency than the state of the art, and (b) replace signatures in BFT broadcast and BFT replication, reducing their latency by 73% and 69%, respectively △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: To appear in the proceedings of OSDI '24. Authors listed in alphabetical order

arXiv:2403.08374 [pdf, other]

Efficient Signature-Free Validated Agreement

Authors: Pierre Civit, Muhammad Ayaz Dzulfikar, Seth Gilbert, Rachid Guerraoui, Jovan Komatovic, Manuel Vidigueira, Igor Zablotchi

Abstract: Byzantine agreement enables n processes to agree on a common L-bit value, despite up to t > 0 arbitrary failures. A long line of work has been dedicated to improving the bit complexity of Byzantine agreement in synchrony. This has culminated in COOL, an error-free (deterministically secure against a computationally unbounded adversary) solution that achieves O(nL + n^2 logn) worst-case bit complex… ▽ More Byzantine agreement enables n processes to agree on a common L-bit value, despite up to t > 0 arbitrary failures. A long line of work has been dedicated to improving the bit complexity of Byzantine agreement in synchrony. This has culminated in COOL, an error-free (deterministically secure against a computationally unbounded adversary) solution that achieves O(nL + n^2 logn) worst-case bit complexity (which is optimal for L > n logn according to the Dolev-Reischuk lower bound). COOL satisfies strong validity: if all correct processes propose the same value, only that value can be decided. Strong validity is, however, not appropriate for today's state machine replication (SMR) and blockchain protocols. These systems value progress and require a decided value to always be valid, excluding default decisions (such as EMPTY) even in cases where there is no unanimity a priori. Validated Byzantine agreement satisfies this property (called external validity). Yet, the best error-free (or even signature-free) validated agreement solutions achieve only O(n^2L) bit complexity, a far cry from the Omega(nL + n^2) Dolev-Reishcuk lower bound. In this paper, we present two new synchronous algorithms for validated Byzantine agreement, HashExt and ErrorFreeExt, with different trade-offs. Both algorithms are (1) signature-free, (2) optimally resilient (tolerate up to t < n / 3 failures), and (3) early-stop** (terminate in O(f+1) rounds, where f <= t is the actual number of failures). On the one hand, HashExt uses only hashes and achieves O(nL + n^3 kappa) bit complexity, which is optimal for L > n^2 kappa (where kappa is the size of a hash). On the other hand, ErrorFreeExt is error-free, using no cryptography whatsoever, and achieves O( (nL + n^2) logn ) bit complexity, which is near-optimal for any L. △ Less

Submitted 17 May, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.10059 [pdf, other]

Partial Synchrony for Free? New Upper Bounds for Byzantine Agreement

Authors: Pierre Civit, Muhammad Ayaz Dzulfikar, Seth Gilbert, Rachid Guerraoui, Jovan Komatovic, Manuel Vidigueira, Igor Zablotchi

Abstract: Byzantine agreement allows n processes to decide on a common value, in spite of arbitrary failures. The seminal Dolev-Reischuk bound states that any deterministic solution to Byzantine agreement exchanges Omega(n^2) bits. In synchronous networks, solutions with optimal O(n^2) bit complexity, optimal fault tolerance, and no cryptography have been established for over three decades. However, these s… ▽ More Byzantine agreement allows n processes to decide on a common value, in spite of arbitrary failures. The seminal Dolev-Reischuk bound states that any deterministic solution to Byzantine agreement exchanges Omega(n^2) bits. In synchronous networks, solutions with optimal O(n^2) bit complexity, optimal fault tolerance, and no cryptography have been established for over three decades. However, these solutions lack robustness under adverse network conditions. Therefore, research has increasingly focused on Byzantine agreement for partially synchronous networks. Numerous solutions have been proposed for the partially synchronous setting. However, these solutions are notoriously hard to prove correct, and the most efficient cryptography-free algorithms still require O(n^3) exchanged bits in the worst case. In this paper, we introduce Oper, the first generic transformation of deterministic Byzantine agreement algorithms from synchrony to partial synchrony. Oper requires no cryptography, is optimally resilient (n >= 3t+1, where t is the maximum number of failures), and preserves the worst-case per-process bit complexity of the transformed synchronous algorithm. Leveraging Oper, we present the first partially synchronous Byzantine agreement algorithm that (1) achieves optimal O(n^2) bit complexity, (2) requires no cryptography, and (3) is optimally resilient (n >= 3t+1), thus showing that the Dolev-Reischuk bound is tight even in partial synchrony. Moreover, we adapt Oper for long values and obtain several new partially synchronous algorithms with improved complexity and weaker (or completely absent) cryptographic assumptions. △ Less

Submitted 5 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

arXiv:2401.16292 [pdf, other]

Pilotfish: Distributed Transaction Execution for Lazy Blockchains

Authors: Quentin Kniep, Lefteris Kokoris-Kogias, Alberto Sonnino, Igor Zablotchi, Nuda Zhang

Abstract: Pilotfish is the first scale-out blockchain execution engine able to harness any degree of parallelizability existing in its workload. Pilotfish allows each validator to employ multiple machines, named ExecutionWorkers, under its control to scale its execution layer. Given a sufficiently parallelizable and compute-intensive load, the number of transactions that the validator can execute increases… ▽ More Pilotfish is the first scale-out blockchain execution engine able to harness any degree of parallelizability existing in its workload. Pilotfish allows each validator to employ multiple machines, named ExecutionWorkers, under its control to scale its execution layer. Given a sufficiently parallelizable and compute-intensive load, the number of transactions that the validator can execute increases linearly with the number of ExecutionWorkers at its disposal. In addition, Pilotfish maintains the consistency of the state, even when many validators experience simultaneous machine failures. This is possible due to the meticulous co-design of our crash-recovery protocol which leverages the existing fault tolerance in the blockchain's consensus mechanism. Finally, Pilotfish can also be seen as the first distributed deterministic execution engine that provides support for dynamic reads as transactions are not required to provide a fully accurate read and write set. This loosening of requirements would normally reduce the parallelizability available by blocking write-after-write conflicts, but our novel versioned-queues scheduling algorithm circumvents this by exploiting the lazy recovery property of Pilotfish, which only persists consistent state and re-executes any optimistic steps taken before the crash. In order to prove our claims we implemented the common path of Pilotfish with support for the MoveVM and evaluated it against the parallel execution MoveVM of Sui. Our results show that Pilotfish provides good scalability up to 8 ExecutionWorkers for a variety of workloads. In computationally-heavy workloads, Pilotfish's scalability is linear. △ Less

Submitted 16 February, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.08015 [pdf, other]

Parallel $k$-Core Decomposition with Batched Updates and Asynchronous Reads

Authors: Quanquan C. Liu, Julian Shun, Igor Zablotchi

Abstract: Maintaining a dynamic $k$-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an approximate $k$-core decomposition. In their solution, both reads and updates need to be batched, and therefore each type of operation can incur high latency waiting… ▽ More Maintaining a dynamic $k$-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an approximate $k$-core decomposition. In their solution, both reads and updates need to be batched, and therefore each type of operation can incur high latency waiting for the other type to finish. To tackle most real-world workloads, which are dominated by reads, this paper presents a novel hybrid concurrent-parallel dynamic $k$-core data structure where asynchronous reads can proceed concurrently with batches of updates, leading to significantly lower read latencies. Our approach is based on tracking causal dependencies between updates, so that causally related groups of updates appear atomic to concurrent readers. Our data structure guarantees linearizability and liveness for both reads and updates, and maintains the same approximation guarantees as prior work. Our experimental evaluation on a 30-core machine shows that our approach reduces read latency by orders of magnitude compared to the batch-dynamic algorithm, up to a $\left(4.05 \cdot 10^{5}\right)$-factor. Compared to an unsynchronized (non-linearizable) baseline, our read latency overhead is only up to a $3.21$-factor greater, while improving accuracy of coreness estimates by up to a factor of $52.7$. △ Less

Submitted 15 January, 2024; originally announced January 2024.

Comments: To appear in PPoPP 2024

arXiv:2303.14259 [pdf, other]

Honeycomb: ordered key-value store acceleration on an FPGA-based SmartNIC

Authors: Junyi Liu, Aleksandar Dragojevic, Shane Flemming, Antonios Katsarakis, Dario Korolija, Igor Zablotchi, Ho-cheung Ng, Anuj Kalia, Miguel Castro

Abstract: In-memory ordered key-value stores are an important building block in modern distributed applications. We present Honeycomb, a hybrid software-hardware system for accelerating read-dominated workloads on ordered key-value stores that provides linearizability for all operations including scans. Honeycomb stores a B-Tree in host memory, and executes SCAN and GET on an FPGA-based SmartNIC, and PUT, U… ▽ More In-memory ordered key-value stores are an important building block in modern distributed applications. We present Honeycomb, a hybrid software-hardware system for accelerating read-dominated workloads on ordered key-value stores that provides linearizability for all operations including scans. Honeycomb stores a B-Tree in host memory, and executes SCAN and GET on an FPGA-based SmartNIC, and PUT, UPDATE and DELETE on the CPU. This approach enables large stores and simplifies the FPGA implementation but raises the challenge of data access and synchronization across the slow PCIe bus. We describe how Honeycomb overcomes this challenge with careful data structure design, caching, request parallelism with out-of-order request execution, wait-free read operations, and batching synchronization between the CPU and the FPGA. For read-heavy YCSB workloads, Honeycomb improves the throughput of a state-of-the-art ordered key-value store by at least 1.8x. For scan-heavy workloads inspired by cloud storage, Honeycomb improves throughput by more than 2x. The cost-performance, which is more important for large-scale deployments, is improved by at least 1.5x on these workloads. △ Less

Submitted 6 April, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

arXiv:2302.07348 [pdf, other]

Cliff-Learning

Authors: Tony T. Wang, Igor Zablotchi, Nir Shavit, Jonathan S. Rosenfeld

Abstract: We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model clif… ▽ More We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improves at a faster than power law rate (i.e. regions of concavity on a log-log scaling plot). We conduct an in-depth investigation of foundation-model cliff-learning and study toy models of the phenomenon. We observe that the degree of cliff-learning reflects the degree of compatibility between the priors of a learning algorithm and the task being learned. △ Less

Submitted 6 June, 2023; v1 submitted 14 February, 2023; originally announced February 2023.

Comments: 16 pages; v2 updates: improved layout, added acknowledgements

arXiv:2210.17174 [pdf, other]

uBFT: Microsecond-scale BFT using Disaggregated Memory [Extended Version]

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Antoine Murat, Athanasios Xygkis, Igor Zablotchi

Abstract: We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tail… ▽ More We propose uBFT, the first State Machine Replication (SMR) system to achieve microsecond-scale latency in data centers, while using only $2f{+}1$ replicas to tolerate $f$ Byzantine failures. The Byzantine Fault Tolerance (BFT) provided by uBFT is essential as pure crashes appear to be a mere illusion with real-life systems reportedly failing in many unexpected ways. uBFT relies on a small non-tailored trusted computing base -- disaggregated memory -- and consumes a practically bounded amount of memory. uBFT is based on a novel abstraction called Consistent Tail Broadcast, which we use to prevent equivocation while bounding memory. We implement uBFT using RDMA-based disaggregated memory and obtain an end-to-end latency of as little as 10us. This is at least 50$\times$ faster than MinBFT , a state of the art $2f{+}1$ BFT SMR based on Intel's SGX. We use uBFT to replicate two KV-stores (Memcached and Redis), as well as a financial order matching engine (Liquibook). These applications have low latency (up to 20us) and become Byzantine tolerant with as little as 10us more. The price for uBFT is a small amount of reliable disaggregated memory (less than 1 MiB), which in our prototype consists of a small number of memory servers connected through RDMA and replicated for fault tolerance. △ Less

Submitted 16 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

arXiv:2108.01330 [pdf, other]

Frugal Byzantine Computing

Authors: M. K. Aguilera, N. Ben-David, R. Guerraoui, D. Papuc, A. Xygkis, I. Zablotchi

Abstract: Traditional techniques for handling Byzantine failures are expensive: digital signatures are too costly, while using $3f{+}1$ replicas is uneconomical ($f$ denotes the maximum number of Byzantine processes). We seek algorithms that reduce the number of replicas to $2f{+}1$ and minimize the number of signatures. While the first goal can be achieved in the message-and-memory model, accomplishing the… ▽ More Traditional techniques for handling Byzantine failures are expensive: digital signatures are too costly, while using $3f{+}1$ replicas is uneconomical ($f$ denotes the maximum number of Byzantine processes). We seek algorithms that reduce the number of replicas to $2f{+}1$ and minimize the number of signatures. While the first goal can be achieved in the message-and-memory model, accomplishing the second goal simultaneously is challenging. We first address this challenge for the problem of broadcasting messages reliably. We consider two variants of this problem, Consistent Broadcast and Reliable Broadcast, typically considered very close. Perhaps surprisingly, we establish a separation between them in terms of signatures required. In particular, we show that Consistent Broadcast requires at least 1 signature in some execution, while Reliable Broadcast requires $O(n)$ signatures in some execution. We present matching upper bounds for both primitives within constant factors. We then turn to the problem of consensus and argue that this separation matters for solving consensus with Byzantine failures: we present a practical consensus algorithm that uses Consistent Broadcast as its main communication primitive. This algorithm works for $n=2f{+}1$ and avoids signatures in the common-case -- properties that have not been simultaneously achieved previously. Overall, our work approaches Byzantine computing in a frugal manner and motivates the use of Consistent Broadcast -- rather than Reliable Broadcast -- as a key primitive for reaching agreement. △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: This paper is an extended version of the DISC 2021 paper

arXiv:2010.06288 [pdf, other]

Microsecond Consensus for Microsecond Applications

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra J. Marathe, Athanasios Xygkis, Igor Zablotchi

Abstract: We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memo… ▽ More We consider the problem of making apps fault-tolerant through replication, when apps operate at the microsecond scale, as in finance, embedded computing, and microservices apps. These apps need a replication scheme that also operates at the microsecond scale, otherwise replication becomes a burden. We propose Mu, a system that takes less than 1.3 microseconds to replicate a (small) request in memory, and less than a millisecond to fail-over the system - this cuts the replication and fail-over latencies of the prior systems by at least 61% and 90%. Mu implements bona fide state machine replication/consensus (SMR) with strong consistency for a generic app, but it really shines on microsecond apps, where even the smallest overhead is significant. To provide this performance, Mu introduces a new SMR protocol that carefully leverages RDMA. Roughly, in Mu a leader replicates a request by simply writing it directly to the log of other replicas using RDMA, without any additional communication. Doing so, however, introduces the challenge of handling concurrent leaders, changing leaders, garbage collecting the logs, and more - challenges that we address in this paper through a judicious combination of RDMA permissions and distributed algorithmic design. We implemented Mu and used it to replicate several systems: a financial exchange app called Liquibook, Redis, Memcached, and HERD. Our evaluation shows that Mu incurs a small replication latency, in some cases being the only viable replication system that incurs an acceptable overhead. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Full version of OSDI'20 paper

arXiv:2008.02527 [pdf, other]

Efficient Multi-word Compare and Swap

Authors: Rachid Guerraoui, Alex Kogan, Virendra J. Marathe, Igor Zablotchi

Abstract: Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes requi… ▽ More Atomic lock-free multi-word compare-and-swap (MCAS) is a powerful tool for designing concurrent algorithms. Yet, its widespread usage has been limited because lock-free implementations of MCAS make heavy use of expensive compare-and-swap (CAS) instructions. Existing MCAS implementations indeed use at least 2k+1 CASes per k-CAS. This leads to the natural desire to minimize the number of CASes required to implement MCAS. We first prove in this paper that it is impossible to "pack" the information required to perform a k-word CAS (k-CAS) in less than k locations to be CASed. Then we present the first algorithm that requires k+1 CASes per call to k-CAS in the common uncontended case. We implement our algorithm and show that it outperforms a state-of-the-art baseline in a variety of benchmarks in most considered workloads. We also present a durably linearizable (persistent memory friendly) version of our MCAS algorithm using only 2 persistence fences per call, while still only requiring k+1 CASes per k-CAS. △ Less

Submitted 6 August, 2020; originally announced August 2020.

Comments: Full version of DISC '20 paper

arXiv:1905.12143 [pdf, other]

The Impact of RDMA on Agreement

Authors: Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Igor Zablotchi

Abstract: Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the in… ▽ More Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires $n \geq 2f_P + 1$ processes (where $f_P$ is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires $n \geq f_P + 1$ processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions. △ Less

Submitted 25 February, 2021; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: Full version of PODC'19 paper, strengthened broadcast algorithm

Showing 1–12 of 12 results for author: Zablotchi, I