-
Derivative-based regularization for regression
Authors:
Enrico Lopedoto,
Maksim Shekhunov,
Vitaly Aksenov,
Kizito Salako,
Tillman Weyde
Abstract:
In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms…
▽ More
In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms of target values but also in terms of the derivatives involved. To estimate data derivatives, we select (from the training data) 2-tuples of input-value pairs, using either nearest neighbour or random, selection. On synthetic and real datasets, we evaluate the effectiveness of adding DLoss, with different weights, to the standard mean squared error loss. The experimental results show that with DLoss (using nearest neighbour selection) we obtain, on average, the best rank with respect to MSE on validation data sets, compared to no regularization, L2 regularization, and Dropout.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Trigram-Based Persistent IDE Indices with Quick Startup
Authors:
Zakhar Iakovlev,
Alexey Chulkov,
Nikita Golikov,
Vyacheslav Lukianov,
Nikita Zinoviev,
Dmitry Ivanov,
Vitaly Aksenov
Abstract:
One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these fi…
▽ More
One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these files.
However, in a code repository, the trigram index evolves across different versions. Upon checking out a new version, this index is typically built from scratch, which is a time-consuming task, while we want our index to have almost zero-time startup.
Thus, we explore the persistent version of a trigram index for full-text and key word patterns search. Our approach just uses the current version of the trigram index and applies only the changes between versions during checkout, significantly enhancing performance. Furthermore, we extend our data structure to accommodate CamelHump search for class and function names.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
In the Search of Optimal Tree Networks: Hardness and Heuristics
Authors:
Maxim Buzdalov,
Pavel Martynov,
Sergey Pankratov,
Vitaly Aksenov,
Stefan Schmid
Abstract:
Demand-aware communication networks are networks whose topology is optimized toward the traffic they need to serve. These networks have recently been enabled by novel optical communication technologies and are investigated intensively in the context of datacenters. In this work, we consider networks with one of the most common topologies~ -- a binary tree.
We show that finding an optimal demand-…
▽ More
Demand-aware communication networks are networks whose topology is optimized toward the traffic they need to serve. These networks have recently been enabled by novel optical communication technologies and are investigated intensively in the context of datacenters. In this work, we consider networks with one of the most common topologies~ -- a binary tree.
We show that finding an optimal demand-aware binary tree network is NP-hard. Then, we propose optimization algorithms that generate efficient binary tree networks on real-life and synthetic workloads.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
On the Complexity of the Virtual Network Embedding in Specific Tree Topologies
Authors:
Sergey Pankratov,
Vitaly Aksenov,
Stefan Schmid
Abstract:
Virtual networks are an innovative abstraction that extends cloud computing concepts to the network: by supporting bandwidth reservations between compute nodes (e.g., virtual machines), virtual networks can provide a predictable performance to distributed and communication-intensive cloud applications. However, in order to make the most efficient use of the shared resources, the Virtual Network Em…
▽ More
Virtual networks are an innovative abstraction that extends cloud computing concepts to the network: by supporting bandwidth reservations between compute nodes (e.g., virtual machines), virtual networks can provide a predictable performance to distributed and communication-intensive cloud applications. However, in order to make the most efficient use of the shared resources, the Virtual Network Embedding (VNE) problem has to be solved: a virtual network should be mapped onto the given physical network so that resource reservations are minimized. The problem has been studied intensively already and is known to be NP-hard in general. In this paper, we revisit this problem and consider it on specific topologies, as they often arise in practice. To be more precise, we study the weighted version of the VNE problem: we consider a virtual weighted network of a specific topology which we want to embed onto a weighted network with capacities and specific topology. As for topologies, we consider most fundamental and commonly used ones: line, star, $2$-tiered star, oversubscribed $2$-tiered star, and tree, in addition to also considering arbitrary topologies. We show that typically the VNE problem is NP-hard even in more specialized cases, however, sometimes there exists a polynomial algorithm: for example, an embedding of the oversubscribed $2$-tiered star onto the tree is polynomial while an embedding of an arbitrary $2$-tiered star is not.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Efficient Self-Adjusting Search Trees via Lazy Updates
Authors:
Alexander Slastin,
Dan Alistarh,
Vitaly Aksenov
Abstract:
Self-adjusting data structures are a classic approach to adapting the complexity of operations to the data access distribution. While several self-adjusting variants are known for both binary search trees and B-Trees, existing constructions come with limitations. For instance, existing works on self-adjusting B-Trees do not provide static-optimality and tend to be complex and inefficient to implem…
▽ More
Self-adjusting data structures are a classic approach to adapting the complexity of operations to the data access distribution. While several self-adjusting variants are known for both binary search trees and B-Trees, existing constructions come with limitations. For instance, existing works on self-adjusting B-Trees do not provide static-optimality and tend to be complex and inefficient to implement in practice. In this paper, we provide a new approach to build efficient self-adjusting search trees based on state-of-the-art non-adaptive structures. We illustrate our approach to obtain a new efficient self-adjusting Interpolation Search Tree (IST) and B-Tree, as well as a new self-adjusting tree called the Log Tree. Of note, our self-adjusting IST has expected complexity in $O(\log \frac{\log m}{\log ac(x)})$, where $m$ is the total number of requests and $ac(x)$ is the number of requests to key $x$. Our technique leads to simple constructions with a reduced number of pointer manipulations: this improves cache efficiency and even allows an efficient concurrent implementation.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Wait-free Trees with Asymptotically-Efficient Range Queries
Authors:
Ilya Kokorin,
Dan Alistarh,
Vitaly Aksenov
Abstract:
Tree data structures, such as red-black trees, quad trees, treaps, or tries, are fundamental tools in computer science. A classical problem in concurrency is to obtain expressive, efficient, and scalable versions of practical tree data structures. We are interested in concurrent trees supporting range queries, i.e., queries that involve multiple consecutive data items. Existing implementations wit…
▽ More
Tree data structures, such as red-black trees, quad trees, treaps, or tries, are fundamental tools in computer science. A classical problem in concurrency is to obtain expressive, efficient, and scalable versions of practical tree data structures. We are interested in concurrent trees supporting range queries, i.e., queries that involve multiple consecutive data items. Existing implementations with this capability can list keys in a specific range, but do not support aggregate range queries: for instance, if we want to calculate the number of keys in a range, the only choice is to retrieve a whole list and return its size. This is suboptimal: in the sequential setting, one can augment a balanced search tree with counters and, consequently, perform these aggregate requests in logarithmic rather than linear time.
In this paper, we propose a generic approach to implement a broad class of range queries on concurrent trees in a way that is wait-free, asymptotically efficient, and practically scalable. The key idea is a new mechanism for maintaining metadata concurrently at tree nodes, which can be seen as a wait-free variant of hand-over-hand locking (which we call hand-over-hand hel**). We implement, test, and benchmark a balanced binary search tree with wait-free insert, delete, contains, and count operations, returning the number of keys in a given range which validates the expected speedups because of our method in practice.
△ Less
Submitted 8 October, 2023;
originally announced October 2023.
-
Parallel-batched Interpolation Search Tree
Authors:
Ilya Kokorin,
Vitaly Aksenov,
Alena Martsenyuk
Abstract:
A sorted set (or map) is one of the most used data types in computer science. In addition to standard set operations, like Insert, Remove, and Contains, it can provide set-set operations such as Union,Intersection, and Difference. Each of these set-set operations is equivalent to some batched operation: the data structure should be able to execute Insert, Remove, and Contains on a batch of keys. I…
▽ More
A sorted set (or map) is one of the most used data types in computer science. In addition to standard set operations, like Insert, Remove, and Contains, it can provide set-set operations such as Union,Intersection, and Difference. Each of these set-set operations is equivalent to some batched operation: the data structure should be able to execute Insert, Remove, and Contains on a batch of keys. It is obvious that we want these "large" operations to be parallelized. These sets are usually implemented with the trees of logarithmic height, such as 2-3 trees, treaps, AVL trees, red-black trees, etc. Until now, little attention was devoted to data structures that work asymptotically better under several restrictions on the stored data. In this work, we parallelize Interpolation Search Tree which is expected to serve requests from a smooth distribution in doubly-logarithmic time. Our data structure of size n performs a batch of m operations in O(m log log n) work and poly-log span.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
Benchmark Framework with Skewed Workloads
Authors:
Vitaly Aksenov,
Dmitry Ivanov,
Ravil Galiev
Abstract:
In this work, we present a new benchmarking suite with new real-life inspired skewed workloads to test the performance of concurrent index data structures. We started this project to prepare workloads specifically for self-adjusting data structures, i.e., they handle more frequent requests faster, and, thus, should perform better than their standard counterparts. We looked over the commonly used s…
▽ More
In this work, we present a new benchmarking suite with new real-life inspired skewed workloads to test the performance of concurrent index data structures. We started this project to prepare workloads specifically for self-adjusting data structures, i.e., they handle more frequent requests faster, and, thus, should perform better than their standard counterparts. We looked over the commonly used suites to test performance of concurrent indices trying to find an inspiration: Synchrobench, Setbench, YCSB, and TPC - and we found several issues with them.
The major problem is that they are not flexible: it is difficult to introduce new workloads, it is difficult to set the duration of the experiments, and it is difficult to change the parameters. We decided to solve this issue by presenting a new suite based on Synchrobench.
Finally, we highlight the problem of measuring performance of data structures. We show that the relative performance of data structures highly depends on the workload: it is not clear which data structure is best. For that, we take three state-of-the-art concurrent binary search trees and run them on the workloads from our benchmarking suite. As a result, we get six experiments with all possible relative performance of the chosen data structures.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
Toward Self-Adjusting k-ary Search Tree Networks
Authors:
Evgenii Feder,
Anton Paramonov,
Pavel Mavrin,
Iosif Salem,
Stefan Schmid,
Vitaly Aksenov
Abstract:
Datacenter networks are becoming increasingly flexible with the incorporation of new networking technologies, such as optical circuit switches. These technologies allow for programmable network topologies that can be reconfigured to better serve network traffic, thus enabling a trade-off between the benefits (i.e., shorter routes) and costs of reconfigurations (i.e., overhead). Self-Adjusting Netw…
▽ More
Datacenter networks are becoming increasingly flexible with the incorporation of new networking technologies, such as optical circuit switches. These technologies allow for programmable network topologies that can be reconfigured to better serve network traffic, thus enabling a trade-off between the benefits (i.e., shorter routes) and costs of reconfigurations (i.e., overhead). Self-Adjusting Networks (SANs) aim at addressing this trade-off by exploiting patterns in network traffic, both when it is revealed piecewise (online dynamic topologies) or known in advance (offline static topologies). In this paper, we take the first steps toward Self-Adjusting k-ary tree networks. These are more powerful generalizations of existing binary search tree networks (like SplayNets), which have been at the core of SAN designs. k-ary search tree networks are a natural generalization offering nodes of higher degrees, reduced route lengths for a fixed number of nodes, and local routing in spite of reconfigurations. We first compute an offline (optimal) static network for arbitrary traffic patterns in $O(n^3 \cdot k)$ time via dynamic programming, and also improve the bound to $O(n^2 \cdot k)$ for the special case of uniformly distributed traffic. Then, we present a centroid-based topology of the network that can be used both in the offline static and the online setting. In the offline uniform-workload case, we construct this quasi-optimal network in linear time $O(n)$ and, finally, we present online self-adjusting k-ary search tree versions of SplayNet. We evaluate experimentally our new structure for $k=2$ (allowing for a comparison with existing SplayNets) on real and synthetic network traces. Our results show that this approach works better than SplayNet in most of the real network traces and in average to low locality synthetic traces, and is only little inferior to SplayNet in all remaining traces.
△ Less
Submitted 26 June, 2024; v1 submitted 25 February, 2023;
originally announced February 2023.
-
Unexpected Scaling in Path Copying Trees
Authors:
Ilya Kokorin,
Alexander Fedorov,
Trevor Brown,
Vitaly Aksenov
Abstract:
Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (henceforth called Universal Constructions or UCs) for building concurrent data structures. These approaches (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks that protect a sequential data struct…
▽ More
Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (henceforth called Universal Constructions or UCs) for building concurrent data structures. These approaches (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks that protect a sequential data structure and allow only one process to access it at a time. The resulting data structures use locks, and hence are blocking. Most work on UCs instead focuses on obtaining non-blocking progress guarantees such as obstruction-freedom, lock-freedom, or wait-freedom. Many non-blocking UCs have appeared. Key examples include the seminal wait-free UC by Herlihy, a NUMA-aware UC by Yi et al., and an efficient UC for large objects by Fatourou et al.
We borrow ideas from persistent data structures and multi-version concurrency control (MVCC), most notably path copying, and use them to implement concurrent versions of sequential persistent data structures. Despite our expectation that our data structures would not scale under write-heavy workloads, they scale in practice. We confirm this scaling analytically in our model with private per-process caches.
△ Less
Submitted 2 December, 2022; v1 submitted 1 December, 2022;
originally announced December 2022.
-
Self-Adjusting Linear Networks with Ladder Demand Graph
Authors:
Anton Paramonov,
Iosif Salem,
Stefan Schmid,
Vitaly Aksenov
Abstract:
Self-adjusting networks (SANs) have the ability to adapt to communication demand by dynamically adjusting the workload (or demand) embedding, i.e., the map** of communication requests into the network topology. SANs can thus reduce routing costs for frequently communicating node pairs by paying a cost for adjusting the embedding. This is particularly beneficial when the demand has structure, whi…
▽ More
Self-adjusting networks (SANs) have the ability to adapt to communication demand by dynamically adjusting the workload (or demand) embedding, i.e., the map** of communication requests into the network topology. SANs can thus reduce routing costs for frequently communicating node pairs by paying a cost for adjusting the embedding. This is particularly beneficial when the demand has structure, which the network can adapt to. Demand can be represented in the form of a demand graph, which is defined by the set of network nodes (vertices) and the set of pairwise communication requests (edges). Thus, adapting to the demand can be interpreted by embedding the demand graph to the network topology. This can be challenging both when the demand graph is known in advance (offline) and when it revealed edge-by-edge (online). The difficulty also depends on whether we aim at constructing a static topology or a dynamic (self-adjusting) one that improves the embedding as more parts of the demand graph are revealed. Yet very little is known about these self-adjusting embeddings.
In this paper, the network topology is restricted to a line and the demand graph to a ladder graph, i.e., a $2^n$ grid, including all possible subgraphs of the ladder. We present an online self-adjusting network that matches the known lower bound asymptotically and is $12$-competitive in terms of request cost. As a warm up result, we present an asymptotically optimal algorithm for the cycle demand graph. We also present an oracle-based algorithm for an arbitrary demand graph that has a constant overhead.
△ Less
Submitted 23 February, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Peformance Prediction for Coarse-Grained Locking: MCS Case
Authors:
Vitaly Aksenov,
Daniil Bolotov,
Petr Kuznetsov
Abstract:
A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. It was already shown that simple stochastic analysis can predict the throughput of coarse-grained lock-based algorithms using CLH lock. In…
▽ More
A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. It was already shown that simple stochastic analysis can predict the throughput of coarse-grained lock-based algorithms using CLH lock. In this short paper, we extend this analysis to algorithms based on the popular MCS lock.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Parallel Batched Interpolation Search Tree
Authors:
Vitaly Aksenov,
Ilya Kokorin,
Alena Martsenyuk
Abstract:
Ordered set (and map) is one of the most used data type. In addition to standard set operations, like insert, delete and contains, it can provide set-set operations such as union, intersection, and difference. Each of these set-set operations is equivalent to batched operations: the data structure should process a set of operations insert, delete, and contains. It is obvious that we want these "la…
▽ More
Ordered set (and map) is one of the most used data type. In addition to standard set operations, like insert, delete and contains, it can provide set-set operations such as union, intersection, and difference. Each of these set-set operations is equivalent to batched operations: the data structure should process a set of operations insert, delete, and contains. It is obvious that we want these "large" operations to be parallelized. Typically, these sets are implemented with the trees of logarithmic height, such as 2-3 tree, Treap, AVL tree, Red-Black tree, etc. Until now, little attention was devoted to data structures that work better but under several restrictions on the data. In this work, we parallelize Interpolation Search Tree which serves each request from a smooth distribution in doubly-logarithmic time. Our data structure of size $n$ performs a batch of $m$ operations in $O(m \log\log n)$ work and poly-log span.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Overview of Bachelors Theses 2021
Authors:
Vitaly Aksenov
Abstract:
In this work, we review Bachelors Theses done under the supervision of Vitaly Aksenov at ITMO University. This overview contains the short description of six theses: "Development of a Streaming Algorithm for the Decomposition of Graph Metrics to Tree Metrics" by Oleg Fafurin, "Development of Memory-friendly Concurrent Data Structures" by Roman Smirnov, "Theoretical Analysis of the Performance of C…
▽ More
In this work, we review Bachelors Theses done under the supervision of Vitaly Aksenov at ITMO University. This overview contains the short description of six theses: "Development of a Streaming Algorithm for the Decomposition of Graph Metrics to Tree Metrics" by Oleg Fafurin, "Development of Memory-friendly Concurrent Data Structures" by Roman Smirnov, "Theoretical Analysis of the Performance of Concurrent Data Structures" by Daniil Bolotov, "Parallel Batched Interpolation Search Tree" by Alena Martsenyuk, "Parallel Batched Self-adjusting Data Structures" by Vitalii Krasnov, and "Parallel Batched Persistent Binary Search Trees" by Ildar Zinatulin.
△ Less
Submitted 26 July, 2021;
originally announced July 2021.
-
Execution of NVRAM Programs with Persistent Stack
Authors:
Vitaly Aksenov,
Ohad Ben-Baruch,
Danny Hendler,
Ilya Kokorin,
Matan Rusanovsky
Abstract:
Non-Volatile Random Access Memory (NVRAM) is a novel type of hardware that combines the benefits of traditional persistent memory (persistency of data over hardware failures) and DRAM (fast random access). In this work, we describe an algorithm that can be used to execute NVRAM programs and recover the system after a hardware failure while taking the architecture of real-world NVRAM systems into a…
▽ More
Non-Volatile Random Access Memory (NVRAM) is a novel type of hardware that combines the benefits of traditional persistent memory (persistency of data over hardware failures) and DRAM (fast random access). In this work, we describe an algorithm that can be used to execute NVRAM programs and recover the system after a hardware failure while taking the architecture of real-world NVRAM systems into account. Moreover, the algorithm can be used to execute NVRAM-destined programs on commodity persistent hardware, such as hard drives. That allows us to test NVRAM algorithms using only cheap hardware, without having access to the NVRAM. We report the usage of our algorithm to implement and test NVRAM CAS algorithm.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Memory Bounds for Concurrent Bounded Queues
Authors:
Vitaly Aksenov,
Nikita Koval,
Petr Kuznetsov,
Anton Paramonov
Abstract:
Concurrent data structures often require additional memory for handling synchronization issues in addition to memory for storing elements. Depending on the amount of this additional memory, implementations can be more or less memory-friendly. A memory-optimal implementation enjoys the minimal possible memory overhead, which, in practice, reduces cache misses and unnecessary memory reclamation.
I…
▽ More
Concurrent data structures often require additional memory for handling synchronization issues in addition to memory for storing elements. Depending on the amount of this additional memory, implementations can be more or less memory-friendly. A memory-optimal implementation enjoys the minimal possible memory overhead, which, in practice, reduces cache misses and unnecessary memory reclamation.
In this paper, we discuss the memory-optimality of non-blocking bounded queues. Essentially, we investigate the possibility of constructing an implementation that utilizes a pre-allocated array to store elements and constant memory overhead, e.g., two positioning counters for enqueue(..) and dequeue() operations. Such an implementation can be readily constructed when the ABA problem is precluded, e.g., assuming that the hardware supports LL/SC instructions or all inserted elements are distinct. However, in the general case, we show that a memory-optimal non-blocking bounded queue incurs linear overhead in the number of concurrent processes. These results not only provide helpful intuition for concurrent algorithm developers but also open a new research avenue on the memory-optimality phenomenon in concurrent data structures.
△ Less
Submitted 16 January, 2024; v1 submitted 30 April, 2021;
originally announced April 2021.
-
NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Authors:
Ali Ramezani-Kebrya,
Fartash Faghri,
Ilya Markov,
Vitalii Aksenov,
Dan Alistarh,
Daniel M. Roy
Abstract:
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD prov…
▽ More
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.
△ Less
Submitted 1 May, 2021; v1 submitted 28 April, 2021;
originally announced April 2021.
-
The Splay-List: A Distribution-Adaptive Concurrent Skip-List
Authors:
Vitaly Aksenov,
Dan Alistarh,
Alexandra Drozdova,
Amirkeivan Mohtashami
Abstract:
The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the seque…
▽ More
The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the sequential case, e.g. the splay-trees; however, they often are hard to translate efficiently in the concurrent case.
In this paper, we investigate distribution-adaptive concurrent data structures and propose a new design called the splay-list. At a high level, the splay-list is similar to a standard skip-list, with the key distinction that the height of each element adapts dynamically to its access rate: popular elements ``move up,'' whereas rarely-accessed elements decrease in height. We show that the splay-list provides order-optimal amortized complexity bounds for a subset of operations while being amenable to efficient concurrent implementation. Experimental results show that the splay-list can leverage distribution-adaptivity to improve on the performance of classic concurrent designs, and can outperform the only previously-known distribution-adaptive design in certain settings.
△ Less
Submitted 3 August, 2020;
originally announced August 2020.
-
Application of accelerated fixed-point algorithms to hydrodynamic well-fracture coupling
Authors:
Vitalii Aksenov,
Maxim Chertov,
Konstantin Sinkov
Abstract:
The coupled simulations of dynamic interactions between the well, hydraulic fractures and reservoir have significant importance in some areas of petroleum reservoir engineering. Several approaches to the problem of coupling between the numerical models of these parts of the full system have been developed in the industry in past years. One of the possible approaches allowing formulation of the pro…
▽ More
The coupled simulations of dynamic interactions between the well, hydraulic fractures and reservoir have significant importance in some areas of petroleum reservoir engineering. Several approaches to the problem of coupling between the numerical models of these parts of the full system have been developed in the industry in past years. One of the possible approaches allowing formulation of the problem as a fixed-point problem is studied in the present work. Accelerated Anderson's and Aitken's fixed-point algorithms are applied to the coupling problem. Accelerated algorithms are compared with traditional Picard iterations on the representative set of test cases including ones remarkably problematic for coupling. Relative performance is measured, and the robustness of the algorithms is tested. Accelerated algorithms enable a significant (up to two orders of magnitude) performance boost in some cases and convergent solutions in the cases where simple Picard iterations fail. Based on the analysis, we provide recommendations for the choice of the particular algorithm and tunable relaxation parameter depending on anticipated complexity of the problem.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
Relaxed Scheduling for Scalable Belief Propagation
Authors:
Vitaly Aksenov,
Dan Alistarh,
Janne H. Korhonen
Abstract:
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into develo** efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prov…
▽ More
The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into develo** efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence.
In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wall-clock convergence time, on a range of practical applications.
△ Less
Submitted 18 January, 2021; v1 submitted 25 February, 2020;
originally announced February 2020.
-
NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization
Authors:
Ali Ramezani-Kebrya,
Fartash Faghri,
Ilya Markov,
Vitalii Aksenov,
Dan Alistarh,
Daniel M. Roy
Abstract:
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD prov…
▽ More
As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.
△ Less
Submitted 3 May, 2021; v1 submitted 16 August, 2019;
originally announced August 2019.
-
Performance Prediction for Coarse-Grained Locking
Authors:
Vitaly Aksenov,
Dan Alistarh,
Petr Kuznetsov
Abstract:
A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether…
▽ More
A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether we can use stochastic analysis to predict the resulting throughput. As a preliminary evidence to the affirmative, we describe a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms. We show that our model works well for CLH lock, and we expect it to work for other popular lock designs such as TTAS, MCS, etc.
△ Less
Submitted 25 April, 2019;
originally announced April 2019.
-
Parallel Combining: Benefits of Explicit Synchronization
Authors:
Vitaly Aksenov,
Petr Kuznetsov,
Anatoly Shalyto
Abstract:
Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel batched one. The idea is that we explicitly synchronize concurrent operations into batches: one of the processes becomes a combiner which collects concurrent r…
▽ More
Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel batched one. The idea is that we explicitly synchronize concurrent operations into batches: one of the processes becomes a combiner which collects concurrent requests and initiates a parallel batched algorithm involving the owners (clients) of the collected requests. Intuitively, the cost of synchronizing the concurrent calls can be compensated by running the parallel batched algorithm.
We validate the intuition via two applications of parallel combining. First, we use our technique to design a concurrent data structure optimized for read-dominated workloads, taking a dynamic graph data structure as an example. Second, we use a novel parallel batched priority queue to build a concurrent one. In both cases, we obtain performance gains with respect to the state-of-the-art algorithms.
△ Less
Submitted 13 November, 2018; v1 submitted 20 October, 2017;
originally announced October 2017.
-
Flat Parallelization
Authors:
Vitaly Aksenov,
Petr Kuznetsov
Abstract:
There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the data in parallel and the cost of synchronization. It has been observed that for a large class of "concurrency-unfriendly" data structures, fine-grained parallelization does not pay off: an implementation based on a single global lock outperforms fine-grained solutions. Th…
▽ More
There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the data in parallel and the cost of synchronization. It has been observed that for a large class of "concurrency-unfriendly" data structures, fine-grained parallelization does not pay off: an implementation based on a single global lock outperforms fine-grained solutions. The flat combining paradigm exploits this by ensuring that a thread holding the global lock sequentially combines requests and then executes the combined requests on behalf of concurrent threads.
In this paper, we propose a synchronization technique that unites flat combining and parallel bulk updates borrowed from parallel algorithms designed for the PRAM model. The idea is that the combiner thread assigns waiting threads to perform concurrent requests in parallel.
We foresee the technique to help in implementing efficient "concurrency-ambivalent" data structures, which can benefit from both parallelism and serialization, depending on the operational context. To validate the idea, we considered heap-based implementations of a priority queue. These data structures exhibit two important features: concurrent remove operations are likely to conflict and thus may benefit from combining, while concurrent insert operations can often be at least partly applied in parallel thus may benefit from parallel batching. We show that the resulting flat parallelization algorithm performs well compared to state-of-the-art priority queue implementations.
△ Less
Submitted 9 May, 2017; v1 submitted 8 May, 2017;
originally announced May 2017.
-
A Concurrency-Optimal Binary Search Tree
Authors:
Vitaly Aksenov,
Vincent Gramoli,
Petr Kuznetsov,
Anna Malova,
Srivatsan Ravi
Abstract:
The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that prot…
▽ More
The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that protects tree \emph{edges} in addition to nodes.
Our implementation outperforms the state-of-the art BSTs on most basic workloads, which suggests that optimizing the set of accepted schedules of the sequential code can be an adequate design principle for efficient concurrent data structures.
△ Less
Submitted 2 March, 2017; v1 submitted 14 February, 2017;
originally announced February 2017.
-
A Concurrency-Optimal List-Based Set
Authors:
Vitaly Aksenov,
Vincent Gramoli,
Petr Kuznetsov,
Srivatsan Ravi,
Di Shang
Abstract:
Designing an efficient concurrent data structure is an important challenge that is not easy to meet. Intuitively, efficiency of an implementation is defined, in the first place, by its ability to process applied operations in parallel, without using unnecessary synchronization. As we show in this paper, even for a data structure as simple as a linked list used to implement the set type, the most e…
▽ More
Designing an efficient concurrent data structure is an important challenge that is not easy to meet. Intuitively, efficiency of an implementation is defined, in the first place, by its ability to process applied operations in parallel, without using unnecessary synchronization. As we show in this paper, even for a data structure as simple as a linked list used to implement the set type, the most efficient algorithms known so far are not concurrency-optimal: they may reject correct concurrent schedules.
We propose a new algorithm for the list-based set based on a value-aware try-lock that we show to achieve optimal concurrency: it only rejects concurrent schedules that violate correctness of the implemented set type. We show empirically that reaching optimality does not induce a significant overhead. In fact, our implementation of the concurrency-optimal algorithm outperforms both the Lazy Linked List and the Harris-Michael state-of-the-art algorithms.
△ Less
Submitted 14 January, 2021; v1 submitted 5 February, 2015;
originally announced February 2015.