Search | arXiv e-print repository

Derivative-based regularization for regression

Authors: Enrico Lopedoto, Maksim Shekhunov, Vitaly Aksenov, Kizito Salako, Tillman Weyde

Abstract: In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms… ▽ More In this work, we introduce a novel approach to regularization in multivariable regression problems. Our regularizer, called DLoss, penalises differences between the model's derivatives and derivatives of the data generating function as estimated from the training data. We call these estimated derivatives data derivatives. The goal of our method is to align the model to the data, not only in terms of target values but also in terms of the derivatives involved. To estimate data derivatives, we select (from the training data) 2-tuples of input-value pairs, using either nearest neighbour or random, selection. On synthetic and real datasets, we evaluate the effectiveness of adding DLoss, with different weights, to the standard mean squared error loss. The experimental results show that with DLoss (using nearest neighbour selection) we obtain, on average, the best rank with respect to MSE on validation data sets, compared to no regularization, L2 regularization, and Dropout. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2403.03751 [pdf, other]

Trigram-Based Persistent IDE Indices with Quick Startup

Authors: Zakhar Iakovlev, Alexey Chulkov, Nikita Golikov, Vyacheslav Lukianov, Nikita Zinoviev, Dmitry Ivanov, Vitaly Aksenov

Abstract: One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these fi… ▽ More One common way to speed up the find operation within a set of text files involves a trigram index. This structure is merely a map from a trigram (sequence consisting of three characters) to a set of files which contain it. When searching for a pattern, potential file locations are identified by intersecting the sets related to the trigrams in the pattern. Then, the search proceeds only in these files. However, in a code repository, the trigram index evolves across different versions. Upon checking out a new version, this index is typically built from scratch, which is a time-consuming task, while we want our index to have almost zero-time startup. Thus, we explore the persistent version of a trigram index for full-text and key word patterns search. Our approach just uses the current version of the trigram index and applies only the changes between versions during checkout, significantly enhancing performance. Furthermore, we extend our data structure to accommodate CamelHump search for class and function names. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.03724 [pdf, other]

In the Search of Optimal Tree Networks: Hardness and Heuristics

Authors: Maxim Buzdalov, Pavel Martynov, Sergey Pankratov, Vitaly Aksenov, Stefan Schmid

Abstract: Demand-aware communication networks are networks whose topology is optimized toward the traffic they need to serve. These networks have recently been enabled by novel optical communication technologies and are investigated intensively in the context of datacenters. In this work, we consider networks with one of the most common topologies~ -- a binary tree. We show that finding an optimal demand-… ▽ More Demand-aware communication networks are networks whose topology is optimized toward the traffic they need to serve. These networks have recently been enabled by novel optical communication technologies and are investigated intensively in the context of datacenters. In this work, we consider networks with one of the most common topologies~ -- a binary tree. We show that finding an optimal demand-aware binary tree network is NP-hard. Then, we propose optimization algorithms that generate efficient binary tree networks on real-life and synthetic workloads. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2311.05474 [pdf, ps, other]

On the Complexity of the Virtual Network Embedding in Specific Tree Topologies

Authors: Sergey Pankratov, Vitaly Aksenov, Stefan Schmid

Abstract: Virtual networks are an innovative abstraction that extends cloud computing concepts to the network: by supporting bandwidth reservations between compute nodes (e.g., virtual machines), virtual networks can provide a predictable performance to distributed and communication-intensive cloud applications. However, in order to make the most efficient use of the shared resources, the Virtual Network Em… ▽ More Virtual networks are an innovative abstraction that extends cloud computing concepts to the network: by supporting bandwidth reservations between compute nodes (e.g., virtual machines), virtual networks can provide a predictable performance to distributed and communication-intensive cloud applications. However, in order to make the most efficient use of the shared resources, the Virtual Network Embedding (VNE) problem has to be solved: a virtual network should be mapped onto the given physical network so that resource reservations are minimized. The problem has been studied intensively already and is known to be NP-hard in general. In this paper, we revisit this problem and consider it on specific topologies, as they often arise in practice. To be more precise, we study the weighted version of the VNE problem: we consider a virtual weighted network of a specific topology which we want to embed onto a weighted network with capacities and specific topology. As for topologies, we consider most fundamental and commonly used ones: line, star, $2$-tiered star, oversubscribed $2$-tiered star, and tree, in addition to also considering arbitrary topologies. We show that typically the VNE problem is NP-hard even in more specialized cases, however, sometimes there exists a polynomial algorithm: for example, an embedding of the oversubscribed $2$-tiered star onto the tree is polynomial while an embedding of an arbitrary $2$-tiered star is not. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2310.05298 [pdf, other]

Efficient Self-Adjusting Search Trees via Lazy Updates

Authors: Alexander Slastin, Dan Alistarh, Vitaly Aksenov

Abstract: Self-adjusting data structures are a classic approach to adapting the complexity of operations to the data access distribution. While several self-adjusting variants are known for both binary search trees and B-Trees, existing constructions come with limitations. For instance, existing works on self-adjusting B-Trees do not provide static-optimality and tend to be complex and inefficient to implem… ▽ More Self-adjusting data structures are a classic approach to adapting the complexity of operations to the data access distribution. While several self-adjusting variants are known for both binary search trees and B-Trees, existing constructions come with limitations. For instance, existing works on self-adjusting B-Trees do not provide static-optimality and tend to be complex and inefficient to implement in practice. In this paper, we provide a new approach to build efficient self-adjusting search trees based on state-of-the-art non-adaptive structures. We illustrate our approach to obtain a new efficient self-adjusting Interpolation Search Tree (IST) and B-Tree, as well as a new self-adjusting tree called the Log Tree. Of note, our self-adjusting IST has expected complexity in $O(\log \frac{\log m}{\log ac(x)})$, where $m$ is the total number of requests and $ac(x)$ is the number of requests to key $x$. Our technique leads to simple constructions with a reduced number of pointer manipulations: this improves cache efficiency and even allows an efficient concurrent implementation. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2310.05293 [pdf, other]

Wait-free Trees with Asymptotically-Efficient Range Queries

Authors: Ilya Kokorin, Dan Alistarh, Vitaly Aksenov

Abstract: Tree data structures, such as red-black trees, quad trees, treaps, or tries, are fundamental tools in computer science. A classical problem in concurrency is to obtain expressive, efficient, and scalable versions of practical tree data structures. We are interested in concurrent trees supporting range queries, i.e., queries that involve multiple consecutive data items. Existing implementations wit… ▽ More Tree data structures, such as red-black trees, quad trees, treaps, or tries, are fundamental tools in computer science. A classical problem in concurrency is to obtain expressive, efficient, and scalable versions of practical tree data structures. We are interested in concurrent trees supporting range queries, i.e., queries that involve multiple consecutive data items. Existing implementations with this capability can list keys in a specific range, but do not support aggregate range queries: for instance, if we want to calculate the number of keys in a range, the only choice is to retrieve a whole list and return its size. This is suboptimal: in the sequential setting, one can augment a balanced search tree with counters and, consequently, perform these aggregate requests in logarithmic rather than linear time. In this paper, we propose a generic approach to implement a broad class of range queries on concurrent trees in a way that is wait-free, asymptotically efficient, and practically scalable. The key idea is a new mechanism for maintaining metadata concurrently at tree nodes, which can be seen as a wait-free variant of hand-over-hand locking (which we call hand-over-hand hel**). We implement, test, and benchmark a balanced binary search tree with wait-free insert, delete, contains, and count operations, returning the number of keys in a given range which validates the expected speedups because of our method in practice. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2306.13785 [pdf, other]

Parallel-batched Interpolation Search Tree

Authors: Ilya Kokorin, Vitaly Aksenov, Alena Martsenyuk

Abstract: A sorted set (or map) is one of the most used data types in computer science. In addition to standard set operations, like Insert, Remove, and Contains, it can provide set-set operations such as Union,Intersection, and Difference. Each of these set-set operations is equivalent to some batched operation: the data structure should be able to execute Insert, Remove, and Contains on a batch of keys. I… ▽ More A sorted set (or map) is one of the most used data types in computer science. In addition to standard set operations, like Insert, Remove, and Contains, it can provide set-set operations such as Union,Intersection, and Difference. Each of these set-set operations is equivalent to some batched operation: the data structure should be able to execute Insert, Remove, and Contains on a batch of keys. It is obvious that we want these "large" operations to be parallelized. These sets are usually implemented with the trees of logarithmic height, such as 2-3 trees, treaps, AVL trees, red-black trees, etc. Until now, little attention was devoted to data structures that work asymptotically better under several restrictions on the stored data. In this work, we parallelize Interpolation Search Tree which is expected to serve requests from a smooth distribution in doubly-logarithmic time. Our data structure of size n performs a batch of m operations in O(m log log n) work and poly-log span. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Comments: 28 pages, 10 sections, 17 figures, 23 references

ACM Class: D.1.3

arXiv:2305.10872 [pdf, other]

Benchmark Framework with Skewed Workloads

Authors: Vitaly Aksenov, Dmitry Ivanov, Ravil Galiev

Abstract: In this work, we present a new benchmarking suite with new real-life inspired skewed workloads to test the performance of concurrent index data structures. We started this project to prepare workloads specifically for self-adjusting data structures, i.e., they handle more frequent requests faster, and, thus, should perform better than their standard counterparts. We looked over the commonly used s… ▽ More In this work, we present a new benchmarking suite with new real-life inspired skewed workloads to test the performance of concurrent index data structures. We started this project to prepare workloads specifically for self-adjusting data structures, i.e., they handle more frequent requests faster, and, thus, should perform better than their standard counterparts. We looked over the commonly used suites to test performance of concurrent indices trying to find an inspiration: Synchrobench, Setbench, YCSB, and TPC - and we found several issues with them. The major problem is that they are not flexible: it is difficult to introduce new workloads, it is difficult to set the duration of the experiments, and it is difficult to change the parameters. We decided to solve this issue by presenting a new suite based on Synchrobench. Finally, we highlight the problem of measuring performance of data structures. We show that the relative performance of data structures highly depends on the workload: it is not clear which data structure is best. For that, we take three state-of-the-art concurrent binary search trees and run them on the workloads from our benchmarking suite. As a result, we get six experiments with all possible relative performance of the chosen data structures. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2302.13113 [pdf, other]

Toward Self-Adjusting k-ary Search Tree Networks

Authors: Evgenii Feder, Anton Paramonov, Pavel Mavrin, Iosif Salem, Stefan Schmid, Vitaly Aksenov

Abstract: Datacenter networks are becoming increasingly flexible with the incorporation of new networking technologies, such as optical circuit switches. These technologies allow for programmable network topologies that can be reconfigured to better serve network traffic, thus enabling a trade-off between the benefits (i.e., shorter routes) and costs of reconfigurations (i.e., overhead). Self-Adjusting Netw… ▽ More Datacenter networks are becoming increasingly flexible with the incorporation of new networking technologies, such as optical circuit switches. These technologies allow for programmable network topologies that can be reconfigured to better serve network traffic, thus enabling a trade-off between the benefits (i.e., shorter routes) and costs of reconfigurations (i.e., overhead). Self-Adjusting Networks (SANs) aim at addressing this trade-off by exploiting patterns in network traffic, both when it is revealed piecewise (online dynamic topologies) or known in advance (offline static topologies). In this paper, we take the first steps toward Self-Adjusting k-ary tree networks. These are more powerful generalizations of existing binary search tree networks (like SplayNets), which have been at the core of SAN designs. k-ary search tree networks are a natural generalization offering nodes of higher degrees, reduced route lengths for a fixed number of nodes, and local routing in spite of reconfigurations. We first compute an offline (optimal) static network for arbitrary traffic patterns in $O(n^3 \cdot k)$ time via dynamic programming, and also improve the bound to $O(n^2 \cdot k)$ for the special case of uniformly distributed traffic. Then, we present a centroid-based topology of the network that can be used both in the offline static and the online setting. In the offline uniform-workload case, we construct this quasi-optimal network in linear time $O(n)$ and, finally, we present online self-adjusting k-ary search tree versions of SplayNet. We evaluate experimentally our new structure for $k=2$ (allowing for a comparison with existing SplayNets) on real and synthetic network traces. Our results show that this approach works better than SplayNet in most of the real network traces and in average to low locality synthetic traces, and is only little inferior to SplayNet in all remaining traces. △ Less

Submitted 26 June, 2024; v1 submitted 25 February, 2023; originally announced February 2023.

arXiv:2212.00521 [pdf, other]

Unexpected Scaling in Path Copying Trees

Authors: Ilya Kokorin, Alexander Fedorov, Trevor Brown, Vitaly Aksenov

Abstract: Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (henceforth called Universal Constructions or UCs) for building concurrent data structures. These approaches (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks that protect a sequential data struct… ▽ More Although a wide variety of handcrafted concurrent data structures have been proposed, there is considerable interest in universal approaches (henceforth called Universal Constructions or UCs) for building concurrent data structures. These approaches (semi-)automatically convert a sequential data structure into a concurrent one. The simplest approach uses locks that protect a sequential data structure and allow only one process to access it at a time. The resulting data structures use locks, and hence are blocking. Most work on UCs instead focuses on obtaining non-blocking progress guarantees such as obstruction-freedom, lock-freedom, or wait-freedom. Many non-blocking UCs have appeared. Key examples include the seminal wait-free UC by Herlihy, a NUMA-aware UC by Yi et al., and an efficient UC for large objects by Fatourou et al. We borrow ideas from persistent data structures and multi-version concurrency control (MVCC), most notably path copying, and use them to implement concurrent versions of sequential persistent data structures. Despite our expectation that our data structures would not scale under write-heavy workloads, they scale in practice. We confirm this scaling analytically in our model with private per-process caches. △ Less

Submitted 2 December, 2022; v1 submitted 1 December, 2022; originally announced December 2022.

arXiv:2207.03948 [pdf, other]

Self-Adjusting Linear Networks with Ladder Demand Graph

Authors: Anton Paramonov, Iosif Salem, Stefan Schmid, Vitaly Aksenov

Abstract: Self-adjusting networks (SANs) have the ability to adapt to communication demand by dynamically adjusting the workload (or demand) embedding, i.e., the map** of communication requests into the network topology. SANs can thus reduce routing costs for frequently communicating node pairs by paying a cost for adjusting the embedding. This is particularly beneficial when the demand has structure, whi… ▽ More Self-adjusting networks (SANs) have the ability to adapt to communication demand by dynamically adjusting the workload (or demand) embedding, i.e., the map** of communication requests into the network topology. SANs can thus reduce routing costs for frequently communicating node pairs by paying a cost for adjusting the embedding. This is particularly beneficial when the demand has structure, which the network can adapt to. Demand can be represented in the form of a demand graph, which is defined by the set of network nodes (vertices) and the set of pairwise communication requests (edges). Thus, adapting to the demand can be interpreted by embedding the demand graph to the network topology. This can be challenging both when the demand graph is known in advance (offline) and when it revealed edge-by-edge (online). The difficulty also depends on whether we aim at constructing a static topology or a dynamic (self-adjusting) one that improves the embedding as more parts of the demand graph are revealed. Yet very little is known about these self-adjusting embeddings. In this paper, the network topology is restricted to a line and the demand graph to a ladder graph, i.e., a $2^n$ grid, including all possible subgraphs of the ladder. We present an online self-adjusting network that matches the known lower bound asymptotically and is $12$-competitive in terms of request cost. As a warm up result, we present an asymptotically optimal algorithm for the cycle demand graph. We also present an oracle-based algorithm for an arbitrary demand graph that has a constant overhead. △ Less

Submitted 23 February, 2023; v1 submitted 8 July, 2022; originally announced July 2022.

arXiv:2110.05545 [pdf, other]

Peformance Prediction for Coarse-Grained Locking: MCS Case

Authors: Vitaly Aksenov, Daniil Bolotov, Petr Kuznetsov

Abstract: A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. It was already shown that simple stochastic analysis can predict the throughput of coarse-grained lock-based algorithms using CLH lock. In… ▽ More A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. It was already shown that simple stochastic analysis can predict the throughput of coarse-grained lock-based algorithms using CLH lock. In this short paper, we extend this analysis to algorithms based on the popular MCS lock. △ Less

Submitted 11 October, 2021; originally announced October 2021.

arXiv:2110.05540 [pdf, ps, other]

Parallel Batched Interpolation Search Tree

Authors: Vitaly Aksenov, Ilya Kokorin, Alena Martsenyuk

Abstract: Ordered set (and map) is one of the most used data type. In addition to standard set operations, like insert, delete and contains, it can provide set-set operations such as union, intersection, and difference. Each of these set-set operations is equivalent to batched operations: the data structure should process a set of operations insert, delete, and contains. It is obvious that we want these "la… ▽ More Ordered set (and map) is one of the most used data type. In addition to standard set operations, like insert, delete and contains, it can provide set-set operations such as union, intersection, and difference. Each of these set-set operations is equivalent to batched operations: the data structure should process a set of operations insert, delete, and contains. It is obvious that we want these "large" operations to be parallelized. Typically, these sets are implemented with the trees of logarithmic height, such as 2-3 tree, Treap, AVL tree, Red-Black tree, etc. Until now, little attention was devoted to data structures that work better but under several restrictions on the data. In this work, we parallelize Interpolation Search Tree which serves each request from a smooth distribution in doubly-logarithmic time. Our data structure of size $n$ performs a batch of $m$ operations in $O(m \log\log n)$ work and poly-log span. △ Less

Submitted 11 October, 2021; originally announced October 2021.

arXiv:2107.12332 [pdf, other]

Overview of Bachelors Theses 2021

Authors: Vitaly Aksenov

Abstract: In this work, we review Bachelors Theses done under the supervision of Vitaly Aksenov at ITMO University. This overview contains the short description of six theses: "Development of a Streaming Algorithm for the Decomposition of Graph Metrics to Tree Metrics" by Oleg Fafurin, "Development of Memory-friendly Concurrent Data Structures" by Roman Smirnov, "Theoretical Analysis of the Performance of C… ▽ More In this work, we review Bachelors Theses done under the supervision of Vitaly Aksenov at ITMO University. This overview contains the short description of six theses: "Development of a Streaming Algorithm for the Decomposition of Graph Metrics to Tree Metrics" by Oleg Fafurin, "Development of Memory-friendly Concurrent Data Structures" by Roman Smirnov, "Theoretical Analysis of the Performance of Concurrent Data Structures" by Daniil Bolotov, "Parallel Batched Interpolation Search Tree" by Alena Martsenyuk, "Parallel Batched Self-adjusting Data Structures" by Vitalii Krasnov, and "Parallel Batched Persistent Binary Search Trees" by Ildar Zinatulin. △ Less

Submitted 26 July, 2021; originally announced July 2021.

arXiv:2105.11932 [pdf, other]

Execution of NVRAM Programs with Persistent Stack

Authors: Vitaly Aksenov, Ohad Ben-Baruch, Danny Hendler, Ilya Kokorin, Matan Rusanovsky

Abstract: Non-Volatile Random Access Memory (NVRAM) is a novel type of hardware that combines the benefits of traditional persistent memory (persistency of data over hardware failures) and DRAM (fast random access). In this work, we describe an algorithm that can be used to execute NVRAM programs and recover the system after a hardware failure while taking the architecture of real-world NVRAM systems into a… ▽ More Non-Volatile Random Access Memory (NVRAM) is a novel type of hardware that combines the benefits of traditional persistent memory (persistency of data over hardware failures) and DRAM (fast random access). In this work, we describe an algorithm that can be used to execute NVRAM programs and recover the system after a hardware failure while taking the architecture of real-world NVRAM systems into account. Moreover, the algorithm can be used to execute NVRAM-destined programs on commodity persistent hardware, such as hard drives. That allows us to test NVRAM algorithms using only cheap hardware, without having access to the NVRAM. We report the usage of our algorithm to implement and test NVRAM CAS algorithm. △ Less

Submitted 25 May, 2021; originally announced May 2021.

arXiv:2104.15003 [pdf, other]

Memory Bounds for Concurrent Bounded Queues

Authors: Vitaly Aksenov, Nikita Koval, Petr Kuznetsov, Anton Paramonov

Abstract: Concurrent data structures often require additional memory for handling synchronization issues in addition to memory for storing elements. Depending on the amount of this additional memory, implementations can be more or less memory-friendly. A memory-optimal implementation enjoys the minimal possible memory overhead, which, in practice, reduces cache misses and unnecessary memory reclamation. I… ▽ More Concurrent data structures often require additional memory for handling synchronization issues in addition to memory for storing elements. Depending on the amount of this additional memory, implementations can be more or less memory-friendly. A memory-optimal implementation enjoys the minimal possible memory overhead, which, in practice, reduces cache misses and unnecessary memory reclamation. In this paper, we discuss the memory-optimality of non-blocking bounded queues. Essentially, we investigate the possibility of constructing an implementation that utilizes a pre-allocated array to store elements and constant memory overhead, e.g., two positioning counters for enqueue(..) and dequeue() operations. Such an implementation can be readily constructed when the ABA problem is precluded, e.g., assuming that the hardware supports LL/SC instructions or all inserted elements are distinct. However, in the general case, we show that a memory-optimal non-blocking bounded queue incurs linear overhead in the number of concurrent processes. These results not only provide helpful intuition for concurrent algorithm developers but also open a new research avenue on the memory-optimality phenomenon in concurrent data structures. △ Less

Submitted 16 January, 2024; v1 submitted 30 April, 2021; originally announced April 2021.

arXiv:2104.13818

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

Abstract: As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD prov… ▽ More As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods. △ Less

Submitted 1 May, 2021; v1 submitted 28 April, 2021; originally announced April 2021.

Comments: This entry is redundant and was created in error. See arXiv:1908.06077 for the latest version

arXiv:2008.01009 [pdf, other]

The Splay-List: A Distribution-Adaptive Concurrent Skip-List

Authors: Vitaly Aksenov, Dan Alistarh, Alexandra Drozdova, Amirkeivan Mohtashami

Abstract: The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the seque… ▽ More The design and implementation of efficient concurrent data structures have seen significant attention. However, most of this work has focused on concurrent data structures providing good \emph{worst-case} guarantees. In real workloads, objects are often accessed at different rates, since access distributions may be non-uniform. Efficient distribution-adaptive data structures are known in the sequential case, e.g. the splay-trees; however, they often are hard to translate efficiently in the concurrent case. In this paper, we investigate distribution-adaptive concurrent data structures and propose a new design called the splay-list. At a high level, the splay-list is similar to a standard skip-list, with the key distinction that the height of each element adapts dynamically to its access rate: popular elements ``move up,'' whereas rarely-accessed elements decrease in height. We show that the splay-list provides order-optimal amortized complexity bounds for a subset of operations while being amenable to efficient concurrent implementation. Experimental results show that the splay-list can leverage distribution-adaptivity to improve on the performance of classic concurrent designs, and can outperform the only previously-known distribution-adaptive design in certain settings. △ Less

Submitted 3 August, 2020; originally announced August 2020.

arXiv:2005.01620 [pdf, other]

Application of accelerated fixed-point algorithms to hydrodynamic well-fracture coupling

Authors: Vitalii Aksenov, Maxim Chertov, Konstantin Sinkov

Abstract: The coupled simulations of dynamic interactions between the well, hydraulic fractures and reservoir have significant importance in some areas of petroleum reservoir engineering. Several approaches to the problem of coupling between the numerical models of these parts of the full system have been developed in the industry in past years. One of the possible approaches allowing formulation of the pro… ▽ More The coupled simulations of dynamic interactions between the well, hydraulic fractures and reservoir have significant importance in some areas of petroleum reservoir engineering. Several approaches to the problem of coupling between the numerical models of these parts of the full system have been developed in the industry in past years. One of the possible approaches allowing formulation of the problem as a fixed-point problem is studied in the present work. Accelerated Anderson's and Aitken's fixed-point algorithms are applied to the coupling problem. Accelerated algorithms are compared with traditional Picard iterations on the representative set of test cases including ones remarkably problematic for coupling. Relative performance is measured, and the robustness of the algorithms is tested. Accelerated algorithms enable a significant (up to two orders of magnitude) performance boost in some cases and convergent solutions in the cases where simple Picard iterations fail. Based on the analysis, we provide recommendations for the choice of the particular algorithm and tunable relaxation parameter depending on anticipated complexity of the problem. △ Less

Submitted 1 May, 2020; originally announced May 2020.

arXiv:2002.11505 [pdf, other]

Relaxed Scheduling for Scalable Belief Propagation

Authors: Vitaly Aksenov, Dan Alistarh, Janne H. Korhonen

Abstract: The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into develo** efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prov… ▽ More The ability to leverage large-scale hardware parallelism has been one of the key enablers of the accelerated recent progress in machine learning. Consequently, there has been considerable effort invested into develo** efficient parallel variants of classic machine learning algorithms. However, despite the wealth of knowledge on parallelization, some classic machine learning algorithms often prove hard to parallelize efficiently while maintaining convergence. In this paper, we focus on efficient parallel algorithms for the key machine learning task of inference on graphical models, in particular on the fundamental belief propagation algorithm. We address the challenge of efficiently parallelizing this classic paradigm by showing how to leverage scalable relaxed schedulers in this context. We present an extensive empirical study, showing that our approach outperforms previous parallel belief propagation implementations both in terms of scalability and in terms of wall-clock convergence time, on a range of practical applications. △ Less

Submitted 18 January, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

arXiv:1908.06077 [pdf, other]

NUQSGD: Provably Communication-efficient Data-parallel SGD via Nonuniform Quantization

Authors: Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, Daniel M. Roy

Abstract: As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD prov… ▽ More As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel SGD is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods. △ Less

Submitted 3 May, 2021; v1 submitted 16 August, 2019; originally announced August 2019.

Comments: 42 pages, 21 figures. To appear in the Journal of Machine Learning Research (JMLR)

arXiv:1904.11323 [pdf, other]

Performance Prediction for Coarse-Grained Locking

Authors: Vitaly Aksenov, Dan Alistarh, Petr Kuznetsov

Abstract: A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether… ▽ More A standard design pattern found in many concurrent data structures, such as hash tables or ordered containers, is an alternation of parallelizable sections that incur no data conflicts and critical sections that must run sequentially and are protected with locks. A lock can be viewed as a queue that arbitrates the order in which the critical sections are executed, and a natural question is whether we can use stochastic analysis to predict the resulting throughput. As a preliminary evidence to the affirmative, we describe a simple model that can be used to predict the throughput of coarse-grained lock-based algorithms. We show that our model works well for CLH lock, and we expect it to work for other popular lock designs such as TTAS, MCS, etc. △ Less

Submitted 25 April, 2019; originally announced April 2019.

arXiv:1710.07588 [pdf, other]

Parallel Combining: Benefits of Explicit Synchronization

Authors: Vitaly Aksenov, Petr Kuznetsov, Anatoly Shalyto

Abstract: Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel batched one. The idea is that we explicitly synchronize concurrent operations into batches: one of the processes becomes a combiner which collects concurrent r… ▽ More Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel batched one. The idea is that we explicitly synchronize concurrent operations into batches: one of the processes becomes a combiner which collects concurrent requests and initiates a parallel batched algorithm involving the owners (clients) of the collected requests. Intuitively, the cost of synchronizing the concurrent calls can be compensated by running the parallel batched algorithm. We validate the intuition via two applications of parallel combining. First, we use our technique to design a concurrent data structure optimized for read-dominated workloads, taking a dynamic graph data structure as an example. Second, we use a novel parallel batched priority queue to build a concurrent one. In both cases, we obtain performance gains with respect to the state-of-the-art algorithms. △ Less

Submitted 13 November, 2018; v1 submitted 20 October, 2017; originally announced October 2017.

arXiv:1705.02851 [pdf, other]

Flat Parallelization

Authors: Vitaly Aksenov, Petr Kuznetsov

Abstract: There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the data in parallel and the cost of synchronization. It has been observed that for a large class of "concurrency-unfriendly" data structures, fine-grained parallelization does not pay off: an implementation based on a single global lock outperforms fine-grained solutions. Th… ▽ More There are two intertwined factors that affect performance of concurrent data structures: the ability of processes to access the data in parallel and the cost of synchronization. It has been observed that for a large class of "concurrency-unfriendly" data structures, fine-grained parallelization does not pay off: an implementation based on a single global lock outperforms fine-grained solutions. The flat combining paradigm exploits this by ensuring that a thread holding the global lock sequentially combines requests and then executes the combined requests on behalf of concurrent threads. In this paper, we propose a synchronization technique that unites flat combining and parallel bulk updates borrowed from parallel algorithms designed for the PRAM model. The idea is that the combiner thread assigns waiting threads to perform concurrent requests in parallel. We foresee the technique to help in implementing efficient "concurrency-ambivalent" data structures, which can benefit from both parallelism and serialization, depending on the operational context. To validate the idea, we considered heap-based implementations of a priority queue. These data structures exhibit two important features: concurrent remove operations are likely to conflict and thus may benefit from combining, while concurrent insert operations can often be at least partly applied in parallel thus may benefit from parallel batching. We show that the resulting flat parallelization algorithm performs well compared to state-of-the-art priority queue implementations. △ Less

Submitted 9 May, 2017; v1 submitted 8 May, 2017; originally announced May 2017.

arXiv:1702.04441 [pdf, other]

A Concurrency-Optimal Binary Search Tree

Authors: Vitaly Aksenov, Vincent Gramoli, Petr Kuznetsov, Anna Malova, Srivatsan Ravi

Abstract: The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that prot… ▽ More The paper presents the first \emph{concurrency-optimal} implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of an internal tree, ensures that every \emph{schedule} is accepted, i.e., interleaving of steps of the sequential code, unless linearizability is violated. To ensure this property, we use a novel read-write locking scheme that protects tree \emph{edges} in addition to nodes. Our implementation outperforms the state-of-the art BSTs on most basic workloads, which suggests that optimizing the set of accepted schedules of the sequential code can be an adequate design principle for efficient concurrent data structures. △ Less

Submitted 2 March, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

arXiv:1502.01633 [pdf, other]

A Concurrency-Optimal List-Based Set

Authors: Vitaly Aksenov, Vincent Gramoli, Petr Kuznetsov, Srivatsan Ravi, Di Shang

Abstract: Designing an efficient concurrent data structure is an important challenge that is not easy to meet. Intuitively, efficiency of an implementation is defined, in the first place, by its ability to process applied operations in parallel, without using unnecessary synchronization. As we show in this paper, even for a data structure as simple as a linked list used to implement the set type, the most e… ▽ More Designing an efficient concurrent data structure is an important challenge that is not easy to meet. Intuitively, efficiency of an implementation is defined, in the first place, by its ability to process applied operations in parallel, without using unnecessary synchronization. As we show in this paper, even for a data structure as simple as a linked list used to implement the set type, the most efficient algorithms known so far are not concurrency-optimal: they may reject correct concurrent schedules. We propose a new algorithm for the list-based set based on a value-aware try-lock that we show to achieve optimal concurrency: it only rejects concurrent schedules that violate correctness of the implemented set type. We show empirically that reaching optimality does not induce a significant overhead. In fact, our implementation of the concurrency-optimal algorithm outperforms both the Lazy Linked List and the Harris-Michael state-of-the-art algorithms. △ Less

Submitted 14 January, 2021; v1 submitted 5 February, 2015; originally announced February 2015.

Showing 1–26 of 26 results for author: Aksenov, V