Search | arXiv e-print repository

Perturbation-Resilient Trades for Dynamic Service Balancing

Authors: ** Sima, Chao Pan, Olgica Milenkovic

Abstract: A combinatorial trade is a pair of sets of blocks of elements that can be exchanged while preserving relevant subset intersection constraints. The class of balanced and swap-robust minimal trades was proposed in [1] for exchanging blocks of data chunks stored on distributed storage systems in an access- and load-balanced manner. More precisely, data chunks in the trades of interest are labeled by… ▽ More A combinatorial trade is a pair of sets of blocks of elements that can be exchanged while preserving relevant subset intersection constraints. The class of balanced and swap-robust minimal trades was proposed in [1] for exchanging blocks of data chunks stored on distributed storage systems in an access- and load-balanced manner. More precisely, data chunks in the trades of interest are labeled by popularity ranks and the blocks are required to have both balanced overall popularity and stability properties with respect to swaps in chunk popularities. The original construction of such trades relied on computer search and paired balanced sets obtained through iterative combining of smaller sets that have provable stability guarantees. To reduce the substantial gap between the results of prior approaches and the known theoretical lower bound, we present new analytical upper and lower bounds on the minimal disbalance of blocks introduced by limited-magnitude popularity ranking swaps. Our constructive and near-optimal approach relies on pairs of graphs whose vertices are two balanced sets with edges/arcs that capture the balance and potential balance changes induced by limited-magnitude popularity swaps. In particular, we show that if we start with carefully selected balanced trades and limit the magnitude of rank swaps to one, the new upper and lower bound on the maximum block disbalance caused by a swap only differ by a factor of $1.07$. We also extend these results for larger popularity swap magnitudes. △ Less

Submitted 21 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2303.12996

arXiv:2402.08751 [pdf, other]

Nearest Neighbor Representations of Neural Circuits

Authors: Kordag Mehmet Kilic, ** Sima, Jehoshua Bruck

Abstract: Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no resu… ▽ More Neural networks successfully capture the computational power of the human brain for many tasks. Similarly inspired by the brain architecture, Nearest Neighbor (NN) representations is a novel approach of computation. We establish a firmer correspondence between NN representations and neural networks. Although it was known how to represent a single neuron using NN representations, there were no results even for small depth neural networks. Specifically, for depth-2 threshold circuits, we provide explicit constructions for their NN representation with an explicit bound on the number of bits to represent it. Example functions include NN representations of convex polytopes (AND of threshold gates), IP2, OR of threshold gates, and linear or exact decision lists. △ Less

Submitted 9 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: This paper is accepted to ISIT 2024. 2nd version has revisions for better clarity, more citations, and more explanation in the proofs. No results are changed

arXiv:2402.08748 [pdf, ps, other]

Nearest Neighbor Representations of Neurons

Authors: Kordag Mehmet Kilic, ** Sima, Jehoshua Bruck

Abstract: The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required fo… ▽ More The Nearest Neighbor (NN) Representation is an emerging computational model that is inspired by the brain. We study the complexity of representing a neuron (threshold function) using the NN representations. It is known that two anchors (the points to which NN is computed) are sufficient for a NN representation of a threshold function, however, the resolution (the maximum number of bits required for the entries of an anchor) is $O(n\log{n})$. In this work, the trade-off between the number of anchors and the resolution of a NN representation of threshold functions is investigated. We prove that the well-known threshold functions EQUALITY, COMPARISON, and ODD-MAX-BIT, which require 2 or 3 anchors and resolution of $O(n)$, can be represented by polynomially large number of anchors in $n$ and $O(\log{n})$ resolution. We conjecture that for all threshold functions, there are NN representations with polynomially large size and logarithmic resolution in $n$. △ Less

Submitted 9 May, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

Comments: This paper is accepted to ISIT 2024. 2nd version had revisions for better clarity, fixing of typos. No results are changed

arXiv:2402.00315 [pdf, ps, other]

Online Distribution Learning with Local Private Constraints

Authors: ** Sima, Changlong Wu, Olgica Milenkovic, Wojciech Szpankowski

Abstract: We study the problem of online conditional distribution estimation with \emph{unbounded} label sets under local differential privacy. Let $\mathcal{F}$ be a distribution-valued function class with unbounded label set. We aim at estimating an \emph{unknown} function $f\in \mathcal{F}$ in an online fashion so that at time $t$ when the context $\boldsymbol{x}_t$ is provided we can generate an estimat… ▽ More We study the problem of online conditional distribution estimation with \emph{unbounded} label sets under local differential privacy. Let $\mathcal{F}$ be a distribution-valued function class with unbounded label set. We aim at estimating an \emph{unknown} function $f\in \mathcal{F}$ in an online fashion so that at time $t$ when the context $\boldsymbol{x}_t$ is provided we can generate an estimate of $f(\boldsymbol{x}_t)$ under KL-divergence knowing only a privatized version of the true labels sampling from $f(\boldsymbol{x}_t)$. The ultimate objective is to minimize the cumulative KL-risk of a finite horizon $T$. We show that under $(ε,0)$-local differential privacy of the privatized labels, the KL-risk grows as $\tildeΘ(\frac{1}ε\sqrt{KT})$ upto poly-logarithmic factors where $K=|\mathcal{F}|$. This is in stark contrast to the $\tildeΘ(\sqrt{T\log K})$ bound demonstrated by Wu et al. (2023a) for bounded label sets. As a byproduct, our results recover a nearly tight upper bound for the hypothesis selection problem of gopi et al. (2020) established only for the batch setting. △ Less

Submitted 31 January, 2024; originally announced February 2024.

arXiv:2401.15520 [pdf, ps, other]

Oracle-Efficient Hybrid Online Learning with Unknown Distribution

Authors: Changlong Wu, ** Sima, Wojciech Szpankowski

Abstract: We study the problem of oracle-efficient hybrid online learning when the features are generated by an unknown i.i.d. process and the labels are generated adversarially. Assuming access to an (offline) ERM oracle, we show that there exists a computationally efficient online predictor that achieves a regret upper bounded by $\tilde{O}(T^{\frac{3}{4}})$ for a finite-VC class, and upper bounded by… ▽ More We study the problem of oracle-efficient hybrid online learning when the features are generated by an unknown i.i.d. process and the labels are generated adversarially. Assuming access to an (offline) ERM oracle, we show that there exists a computationally efficient online predictor that achieves a regret upper bounded by $\tilde{O}(T^{\frac{3}{4}})$ for a finite-VC class, and upper bounded by $\tilde{O}(T^{\frac{p+1}{p+2}})$ for a class with $α$ fat-shattering dimension $α^{-p}$. This provides the first known oracle-efficient sublinear regret bounds for hybrid online learning with an unknown feature generation process. In particular, it confirms a conjecture of Lazaric and Munos (JCSS 2012). We then extend our result to the scenario of shifting distributions with $K$ changes, yielding a regret of order $\tilde{O}(T^{\frac{4}{5}}K^{\frac{1}{5}})$. Finally, we establish a regret of $\tilde{O}((K^{\frac{2}{3}}(\log|\mathcal{H}|)^{\frac{1}{3}}+K)\cdot T^{\frac{4}{5}})$ for the contextual $K$-armed bandits with a finite policy set $\mathcal{H}$, i.i.d. generated contexts from an unknown distribution, and adversarially generated costs. △ Less

Submitted 27 January, 2024; originally announced January 2024.

arXiv:2310.03897 [pdf, other]

Break-Resilient Codes for Forensic 3D Fingerprinting

Authors: Canran Wang, ** Sima, Netanel Raviv

Abstract: 3D printing brings about a revolution in consumption and distribution of goods, but poses a significant risk to public safety. Any individual with internet access and a commodity printer can now produce untraceable firearms, keys, and dangerous counterfeit products. To aid government authorities in combating these new security threats, objects are often tagged with identifying information. This in… ▽ More 3D printing brings about a revolution in consumption and distribution of goods, but poses a significant risk to public safety. Any individual with internet access and a commodity printer can now produce untraceable firearms, keys, and dangerous counterfeit products. To aid government authorities in combating these new security threats, objects are often tagged with identifying information. This information, also known as fingerprints, is written into the object using various bit embedding techniques, such as varying the width of the molten thermoplastic layers. Yet, due to the adversarial nature of the problem, it is important to devise tamper resilient fingerprinting techniques, so that the fingerprint could be extracted even if the object was damaged. While fingerprinting various forms of digital media (such as videos, images, etc.) has been studied extensively in the past, 3D printing is a relatively new medium which is exposed to different types of adversarial physical tampering that do not exist in the digital world. This paper focuses on one such type of adversarial tampering, where the adversary breaks the object to at most a certain number of parts. This gives rise to a new adversarial coding problem, which is formulated and investigated herein. We survey the existing technology, present an abstract problem definition, provide lower bounds for the required redundancy, and construct a code which attains it up to asymptotically small factors. Notably, the problem bears some resemblance to the torn paper channel, which was recently studied for applications in DNA storage. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2310.01729 [pdf, other]

Error Correction for DNA Storage

Authors: ** Sima, Netanel Raviv, Moshe Schwartz, Jehoshua Bruck

Abstract: DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including… ▽ More DNA-based storage is an emerging storage technology that provides high information density and long duration. Due to the physical constraints in the reading and writing processes, error correction in DNA storage poses several interesting coding theoretic challenges, some of which are new. In this paper, we give a brief introduction to some of the coding challenges for DNA-based storage, including deletion/insertion correcting codes, codes over sliced channels, and duplication correcting codes. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2308.07793 [pdf, ps, other]

Robust Indexing for the Sliced Channel: Almost Optimal Codes for Substitutions and Deletions

Authors: ** Sima, Netanel Raviv, Jehoshua Bruck

Abstract: Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and… ▽ More Encoding data as a set of unordered strings is receiving great attention as it captures one of the basic features of DNA storage systems. However, the challenge of constructing optimal redundancy codes for this channel remained elusive. In this paper, we address this problem and present an order-wise optimal construction of codes that are capable of correcting multiple substitution, deletion, and insertion errors for this channel model. The key ingredient in the code construction is a technique we call robust indexing: simultaneously assigning indices to unordered strings (hence, creating order) and also embedding information in these indices. The encoded indices are resilient to substitution, deletion, and insertion errors, and therefore, so is the entire code. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.06895 [pdf, other]

Federated Classification in Hyperbolic Spaces via Secure Aggregation of Convex Hulls

Authors: Saurav Prakash, ** Sima, Chao Pan, Eli Chien, Olgica Milenkovic

Abstract: Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setti… ▽ More Hierarchical and tree-like data sets arise in many applications, including language processing, graph data mining, phylogeny and genomics. It is known that tree-like data cannot be embedded into Euclidean spaces of finite dimension with small distortion. This problem can be mitigated through the use of hyperbolic spaces. When such data also has to be processed in a distributed and privatized setting, it becomes necessary to work with new federated learning methods tailored to hyperbolic spaces. As an initial step towards the development of the field of federated learning in hyperbolic spaces, we propose the first known approach to federated classification in hyperbolic spaces. Our contributions are as follows. First, we develop distributed versions of convex SVM classifiers for Poincaré discs. In this setting, the information conveyed from clients to the global classifier are convex hulls of clusters present in individual client data. Second, to avoid label switching issues, we introduce a number-theoretic approach for label recovery based on the so-called integer $B_h$ sequences. Third, we compute the complexity of the convex hulls in hyperbolic spaces to assess the extent of data leakage; at the same time, in order to limit communication cost for the hulls, we propose a new quantization method for the Poincaré disc coupled with Reed-Solomon-like encoding. Fourth, at the server level, we introduce a new approach for aggregating convex hulls of the clients based on balanced graph partitioning. We test our method on a collection of diverse data sets, including hierarchical single-cell RNA-seq data from different patients distributed across different repositories that have stringent privacy constraints. The classification accuracy of our method is up to $\sim 11\%$ better than its Euclidean counterpart, demonstrating the importance of privacy-preserving learning in hyperbolic spaces. △ Less

Submitted 16 January, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

Comments: Published in the Transactions on Machine Learning Research (TMLR). Link: https://openreview.net/forum?id=umggDfMHha

arXiv:2305.05808 [pdf, ps, other]

On the Information Capacity of Nearest Neighbor Representations

Authors: Kordag Mehmet Kilic, ** Sima, Jehoshua Bruck

Abstract: The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call… ▽ More The $\textit{von Neumann Computer Architecture}$ has a distinction between computation and memory. In contrast, the brain has an integrated architecture where computation and memory are indistinguishable. Motivated by the architecture of the brain, we propose a model of $\textit{associative computation}$ where memory is defined by a set of vectors in $\mathbb{R}^n$ (that we call $\textit{anchors}$), computation is performed by convergence from an input vector to a nearest neighbor anchor, and the output is a label associated with an anchor. Specifically, in this paper, we study the representation of Boolean functions in the associative computation model, where the inputs are binary vectors and the corresponding outputs are the labels ($0$ or $1$) of the nearest neighbor anchors. The information capacity of a Boolean function in this model is associated with two quantities: $\textit{(i)}$ the number of anchors (called $\textit{Nearest Neighbor (NN) Complexity}$) and $\textit{(ii)}$ the maximal number of bits representing entries of anchors (called $\textit{Resolution}$). We study symmetric Boolean functions and present constructions that have optimal NN complexity and resolution. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: The conference version is submitted to and accepted by ISIT 2023

arXiv:2304.01365 [pdf, ps, other]

Finding a Burst of Positives via Nonadaptive Semiquantitative Group Testing

Authors: Yun-Han Li, Ryan Gabrys, ** Sima, Ilan Shomorony, Olgica Milenkovic

Abstract: Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior wo… ▽ More Motivated by testing for pathogenic diseases we consider a new nonadaptive group testing problem for which: (1) positives occur within a burst, capturing the fact that infected test subjects often come in clusters, and (2) that the test outcomes arise from semiquantitative measurements that provide coarse information about the number of positives in any tested group. Our model generalizes prior work on detecting a single burst of positives with classical group testing[1] as well as work on semiquantitative group testing (SQGT)[2]. Specifically, we study the setting where the burst-length $\ell$ is known and the semiquantitative tests provide potentially nonuniform estimates on the number of positives in a test group. The estimates represent the index of a quantization bin containing the (exact) total number of positives, for arbitrary thresholds $η_1,\dots,η_s$. Interestingly, we show that the minimum number of tests needed for burst identification is essentially only a function of the largest threshold $η_s$. In this context, our main result is an order-optimal test scheme that can recover any burst of length $\ell$ using roughly $\frac{\ell}{2η_s}+\log_{s+1}(n)$ measurements. This suggests that a large saturation level $η_s$ is more important than finely quantized information when dealing with bursts. We also provide results for related modeling assumptions and specialized choices of thresholds. △ Less

Submitted 3 April, 2023; originally announced April 2023.

arXiv:2303.12996 [pdf, other]

Perturbation-Resilient Sets for Dynamic Service Balancing

Authors: ** Sima, Chao Pan, Olgica Milenkovic

Abstract: Balanced and swap-robust minimal trades, introduced in [1], are important for studying the balance and stability of server access request protocols under data popularity changes. Constructions of such trades have so far relied on paired sets obtained through iterative combining of smaller sets that have provable stability guarantees, coupled with exhaustive computer search. Currently, there exists… ▽ More Balanced and swap-robust minimal trades, introduced in [1], are important for studying the balance and stability of server access request protocols under data popularity changes. Constructions of such trades have so far relied on paired sets obtained through iterative combining of smaller sets that have provable stability guarantees, coupled with exhaustive computer search. Currently, there exists a nonnegligible gap between the resulting total dynamic balance discrepancy and the known theoretical lower bound. We present both new upper and lower bounds on the total service requests discrepancy under limited popularity changes. Our constructive near-optimal approach uses a new class of paired graphs whose vertices are two balanced sets with edges (arcs) that capture the balance and potential balance changes induced by limited-magnitude popularity changes (swaps). △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2303.12990 [pdf, ps, other]

On Constant-Weight Binary $B_2$-Sequences

Authors: ** Sima, Yun-Han Li, Ilan Shomorony, Olgica Milenkovic

Abstract: Motivated by applications in polymer-based data storage we introduced the new problem of characterizing the code rate and designing constant-weight binary $B_2$-sequences. Binary $B_2$-sequences are collections of binary strings of length $n$ with the property that the real-valued sums of all distinct pairs of strings are distinct. In addition to this defining property, constant-weight binary… ▽ More Motivated by applications in polymer-based data storage we introduced the new problem of characterizing the code rate and designing constant-weight binary $B_2$-sequences. Binary $B_2$-sequences are collections of binary strings of length $n$ with the property that the real-valued sums of all distinct pairs of strings are distinct. In addition to this defining property, constant-weight binary $B_2$-sequences also satisfy the constraint that each string has a fixed, relatively small weight $ω$ that scales linearly with $n$. The constant-weight constraint ensures low-cost synthesis and uniform processing of the data readout via tandem mass spectrometers. Our main results include upper bounds on the size of the codes formulated as entropy-optimization problems and constructive lower bounds based on Sidon sequences. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2210.16424 [pdf, other]

Machine Unlearning of Federated Clusters

Authors: Chao Pan, ** Sima, Saurav Prakash, Vishal Rana, Olgica Milenkovic

Abstract: Federated clustering (FC) is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning for FC methods has become of significant importance. We introduce, for the first time, the problem of machine unlearning fo… ▽ More Federated clustering (FC) is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning for FC methods has become of significant importance. We introduce, for the first time, the problem of machine unlearning for FC, and propose an efficient unlearning mechanism for a customized secure FC framework. Our FC framework utilizes special initialization procedures that we show are well-suited for unlearning. To protect client data privacy, we develop the secure compressed multiset aggregation (SCMA) framework that addresses sparse secure federated learning (FL) problems encountered during clustering as well as more general problems. To simultaneously facilitate low communication complexity and secret sharing protocols, we integrate Reed-Solomon encoding with special evaluation points into our SCMA pipeline, and prove that the client communication cost is logarithmic in the vector dimension. Additionally, to demonstrate the benefits of our unlearning mechanism over complete retraining, we provide a theoretical analysis for the unlearning performance of our approach. Simulation results show that the new FC framework exhibits superior clustering performance compared to previously reported FC baselines when the cluster sizes are highly imbalanced. Compared to completely retraining K-means++ locally and globally for each removal request, our unlearning procedure offers an average speed-up of roughly 84x across seven datasets. Our implementation for the proposed method is available at https://github.com/thupchnsky/mufc. △ Less

Submitted 30 June, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: 27 pages. ICLR 2023

arXiv:2210.11818 [pdf, ps, other]

Non-binary Codes for Correcting a Burst of at Most t Deletions

Authors: Shuche Wang, Yuanyuan Tang, ** Sima, Ryan Gabrys, Farzad Farnoud

Abstract: The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to t… ▽ More The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in non-binary sequences. We first propose a non-binary code correcting a burst of at most 2 deletions for $q$-ary alphabets. Afterwards, we extend this result to the case where the length of the burst can be at most $t$ where $t$ is a constant. Finally, we consider the setup where the sequences that are transmitted are permutations. The proposed codes are the largest known for their respective parameter regimes. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: 20 pages. The paper has been submitted to IEEE Transactions on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and Allerton2022

arXiv:2207.08372 [pdf, ps, other]

Correcting $k$ Deletions and Insertions in Racetrack Memory

Authors: ** Sima, Jehoshua Bruck

Abstract: One of the main challenges in develo** racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized ver… ▽ More One of the main challenges in develo** racetrack memory systems is the limited precision in controlling the track shifts, that in turn affects the reliability of reading and writing the data. A current proposal for combating deletions in racetrack memories is to use redundant heads per-track resulting in multiple copies (potentially erroneous) and recovering the data by solving a specialized version of a sequence reconstruction problem. Using this approach, $k$-deletion correcting codes of length $n$, with $d \ge 2$ heads per-track, with redundancy $\log \log n + 4$ were constructed. However, the known approach requires that $k \le d$, namely, that the number of heads ($d$) is larger than or equal to the number of correctable deletions ($k$). Here we address the question: What is the best redundancy that can be achieved for a $k$-deletion code ($k$ is a constant) if the number of heads is fixed at $d$ (due to implementation constraints)? One of our key results is an answer to this question, namely, we construct codes that can correct $k$ deletions, for any $k$ beyond the known limit of $d$. The code has $4k \log \log n+o(\log \log n)$ redundancy for $k \le 2d-1$. In addition, when $k \ge 2d$, our codes have $2 \lfloor k/d\rfloor \log n+o(\log n)$ redundancy, that we prove it is order-wise optimal, specifically, we prove that the redundancy required for correcting $k$ deletions is at least $\lfloor k/d\rfloor \log n+o(\log n)$. The encoding/decoding complexity of our codes is $O(n\log^{2k}n)$. Finally, we ask a general question: What is the optimal redundancy for codes correcting a combination of at most $k$ deletions and insertions in a $d$-head racetrack memory? We prove that the redundancy sufficient to correct a combination of $k$ deletion and insertion errors is similar to the case of $k$ deletion errors. △ Less

Submitted 18 July, 2022; originally announced July 2022.

arXiv:2205.08032 [pdf, ps, other]

On Algebraic Constructions of Neural Networks with Small Weights

Authors: Kordag Mehmet Kilic, ** Sima, Jehoshua Bruck

Abstract: Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory… ▽ More Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel's Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions. We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2102.10416 [pdf, ps, other]

Simplest Non-Regular Deterministic Context-Free Language

Authors: Petr Jancar, Jiri Sima

Abstract: We introduce a new notion of C-simple problems for a class C of decision problems (i.e. languages), w.r.t. a particular reduction. A problem is C-simple if it can be reduced to each problem in C. This can be viewed as a conceptual counterpart to C-hard problems to which all problems in C reduce. Our concrete example is the class of non-regular deterministic context-free languages (DCFL'), with a t… ▽ More We introduce a new notion of C-simple problems for a class C of decision problems (i.e. languages), w.r.t. a particular reduction. A problem is C-simple if it can be reduced to each problem in C. This can be viewed as a conceptual counterpart to C-hard problems to which all problems in C reduce. Our concrete example is the class of non-regular deterministic context-free languages (DCFL'), with a truth-table reduction by Mealy machines (which proves to be a preorder). The main technical result is a proof that the DCFL' language $L=\{0^n1^n; n\geq 1\}$ is DCFL'-simple, which can thus be viewed as the simplest problem in the class DCFL'. This result has already provided an application, to the computational model of neural networks 1ANN at the first level of analog neuron hierarchy. This model was proven not to recognize $L$, by using a specialized technical argument that can hardly be generalized to other languages in DCFL'. By the result that $L$ is DCFL'-simple, w.r.t. the reduction that can be implemented by 1ANN, we immediately obtain that 1ANN cannot accept any language in DCFL'. It thus seems worthwhile to explore if looking for C-simple problems in other classes C under suitable reductions could provide effective tools for expanding the lower-bound results known for single problems to the whole classes of problems. △ Less

Submitted 20 February, 2021; originally announced February 2021.

arXiv:2102.05372 [pdf, ps, other]

Trace Reconstruction with Bounded Edit Distance

Authors: ** Sima, Jehoshua Bruck

Abstract: The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA… ▽ More The trace reconstruction problem studies the number of noisy samples needed to recover an unknown string $\boldsymbol{x}\in\{0,1\}^n$ with high probability, where the samples are independently obtained by passing $\boldsymbol{x}$ through a random deletion channel with deletion probability $q$. The problem is receiving significant attention recently due to its applications in DNA sequencing and DNA storage. Yet, there is still an exponential gap between upper and lower bounds for the trace reconstruction problem. In this paper we study the trace reconstruction problem when $\boldsymbol{x}$ is confined to an edit distance ball of radius $k$, which is essentially equivalent to distinguishing two strings with edit distance at most $k$. It is shown that $n^{O(k)}$ samples suffice to achieve this task with high probability. △ Less

Submitted 14 April, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

arXiv:2102.01633 [pdf, other]

Stronger Separation of Analog Neuron Hierarchy by Deterministic Context-Free Languages

Authors: Jiří Šíma

Abstract: We analyze the computational power of discrete-time recurrent neural networks (NNs) with the saturated-linear activation function within the Chomsky hierarchy. This model restricted to integer weights coincides with binary-state NNs with the Heaviside activation function, which are equivalent to finite automata (Chomsky level 3) recognizing regular languages (REG), while rational weights make this… ▽ More We analyze the computational power of discrete-time recurrent neural networks (NNs) with the saturated-linear activation function within the Chomsky hierarchy. This model restricted to integer weights coincides with binary-state NNs with the Heaviside activation function, which are equivalent to finite automata (Chomsky level 3) recognizing regular languages (REG), while rational weights make this model Turing-complete even for three analog-state units (Chomsky level 0). For the intermediate model $α$ANN of a binary-state NN that is extended with $α\geq 0$ extra analog-state neurons with rational weights, we have established the analog neuron hierarchy 0ANNs $\subset$ 1ANNs $\subset$ 2ANNs $\subseteq$ 3ANNs. The separation 1ANNs $\subsetneqq$ 2ANNs has been witnessed by the non-regular deterministic context-free language (DCFL) $L_\#=\{0^n1^n\mid n\geq 1\}$ which cannot be recognized by any 1ANN even with real weights, while any DCFL (Chomsky level 2) is accepted by a 2ANN with rational weights. In this paper, we strengthen this separation by showing that any non-regular DCFL cannot be recognized by 1ANNs with real weights, which means (DCFLs $\setminus$ REG) $\subset$ (2ANNs $\setminus$ 1ANNs), implying 1ANNs $\cap$ DCFLs = 0ANNs. For this purpose, we have shown that $L_\#$ is the simplest non-regular DCFL by reducing $L_\#$ to any language in this class, which is by itself an interesting achievement in computability theory. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: 30 pages, 4 figures

arXiv:2101.01151 [pdf, other]

doi 10.3233/FI-2021-2101

A polynomial-time construction of a hitting set for read-once branching programs of width 3

Authors: Jiří Šíma, Stanislav Žák

Abstract: Recently, an interest in constructing pseudorandom or hitting set generators for restricted branching programs has increased, which is motivated by the fundamental issue of derandomizing space-bounded computations. Such constructions have been known only in the case of width 2 and in very restricted cases of bounded width. In this paper, we characterize the hitting sets for read-once branching pro… ▽ More Recently, an interest in constructing pseudorandom or hitting set generators for restricted branching programs has increased, which is motivated by the fundamental issue of derandomizing space-bounded computations. Such constructions have been known only in the case of width 2 and in very restricted cases of bounded width. In this paper, we characterize the hitting sets for read-once branching programs of width 3 by a so-called richness condition. Namely, we show that such sets hit the class of read-once conjunctions of DNF and CNF (i.e. the weak richness). Moreover, we prove that any rich set extended with all strings within Hamming distance of 3 is a hitting set for read-once branching programs of width 3. Then, we show that any almost $O(\log n)$-wise independent set satisfies the richness condition. By using such a set due to Alon et al. (1992) our result provides an explicit polynomial-time construction of a hitting set for read-once branching programs of width 3 with acceptance probability $\varepsilon>5/6$. We announced this result at conferences more than ten years ago, including only proof sketches, which motivated a number of subsequent results on pseudorandom generators for restricted read-once branching programs. This paper contains our original detailed proof that has not been published yet. △ Less

Submitted 7 February, 2022; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: 48 pages, 10 figures

Journal ref: Fundamenta Informaticae, Volume 184, Issue 4 (March 10, 2022) fi:7043

arXiv:1910.12247 [pdf, ps, other]

Optimal $k$-Deletion Correcting Codes

Authors: ** Sima, Jehoshua Bruck

Abstract: Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution t… ▽ More Levenshtein introduced the problem of constructing $k$-deletion correcting codes in 1966, proved that the optimal redundancy of those codes is $O(k\log N)$, and proposed an optimal redundancy single-deletion correcting code (using the so-called VT construction). However, the problem of constructing optimal redundancy $k$-deletion correcting codes remained open. Our key contribution is a solution to this longstanding open problem. We present a $k$-deletion correcting code that has redundancy $8k\log n +o(\log n)$ and encoding/decoding algorithms of complexity $O(n^{2k+1})$ for constant $k$. △ Less

Submitted 27 October, 2019; originally announced October 2019.

arXiv:1809.02716 [pdf, ps, other]

On Coding over Sliced Information

Authors: ** Sima, Netanel Raviv, Jehoshua Bruck

Abstract: The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surpris… ▽ More The interest in channel models in which the data is sent as an unordered set of binary strings has increased lately, due to emerging applications in DNA storage, among others. In this paper we analyze the minimal redundancy of binary codes for this channel under substitution errors, and provide several constructions, some of which are shown to be asymptotically optimal up to constants. The surprising result in this paper is that while the information vector is sliced into a set of unordered strings, the amount of redundant bits that are required to correct errors is order-wise equivalent to the amount required in the classical error correcting paradigm. △ Less

Submitted 27 October, 2019; v1 submitted 7 September, 2018; originally announced September 2018.

arXiv:1806.09240 [pdf, ps, other]

Two Deletion Correcting Codes from Indicator Vectors

Authors: ** Sima, Netanel Raviv, Jehoshua Bruck

Abstract: Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a… ▽ More Construction of capacity achieving deletion correcting codes has been a baffling challenge for decades. A recent breakthrough by Brakensiek $et~al$., alongside novel applications in DNA storage, have reignited the interest in this longstanding open problem. In spite of recent advances, the amount of redundancy in existing codes is still orders of magnitude away from being optimal. In this paper, a novel approach for constructing binary two-deletion correcting codes is proposed. By this approach, parity symbols are computed from indicator vectors (i.e., vectors that indicate the positions of certain patterns) of the encoded message, rather than from the message itself. Most interestingly, the parity symbols and the proof of correctness are a direct generalization of their counterparts in the Varshamov-Tenengolts construction. Our techniques require $7\log(n)+o(\log(n)$ redundant bits to encode an~$n$-bit message, which is near-optimal. △ Less

Submitted 24 June, 2018; originally announced June 2018.

arXiv:1804.00702 [pdf, other]

ROLP: Runtime Object Lifetime Profiling for Big Data Memory Management

Authors: Rodrigo Bruno, Duarte Patrício, José Simão, Luís Veiga, Paulo Ferreira

Abstract: Low latency services such as credit-card fraud detection and website targeted advertisement rely on Big Data platforms (e.g., Lucene, Graphchi, Cassandra) which run on top of memory managed runtimes, such as the JVM. These platforms, however, suffer from unpredictable and unacceptably high pause times due to inadequate memory management decisions (e.g., allocating objects with very different lifet… ▽ More Low latency services such as credit-card fraud detection and website targeted advertisement rely on Big Data platforms (e.g., Lucene, Graphchi, Cassandra) which run on top of memory managed runtimes, such as the JVM. These platforms, however, suffer from unpredictable and unacceptably high pause times due to inadequate memory management decisions (e.g., allocating objects with very different lifetimes next to each other, resulting in memory fragmentation). This leads to long and frequent application pause times, breaking Service Level Agreements (SLAs). This problem has been previously identified and results show that current memory management techniques are ill-suited for applications that hold in memory massive amounts of middle to long-lived objects (which is the case for a wide spectrum of Big Data applications). Previous works try to reduce such application pauses by allocating objects off-heap or in special allocation regions/generations, thus alleviating the pressure on memory management. However, all these solutions require a combination of programmer effort and knowledge, source code access, or off-line profiling, with clear negative impact on programmer productivity and/or application performance. This paper presents ROLP, a runtime object lifetime profiling system. ROLP profiles application code at runtime in order to identify which allocation contexts create objects with middle to long lifetimes, given that such objects need to be handled differently (regarding short-lived ones). This profiling information greatly improves memory management decisions, leading to long tail latencies reduction of up to 51% for Lucene, 85% for GraphChi, and 60% for Cassandra, with negligible throughput and memory overhead. ROLP is implemented for the OpenJDK 8 HotSpot JVM and it does not require any programmer effort or source code access. △ Less

Submitted 9 March, 2018; originally announced April 2018.

arXiv:1704.03324 [pdf, other]

Gang-GC: Locality-aware Parallel Data Placement Optimizations for Key-Value Storages

Authors: Duarte Patrício, José Simão, Luís Veiga

Abstract: Many cloud applications rely on fast and non-relational storage to aid in the processing of large amounts of data. Managed runtimes are now widely used to support the execution of several storage solutions of the NoSQL movement, particularly when dealing with big data key-value store-driven applications. The benefits of these runtimes can however be limited by modern parallel throughput-oriented G… ▽ More Many cloud applications rely on fast and non-relational storage to aid in the processing of large amounts of data. Managed runtimes are now widely used to support the execution of several storage solutions of the NoSQL movement, particularly when dealing with big data key-value store-driven applications. The benefits of these runtimes can however be limited by modern parallel throughput-oriented GC algorithms, where related objects have the potential to be dispersed in memory, either in the same or different generations. In the long run this causes more page faults and degradation of locality on system-level memory caches. We propose, Gang-CG, an extension to modern heap layouts and to a parallel GC algorithm to promote locality between groups of related objects. This is done without extensive profiling of the applications and in a way that is transparent to the programmer, without the need to use specialized data structures. The heap layout and algorithmic extensions were implemented over the Parallel Scavenge garbage collector of the HotSpot JVM\@. Using microbenchmarks that capture the architecture of several key-value stores databases, we show negligible overhead in frequent operations such as the allocation of new objects and improvements to the access speed of data, supported by lower misses in system-level memory caches. Overall, we show a 6\% improvement in the average time of read and update operations and an average decrease of 12.4\% in page faults. △ Less

Submitted 11 April, 2017; originally announced April 2017.

Report number: INESC-ID Tec. Rep. 5/2017, Feb 2017

arXiv:1701.03507 [pdf, other]

Beyond NGS data sharing and towards open science

Authors: Bruno Dantas, Calmenelias Fleitas, Alexandre P. Francisco, José Simão, Cátia Vaz

Abstract: Biosciences have been revolutionized by next generation sequencing (NGS) technologies in last years, leading to new perspectives in medical, industrial and environmental applications. And although our motivation comes from biosciences, the following is true for many areas of science: published results are usually hard to reproduce either because data is not available or tools are not readily avail… ▽ More Biosciences have been revolutionized by next generation sequencing (NGS) technologies in last years, leading to new perspectives in medical, industrial and environmental applications. And although our motivation comes from biosciences, the following is true for many areas of science: published results are usually hard to reproduce either because data is not available or tools are not readily available, which delays the adoption of new methodologies and hinders innovation. Our focus is on tool readiness and pipelines availability. Even though most tools are freely available, pipelines for data analysis are in general barely described and their configuration is far from trivial, with many parameters to be tuned. In this paper we discuss how to effectively build and use pipelines, relying on state of the art computing technologies to execute them without users need to configure, install and manage tools, servers and complex workflow management systems. We perform an in depth comparative analysis of state of the art frameworks and systems. The NGSPipes framework is proposed showing that we can have public pipelines ready to process and analyse experimental data, produced for instance by high-throughput technologies, but without relying on centralized servers or Web services. The NGSPipes framework and underlying architecture provides a major step towards open science and true collaboration in what concerns tools and pipelines among computational biology researchers and practitioners. We show that it is possible to execute data analysis pipelines in a decentralized and platform independent way. Approaches like the one proposed are crucial for archiving and reusing data analysis pipelines at medium/long-term. NGSPipes framework is freely available at http://ngspipes.github.io/. △ Less

Submitted 11 November, 2016; originally announced January 2017.

Comments: 19 pages, 10 figures

arXiv:1601.05917 [pdf, other]

Polar Codes for Broadcast Channels with Receiver Message Side Information and Noncausal State Available at the Encoder

Authors: ** Sima, Wei Chen

Abstract: In this paper polar codes are proposed for two receiver broadcast channels with receiver message side information (BCSI) and noncausal state available at the encoder, referred to as BCSI with noncausal state for short, where the two receivers know a priori the private messages intended for each other. This channel generalizes BCSI with common message and Gelfand-Pinsker problem and has application… ▽ More In this paper polar codes are proposed for two receiver broadcast channels with receiver message side information (BCSI) and noncausal state available at the encoder, referred to as BCSI with noncausal state for short, where the two receivers know a priori the private messages intended for each other. This channel generalizes BCSI with common message and Gelfand-Pinsker problem and has applications in cellular communication systems. We establish an achievable rate region for BCSI with noncausal state and show that it is strictly larger than the straightforward extension of the Gelfand-Pinsker result. To achieve the established rate region with polar coding, we present polar codes for the general Gelfand-Pinsker problem, which adopts chaining construction and utilizes causal information to pre-transmit the frozen bits. It is also shown that causal information is necessary to pre-transmit the frozen bits. Based on the result of Gelfand-Pinsker problem, we use the chaining construction method to design polar codes for BCSI with noncausal state. The difficulty is that there are multiple chains sharing common information bit indices. To avoid value assignment conflicts, a nontrivial polarization alignment scheme is presented. It is shown that the proposed rate region is tight for degraded BCSI with noncausal state. △ Less

Submitted 22 January, 2016; originally announced January 2016.

Comments: 22 pages, 7 figures

arXiv:1407.8409 [pdf, other]

Joint Network and Gelfand-Pinsker Coding for 3-Receiver Gaussian Broadcast Channels with Receiver Message Side Information

Authors: ** Sima, Wei Chen

Abstract: The problem of characterizing the capacity region for Gaussian broadcast channels with receiver message side information appears difficult and remains open for N >= 3 receivers. This paper proposes a joint network and Gelfand-Pinsker coding method for 3-receiver cases. Using the method, we establish a unified inner bound on the capacity region of 3-receiver Gaussian broadcast channels under genera… ▽ More The problem of characterizing the capacity region for Gaussian broadcast channels with receiver message side information appears difficult and remains open for N >= 3 receivers. This paper proposes a joint network and Gelfand-Pinsker coding method for 3-receiver cases. Using the method, we establish a unified inner bound on the capacity region of 3-receiver Gaussian broadcast channels under general message side information configuration. The achievability proof of the inner bound uses an idea of joint interference cancelation, where interference is canceled by using both dirty-paper coding at the encoder and successive decoding at some of the decoders. We show that the inner bound is larger than that achieved by state of the art coding schemes. An outer bound is also established and shown to be tight in 46 out of all 64 possible cases. △ Less

Submitted 19 August, 2014; v1 submitted 31 July, 2014; originally announced July 2014.

Comments: Author's final version (presented at the 2014 IEEE International Symposium on Information Theory [ISIT 2014])

arXiv:cs/0506100 [pdf, ps, other]

On the NP-Completeness of Some Graph Cluster Measures

Authors: Jiri Sima, Satu Elisa Schaeffer

Abstract: Graph clustering is the problem of identifying sparsely connected dense subgraphs (clusters) in a given graph. Proposed clustering algorithms usually optimize various fitness functions that measure the quality of a cluster within the graph. Examples of such cluster measures include the conductance, the local and relative densities, and single cluster editing. We prove that the decision problems… ▽ More Graph clustering is the problem of identifying sparsely connected dense subgraphs (clusters) in a given graph. Proposed clustering algorithms usually optimize various fitness functions that measure the quality of a cluster within the graph. Examples of such cluster measures include the conductance, the local and relative densities, and single cluster editing. We prove that the decision problems associated with the optimization tasks of finding the clusters that are optimal with respect to these fitness measures are NP-complete. △ Less

Submitted 29 June, 2005; originally announced June 2005.

Comments: 9 pages, no figures

Showing 1–30 of 30 results for author: Sima, J