-
Asymmetric Leaky Private Information Retrieval
Authors:
Islam Samy,
Mohamed A. Attia,
Ravi Tandon,
Loukas Lazos
Abstract:
Information-theoretic formulations of the private information retrieval (PIR) problem have been investigated under a variety of scenarios. Symmetric private information retrieval (SPIR) is a variant where a user is able to privately retrieve one out of $K$ messages from $N$ non-colluding replicated databases without learning anything about the remaining $K-1$ messages. However, the goal of perfect…
▽ More
Information-theoretic formulations of the private information retrieval (PIR) problem have been investigated under a variety of scenarios. Symmetric private information retrieval (SPIR) is a variant where a user is able to privately retrieve one out of $K$ messages from $N$ non-colluding replicated databases without learning anything about the remaining $K-1$ messages. However, the goal of perfect privacy can be too taxing for certain applications. In this paper, we investigate if the information-theoretic capacity of SPIR (equivalently, the inverse of the minimum download cost) can be increased by relaxing both user and DB privacy definitions. Such relaxation is relevant in applications where privacy can be traded for communication efficiency. We introduce and investigate the Asymmetric Leaky PIR (AL-PIR) model with different privacy leakage budgets in each direction. For user privacy leakage, we bound the probability ratios between all possible realizations of DB queries by a function of a non-negative constant $ε$. For DB privacy, we bound the mutual information between the undesired messages, the queries, and the answers, by a function of a non-negative constant $δ$. We propose a general AL-PIR scheme that achieves an upper bound on the optimal download cost for arbitrary $ε$ and $δ$. We show that the optimal download cost of AL-PIR is upper-bounded as $D^{*}(ε,δ)\leq 1+\frac{1}{N-1}-\frac{δe^ε}{N^{K-1}-1}$. Second, we obtain an information-theoretic lower bound on the download cost as $D^{*}(ε,δ)\geq 1+\frac{1}{Ne^ε-1}-\fracδ{(Ne^ε)^{K-1}-1}$. The gap analysis between the two bounds shows that our AL-PIR scheme is optimal when $ε=0$, i.e., under perfect user privacy and it is optimal within a maximum multiplicative gap of $\frac{N-e^{-ε}}{N-1}$ for any $(ε,δ)$.
△ Less
Submitted 4 June, 2020;
originally announced June 2020.
-
Latent-variable Private Information Retrieval
Authors:
Islam Samy,
Mohamed A. Attia,
Ravi Tandon,
Loukas Lazos
Abstract:
In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This so…
▽ More
In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This solution, while private, can be too costly, particularly, when perfect (information-theoretic) privacy constraints are imposed. For instance, for a single database holding $K$ messages, privately retrieving one message is possible if and only if the user downloads the entire database of $K$ messages. Retrieving content privately, however, may not be necessary to perfectly hide the latent attributes.
Motivated by the above, we formulate and study the problem of latent-variable private information retrieval (LV-PIR), which aims at allowing the user efficiently retrieve one out of $K$ messages (indexed by $θ$) without revealing any information about the latent variable (modeled by $S$). We focus on the practically relevant setting of a single database and show that one can significantly reduce the download cost of LV-PIR (compared to the classical PIR) based on the correlation between $θ$ and $S$. We present a general scheme for LV-PIR as a function of the statistical relationship between $θ$ and $S$, and also provide new results on the capacity/download cost of LV-PIR. Several open problems and new directions are also discussed.
△ Less
Submitted 14 May, 2020; v1 submitted 16 January, 2020;
originally announced January 2020.
-
The Capacity of Private Information Retrieval from Uncoded Storage Constrained Databases
Authors:
Mohamed Adel Attia,
Deepak Kumar,
Ravi Tandon
Abstract:
Private information retrieval (PIR) allows a user to retrieve a desired message from a set of databases without revealing the identity of the desired message. The replicated databases scenario was considered by Sun and Jafar, 2016, where $N$ databases can store the same $K$ messages completely. A PIR scheme was developed to achieve the optimal download cost given by…
▽ More
Private information retrieval (PIR) allows a user to retrieve a desired message from a set of databases without revealing the identity of the desired message. The replicated databases scenario was considered by Sun and Jafar, 2016, where $N$ databases can store the same $K$ messages completely. A PIR scheme was developed to achieve the optimal download cost given by $\left(1+ \frac{1}{N}+ \frac{1}{N^{2}}+ \cdots + \frac{1}{N^{K-1}}\right)$. In this work, we consider the problem of PIR from storage constrained databases. Each database has a storage capacity of $μKL$ bits, where $L$ is the size of each message in bits, and $μ\in [1/N, 1]$ is the normalized storage. On one extreme, $μ=1$ is the replicated databases case. On the other hand, when $μ= 1/N$, then in order to retrieve a message privately, the user has to download all the messages from the databases achieving a download cost of $1/K$. We aim to characterize the optimal download cost versus storage trade-off for any storage capacity in the range $μ\in [1/N, 1]$. For any $(N,K)$, we show that the optimal trade-off between storage, $μ$, and the download cost, $D(μ)$, is given by the lower convex hull of the $N$ pairs $\left(μ= \frac{t}{N},D(μ) = \left(1+ \frac{1}{t}+ \frac{1}{t^{2}}+ \cdots + \frac{1}{t^{K-1}}\right)\right)$ for $t=1,2,\ldots, N$. To prove this result, we first present the storage constrained PIR scheme for any $(N,K)$. We next obtain a general lower bound on the download cost for PIR, which is valid for the following storage scenarios: replicated or storage constrained, coded or uncoded, and fixed or optimized. We then specialize this bound using the uncoded storage assumption to obtain lower bounds matching the achievable download cost of the storage constrained PIR scheme for any value of the available storage.
△ Less
Submitted 23 October, 2018; v1 submitted 10 May, 2018;
originally announced May 2018.
-
Near Optimal Coded Data Shuffling for Distributed Learning
Authors:
Mohamed A. Attia,
Ravi Tandon
Abstract:
Data shuffling between distributed cluster of nodes is one of the critical steps in implementing large-scale learning algorithms. Randomly shuffling the data-set among a cluster of workers allows different nodes to obtain fresh data assignments at each learning epoch. This process has been shown to provide improvements in the learning process. However, the statistical benefits of distributed data…
▽ More
Data shuffling between distributed cluster of nodes is one of the critical steps in implementing large-scale learning algorithms. Randomly shuffling the data-set among a cluster of workers allows different nodes to obtain fresh data assignments at each learning epoch. This process has been shown to provide improvements in the learning process. However, the statistical benefits of distributed data shuffling come at the cost of extra communication overhead from the master node to worker nodes, and can act as one of the major bottlenecks in the overall time for computation. There has been significant recent interest in devising approaches to minimize this communication overhead. One approach is to provision for extra storage at the computing nodes. The other emerging approach is to leverage coded communication to minimize the overall communication overhead.
The focus of this work is to understand the fundamental trade-off between the amount of storage and the communication overhead for distributed data shuffling. In this work, we first present an information theoretic formulation for the data shuffling problem, accounting for the underlying problem parameters (number of workers, $K$, number of data points, $N$, and the available storage, $S$ per node). We then present an information theoretic lower bound on the communication overhead for data shuffling as a function of these parameters. We next present a novel coded communication scheme and show that the resulting communication overhead of the proposed scheme is within a multiplicative factor of at most $\frac{K}{K-1}$ from the information-theoretic lower bound. Furthermore, we present the aligned coded shuffling scheme for some storage values, which achieves the optimal storage vs communication trade-off for $K<5$, and further reduces the maximum multiplicative gap down to $\frac{K-\frac{1}{3}}{K-1}$, for $K\geq 5$.
△ Less
Submitted 5 January, 2018;
originally announced January 2018.
-
Combating Computational Heterogeneity in Large-Scale Distributed Computing via Work Exchange
Authors:
Mohamed A. Attia,
Ravi Tandon
Abstract:
Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes in a time efficient manner. One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion tim…
▽ More
Owing to data-intensive large-scale applications, distributed computation systems have gained significant recent interest, due to their ability of running such tasks over a large number of commodity nodes in a time efficient manner. One of the major bottlenecks that adversely impacts the time efficiency is the computational heterogeneity of distributed nodes, often limiting the task completion time due to the slowest worker.
In this paper, we first present a lower bound on the expected computation time based on the work-conservation principle. We then present our approach of work exchange to combat the latency problem, in which faster workers can be reassigned additional leftover computations that were originally assigned to slower workers. We present two variations of the work exchange approach: a) when the computational heterogeneity knowledge is known a priori; and b) when heterogeneity is unknown and is estimated in an online manner to assign tasks to distributed workers. As a baseline, we also present and analyze the use of an optimized Maximum Distance Separable (MDS) coded distributed computation scheme over heterogeneous nodes. Simulation results also compare the proposed approach of work exchange, the baseline MDS coded scheme and the lower bound obtained via work-conservation principle. We show that the work exchange scheme achieves time for computation which is very close to the lower bound with limited coordination and communication overhead even when the knowledge about heterogeneity levels is not available.
△ Less
Submitted 22 November, 2017;
originally announced November 2017.