Search | arXiv e-print repository

Minimal Communication-Cost Statistical Learning

Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

Abstract: A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (populatio… ▽ More A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small 'complexity' for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder's output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model $\widehat{W}$ given $W$ and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder's output message. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted at ISIT 2024

arXiv:2402.03254 [pdf, other]

Minimum Description Length and Generalization Guarantees for Representation Learning

Authors: Milad Sefidgaran, Abdellatif Zaidi, Piotr Krasnowski

Abstract: A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. In this paper, we es… ▽ More A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees. In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length" (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB. △ Less

Submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted and presented at NeurIPS 2023

arXiv:2306.05862 [pdf, other]

Lessons from Generalization Error Analysis of Federated Learning: You May Communicate Less Often!

Authors: Milad Sefidgaran, Romain Chor, Abdellatif Zaidi, Yijun Wan

Abstract: We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds $R$ between $K$ clients and a parameter server (PS), i.e., the effect on the generalization error of how often the clients' local models are aggregated at PS. In our setup, the more the… ▽ More We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds $R$ between $K$ clients and a parameter server (PS), i.e., the effect on the generalization error of how often the clients' local models are aggregated at PS. In our setup, the more the clients communicate with PS the less data they use for local training in each round, such that the amount of training data per client is identical for distinct values of $R$. We establish PAC-Bayes and rate-distortion theoretic bounds on the generalization error that account explicitly for the effect of the number of rounds $R$, in addition to the number of participating devices $K$ and individual datasets size $n$. The bounds, which apply to a large class of loss functions and learning algorithms, appear to be the first of their kind for the FL setting. Furthermore, we apply our bounds to FL-type Support Vector Machines (FSVM); and derive (more) explicit bounds in this case. In particular, we show that the generalization bound of FSVM increases with $R$, suggesting that more frequent communication with PS diminishes the generalization power. This implies that the population risk decreases less fast with $R$ than does the empirical risk. Moreover, our bound suggests that the generalization error of FSVM decreases faster than that of centralized learning by a factor of $\mathcal{O}(\sqrt{\log(K)/K})$. Finally, we provide experimental results obtained using neural networks (ResNet-56) which show evidence that not only may our observations for FSVM hold more generally but also that the population risk may even start to increase beyond some value of $R$. △ Less

Submitted 10 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted at ICML 2024

arXiv:2304.12216 [pdf, other]

More Communication Does Not Result in Smaller Generalization Error in Federated Learning

Authors: Romain Chor, Milad Sefidgaran, Abdellatif Zaidi

Abstract: We study the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, there are $K$ devices or clients, each holding an independent own dataset of size $n$. Individual models, learned locally via Stochastic Gradient Descent, are aggregated (averaged) by a central server into a global model and then sent back to the devices. We consider multiple (say… ▽ More We study the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, there are $K$ devices or clients, each holding an independent own dataset of size $n$. Individual models, learned locally via Stochastic Gradient Descent, are aggregated (averaged) by a central server into a global model and then sent back to the devices. We consider multiple (say $R \in \mathbb N^*$) rounds of model aggregation and study the effect of $R$ on the generalization error of the final aggregated model. We establish an upper bound on the generalization error that accounts explicitly for the effect of $R$ (in addition to the number of participating devices $K$ and dataset size $n$). It is observed that, for fixed $(n, K)$, the bound increases with $R$, suggesting that the generalization of such learning algorithms is negatively affected by more frequent communication with the parameter server. Combined with the fact that the empirical risk, however, generally decreases for larger values of $R$, this indicates that $R$ might be a parameter to optimize to reduce the population risk of FL algorithms. The results of this paper, which extend straightforwardly to the heterogeneous data setting, are also illustrated through numerical examples. △ Less

Submitted 11 May, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

Comments: Extended version of paper accepted at ISIT 2023

arXiv:2303.05369 [pdf, other]

Data-dependent Generalization Bounds via Variable-Size Compressibility

Authors: Milad Sefidgaran, Abdellatif Zaidi

Abstract: In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input dat… ▽ More In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the Rényi information dimension of a process, and the metric mean dimension. △ Less

Submitted 11 June, 2024; v1 submitted 9 March, 2023; originally announced March 2023.

Comments: Accepted for publication in IEEE Transactions on Information Theory

arXiv:2206.02604 [pdf, other]

Rate-Distortion Theoretic Bounds on Generalization Error for Distributed Learning

Authors: Milad Sefidgaran, Romain Chor, Abdellatif Zaidi

Abstract: In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms. Specifically, there are $K$ clients whose individually chosen models are aggregated by a central server. The bounds depend on the compressibility of each client's algorithm while kee** other clients' algorithms un-compressed, and levera… ▽ More In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms. Specifically, there are $K$ clients whose individually chosen models are aggregated by a central server. The bounds depend on the compressibility of each client's algorithm while kee** other clients' algorithms un-compressed, and leverage the fact that small changes in each local model change the aggregated model by a factor of only $1/K$. Adopting a recently proposed approach by Sefidgaran et al., and extending it suitably to the distributed setting, this enables smaller rate-distortion terms which are shown to translate into tighter generalization bounds. The bounds are then applied to the distributed support vector machines (SVM), suggesting that the generalization error of the distributed setting decays faster than that of the centralized one with a factor of $\mathcal{O}(\log(K)/\sqrt{K})$. This finding is validated also experimentally. A similar conclusion is obtained for a multiple-round federated learning setup where each client uses stochastic gradient Langevin dynamics (SGLD). △ Less

Submitted 22 November, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2203.02474 [pdf, other]

Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms

Authors: Milad Sefidgaran, Amin Gohari, Gaël Richard, Umut Şimşekli

Abstract: Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fr… ▽ More Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions. △ Less

Submitted 29 June, 2022; v1 submitted 4 March, 2022; originally announced March 2022.

Comments: Accepted for presentation at the Conference on Learning Theory (COLT) 2022

arXiv:2106.03795 [pdf, other]

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Authors: Melih Barsbey, Milad Sefidgaran, Murat A. Erdogdu, Gaël Richard, Umut Şimşekli

Abstract: Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalizat… ▽ More Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be '$\ell_p$-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility. △ Less

Submitted 7 June, 2021; originally announced June 2021.

arXiv:2102.00697 [pdf, other]

Zero-Error Sum Modulo Two with a Common Observation

Authors: Milad Sefidgaran, Aslan Tchamkerten

Abstract: This paper investigates the classical modulo two sum problem in source coding, but with a common observation: a transmitter observes $(X,Z)$, the other transmitter observes $(Y,Z)$, and the receiver wants to compute $X \oplus Y$ without error. Through a coupling argument, this paper establishes a new lower bound on the sum-rate when $X-Z-Y$ forms a Markov chain. This paper investigates the classical modulo two sum problem in source coding, but with a common observation: a transmitter observes $(X,Z)$, the other transmitter observes $(Y,Z)$, and the receiver wants to compute $X \oplus Y$ without error. Through a coupling argument, this paper establishes a new lower bound on the sum-rate when $X-Z-Y$ forms a Markov chain. △ Less

Submitted 22 March, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

Comments: Accepted for presentation at IEEE ITW 2020

arXiv:2011.11341 [pdf, other]

Lower Bound on the Capacity of the Continuous-Space SSFM Model of Optical Fiber

Authors: Milad Sefidgaran, Mansoor Yousefi

Abstract: The capacity of a discrete-time model of optical fiber described by the split-step Fourier method (SSFM) as a function of the signal-to-noise ratio $\text{SNR}$ and the number of segments in distance $K$ is considered. It is shown that if $K\geq \text{SNR}^{2/3}$ and $\text{SNR} \rightarrow \infty$, the capacity of the resulting continuous-space lossless model is lower bounded by… ▽ More The capacity of a discrete-time model of optical fiber described by the split-step Fourier method (SSFM) as a function of the signal-to-noise ratio $\text{SNR}$ and the number of segments in distance $K$ is considered. It is shown that if $K\geq \text{SNR}^{2/3}$ and $\text{SNR} \rightarrow \infty$, the capacity of the resulting continuous-space lossless model is lower bounded by $\frac{1}{2}\log_2(1+\text{SNR}) - \frac{1}{2}+ o(1)$, where $o(1)$ tends to zero with $\text{SNR}$. As $K\rightarrow \infty$, the inter-symbol interference (ISI) averages out to zero due to the law of large numbers and the SSFM model tends to a diagonal phase noise model. It follows that, in contrast to the discrete-space model where there is only one signal degree-of-freedom (DoF) at high powers, the number of DoFs in the continuous-space model is at least half of the input dimension $n$. Intensity-modulation and direct detection achieves this rate. The pre-log in the lower bound when $K= \sqrt[δ]{\text{SNR}}$ is generally characterized in terms of $δ$. It is shown that if the nonlinearity parameter $γ\rightarrow \infty$, the capacity of the continuous-space model is $\frac{1}{2}\log_2(1+\text{SNR})+ o(1)$. The SSFM model when the dispersion matrix does not depend on $K$ is considered. It is shown that the capacity of this model when $K= \sqrt[δ]{\text{SNR}}$, $δ>3$, and $\text{SNR} \rightarrow \infty$ is $\frac{1}{2n}\log_2(1+\text{SNR})+ O(1)$. Thus, there is only one DoF in this model. Finally, it is found that the maximum achievable information rates (AIRs) of the SSFM model with back-propagation equalization obtained using numerical simulation follows a double-ascent curve. △ Less

Submitted 26 September, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:1312.3631 [pdf, ps, other]

Distributed Function Computation Over a Rooted Directed Tree

Authors: Milad Sefidgaran, Aslan Tchamkerten

Abstract: This paper establishes the rate region for a class of source coding function computation setups where sources of information are available at the nodes of a tree and where a function of these sources must be computed at the root. The rate region holds for any function as long as the sources' joint distribution satisfies a certain Markov criterion. This criterion is met, in particular, when the sou… ▽ More This paper establishes the rate region for a class of source coding function computation setups where sources of information are available at the nodes of a tree and where a function of these sources must be computed at the root. The rate region holds for any function as long as the sources' joint distribution satisfies a certain Markov criterion. This criterion is met, in particular, when the sources are independent. This result recovers the rate regions of several function computation setups. These include the point-to-point communication setting with arbitrary sources, the noiseless multiple access network with "conditionally independent sources," and the cascade network with Markovian sources. △ Less

Submitted 7 April, 2015; v1 submitted 12 December, 2013; originally announced December 2013.

Comments: 36 pages, Submitted to IEEE Transactions on Information Theory

arXiv:1303.0817 [pdf, other]

On Cooperation in Multi-Terminal Computation and Rate Distortion

Authors: Milad Sefidgaran, Aslan Tchamkerten

Abstract: A receiver wants to compute a function of two correlated sources separately observed by two transmitters. One of the transmitters may send a possibly private message to the other transmitter in a cooperation phase before both transmitters communicate to the receiver. For this network configuration this paper investigates both a function computation setup, wherein the receiver wants to compute a gi… ▽ More A receiver wants to compute a function of two correlated sources separately observed by two transmitters. One of the transmitters may send a possibly private message to the other transmitter in a cooperation phase before both transmitters communicate to the receiver. For this network configuration this paper investigates both a function computation setup, wherein the receiver wants to compute a given function of the sources exactly, and a rate distortion setup, wherein the receiver wants to compute a given function within some distortion. For the function computation setup, a general inner bound to the rate region is established and shown to be tight in a number of cases: partially invertible functions, full cooperation between transmitters, one-round point-to-point communication, two-round point-to-point communication, and the cascade setup where the transmitters and the receiver are aligned. In particular it is shown that the ratio of the total number of transmitted bits without cooperation and the total number of transmitted bits with cooperation can be arbitrarily large. Furthermore, one bit of cooperation suffices to arbitrarily reduce the amount of information both transmitters need to convey to the receiver. For the rate distortion version, an inner bound to the rate region is exhibited which always includes, and sometimes strictly, the convex hull of Kaspi-Berger's related inner bounds. The strict inclusion is shown via two examples. △ Less

Submitted 7 April, 2015; v1 submitted 4 March, 2013; originally announced March 2013.

Comments: 31 pages, Submitted to IEEE Transactions on Information Theory

arXiv:1107.5806 [pdf, ps, other]

On Computing a Function of Correlated Sources

Authors: Milad Sefidgaran, Aslan Tchamkerten

Abstract: A receiver wants to compute a function f of two correlated sources X and Y and side information Z. What is the minimum number of bits that needs to be communicated by each transmitter? In this paper, we derive inner and outer bounds to the rate region of this problem which coincide in the cases where f is partially invertible and where the sources are independent given the side information. Th… ▽ More A receiver wants to compute a function f of two correlated sources X and Y and side information Z. What is the minimum number of bits that needs to be communicated by each transmitter? In this paper, we derive inner and outer bounds to the rate region of this problem which coincide in the cases where f is partially invertible and where the sources are independent given the side information. These rate regions point to an important difference with the single source case. Whereas for the latter it is sufficient to consider independent sets of some suitable characteristic graph, for multiple sources such a restriction is suboptimal and multisets are necessary. △ Less

Submitted 11 October, 2012; v1 submitted 28 July, 2011; originally announced July 2011.

Comments: 11 pages, Submitted to IEEE Transactions on Information Theory

Showing 1–13 of 13 results for author: Sefidgaran, M