-
Minimal Communication-Cost Statistical Learning
Authors:
Milad Sefidgaran,
Abdellatif Zaidi,
Piotr Krasnowski
Abstract:
A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (populatio…
▽ More
A client device which has access to $n$ training data samples needs to obtain a statistical hypothesis or model $W$ and then to send it to a remote server. The client and the server devices share some common randomness sequence as well as a prior on the hypothesis space. In this problem a suitable hypothesis or model $W$ should meet two distinct design criteria simultaneously: (i) small (population) risk during the inference phase and (ii) small 'complexity' for it to be conveyed to the server with minimum communication cost. In this paper, we propose a joint training and source coding scheme with provable in-expectation guarantees, where the expectation is over the encoder's output message. Specifically, we show that by imposing a constraint on a suitable Kullback-Leibler divergence between the conditional distribution induced by a compressed learning model $\widehat{W}$ given $W$ and the prior, one guarantees simultaneously small average empirical risk (aka training loss), small average generalization error and small average communication cost. We also consider a one-shot scenario in which the guarantees on the empirical risk and generalization error are obtained for every encoder's output message.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
Minimum Description Length and Generalization Guarantees for Representation Learning
Authors:
Milad Sefidgaran,
Abdellatif Zaidi,
Piotr Krasnowski
Abstract:
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees.
In this paper, we es…
▽ More
A major challenge in designing efficient statistical supervised learning algorithms is finding representations that perform well not only on available training samples but also on unseen data. While the study of representation learning has spurred much interest, most existing such approaches are heuristic; and very little is known about theoretical generalization guarantees.
In this paper, we establish a compressibility framework that allows us to derive upper bounds on the generalization error of a representation learning algorithm in terms of the "Minimum Description Length" (MDL) of the labels or the latent variables (representations). Rather than the mutual information between the encoder's input and the representation, which is often believed to reflect the algorithm's generalization capability in the related literature but in fact, falls short of doing so, our new bounds involve the "multi-letter" relative entropy between the distribution of the representations (or labels) of the training and test sets and a fixed prior. In particular, these new bounds reflect the structure of the encoder and are not vacuous for deterministic algorithms. Our compressibility approach, which is information-theoretic in nature, builds upon that of Blum-Langford for PAC-MDL bounds and introduces two essential ingredients: block-coding and lossy-compression. The latter allows our approach to subsume the so-called geometrical compressibility as a special case. To the best knowledge of the authors, the established generalization bounds are the first of their kind for Information Bottleneck (IB) type encoders and representation learning. Finally, we partly exploit the theoretical results by introducing a new data-dependent prior. Numerical simulations illustrate the advantages of well-chosen such priors over classical priors used in IB.
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Lessons from Generalization Error Analysis of Federated Learning: You May Communicate Less Often!
Authors:
Milad Sefidgaran,
Romain Chor,
Abdellatif Zaidi,
Yijun Wan
Abstract:
We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds $R$ between $K$ clients and a parameter server (PS), i.e., the effect on the generalization error of how often the clients' local models are aggregated at PS. In our setup, the more the…
▽ More
We investigate the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, we study the evolution of the generalization error with the number of communication rounds $R$ between $K$ clients and a parameter server (PS), i.e., the effect on the generalization error of how often the clients' local models are aggregated at PS. In our setup, the more the clients communicate with PS the less data they use for local training in each round, such that the amount of training data per client is identical for distinct values of $R$. We establish PAC-Bayes and rate-distortion theoretic bounds on the generalization error that account explicitly for the effect of the number of rounds $R$, in addition to the number of participating devices $K$ and individual datasets size $n$. The bounds, which apply to a large class of loss functions and learning algorithms, appear to be the first of their kind for the FL setting. Furthermore, we apply our bounds to FL-type Support Vector Machines (FSVM); and derive (more) explicit bounds in this case. In particular, we show that the generalization bound of FSVM increases with $R$, suggesting that more frequent communication with PS diminishes the generalization power. This implies that the population risk decreases less fast with $R$ than does the empirical risk. Moreover, our bound suggests that the generalization error of FSVM decreases faster than that of centralized learning by a factor of $\mathcal{O}(\sqrt{\log(K)/K})$. Finally, we provide experimental results obtained using neural networks (ResNet-56) which show evidence that not only may our observations for FSVM hold more generally but also that the population risk may even start to increase beyond some value of $R$.
△ Less
Submitted 10 June, 2024; v1 submitted 9 June, 2023;
originally announced June 2023.
-
More Communication Does Not Result in Smaller Generalization Error in Federated Learning
Authors:
Romain Chor,
Milad Sefidgaran,
Abdellatif Zaidi
Abstract:
We study the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, there are $K$ devices or clients, each holding an independent own dataset of size $n$. Individual models, learned locally via Stochastic Gradient Descent, are aggregated (averaged) by a central server into a global model and then sent back to the devices. We consider multiple (say…
▽ More
We study the generalization error of statistical learning models in a Federated Learning (FL) setting. Specifically, there are $K$ devices or clients, each holding an independent own dataset of size $n$. Individual models, learned locally via Stochastic Gradient Descent, are aggregated (averaged) by a central server into a global model and then sent back to the devices. We consider multiple (say $R \in \mathbb N^*$) rounds of model aggregation and study the effect of $R$ on the generalization error of the final aggregated model. We establish an upper bound on the generalization error that accounts explicitly for the effect of $R$ (in addition to the number of participating devices $K$ and dataset size $n$). It is observed that, for fixed $(n, K)$, the bound increases with $R$, suggesting that the generalization of such learning algorithms is negatively affected by more frequent communication with the parameter server. Combined with the fact that the empirical risk, however, generally decreases for larger values of $R$, this indicates that $R$ might be a parameter to optimize to reduce the population risk of FL algorithms. The results of this paper, which extend straightforwardly to the heterogeneous data setting, are also illustrated through numerical examples.
△ Less
Submitted 11 May, 2023; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Data-dependent Generalization Bounds via Variable-Size Compressibility
Authors:
Milad Sefidgaran,
Abdellatif Zaidi
Abstract:
In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input dat…
▽ More
In this paper, we establish novel data-dependent upper bounds on the generalization error through the lens of a "variable-size compressibility" framework that we introduce newly here. In this framework, the generalization error of an algorithm is linked to a variable-size 'compression rate' of its input data. This is shown to yield bounds that depend on the empirical measure of the given input data at hand, rather than its unknown distribution. Our new generalization bounds that we establish are tail bounds, tail bounds on the expectation, and in-expectations bounds. Moreover, it is shown that our framework also allows to derive general bounds on any function of the input data and output hypothesis random variables. In particular, these general bounds are shown to subsume and possibly improve over several existing PAC-Bayes and data-dependent intrinsic dimension-based bounds that are recovered as special cases, thus unveiling a unifying character of our approach. For instance, a new data-dependent intrinsic dimension-based bound is established, which connects the generalization error to the optimization trajectories and reveals various interesting connections with the rate-distortion dimension of a process, the Rényi information dimension of a process, and the metric mean dimension.
△ Less
Submitted 11 June, 2024; v1 submitted 9 March, 2023;
originally announced March 2023.
-
Rate-Distortion Theoretic Bounds on Generalization Error for Distributed Learning
Authors:
Milad Sefidgaran,
Romain Chor,
Abdellatif Zaidi
Abstract:
In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms. Specifically, there are $K$ clients whose individually chosen models are aggregated by a central server. The bounds depend on the compressibility of each client's algorithm while kee** other clients' algorithms un-compressed, and levera…
▽ More
In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms. Specifically, there are $K$ clients whose individually chosen models are aggregated by a central server. The bounds depend on the compressibility of each client's algorithm while kee** other clients' algorithms un-compressed, and leverage the fact that small changes in each local model change the aggregated model by a factor of only $1/K$. Adopting a recently proposed approach by Sefidgaran et al., and extending it suitably to the distributed setting, this enables smaller rate-distortion terms which are shown to translate into tighter generalization bounds. The bounds are then applied to the distributed support vector machines (SVM), suggesting that the generalization error of the distributed setting decays faster than that of the centralized one with a factor of $\mathcal{O}(\log(K)/\sqrt{K})$. This finding is validated also experimentally. A similar conclusion is obtained for a multiple-round federated learning setup where each client uses stochastic gradient Langevin dynamics (SGLD).
△ Less
Submitted 22 November, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Rate-Distortion Theoretic Generalization Bounds for Stochastic Learning Algorithms
Authors:
Milad Sefidgaran,
Amin Gohari,
Gaël Richard,
Umut Şimşekli
Abstract:
Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fr…
▽ More
Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, compressibility of the hypothesis space, and the fractal dimension of the hypothesis space. While these bounds have illuminated the problem at hand from different angles, their suggested complexity notions might appear seemingly unrelated, thereby restricting their high-level impact. In this study, we prove novel generalization bounds through the lens of rate-distortion theory, and explicitly relate the concepts of mutual information, compressibility, and fractal dimensions in a single mathematical framework. Our approach consists of (i) defining a generalized notion of compressibility by using source coding concepts, and (ii) showing that the `compression error rate' can be linked to the generalization error both in expectation and with high probability. We show that in the `lossless compression' setting, we recover and improve existing mutual information-based bounds, whereas a `lossy compression' scheme allows us to link generalization to the rate-distortion dimension -- a particular notion of fractal dimension. Our results bring a more unified perspective on generalization and open up several future research directions.
△ Less
Submitted 29 June, 2022; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks
Authors:
Melih Barsbey,
Milad Sefidgaran,
Murat A. Erdogdu,
Gaël Richard,
Umut Şimşekli
Abstract:
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalizat…
▽ More
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be '$\ell_p$-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
Zero-Error Sum Modulo Two with a Common Observation
Authors:
Milad Sefidgaran,
Aslan Tchamkerten
Abstract:
This paper investigates the classical modulo two sum problem in source coding, but with a common observation: a transmitter observes $(X,Z)$, the other transmitter observes $(Y,Z)$, and the receiver wants to compute $X \oplus Y$ without error. Through a coupling argument, this paper establishes a new lower bound on the sum-rate when $X-Z-Y$ forms a Markov chain.
This paper investigates the classical modulo two sum problem in source coding, but with a common observation: a transmitter observes $(X,Z)$, the other transmitter observes $(Y,Z)$, and the receiver wants to compute $X \oplus Y$ without error. Through a coupling argument, this paper establishes a new lower bound on the sum-rate when $X-Z-Y$ forms a Markov chain.
△ Less
Submitted 22 March, 2021; v1 submitted 1 February, 2021;
originally announced February 2021.
-
Lower Bound on the Capacity of the Continuous-Space SSFM Model of Optical Fiber
Authors:
Milad Sefidgaran,
Mansoor Yousefi
Abstract:
The capacity of a discrete-time model of optical fiber described by the split-step Fourier method (SSFM) as a function of the signal-to-noise ratio $\text{SNR}$ and the number of segments in distance $K$ is considered. It is shown that if $K\geq \text{SNR}^{2/3}$ and $\text{SNR} \rightarrow \infty$, the capacity of the resulting continuous-space lossless model is lower bounded by…
▽ More
The capacity of a discrete-time model of optical fiber described by the split-step Fourier method (SSFM) as a function of the signal-to-noise ratio $\text{SNR}$ and the number of segments in distance $K$ is considered. It is shown that if $K\geq \text{SNR}^{2/3}$ and $\text{SNR} \rightarrow \infty$, the capacity of the resulting continuous-space lossless model is lower bounded by $\frac{1}{2}\log_2(1+\text{SNR}) - \frac{1}{2}+ o(1)$, where $o(1)$ tends to zero with $\text{SNR}$. As $K\rightarrow \infty$, the inter-symbol interference (ISI) averages out to zero due to the law of large numbers and the SSFM model tends to a diagonal phase noise model. It follows that, in contrast to the discrete-space model where there is only one signal degree-of-freedom (DoF) at high powers, the number of DoFs in the continuous-space model is at least half of the input dimension $n$. Intensity-modulation and direct detection achieves this rate. The pre-log in the lower bound when $K= \sqrt[δ]{\text{SNR}}$ is generally characterized in terms of $δ$.
It is shown that if the nonlinearity parameter $γ\rightarrow \infty$, the capacity of the continuous-space model is $\frac{1}{2}\log_2(1+\text{SNR})+ o(1)$.
The SSFM model when the dispersion matrix does not depend on $K$ is considered. It is shown that the capacity of this model when $K= \sqrt[δ]{\text{SNR}}$, $δ>3$, and $\text{SNR} \rightarrow \infty$ is $\frac{1}{2n}\log_2(1+\text{SNR})+ O(1)$. Thus, there is only one DoF in this model.
Finally, it is found that the maximum achievable information rates (AIRs) of the SSFM model with back-propagation equalization obtained using numerical simulation follows a double-ascent curve.
△ Less
Submitted 26 September, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Distributed Function Computation Over a Rooted Directed Tree
Authors:
Milad Sefidgaran,
Aslan Tchamkerten
Abstract:
This paper establishes the rate region for a class of source coding function computation setups where sources of information are available at the nodes of a tree and where a function of these sources must be computed at the root. The rate region holds for any function as long as the sources' joint distribution satisfies a certain Markov criterion. This criterion is met, in particular, when the sou…
▽ More
This paper establishes the rate region for a class of source coding function computation setups where sources of information are available at the nodes of a tree and where a function of these sources must be computed at the root. The rate region holds for any function as long as the sources' joint distribution satisfies a certain Markov criterion. This criterion is met, in particular, when the sources are independent.
This result recovers the rate regions of several function computation setups. These include the point-to-point communication setting with arbitrary sources, the noiseless multiple access network with "conditionally independent sources," and the cascade network with Markovian sources.
△ Less
Submitted 7 April, 2015; v1 submitted 12 December, 2013;
originally announced December 2013.
-
On Cooperation in Multi-Terminal Computation and Rate Distortion
Authors:
Milad Sefidgaran,
Aslan Tchamkerten
Abstract:
A receiver wants to compute a function of two correlated sources separately observed by two transmitters. One of the transmitters may send a possibly private message to the other transmitter in a cooperation phase before both transmitters communicate to the receiver. For this network configuration this paper investigates both a function computation setup, wherein the receiver wants to compute a gi…
▽ More
A receiver wants to compute a function of two correlated sources separately observed by two transmitters. One of the transmitters may send a possibly private message to the other transmitter in a cooperation phase before both transmitters communicate to the receiver. For this network configuration this paper investigates both a function computation setup, wherein the receiver wants to compute a given function of the sources exactly, and a rate distortion setup, wherein the receiver wants to compute a given function within some distortion.
For the function computation setup, a general inner bound to the rate region is established and shown to be tight in a number of cases: partially invertible functions, full cooperation between transmitters, one-round point-to-point communication, two-round point-to-point communication, and the cascade setup where the transmitters and the receiver are aligned. In particular it is shown that the ratio of the total number of transmitted bits without cooperation and the total number of transmitted bits with cooperation can be arbitrarily large. Furthermore, one bit of cooperation suffices to arbitrarily reduce the amount of information both transmitters need to convey to the receiver.
For the rate distortion version, an inner bound to the rate region is exhibited which always includes, and sometimes strictly, the convex hull of Kaspi-Berger's related inner bounds. The strict inclusion is shown via two examples.
△ Less
Submitted 7 April, 2015; v1 submitted 4 March, 2013;
originally announced March 2013.
-
On Computing a Function of Correlated Sources
Authors:
Milad Sefidgaran,
Aslan Tchamkerten
Abstract:
A receiver wants to compute a function f of two correlated sources X and Y and side information Z. What is the minimum number of bits that needs to be communicated by each transmitter?
In this paper, we derive inner and outer bounds to the rate region of this problem which coincide in the cases where f is partially invertible and where the sources are independent given the side information.
Th…
▽ More
A receiver wants to compute a function f of two correlated sources X and Y and side information Z. What is the minimum number of bits that needs to be communicated by each transmitter?
In this paper, we derive inner and outer bounds to the rate region of this problem which coincide in the cases where f is partially invertible and where the sources are independent given the side information.
These rate regions point to an important difference with the single source case. Whereas for the latter it is sufficient to consider independent sets of some suitable characteristic graph, for multiple sources such a restriction is suboptimal and multisets are necessary.
△ Less
Submitted 11 October, 2012; v1 submitted 28 July, 2011;
originally announced July 2011.