Abstract
Federated learning (FL), a novel branch of distributed machine learning (ML), develops global models through a private procedure without direct access to local datasets. However, it is still possible to access the model updates (gradient updates of deep neural networks) transferred between clients and servers, potentially revealing sensitive local information to adversaries using model inversion attacks. Differential privacy (DP) offers a promising approach to addressing this issue by adding noise to the parameters. On the other hand, heterogeneities in data structure, storage, communication, and computational capabilities of devices can cause convergence problems and delays in develo** the global model. A personalized weighted averaging of local parameters based on the resources of each device can yield a better aggregated model in each round. In this paper, to efficiently preserve privacy, we propose a personalized DP framework that injects noise based on clients’ relative impact factors and aggregates parameters while considering heterogeneities and adjusting properties. To fulfill the DP requirements, we first analyze the convergence boundary of the FL algorithm when impact factors are personalized and fixed throughout the learning process. We then further study the convergence property considering time-varying (adaptive) impact factors.
I Introduction
Smart distributed systems such as smartphones, automated vehicles, multi-agent systems, and wearable devices are growing rapidly in our daily lives. Their underlying mechanism which is attached with sensing and communicating generates an unprecedented amount of data every day. Therefore, utilizing these sources of rich information to enhance services offered to people and organizations owning the data, without violating their privacy, matters a great deal. The developments in the computational and communicational capabilities of intelligent distributed devices along with their abilities to collect and store large datasets have opened up effective alternatives for managing and analyzing local databases.
A common traditional practice to develop predictive machine learning (ML) models is to transmit raw data over networks and generate models in a centralized manner. While this method has provided data owners valuable services throughout the years, their efficiency for today’s crowdsourced data is called into question. Communication costs of sending large volumes of data on one hand, and privacy concerns for sharing personal information on the other hand have provided space for decentralized ML algorithms, such as federated learning (FL) [1].
Federated machine learning is a promising solution in settings dealing with large volumes of data as well as privacy concerns about clients’ sensitive information [2]. In this framework, each device builds its model using local datasets, and the essential model parameters, rather than raw data, are transmitted to the cloud server. The server aggregates these parameters and updates the global model throughout a recursive downloading and uploading cycle [3, 4, 5]. Hence, each client benefits from a larger database during the learning process, without direct access to it. While offering great advantages over conventional ML methods, FL has its own challenges. Expensive communications, systems and statistical heterogeneities, and security risks are considered as the four main issues while develo** FL models [6].
Deep Learning (DL) models are widely used in FL, especially for feature extraction in the large image, voice, and text datasets. In order to optimize local DL models inside the clients, stochastic gradient descent (SGD) is generally adopted [7, 8, 9]. Sending frequent gradient updates with the massive number of both parameters and clients in FL leads to an extreme rise in communication costs. Increasing the number of local updates [10, 11] is one natural way to modify communication bottlenecks with more local computations. On the other hand, quantization [12, 13, 14] and sparsification [15, 16] methods mitigate this challenge by reducing the size of transmitted messages in each round.
Dealing with systems that have different computational capabilities, network capacities, and power resources is an inevitable challenge in FL. Several approaches, including resource-based client selection [17, 18], robust and fault-tolerant algorithms [19], and asynchronous communications [20] address these challenges. On the other extreme, heterogeneity in data distributions of the clients causes problems in the training and convergence of FL algorithms. Using multi-task learning methods [21, 22] and avoiding local minimums by adding a proximal term to the objective function [23] help handling unbalanced and non-IID data in FL [6].
Even though the idea of FL was first proposed for its strong privacy guarantees, it has been shown that local datasets can be still revealed to stragglers using model inversion attacks on shared updates [24], especially when DL is used in local models [25, 26]. To mitigate this challenge, differential privacy (DP) is one of the widely used protection algorithms due to its solid theoretical guarantees [27, 28]. In order to reduce the risk of data leakage in ML algorithms, noise with Gaussian, Laplace, or Exponential distribution is deliberately added to data in DP. The work in [29] proposes a global DP algorithm in FL and gives a theoretical explanation for the convergence behavior of the suggested scheme.
As discussed earlier, nodes in a distributed architecture differ in the data structure, dataset size, network condition, reliability, availability, and computation capabilities, which can even be time-varying. A privacy-preserving approach in FL is not effective unless paying attention to these personalized characteristics. Hence, there exist multiple works on DP with the content “adaptive” to compensate systems and statistical heterogeneities in FL. These works can be divided into two general directions based on the adaptability criterion. One direction injects an adaptive noise distribution to local parameters to enhance local protection. It considers each client separately without involving their heterogeneity. For instance, adaptive clip** [30] finds the best clip** constant for DP in each device based on their local behaviors. In [31], noise with Laplace distribution is added to model updates based on the neurons’ contributions in the clients. The work in [32] achieves a trade-off between privacy and accuracy by adding more noise to less important parameters and less noise to more important ones.
The second direction, however, concentrates on personalized training in the heterogeneous networks. The work in [33] trains differentially private models in each client and uploads the local updates for the server. These directions both lack considering the local characteristics in the aggregating process. More specifically, they assume the same impact factor for all devices during aggregation, regardless of their local dissimilarities. This assumption not only simplifies the convergence analysis of the algorithm but also changes the DP requirements [8]. To the best of our knowledge, the privacy and convergence analysis in FL with non-identical and time-varying impact factors have not yet been studied in the existing literature.
In this paper, we combine the heterogeneity and privacy concerns in a novel FL scheme. Regardless of multi-task learning algorithms used in FL, each local model possesses a weight or an impact in the global cost function. This impact can be assigned considering many factors by the server or the clients. It can also change (increase or decrease) or even become zero during the learning process. We, therefore, propose a DP algorithm considering the non-identical impact factors, namely, personalized aggregation in differentially private federated learning (PADPFL). We further establish the convergence analysis of the algorithm and the influence of the additive noise on it.
In summary, the main contributions of this paper are as
follows.
-
•
We propose a noise injection paradigm, PADPFL, that satisfies DP requirements with Gaussian distribution when clients have different impact factors in the aggregation process.
-
•
We perform a convergence analysis of the proposed algorithm for Non-IID clients when using fixed non-identical impact factors throughout training the global model.
-
•
We perform a convergence analysis of the proposed algorithm for Non-IID clients when using adaptive (time-varying) impact factors throughout training the global model.
-
•
We conduct evaluations on real-world datasets to verify the effectiveness of PADPFL,
and observe the trade-off between model accuracy, privacy budget, and impact factors.
The remainder of this paper is organized as follows. In
Section II, we review some preliminaries on FL,and DP. In Section III, we introduce our approach for a differentially private federated learning in a client and server side. Next, we analyse the convergence bound on the global loss function of the proposed solution for a fixed and time-varying impacts, in Section IV and V, respectively. Simulations and results are
presented in Section IV, and the summary and conclusion are
given in Section VI.
IV Convergence Analysis of the
Personalized DP in FL
In this section, we analyze the convergence properties of the proposed algorithm for personalized DP in FL. Our main purpose is to reach a convergence upper limit for the algorithm when we have personalized impact factors.
The required assumptions for our analysis about the properties of the global and local loss functions, regarding their relation , are as follows:
Assumption 1.
-
1.
is convex.
-
2.
is -Lipschitz smooth, i.e.,
.
-
3.
; where and represent the initial and optimal model parameters, respectively.
-
4.
; where is the divergence measure.
Note that the distribution of local datasets in the non-i.i.d fashion breaks the general assumption of . Hence, the expectation over clients is not considered equal with the global expectation . The only assumption on relative impact factors is .
As the first step through our convergence bounding analysis, we present the following lemma for the local dissimilarity measure, when having non-identical impacts.
Lemma 1 (-local dissimilarity).
For the local loss functions with impact factors in the FL global function , there exists as a measure of dissimilarity at such that
|
|
|
(15) |
Proof.
Due to Assumption 1, we have
|
|
|
(16) |
and
|
|
|
(17) |
Considering (17) and multiplying (16) with yields
|
|
|
(18) |
Considering and , we have
|
|
|
(19) |
Note that when , there exists
|
|
|
(20) |
Therefore, we have
|
|
|
(21) |
where is the upper bound of .
Considering (21), there also exists such that
|
|
|
(22) |
This completes the proof.
∎
Now, the following lemma gives an expected upper bound on the increment of global loss value per-iteration, when DP noise injection is adopted.
Lemma 2 (Per-iteration expected increment).
The expected difference of global loss functions in two consecutive iterations and , or the per-iteration expected increment in the value of the loss function, has the following upper limit:
|
|
|
(23) |
where
|
|
|
|
|
|
and is the aggregated noise of the clients and server in each cycle.
Proof.
Considering the aggregation process with artificial noises of the client and server side in the -th aggregation, we have
|
|
|
(24) |
where
|
|
|
(25) |
Because is -Lipschitz smooth, we have
|
|
|
(26) |
for all . Summation of (26) multiplied with yields
|
|
|
(27) |
Considering the definition of global loss function and , we have
|
|
|
(28) |
and therefore,
|
|
|
(29) |
Defining
|
|
|
(30) |
we have
|
|
|
(31) |
Summation of (31) multiplied with yields
|
|
|
(32) |
and therefore,
|
|
|
(33) |
Substituting (33) into (29), we obtain
|
|
|
(34) |
Now, let us bound . We know
|
|
|
(35) |
where Define , due to the -convexity of we have
|
|
|
(36) |
and
|
|
|
(37) |
where denotes a -inexact solution of [37]. For such a solution, , we have
|
|
|
(38) |
Now we can use (36) and (37) to obtain
|
|
|
(39) |
Therefore,
|
|
|
(40) |
Since is -Lipschitz smooth, we have
|
|
|
(41) |
Using the triangle inequality, (38), (40), and (41), we obtain
|
|
|
(42) |
Then, from (42) and the Cauchy-Schwarz inequality we have
|
|
|
(43) |
Substituting (40) and (43) into (34) yields
|
|
|
(44) |
Then, we obtain
|
|
|
(45) |
where
|
|
|
|
|
|
This completes the proof.
∎
As expected, lemma indicates the adverse effect of differential privacy in the expected per-iteration increment of the global loss value. ….
As the final step, we use the per-iteration increment to establish the convergence analysis of the proposed algorithm.
Theorem 2 (Convergence upper bound of personalized ….).
The upper limit of the difference between the -th and the optimal loss function values defined as the convergence property is given by
|
|
|
(46) |
where .
Proof.
Considering the same and independent noise distribution of the additive noise, we define . Applying (45) recursively for yields
|
|
|
(47) |
Considering and Adding to both sides of (47), we have
|
|
|
(48) |
Since we have , we obtain
|
|
|
(49) |
Setting and substituting (49) into (48), we have
|
|
|
(50) |
where .
This completes the proof.
∎
The last two terms in the right hand side of (46) depend directly on the amount of noise. lower values strengthen the privacy protection and adversely affect the convergence property. The first two terms, however, are the constant parts depending on the number of iterations. …
In the above analysis, we saw that by a wise choice of impact factors, , and we can be confident about the convergence of the FL algorithm while -DP is used. The number of clients involved in learning in the presented analysis should not necessarily be fixed through training, and this enhances the compatibility of the proposed approach. In the next section, we present the analysis of the same algorithm when impact factors adaptively change throughout the learning process.
V Convergence Analysis of
DP in FL with adaptive impact factors
In this section, we consider an extension to the previous part when impact factors are not fixed during the training. In fact, impacts assigned to clients can vary in each iteration based on the devises’ resources or network conditions. The calculated amount of Gaussian noise in section can still be utilized here, since iterations are independent in noise generation. However, the convergence analysis provided in the previous section needs to be more generalized.
Here, we change to to represent this adaptability in our equations. Without loss of generality, we assume the relation between two consecutive impact factors to be
|
|
|
(51) |
where is the amount of change that the relative impact factor assigned to -th client undergoes for -th iteration. Hence, and .
In order to perform the analysis of the adaptive form, we first present an extension to lemma and then present the convergence upper bound in theorem .
Lemma 3 (Per-iteration expected increment: Extension).
The per-iteration expected increment in the value of the loss function, when adaptive is adopted, has the following upper limit:
|
|
|
(52) |
where
|
|
|
|
|
|
and is the aggregated noise of the clients and server in each cycle.
Proof.
From (15) we have
|
|
|
(53) |
Adding to both sides of (53) yields
|
|
|
(54) |
Hence, we have
|
|
|
(55) |
where
|
|
|
(56) |
Therefore, we can bound as
|
|
|
(57) |
Summation of (26) multiplied with yields
|
|
|
(58) |
Considering (51), we have
|
|
|
(59) |
Without loss of generality, we assume , and therefore
|
|
|
(60) |
Then, (59) and (60) gives
|
|
|
(61) |
Defining as (30) and Summation of (31) multiplied with yields
|
|
|
(62) |
and therefore,
|
|
|
(63) |
Substituting (63) into (61), we obtain
|
|
|
(64) |
-Lipschitzity of local loss functions leads to have a -Lipschitz global loss function. Hence,
|
|
|
(65) |
Therefore, using triangle inequality, (55), and (65) we obtain
|
|
|
(66) |
Substituting (66) and (64) into (34) yields
|
|
|
(67) |
And, we get
|
|
|
(68) |
where
|
|
|
|
|
|
This completes the proof.
∎
Theorem 3 (Convergence upper bound of adaptive personalized ….).
Using adaptive assignment, the upper limit of the difference between the -th and the optimal loss function values defined as the convergence property is given by
|
|
|
(69) |
where .
Proof.
The proof can be easily extended from the proof for Theorem
and using lemma .
∎