Wireless Federated Learning over Resource-Constrained Networks: Digital versus Analog Transmissions

Jiacheng Yao,  Wei Xu,  Zhaohui Yang,  Xiaohu You,  Mehdi Bennis,  and H. Vincent Poor Part of this work is presented in IEEE ICC 2024[1].J. Yao, W. Xu and X. You are with the National Mobile Communications Research Laboratory (NCRL), Southeast University, Nan**g 210096, China ({jcyao, wxu}@seu.edu.cn).Zhaohui Yang is with the Zhejiang Lab, Hangzhou 311121, China, and also with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, Zhejiang 310027, China ([email protected]).Mehdi Bennis is with the Center for Wireless Communications, Oulu University, Oulu 90014, Finland (e-mail: [email protected]).H. Vincent Poor is with the Department of Electrical and Computer Engineering, Princeton University, NJ 08544 USA (e-mail: [email protected]).
Abstract

To enable wireless federated learning (FL) in communication resource-constrained networks, two communication schemes, i.e., digital and analog ones, are effective solutions. In this paper, we quantitatively compare these two techniques, highlighting their essential differences as well as respectively suitable scenarios. We first examine both digital and analog transmission schemes, together with a unified and fair comparison framework under imbalanced device sampling, strict latency targets, and transmit power constraints. A universal convergence analysis under various imperfections is established for evaluating the performance of FL over wireless networks. These analytical results reveal that the fundamental difference between the digital and analog communications lies in whether communication and computation are jointly designed or not. The digital scheme decouples the communication design from FL computing tasks, making it difficult to support uplink transmission from massive devices with limited bandwidth and hence the performance is mainly communication-limited. In contrast, the analog communication allows over-the-air computation (AirComp) and achieves better spectrum utilization. However, the computation-oriented analog transmission reduces power efficiency, and its performance is sensitive to computation errors from imperfect channel state information (CSI). Furthermore, device sampling for both schemes are optimized and differences in sampling optimization are analyzed. Numerical results verify the theoretical analysis and affirm the superior performance of the sampling optimization.

Index Terms:
Federated learning (FL), digital communication, over-the-air computation (AirComp), convergence analysis.

I Introduction

The dramatic development of data science has catalyzed significant advances in artificial intelligence (AI), which is driving innovation for anticipated sixth-generation (6G) mobile networks. The integration of AI and communication is envisioned to drive the shift from connected things to ubiquitous connected intelligence in wireless networks, supporting a large number of emerging intelligent applications [2, 3, 4, 5]. Nonetheless, traditional centralized learning paradigms depend on extensive data transmission and considerable computational resources at cloud servers, which is challenging to implement in wireless networks. To better embrace AI, edge learning (EL) is viewed as a promising distributed learning technique that harnesses massive data and computational capacity available in edge devices distributed across wireless networks [6, 7, 8]. Distinguishing it from the traditional separate design for computation and communication, EL integrates the two and achieves efficient utilization of resources and improves performance through learning task-oriented communication design.

In particular, a key EL paradigm, namely federated learning (FL), has garnered significant attention from both academic and industrial circles, primarily due to its communication-efficient and privacy-enhancing characteristics [9, 10]. In FL, distributed edge devices utilize local datasets to collaboratively train a shared learning model with the assistance of a central parameter server (PS). By exchanging model parameters instead of raw data, the PS iteratively updates the global model until convergence. FL scheme minimizes the amount of transmitted data, as well as hel** safeguard privacy and security. Recent studies have explored implementation of FL algorithms at wireless edge to support emerging AI applications [11, 12, 13, 14]. However, limited communication resources has posed a significant bottleneck to the performance of wireless FL [15, 16]. One particular concern regards the uplink transmission process, where numerous participating devices need to transmit local updates to the PS, leading to a substantial increase in communication overhead and transmission latency [17]. Hence, the development of efficient uplink transmission is crucial to enable wireless FL.

To support data transmission in wireless FL, digital communication schemes have been widely considered in recent works, where local updates are quantized into finite bits and then transmitted to the PS via traditional frequency division multiple access (FDMA) and time division multiple access (TDMA) schemes. At the receiver, the PS relies on channel coding for error detection and correction, before model aggregation using the received local updates. In [12] and [18], the authors characterized the impact of packet errors on the convergence of FL, which enabled a task-oriented communication resource allocation scheme. The influence of various finite-precision quantization schemes in uplink and downlink communications was considered in [19]. Building upon convergence analysis of the quantized FL, the quantization bits allocation was optimized in [20] and [21] to adapt channel diversity and requirements of the FL tasks. To further alleviate the communication bottleneck, one-bit quantization technique and reconfigurable intelligent surface (RIS) were used in [22] to reduce communication overhead and enhance communication reliability, respectively. Apart from resource allocation methods, modifications from the algorithmic perspective have been considered to combat unreliable transmissions. In [23], the authors proposed a user datagram protocol (UDP)-based robust training algorithm, which asymptotically achieved the same convergence rate as that with error-free communications. Moreover in [24], for replacing erroneous local updates, a global model reusing scheme, namely the GoMORE scheme, was devised to successfully mitigate the negative impacts of packet loss. Alternatively, another solution is to further squeeze the communication overhead, thus improving the convergence over resource-constrained networks. The model pruning in [25] was seen to be an effective way to compress the large-scale model into a smaller size, facilitating communication-efficient FL design.

In addition to these digital communication schemes, analog communication is an alternative communication-efficient way for deploying wireless FL. In particular, the local updates are amplitude-modulated and then simultaneously transmitted by reusing the available radio resource. Due to the superposition property of radio channels, the global model can be computed automatically over-the-air, which is therefore referred to as over-the-air computation (AirComp) [26]. Unlike the digital paradigm, analog communication pushes model aggregation from the PS to the air, which not only functionally but physically integrates the computation and communication. Benefiting from the over-the-air aggregation, the communication latency is substantially reduced and the spectrum utilization is much more efficient, leading to fast-convergent and communication-efficient FL. It was shown in [27] that the convergence rate of centralized learning remains approachable with this analog approach without power control and beamforming. Furthermore in [28], to combat deep fading, a novel truncated channel inversion scheme was proposed to exclude devices experiencing deep fades from the training process avoiding excessive energy consumption. Further insights into analog aggregation schemes were also discussed in the context of fundamental trade-offs between communication and learning. Besides, the impact of over-the-air aggregation errors on optimality gap was analyzed in [29] and [30] with power control optimization. Furthermore, the authors in [31] proposed an AirComp-based adaptive reweighing scheme for the aggregation, and jointly considered the power control and device selection deign based on the derived optimality gap. To combat the additive noise, robust FL training methods were proposed in [32] for both the expectation-based and the worst-case noise models. Considering multi-antenna scenarios, the beamforming design at the receiver was optimized by solving a sparse and low-rank optimization problem in [33]. In practice, considering the lack of perfect channel state information (CSI) for accurate power control, the work [34] investigated the impact of CSI uncertainty at the transmitter on FL convergence and revealed that CSI imperfection plays an key factor affecting the AirComp performance and convergence.

As mentioned above, by incorporating learning task-oriented resource allocation, both digital and analog transmissions are effective ways to fulfill the communication requirements of wireless FL [35, 36, 37]. In traditional communication for data transmission, digital communication schemes have been proven not only in theory but also in practice as dominantly outperforming analog communication techniques in almost all cases of interest. In communications for computation tasks, however, analog communication has shown to be exceptionally effective in some cases of resource-constrained networks [38]. Hence, it is of interest to comprehensively compare digital and analog transmissions for wireless FL. Several recent studies have compared the two communication paradigms from some specific perspectives, including communication latency [28, 39] and convergence performance [40, 41]. However, to the best of our knowledge, there is a lack of literature that presents a comprehensive and quantitative comparison between the two fundamental communication paradigms, especially under practical constraints. Also, there have been few attempts to elucidate the fundamental differences between digital and analog transmissions in the context of FL, which is crucial for its deployment and design.

Against this background, in this paper, we conduct a theoretical comparison between the digital and analog transmission schemes under practical constraints. The main contributions of this paper are summarized as follows.

  • We propose a unified framework for digital and analog transmissions in wireless FL, and characterize the model aggregation distortion caused by wireless transmission schemes. Using this framework, a fair comparison is conducted under the consideration of a stringent transmission delay target and two types of transmit power budgets. We exploit optimality gap, defined by the gap between the optimal and actually achieved loss function value, to characterize the convergence behavior and establish a stringent upper bound of the optimality gap for precise analysis and optimization in the digital/analog transmission enabled wireless FL. It offers a precise characterization of the influence of wireless transmission imperfections on convergence in closed-form.

  • Analytical results reveal that the digital transmission is hard to achieve satisfactory performance especially with limited radio resources due to orthogonal access and decoupled design. In contrast, the analog scheme exhibits a performance gain in terms of the optimality gap of the order of 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG with the increasing number of participating devices, N𝑁Nitalic_N, and thereby achieving a higher level of efficiency in spectrum utilization. However, the introduction of computation goals in the analog communication process results in less efficient transmit power utilization, and the presence of CSI uncertainties inevitably comes with computational distortion, thus enlarging the optimality gap by the order of 1ρ21superscript𝜌2\frac{1}{\rho^{2}}divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG with a decreasing level of channel estimation accuracy ρ𝜌\rhoitalic_ρ.

  • Based on the derived optimality gap, we formulate an inclusion probability optimization problem for effective device sampling in wirless FL. The optimization problems for both digital and analog cases are optimally solved by checking the Karush-Kuhn-Tucker (KKT) conditions and exploiting the Dinkelbach algorithm, respectively. Through the examination of optimal solutions, we identify the essential differences underlying the device sampling optimization for digital and analog transmissions.

Extensive numerical simulations are conducted to validate the derived analytical observations and the proposed sampling optimization. In particular, it is observed that the digital scheme has better power utilization, while the analog transmission is more spectrum-efficient.

The rest of this paper is organized as follows. In Section \@slowromancapii@, we describe the typical FL algorithm, with details of digital and analog transmissions, and propose a fair comparison framework. Section \@slowromancapiii@ provides some preliminaries for the convergence analysis. In Section \@slowromancapiv@, we analyze the convergence performance under different transmission schemes and offer engineering insights. Then, in Section \@slowromancapv@, we optimize the inclusion probabilities for both the digital and analog schemes. Simulation results and conclusions are given in Sections \@slowromancapvi@ and \@slowromancapvii@, respectively.

Notation: Boldface lowercase (uppercase) letters represent vectors (matrices). The set of all real numbers is denoted by \mathbb{R}blackboard_R. Superscripts ()Tsuperscript𝑇(\cdot)^{T}( ⋅ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and ()superscript(\cdot)^{\ast}( ⋅ ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT stand for the transpose and conjugate operations, respectively. The operator ()\Re(\cdot)roman_ℜ ( ⋅ ) returns the real part of the input complex number. The operator \left\|\cdot\right\|∥ ⋅ ∥ takes the Euclidean norm of vectors. A circularly symmetric complex Gaussian distribution is denoted by 𝒞𝒩𝒞𝒩{\cal{CN}}caligraphic_C caligraphic_N, and 𝔼{}𝔼\mathbb{E}\{\cdot\}blackboard_E { ⋅ } is the expectation operation.

II System Model and Communication Framework

We consider a typical wireless FL system as shown in Fig. 1, where K𝐾Kitalic_K distributed devices are coordinated by a central PS to perform FL. The training procedure and transmission model are elaborated in the sequel.

Refer to caption
Figure 1: The architecture of a typical wireless FL system.

II-A Federated Learning Model

In FL, the distributed devices collaboratively train a shared machine learning model via local computing based on their local datasets and information exchange with the PS. Let 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the local dataset owned by the k𝑘kitalic_k-th device, which contains Dk=|𝒟k|subscript𝐷𝑘subscript𝒟𝑘D_{k}=|\mathcal{D}_{k}|italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = | caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | training samples. The goal of the FL algorithm is to find the optimal d𝑑ditalic_d-dimensional model parameter vector, denoted by 𝐰d×1superscript𝐰superscript𝑑1\mathbf{w}^{*}\in\mathbb{R}^{d\times 1}bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × 1 end_POSTSUPERSCRIPT, to minimize the global loss function F(𝐰)𝐹𝐰F(\mathbf{w})italic_F ( bold_w ), i.e.,

𝐰superscript𝐰\displaystyle\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argmin𝐰F(𝐰)=argmin𝐰1Dk=1KDkFk(𝐰)absentsubscript𝐰𝐹𝐰subscript𝐰1𝐷superscriptsubscript𝑘1𝐾subscript𝐷𝑘subscript𝐹𝑘𝐰\displaystyle=\arg\min_{\mathbf{w}}F(\mathbf{w})=\arg\min_{\mathbf{w}}\frac{1}% {D}\sum_{k=1}^{K}D_{k}F_{k}(\mathbf{w})= roman_arg roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT italic_F ( bold_w ) = roman_arg roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w )
=argmin𝐰k=1KαkFk(𝐰),absentsubscript𝐰superscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝐹𝑘𝐰\displaystyle=\arg\min_{\mathbf{w}}\sum_{k=1}^{K}\alpha_{k}F_{k}(\mathbf{w}),= roman_arg roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) , (1)

where Dk=1KDk𝐷superscriptsubscript𝑘1𝐾subscript𝐷𝑘D\triangleq\sum_{k=1}^{K}D_{k}italic_D ≜ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, αkDkDsubscript𝛼𝑘subscript𝐷𝑘𝐷\alpha_{k}\triangleq\frac{D_{k}}{D}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≜ divide start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG represents the aggregation weight for the k𝑘kitalic_k-th user, and Fk(𝐰)subscript𝐹𝑘𝐰F_{k}(\mathbf{w})italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) is the local loss function at device k𝑘kitalic_k defined as

Fk(𝐰)=1Dk𝐮𝒟k(𝐰,𝐮),subscript𝐹𝑘𝐰1subscript𝐷𝑘subscript𝐮subscript𝒟𝑘𝐰𝐮\displaystyle F_{k}(\mathbf{w})=\frac{1}{D_{k}}\sum_{\mathbf{u}\in\mathcal{D}_% {k}}\mathcal{L}(\mathbf{w},\mathbf{u}),italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_u ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_w , bold_u ) , (2)

where 𝐮𝐮\mathbf{u}bold_u denotes a training sample selected from 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and (𝐰,𝐮)𝐰𝐮\mathcal{L}(\mathbf{w},\mathbf{u})caligraphic_L ( bold_w , bold_u ) represents the sample-wise loss function with respect to 𝐮𝐮\mathbf{u}bold_u. Due to the heterogeneity of the system, we note that local datasets at distinct devices are usually non-independent and non-identically distributed (non-IID), and the optimal model parameters in (II-A) are not necessarily the optimal for local datasets. Let 𝐰ksuperscriptsubscript𝐰𝑘\mathbf{w}_{k}^{*}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the locally optimal model at device k𝑘kitalic_k, i.e., 𝐰k=argmin𝐰Fk(𝐰)superscriptsubscript𝐰𝑘subscript𝐰subscript𝐹𝑘𝐰\mathbf{w}_{k}^{*}=\arg\min_{\mathbf{w}}F_{k}(\mathbf{w})bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ). It is usually different from the globally optimal 𝐰superscript𝐰\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT unless the local dataset 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT experiences the same distribution as the whole data population.

To effectively handle the optimization problem in (II-A), an FL algorithm performs the model training in an iterative manner. Specifically, the m𝑚mitalic_m-th round of the FL algorithm consists of the following steps.

  • 1)

    Model Broadcasting: The PS broadcasts the latest global model 𝐰msubscript𝐰𝑚\mathbf{w}_{m}bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to al devices.

  • 2)

    Local Computing: After receiving 𝐰msubscript𝐰𝑚\mathbf{w}_{m}bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, each device exploits its local dataset to compute the local gradient as

    𝐠mkFk(𝐰m)=1Dk𝐮𝒟k(𝐰m,𝐮),k.formulae-sequencesuperscriptsubscript𝐠𝑚𝑘subscript𝐹𝑘subscript𝐰𝑚1subscript𝐷𝑘subscript𝐮subscript𝒟𝑘subscript𝐰𝑚𝐮for-all𝑘\displaystyle\mathbf{g}_{m}^{k}\triangleq\nabla F_{k}(\mathbf{w}_{m})=\frac{1}% {D_{k}}\sum_{\mathbf{u}\in\mathcal{D}_{k}}\nabla\mathcal{L}(\mathbf{w}_{m},% \mathbf{u}),\enspace\forall k.bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≜ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_u ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ caligraphic_L ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_u ) , ∀ italic_k . (3)
  • 3)

    Local Update Uploading: Each device reports its local gradient to the PS.

  • 4)

    Model Aggregation: Upon receiving all local gradients, the PS updates the global model according to

    𝐰m+1=𝐰mη𝐠m,subscript𝐰𝑚1subscript𝐰𝑚𝜂subscript𝐠𝑚\displaystyle\mathbf{w}_{m+1}=\mathbf{w}_{m}-\eta\mathbf{g}_{m},bold_w start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (4)

    where η𝜂\etaitalic_η is the learning rate and 𝐠msubscript𝐠𝑚\mathbf{g}_{m}bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is given by

    𝐠mk=1Kαk𝐠mk.subscript𝐠𝑚superscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝐠𝑚𝑘\displaystyle\mathbf{g}_{m}\triangleq\sum_{k=1}^{K}\alpha_{k}\mathbf{g}_{m}^{k}.bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≜ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (5)

The above steps iterate until a convergence condition is met.

Considering the potentially massive number of devices and limited resources in practice, only a subset of devices can participate in each round of the training. Let 𝒮msubscript𝒮𝑚\mathcal{S}_{m}caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the set of activated devices selected in the m𝑚mitalic_m-th communication round and N=|𝒮m|𝑁subscript𝒮𝑚N=|\mathcal{S}_{m}|italic_N = | caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | be the number of participating devices per round. Due to imbalanced dataset sizes and data heterogeneity, we assume that the PS performs non-uniform device sampling without replacement to select the participating devices per round. Specifically, the devices are randomly selected one by one from the remaining unselected device set. Once the number of selected devices reaches N𝑁Nitalic_N, the sampling process terminates. Denote the inclusion probability of the device k𝑘kitalic_k as rksubscript𝑟𝑘r_{k}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which represents the probability of device k𝑘kitalic_k being sampled per round and satisfies rk1subscript𝑟𝑘1r_{k}\leq 1italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1, kfor-all𝑘\forall k∀ italic_k, and k=1Krk=Nsuperscriptsubscript𝑘1𝐾subscript𝑟𝑘𝑁\sum_{k=1}^{K}r_{k}=N∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N. Due to the non-IID nature of the data, misaligned inclusion probability may bias the global model away from the local optimum, thereby decelerating the convergence and causing performance loss. Hence, in the following sections, we focus on the performance evaluation under fixed inclusion probabilities and characterize the impact of device sampling for wireless FL.

Also, in wireless FL, the parameter transmission in Steps 1) and 3) relies on wireless communication between the PS and devices, which comes with additional imperfection in the model training procedure. Considering a sufficient power budget at the PS, the downlink transmission is usually assumed error-free [12]. Otherwise, for uplink transmission with limited communication resources, additional errors are inevitable. Efficient transmission and resource allocation schemes need to be designed to alleviate this impact of wireless environment.

II-B Uplink Transmission Method

We rely on the wireless uplink transmission to provide an estimation of the actual gradient in (5). Assume that the total uplink bandwidth B𝐵Bitalic_B can be divided into up to M𝑀Mitalic_M subbands, which supports orthogonal access for M𝑀Mitalic_M devices. Without loss of generality, a frequency non-selective block fading channel model is adopted, where the wireless channels remain unchanged within a communication round. Let h¯k=dkα2hksubscript¯𝑘superscriptsubscript𝑑𝑘𝛼2subscript𝑘\bar{h}_{k}=d_{k}^{-\frac{\alpha}{2}}h_{k}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the channel between the k𝑘kitalic_k-th device and the PS, where dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the distance between the PS and device k𝑘kitalic_k, α𝛼\alphaitalic_α represents the large-scale path loss exponent, and hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the small-scale fading of the channel. Assume that the channels are independent Rayleigh fadings, i.e., hk𝒞𝒩(0,1)similar-tosubscript𝑘𝒞𝒩01h_{k}\sim\mathcal{CN}(0,1)italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , 1 ). In practice, perfect estimation of the small-scale fading of the channel is usually not available. Let h^ksubscript^𝑘\hat{h}_{k}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denote the estimated channel at device k𝑘kitalic_k. Then, we model the CSI imperfection of the small-scale fading as

hk=ρh^k+1ρ2vk,subscript𝑘𝜌subscript^𝑘1superscript𝜌2subscript𝑣𝑘\displaystyle h_{k}=\rho\hat{h}_{k}+\sqrt{1-\rho^{2}}v_{k},italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_ρ over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , (6)

where ρ(0,1]𝜌01\rho\in(0,1]italic_ρ ∈ ( 0 , 1 ] is the correlation coefficient between hksubscript𝑘h_{k}italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and h^ksubscript^𝑘\hat{h}_{k}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to reflect the level of channel estimation accuracy, and vk𝒞𝒩(0,1)similar-tosubscript𝑣𝑘𝒞𝒩01v_{k}\sim\mathcal{CN}(0,1)italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , 1 ) is the channel estimation error independent of h^ksubscript^𝑘\hat{h}_{k}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In the following, we introduce two typical uplink transmission schemes, i.e., digital and analog transmissions.

II-B1 Digital Transmission Model

In the digital transmission, the N𝑁Nitalic_N selected devices first quantize their local updates into a finite number of b𝑏bitalic_b bits and then simultaneously transmit the quantized local updates to the PS. Specifically, we assume that the local update 𝐠mksuperscriptsubscript𝐠𝑚𝑘\mathbf{g}_{m}^{k}bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is quantized by the stochastic quantization method in [20]. Denote the maximum and the minimum values of the modulus among all parameters in 𝐠mksuperscriptsubscript𝐠𝑚𝑘\mathbf{g}_{m}^{k}bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT by gm,maxksuperscriptsubscript𝑔𝑚𝑘g_{m,\max}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and gm,minksuperscriptsubscript𝑔𝑚𝑘g_{m,\min}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, respectively. Then, the interval [gm,mink,gm,maxk]superscriptsubscript𝑔𝑚𝑘superscriptsubscript𝑔𝑚𝑘[g_{m,\min}^{k},g_{m,\max}^{k}][ italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] is divided evenly into 2b1superscript2𝑏12^{b}-12 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 quantization intervals. The uniformly distributed knobs are denoted by τi=gm,mink+gm,maxkgm,mink2b1isubscript𝜏𝑖superscriptsubscript𝑔𝑚𝑘superscriptsubscript𝑔𝑚𝑘superscriptsubscript𝑔𝑚𝑘superscript2𝑏1𝑖\tau_{i}=g_{m,\min}^{k}+\frac{g_{m,\max}^{k}-g_{m,\min}^{k}}{2^{b}-1}iitalic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + divide start_ARG italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 end_ARG italic_i for i=0,,2b1𝑖0superscript2𝑏1i=0,\cdots,2^{b}-1italic_i = 0 , ⋯ , 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1. Given |x|[τi,τi+1)𝑥subscript𝜏𝑖subscript𝜏𝑖1|x|\in[\tau_{i},\tau_{i+1})| italic_x | ∈ [ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ), the quantization function 𝒬(x)𝒬𝑥\mathcal{Q}(x)caligraphic_Q ( italic_x ) is expressed as

𝒬(x)={sign(x)τiw.p.τi+1|x|τi+1τi,sign(x)τi+1w.p.|x|τiτi+1τi,𝒬𝑥casessign𝑥subscript𝜏𝑖formulae-sequencewpsubscript𝜏𝑖1𝑥subscript𝜏𝑖1subscript𝜏𝑖sign𝑥subscript𝜏𝑖1formulae-sequencewp𝑥subscript𝜏𝑖subscript𝜏𝑖1subscript𝜏𝑖\displaystyle\mathcal{Q}(x)=\left\{\begin{array}[]{cc}\mathrm{sign}(x)\tau_{i}% &\mathrm{w.p.}\enspace\frac{\tau_{i+1}-|x|}{\tau_{i+1}-\tau_{i}},\\ \mathrm{sign}(x)\tau_{i+1}&\mathrm{w.p.}\enspace\frac{|x|-\tau_{i}}{\tau_{i+1}% -\tau_{i}},\\ \end{array}\right.caligraphic_Q ( italic_x ) = { start_ARRAY start_ROW start_CELL roman_sign ( italic_x ) italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL start_CELL roman_w . roman_p . divide start_ARG italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - | italic_x | end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW start_ROW start_CELL roman_sign ( italic_x ) italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT end_CELL start_CELL roman_w . roman_p . divide start_ARG | italic_x | - italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , end_CELL end_ROW end_ARRAY (9)

where sign()sign\mathrm{sign}(\cdot)roman_sign ( ⋅ ) represents the signum function and “w.p.” represents “with probability.” Exploiting the quantization function in (9), the local update 𝐠mksuperscriptsubscript𝐠𝑚𝑘\mathbf{g}_{m}^{k}bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is quantized as 𝒬(𝐠mk)[𝒬(gm,1k),,𝒬(gm,dk)]T𝒬superscriptsubscript𝐠𝑚𝑘superscript𝒬superscriptsubscript𝑔𝑚1𝑘𝒬superscriptsubscript𝑔𝑚𝑑𝑘𝑇\mathcal{Q}\left(\mathbf{g}_{m}^{k}\right)\triangleq\left[\mathcal{Q}\left(g_{% m,1}^{k}\right),\cdots,\mathcal{Q}\left(g_{m,d}^{k}\right)\right]^{T}caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ≜ [ caligraphic_Q ( italic_g start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , ⋯ , caligraphic_Q ( italic_g start_POSTSUBSCRIPT italic_m , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, which is transmitted to the PS. Note that the exact value of gm,maxksuperscriptsubscript𝑔𝑚𝑘g_{m,\max}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and gm,minksuperscriptsubscript𝑔𝑚𝑘g_{m,\min}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT need to be transmitted to the PS with sufficient precision to support effective recovery. Hence, the total number of bits needed for transmitting amounts to

btotal=d(b+1)+q,subscript𝑏total𝑑𝑏1𝑞\displaystyle b_{\mathrm{total}}=d(b+1)+q,italic_b start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = italic_d ( italic_b + 1 ) + italic_q , (10)

where q𝑞qitalic_q is the number of bits used to represent gm,maxksuperscriptsubscript𝑔𝑚𝑘g_{m,\max}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and gm,minksuperscriptsubscript𝑔𝑚𝑘g_{m,\min}^{k}italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and the additional one bit is the sign bit.

During the uplink FL parameter report, transmission errors are inevitable due to the channel dynamics and limited communication resources. Without loss of generality, we adopt the typical FDMA technique as an example. Assume that MN𝑀𝑁M\geq Nitalic_M ≥ italic_N and hence each device can occupy different subbands equally to avoid interference with each other.111We generally assume orthogonal access between different devices and refrain from specifying the particular multiple access design. Hence, the following analysis can be safely extended to orthogonal access scenarios like TDMA and orthogonal frequency division multiple access (OFDMA). Then, the channel capacity of device k𝑘kitalic_k can be evaluated as

Ck=Bklog2(1+Pk|h¯k|2BkN0),subscript𝐶𝑘subscript𝐵𝑘subscript21subscript𝑃𝑘superscriptsubscript¯𝑘2subscript𝐵𝑘subscript𝑁0\displaystyle C_{k}=B_{k}\log_{2}\left(1+\frac{P_{k}|\bar{h}_{k}|^{2}}{B_{k}N_% {0}}\right),italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + divide start_ARG italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) , (11)

where Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the bandwidth allocated to device k𝑘kitalic_k and it is set to BN𝐵𝑁\frac{B}{N}divide start_ARG italic_B end_ARG start_ARG italic_N end_ARG, Pksubscript𝑃𝑘P_{k}italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the transmit power at device k𝑘kitalic_k, and N0subscript𝑁0N_{0}italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the noise power density.

The transmission delay under the digital transmission is primarily influenced by stragglers, which refer to devices with poor channel conditions. To avoid the uncontrolled severe delay brought by stragglers, we assume that all the devices transmit the local updates at a fixed rate rather than a dynamic one based on instantaneous signal-to-noise ratio (SNR) levels. Hence, the use of a fixed-rate transmission acts as a truncation mechanism for stragglers. Additionally, for devices experiencing favorable channel conditions, it is more beneficial to transmit at a lower rate with enhanced transmission reliability. The target transmission rate is denoted by R=BNlog2(1+θ)𝑅𝐵𝑁subscript21𝜃R=\frac{B}{N}\log_{2}(1+\theta)italic_R = divide start_ARG italic_B end_ARG start_ARG italic_N end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + italic_θ ), where θ𝜃\thetaitalic_θ is a chosen constant. According to [12], the transmission is assumed error-free if the transmission rate is no larger than the channel capacity. Hence, the probability of successful transmission at device k𝑘kitalic_k is calculated as

pk=Pr{RCk}=exp(BN0θ2NPkdkα).subscript𝑝𝑘Pr𝑅subscript𝐶𝑘exp𝐵subscript𝑁0𝜃2𝑁subscript𝑃𝑘superscriptsubscript𝑑𝑘𝛼\displaystyle p_{k}=\Pr\left\{R\leq C_{k}\right\}=\mathrm{exp}\left(-\frac{BN_% {0}\theta}{2NP_{k}d_{k}^{-\alpha}}\right).italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_Pr { italic_R ≤ italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } = roman_exp ( - divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_θ end_ARG start_ARG 2 italic_N italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG ) . (12)

At the PS, a cyclic redundancy check (CRC) mechanism is applied to check the detected data such that erroneous local updates can be excluded from the model aggregation. Finally, the obtained estimate of the desired gradient in (5) is given by

𝐠^m,D=k=1Kχkαkξk,Drk𝒬(𝐠mk),subscript^𝐠𝑚Dsuperscriptsubscript𝑘1𝐾subscript𝜒𝑘subscript𝛼𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘\displaystyle\hat{\mathbf{g}}_{m,\text{D}}=\sum_{k=1}^{K}\frac{\chi_{k}\alpha_% {k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{m}^{k}),over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , (13)

where χksubscript𝜒𝑘\chi_{k}italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is an indicator variable for the device selection, and ξk,Dsubscript𝜉𝑘D\xi_{k,\text{D}}italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT represents distortion brought by packet loss. To be concrete, χksubscript𝜒𝑘\chi_{k}italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is 1 if k𝒮m𝑘subscript𝒮𝑚k\in\mathcal{S}_{m}italic_k ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and otherwise χksubscript𝜒𝑘\chi_{k}italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is 0. Considering the definition of the inclusion probability, we have 𝔼[χk]=rk1𝔼delimited-[]subscript𝜒𝑘subscript𝑟𝑘1\mathbb{E}\left[\chi_{k}\right]=r_{k}\leq 1blackboard_E [ italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1, which decreases the desired expected aggregation coefficient for unbiased gradient estimation. In order to compensate for the impact of partial participation, we multiply the coefficient 1rk1subscript𝑟𝑘\frac{1}{r_{k}}divide start_ARG 1 end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG in (13), such that 1rk𝔼[χk]=11subscript𝑟𝑘𝔼delimited-[]subscript𝜒𝑘1\frac{1}{r_{k}}\mathbb{E}\left[\chi_{k}\right]=1divide start_ARG 1 end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E [ italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] = 1. Analogously, the distortion ξk,Dsubscript𝜉𝑘D\xi_{k,\text{D}}italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT is characterized by the probability in (12) as

ξk,D={1pkw.p.pk,0w.p.1pk,subscript𝜉𝑘Dcases1subscript𝑝𝑘w.p.subscript𝑝𝑘0w.p.1subscript𝑝𝑘\displaystyle\xi_{k,\text{D}}=\left\{\begin{array}[]{cl}\frac{1}{p_{k}}&\text{% w.p.}\enspace p_{k},\\ 0&\text{w.p.}\enspace 1-p_{k},\end{array}\right.italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_CELL start_CELL w.p. italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL w.p. 1 - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL end_ROW end_ARRAY (16)

to ensure 𝔼[ξk,D]=1𝔼delimited-[]subscript𝜉𝑘D1\mathbb{E}\left[\xi_{k,\text{D}}\right]=1blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT ] = 1. With the gradient estimate in (13), the global model updated at the (m+1)𝑚1(m+1)( italic_m + 1 )-th round equals to

𝐰~m+1=𝐰~mη𝐠^m,D,subscript~𝐰𝑚1subscript~𝐰𝑚𝜂subscript^𝐠𝑚D\displaystyle\tilde{\mathbf{w}}_{m+1}=\tilde{\mathbf{w}}_{m}-\eta\hat{\mathbf{% g}}_{m,\text{D}},over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT , (17)

where 𝐰~msubscript~𝐰𝑚\tilde{\mathbf{w}}_{m}over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denotes the model obtained at the previous round.

II-B2 Analog Transmission Model

In the analog transmission with AirComp, selected devices simultaneously upload the uncoded analog signals of local gradients to the PS by fully reusing the time-frequency resource. A weighted summation of the local updates in (5) can be achieved by exploiting channel pre-equalization and the waveform superposition nature of the wireless channel. In this study, we consider that the total bandwidth is constrained for fair comparison and all subbands are utilized for the transmission of identical parameters. This is because the uncoded nature of the analog transmission diminishes its robustness, rendering it more vulnerable to interference and even the malicious attacks.222The derived results directly extend to the case of dividing bandwidth for distinct parameter transmission in broadband scenarios[28]. Specifically, the received signal at the PS is expressed as

𝐲=k=1Kχkh¯kβk𝐠mk+𝐳m,𝐲superscriptsubscript𝑘1𝐾subscript𝜒𝑘subscript¯𝑘subscript𝛽𝑘superscriptsubscript𝐠𝑚𝑘subscript𝐳𝑚\displaystyle\mathbf{y}=\sum_{k=1}^{K}\chi_{k}\bar{h}_{k}\beta_{k}\mathbf{g}_{% m}^{k}+\mathbf{z}_{m},bold_y = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (18)

where βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the pre-processing factor at device k𝑘kitalic_k, and 𝐳msubscript𝐳𝑚\mathbf{z}_{m}bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is additive white Gaussian noise following 𝒞𝒩(𝟎,BN0𝐈)𝒞𝒩0𝐵subscript𝑁0𝐈\mathcal{CN}(\mathbf{0},BN_{0}\mathbf{I})caligraphic_C caligraphic_N ( bold_0 , italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_I ). To accurately estimate the desired gradient in (5), the pre-processing factor βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT should be adapted to the channel coefficient h¯ksubscript¯𝑘\bar{h}_{k}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Unlike the digital transmission, CSI is needed at the transmitter for the analog transmission. Channel pre-equalization is performed based on the CSI available at each device. For simplicity, we adopt the typical truncated channel inversion scheme to combat deep fades[28]. It is expressed as

βk={ζλαkdkα2h^krk|h^k|2|h^k|2γth,0|h^k|2<γth,subscript𝛽𝑘cases𝜁𝜆subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼2superscriptsubscript^𝑘subscript𝑟𝑘superscriptsubscript^𝑘2superscriptsubscript^𝑘2subscript𝛾th0superscriptsubscript^𝑘2subscript𝛾th\displaystyle\beta_{k}=\left\{\begin{array}[]{cl}\frac{\zeta\lambda\alpha_{k}d% _{k}^{\frac{\alpha}{2}}\hat{h}_{k}^{*}}{r_{k}|\hat{h}_{k}|^{2}}&|\hat{h}_{k}|^% {2}\geq\gamma_{\mathrm{th}},\\ 0&|\hat{h}_{k}|^{2}<\gamma_{\mathrm{th}},\end{array}\right.italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL divide start_ARG italic_ζ italic_λ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT , end_CELL end_ROW end_ARRAY (21)

where γthsubscript𝛾th\gamma_{\mathrm{th}}italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT is a predetermined power-cutoff threshold, ζ𝜁\zetaitalic_ζ is a scaling factor for ensuring the transmit power constraint, and compensation coefficient λ𝜆\lambdaitalic_λ is selected to alleviate the impact of imperfect CSI [34]. Through the pre-processing in (21), we aim to eliminate the influence of the uneven channel fading h¯ksubscript¯𝑘\bar{h}_{k}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and the inclusion probability pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, thereby ensuring the unbiased gradient estimation.

At the receiver, the PS scales the real part of 𝐲𝐲\mathbf{y}bold_y in (18) with 1ζ1𝜁\frac{1}{\zeta}divide start_ARG 1 end_ARG start_ARG italic_ζ end_ARG and obtain an estimate of the actual gradient in (5). It yields

𝐠^m,A=k=1Kχkαkξk,Ark𝐠mk+𝐳¯m,subscript^𝐠𝑚Asuperscriptsubscript𝑘1𝐾subscript𝜒𝑘subscript𝛼𝑘subscript𝜉𝑘Asubscript𝑟𝑘superscriptsubscript𝐠𝑚𝑘subscript¯𝐳𝑚\displaystyle\hat{\mathbf{g}}_{m,\text{A}}=\sum_{k=1}^{K}\frac{\chi_{k}\alpha_% {k}\xi_{k,\text{A}}}{r_{k}}\mathbf{g}_{m}^{k}+\bar{\mathbf{z}}_{m},over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , A end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (22)

where 𝐳¯m{𝐳m}ζsubscript¯𝐳𝑚subscript𝐳𝑚𝜁\bar{\mathbf{z}}_{m}\triangleq\frac{\Re\{\mathbf{z}_{m}\}}{\zeta}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≜ divide start_ARG roman_ℜ { bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } end_ARG start_ARG italic_ζ end_ARG is the equivalent noise, and ξk,Asubscript𝜉𝑘A\xi_{k,\text{A}}italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT denotes the distortion brought by the analog transmission with imperfect CSI. It follows

ξk,A={λ{hkh^k}|h^k|2w.p.eγth,0w.p.1eγth.subscript𝜉𝑘Acases𝜆superscriptsubscript𝑘subscript^𝑘superscriptsubscript^𝑘2w.p.superscriptesubscript𝛾th0w.p.1superscriptesubscript𝛾th\displaystyle\xi_{k,\text{A}}=\left\{\begin{array}[]{cl}\frac{\lambda\Re\{h_{k% }^{*}\hat{h}_{k}\}}{|\hat{h}_{k}|^{2}}&\text{w.p.}\enspace\mathrm{e}^{-\gamma_% {\mathrm{th}}},\\ 0&\text{w.p.}\enspace 1-\mathrm{e}^{-\gamma_{\mathrm{th}}}.\end{array}\right.italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL divide start_ARG italic_λ roman_ℜ { italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } end_ARG start_ARG | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL w.p. roman_e start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL w.p. 1 - roman_e start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . end_CELL end_ROW end_ARRAY (25)

Similarly, the global model at the (m+1)𝑚1(m+1)( italic_m + 1 )-th round under the analog transmission is updated as

𝐰~m+1=𝐰~mη𝐠^m,A.subscript~𝐰𝑚1subscript~𝐰𝑚𝜂subscript^𝐠𝑚A\displaystyle\tilde{\mathbf{w}}_{m+1}=\tilde{\mathbf{w}}_{m}-\eta\hat{\mathbf{% g}}_{m,\text{A}}.over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , A end_POSTSUBSCRIPT . (26)

II-C A Unified Framework for Wireless FL Comparison

To minimize the optimality gap brought by imperfect uplink transmission, the overall FL task-oriented optimization over the wireless networks can be formulated as

minimize 𝔼[F(𝐰m+1)]F(𝐰)𝔼delimited-[]𝐹subscript𝐰𝑚1𝐹superscript𝐰\displaystyle\mathbb{E}\left[F(\mathbf{w}_{m+1})\right]-F(\mathbf{w}^{*})blackboard_E [ italic_F ( bold_w start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
subject to C1:TTmax,:subscriptC1𝑇subscript𝑇\displaystyle\mathrm{C}_{1}:T\leq T_{\max},roman_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_T ≤ italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ,
C2:PkPmax,k,:subscriptC2subscript𝑃𝑘subscript𝑃for-all𝑘\displaystyle\mathrm{C}_{2}:P_{k}\leq P_{\max},\enspace\forall k,roman_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , ∀ italic_k , (27)

where the expectation is taken over channel dynamics, T𝑇Titalic_T represents uplink transmission delay per round, Tmaxsubscript𝑇T_{\max}italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and Pmaxsubscript𝑃P_{\max}italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT denotes the maximum transmission delay target and the transmit power unit, respectively. Constraint C1subscriptC1\mathrm{C}_{1}roman_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscriptC2\mathrm{C}_{2}roman_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively represent the maximum transmission delay and maximum transmit power constraint in practice. Apart from the maximum power budget, another typical transmit power constraint is the average power budget [28], i.e.,

C¯2:𝔼[Pk]Pave,k,:subscript¯C2𝔼delimited-[]subscript𝑃𝑘subscript𝑃avefor-all𝑘\displaystyle\bar{\mathrm{C}}_{2}:\mathbb{E}[P_{k}]\leq P_{\mathrm{ave}},% \enspace\forall k,over¯ start_ARG roman_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : blackboard_E [ italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ≤ italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT , ∀ italic_k , (28)

where Pavesubscript𝑃aveP_{\mathrm{ave}}italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT denotes the average power budget and limits the energy consumption during the uplink transmission process.

TABLE I: Main Differences Between the Two Paradigms
Paradigms Gradient estimation Transmission delay Power budget
Digital (13) (29) (31)
Analog (22) (32) (33), (34)

For fair comparison between the two transmission paradigms, we measure the achievable objective value of the problem in (II-C) under the same transmission delay target and transmit power budget. Specific constraints for the two transmission paradigms are listed as follows, summarized in Table I.

For digital transmission, the transmission delay per communication round is calculated as

TD=btotalR=Nd(b+1)Blog2(1+θ),subscript𝑇Dsubscript𝑏total𝑅𝑁𝑑𝑏1𝐵subscript21𝜃\displaystyle T_{\text{D}}=\frac{b_{\mathrm{total}}}{R}=\frac{Nd(b+1)}{B\log_{% 2}(1+\theta)},italic_T start_POSTSUBSCRIPT D end_POSTSUBSCRIPT = divide start_ARG italic_b start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG = divide start_ARG italic_N italic_d ( italic_b + 1 ) end_ARG start_ARG italic_B roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + italic_θ ) end_ARG , (29)

where the evaluation holds with a sufficiently large model size d𝑑ditalic_d. Hence, constraint C1subscriptC1\mathrm{C}_{1}roman_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is reformulated as

Nd(b+1)Blog2(1+θ)Tmaxθ2Nd(b+1)BTmax1.𝑁𝑑𝑏1𝐵subscript21𝜃subscript𝑇𝜃superscript2𝑁𝑑𝑏1𝐵subscript𝑇1\displaystyle\frac{Nd(b+1)}{B\log_{2}(1+\theta)}\leq T_{\max}\Rightarrow\theta% \geq 2^{\frac{Nd(b+1)}{BT_{\max}}}-1.divide start_ARG italic_N italic_d ( italic_b + 1 ) end_ARG start_ARG italic_B roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 1 + italic_θ ) end_ARG ≤ italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ⇒ italic_θ ≥ 2 start_POSTSUPERSCRIPT divide start_ARG italic_N italic_d ( italic_b + 1 ) end_ARG start_ARG italic_B italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - 1 . (30)

For constraint C2subscriptC2\mathrm{C}_{2}roman_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, due to its interference-free characteristic, full power transmission is optimal and hence the constraint is reformulated by

Pk=Pmax,k.subscript𝑃𝑘subscript𝑃for-all𝑘\displaystyle P_{k}=P_{\max},\enspace\forall k.italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , ∀ italic_k . (31)

Also, with the average transmit power budget, we assume invariant transmit power over different communication rounds and have Pk=Pave,ksubscript𝑃𝑘subscript𝑃avefor-all𝑘P_{k}=P_{\mathrm{ave}},\enspace\forall kitalic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT , ∀ italic_k.

For analog transmission, according to [39, Eq. (16)], the per-round delay follows

TA=dMB,subscript𝑇A𝑑𝑀𝐵\displaystyle T_{\text{A}}=\frac{dM}{B},italic_T start_POSTSUBSCRIPT A end_POSTSUBSCRIPT = divide start_ARG italic_d italic_M end_ARG start_ARG italic_B end_ARG , (32)

which is a constant. For feasibility, we assume that the target Tmaxsubscript𝑇T_{\max}italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT cannot be smaller than TAsubscript𝑇AT_{\text{A}}italic_T start_POSTSUBSCRIPT A end_POSTSUBSCRIPT. The maximum power constraint C2subscriptC2\mathrm{C}_{2}roman_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is rewritten as

maxm,k{βk𝐠mk2}Pmax,subscript𝑚𝑘superscriptnormsubscript𝛽𝑘superscriptsubscript𝐠𝑚𝑘2subscript𝑃\displaystyle\max_{m,k}\left\{\left\|\beta_{k}\mathbf{g}_{m}^{k}\right\|^{2}% \right\}\leq P_{\max},roman_max start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT { ∥ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ≤ italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , (33)

for the analog transmission. Unlike the digital transmission, it is impossible to fully utilize the maximum power in analog transmission due to the need for channel pre-equalization. On the other hand, the average power constraint C¯2subscript¯C2\bar{\mathrm{C}}_{2}over¯ start_ARG roman_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT follows

𝔼[βk𝐠mk2]Pave,𝔼delimited-[]superscriptnormsubscript𝛽𝑘superscriptsubscript𝐠𝑚𝑘2subscript𝑃ave\displaystyle\mathbb{E}\left[\left\|\beta_{k}\mathbf{g}_{m}^{k}\right\|^{2}% \right]\leq P_{\mathrm{ave}},blackboard_E [ ∥ italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT , (34)

where the expectation is taken over the wireless channel dynamics and different communication rounds.

III Preliminaries

To pave the way for performance analysis, this section provides necessary assumptions and lemmas about the learning algorithms and the transmission paradigms, which will be useful in the next section.

III-A Assumptions for Learning Algorithms

To begin with, we make several common assumptions on the loss functions, which are widely used in FL studies like [12, 29, 42].

Assumption 1: The local loss functions Fk()subscript𝐹𝑘F_{k}(\cdot)italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) are μ𝜇\muitalic_μ-strongly convex for all devices, that is

Fk(𝐰)Fk(𝐯)+Fk(𝐯)T(𝐰𝐯)+μ2𝐰𝐯2.subscript𝐹𝑘𝐰subscript𝐹𝑘𝐯subscript𝐹𝑘superscript𝐯𝑇𝐰𝐯𝜇2superscriptnorm𝐰𝐯2\displaystyle F_{k}(\mathbf{w})\geq F_{k}(\mathbf{v})+\nabla F_{k}(\mathbf{v})% ^{T}(\mathbf{w}-\mathbf{v})+\frac{\mu}{2}\|\mathbf{w}-\mathbf{v}\|^{2}.italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) ≥ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_v ) + ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_v ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_w - bold_v ) + divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ∥ bold_w - bold_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (35)

Assumption 2: The local loss functions Fk()subscript𝐹𝑘F_{k}(\cdot)italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( ⋅ ) are differentiable and have L𝐿Litalic_L-Lipschitz gradients, which follows

Fk(𝐰)Fk(𝐯)L𝐰𝐯,normsubscript𝐹𝑘𝐰subscript𝐹𝑘𝐯𝐿norm𝐰𝐯\displaystyle\|\nabla F_{k}(\mathbf{w})-\nabla F_{k}(\mathbf{v})\|\leq L\|% \mathbf{w}-\mathbf{v}\|,∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) - ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_v ) ∥ ≤ italic_L ∥ bold_w - bold_v ∥ , (36)

and it is equivalent to

Fk(𝐰)Fk(𝐯)+Fk(𝐯)T(𝐰𝐯)+L2𝐰𝐯2.subscript𝐹𝑘𝐰subscript𝐹𝑘𝐯subscript𝐹𝑘superscript𝐯𝑇𝐰𝐯𝐿2superscriptnorm𝐰𝐯2\displaystyle F_{k}(\mathbf{w})\leq F_{k}(\mathbf{v})+\nabla F_{k}(\mathbf{v})% ^{T}(\mathbf{w}-\mathbf{v})+\frac{L}{2}\|\mathbf{w}-\mathbf{v}\|^{2}.italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w ) ≤ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_v ) + ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_v ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_w - bold_v ) + divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ∥ bold_w - bold_v ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (37)

Assumption 3: In most practical applications, it is safe to assume that the sample-wise gradient is always upper bounded by a finite constant γ𝛾\gammaitalic_γ, i.e.,

(𝐰,𝐮)γ.norm𝐰𝐮𝛾\displaystyle\left\|\nabla\mathcal{L}(\mathbf{w},\mathbf{u})\right\|\leq\gamma.∥ ∇ caligraphic_L ( bold_w , bold_u ) ∥ ≤ italic_γ . (38)

Assumption 4: The distance between the locally optimal model, 𝐰ksuperscriptsubscript𝐰𝑘\mathbf{w}_{k}^{*}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and the globally optimal model, 𝐰superscript𝐰\mathbf{w}^{*}bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, is uniformly bounded by a finite constant δ𝛿\deltaitalic_δ, i.e.,

𝐰k𝐰δ.normsuperscriptsubscript𝐰𝑘superscript𝐰𝛿\displaystyle\|\mathbf{w}_{k}^{*}-\mathbf{w}^{*}\|\leq\delta.∥ bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ ≤ italic_δ . (39)

III-B Preliminary Lemmas

We present lemmas regarding the strong convexity and Lipschitz smooth properties of the global loss function.

Lemma 1:

With μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth local loss functions, the global loss function F()𝐹F(\cdot)italic_F ( ⋅ ) is also μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth.

Proof:

Recalling the definition of F()𝐹F(\cdot)italic_F ( ⋅ ) in (II-A), with Assumptions 1-2, it is easily verified that any linear combination of μ𝜇\muitalic_μ-strongly convex and L𝐿Litalic_L-smooth local loss functions also satisfies (35) and (37). The proof completes. \square

We then provide the following lemma regarding the imperfection in digital and analog transmission paradigms.

Lemma 2:

Under the stochastic quantization and the proposed digital aggregation in (13), 𝐠^m,Dsubscript^𝐠𝑚D\hat{\mathbf{g}}_{m,\text{D}}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT is an unbiased estimate of the actual gradient in (5). For the considered analog paradigm in (22), by choosing λ=eγthρ𝜆superscript𝑒subscript𝛾th𝜌\lambda=\frac{e^{\gamma_{\mathrm{th}}}}{\rho}italic_λ = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ end_ARG, the gradient estimate 𝐠^m,Asubscript^𝐠𝑚A\hat{\mathbf{g}}_{m,\text{A}}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , A end_POSTSUBSCRIPT is also unbiased.

Proof:

Please refer to Appendix A. \square

Although both the digital and analog transmissions achieve unbiased gradient estimations, there are fundamental differences in the distortion between the two paradigms. For the digital transmission, the distortion mainly lies in the gradients themselves, i.e., gradient quantization errors. On the other hand, due to the integration of communication and computation in AirComp, the analog transmission additionally suffers from distortion in coefficient aggregation, i.e., computation error, which is due to the CSI imperfection. This essential difference further discriminates the performances of digital and analog transmissions, which are elaborated in the next section.

IV Comparison with Convergence Analysis

In this section, we analyze the convergence performance under the digital and analog transmissions with the practical constraints for wireless FL. Based on the derived results, we further conduct quantitative comparisons between the two paradigms from various perspectives of view.

IV-A Convergence under the Maximum Power Budget

We characterize the convergence performance under different transmission paradigms in the following theorems.

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [F(𝐰~m+1)]F(𝐰)L2(1ημ+2η2L2gD(𝐫,b))m+1𝔼[𝐰~0𝐰2]+η(Lϕ(b)+2L3δ2)gD(𝐫,b)2μ4ηL2gD(𝐫,b),delimited-[]𝐹subscript~𝐰𝑚1𝐹superscript𝐰𝐿2superscript1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔D𝐫𝑏𝑚1𝔼delimited-[]superscriptnormsubscript~𝐰0superscript𝐰2𝜂𝐿italic-ϕ𝑏2superscript𝐿3superscript𝛿2subscript𝑔D𝐫𝑏2𝜇4𝜂superscript𝐿2subscript𝑔D𝐫𝑏\displaystyle\left[F(\tilde{\mathbf{w}}_{m+1})\right]-F(\mathbf{w}^{*})\leq% \frac{L}{2}\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{D}}(\mathbf{r},b)\right)^{m+% 1}\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{0}-\mathbf{w}^{*}\right\|^{2}% \right]+\frac{\eta(L\phi(b)+2L^{3}\delta^{2})g_{\text{D}}(\mathbf{r},b)}{2\mu-% 4\eta L^{2}g_{\text{D}}(\mathbf{r},b)},[ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ) start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_η ( italic_L italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG , (40)
𝔼𝔼\displaystyle\mathbb{E}blackboard_E [F(𝐰~m+1)]F(𝐰)L2(1ημ+2η2L2gA(𝐫,γth))m+1𝔼[𝐰~0𝐰2]+η(Lφ(𝐫,γth)+2L3δ2gA(𝐫,γth))2μ4ηL2gA(𝐫,γth).delimited-[]𝐹subscript~𝐰𝑚1𝐹superscript𝐰𝐿2superscript1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔A𝐫subscript𝛾th𝑚1𝔼delimited-[]superscriptnormsubscript~𝐰0superscript𝐰2𝜂𝐿𝜑𝐫subscript𝛾th2superscript𝐿3superscript𝛿2subscript𝑔A𝐫subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔A𝐫subscript𝛾th\displaystyle\left[F(\tilde{\mathbf{w}}_{m+1})\right]-F(\mathbf{w}^{*})\leq% \frac{L}{2}\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{A}}(\mathbf{r},\gamma_{% \mathrm{th}})\right)^{m+1}\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{0}-% \mathbf{w}^{*}\right\|^{2}\right]+\frac{\eta\left(L\varphi(\mathbf{r},\gamma_{% \mathrm{th}})+2L^{3}\delta^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})% \right)}{2\mu-4\eta L^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})}.[ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + divide start_ARG italic_η ( italic_L italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG . (41)
Theorem 1 (Digital Transmission):

For a fixed learning rate satisfying ημ2L2gD(𝐫,b)𝜂𝜇2superscript𝐿2subscript𝑔D𝐫𝑏\eta\leq\frac{\mu}{2L^{2}g_{\text{D}}(\mathbf{r},b)}italic_η ≤ divide start_ARG italic_μ end_ARG start_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG, the optimality gap of the distributed gradient update in the (m+1)𝑚1(m+1)( italic_m + 1 )-th iteration of the digital transmission is equal to (40) at the top of the next page, where ϕ(b)italic-ϕ𝑏\phi(b)italic_ϕ ( italic_b ) is a constant defined in Appendix B regarding the quantization errors, 𝐫[r1,,rK]T𝐫superscriptsubscript𝑟1subscript𝑟𝐾𝑇\mathbf{r}\triangleq[r_{1},\cdots,r_{K}]^{T}bold_r ≜ [ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and gD(𝐫,b)k=1Kαkpkrksubscript𝑔D𝐫𝑏superscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑝𝑘subscript𝑟𝑘g_{\text{D}}(\mathbf{r},b)\triangleq\sum_{k=1}^{K}\frac{\alpha_{k}}{p_{k}r_{k}}italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ≜ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG.

Proof:

Please refer to Appendix B. \square

Theorem 2 (Analog Transmission):

For a fixed learning rate satisfying ημ2L2gA(𝐫,γth)𝜂𝜇2superscript𝐿2subscript𝑔A𝐫subscript𝛾th\eta\leq\frac{\mu}{2L^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})}italic_η ≤ divide start_ARG italic_μ end_ARG start_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG, the optimality gap of the distributed gradient update in the (m+1)𝑚1(m+1)( italic_m + 1 )-th iteration of the analog transmission is equal to (41) at the top of the next page, where gA(𝐫,γth)k=1Kαkrk(eγth+(1ρ2)E1(γth)e2γth2ρ2)1subscript𝑔A𝐫subscript𝛾thsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑟𝑘superscript𝑒subscript𝛾th1superscript𝜌2subscriptE1subscript𝛾thsuperscript𝑒2subscript𝛾th2superscript𝜌21g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})\triangleq\sum_{k=1}^{K}\frac{% \alpha_{k}}{r_{k}}\left(e^{\gamma_{\mathrm{th}}}+\frac{(1-\rho^{2})\mathrm{E}_% {1}\left(\gamma_{\mathrm{th}}\right)e^{2\gamma_{\mathrm{th}}}}{2\rho^{2}}% \right)-1italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ≜ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + divide start_ARG ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - 1, E1(x)xettdtsubscriptE1𝑥superscriptsubscript𝑥superscript𝑒𝑡𝑡differential-d𝑡\mathrm{E}_{1}(x)\!\triangleq\!\int_{x}^{\infty}\frac{e^{-t}}{t}\mathrm{d}troman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) ≜ ∫ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_t end_ARG roman_d italic_t, and φ(𝐫,γth)BN0γ2e2γth2Pmaxρ2γthmaxk{αk2rk2dkα}𝜑𝐫subscript𝛾th𝐵subscript𝑁0superscript𝛾2superscript𝑒2subscript𝛾th2subscript𝑃superscript𝜌2subscript𝛾thsubscript𝑘superscriptsubscript𝛼𝑘2superscriptsubscript𝑟𝑘2superscriptsubscript𝑑𝑘𝛼\varphi(\mathbf{r},\gamma_{\mathrm{th}})\!\triangleq\!\frac{BN_{0}\gamma^{2}e^% {2\gamma_{\mathrm{th}}}}{2P_{\max}\rho^{2}\gamma_{\mathrm{th}}}\max_{k}\left\{% \frac{\alpha_{k}^{2}}{r_{k}^{2}}d_{k}^{\alpha}\right\}italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ≜ divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_ARG roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT }.

Proof:

Please refer to Appendix C. \square

From Theorems 1-2, we find that the convergence rate mainly depends on the choice of the learning rate η𝜂\etaitalic_η, while the imperfections in transmission also have a certain impact. We conclude the following immediate observations on the convergence rate.

Remark 1:

As observed in (40) and (41), the convergence performace of an FL algorithm is negatively related to gD(𝐫,b)subscript𝑔D𝐫𝑏g_{\text{D}}(\mathbf{r},b)italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) for digital transmission and to gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) for analog transmission. We refer to gD(𝐫,b)subscript𝑔D𝐫𝑏g_{\text{D}}(\mathbf{r},b)italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) and gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) as the virtual sum weight for the digital and analog transmissions, respectively, which reflects the degree of hindrance to the convergence imposed by unequal sampling and vulnerable wireless communication. Under the ideal case, with full device participation and no transmission outage, the virtual sum weight equals to 1, otherwise it is amplified by the imperfect characteristics. It is interesting to note that, for devices with more data samples, i.e., larger αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the impact of imperfections is exaggerated.

Remark 2:

Comparing gD(𝐫,b)subscript𝑔D𝐫𝑏g_{\text{D}}(\mathbf{r},b)italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) and gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ), it can be seen that the vulnerability of digital transmission introduces additional heterogeneity, i.e., varying pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which does not exist in the analog paradigm. This is because outage probability in the digital case is determined by channel conditions and varying across different devices. On the other hand, due to the uniform truncation threshold, all participating devices enjoy the same truncation probability in the analog transmission. Hence, in design of inclusion probabilities 𝐫𝐫\mathbf{r}bold_r for the digital case, we need to adapt the inclusion probabilities to both dataset size and channel condition. By contrast, in the case of analog transmission, only the heterogeneity of the dataset size needs to be considered.

According to Theorems 1-2, we are ready to derive the optimality gap after convergence for further evaluation in the following corollary, which reflects the ultimately achievable performance of the wireless FL.

Corollary 1:

With sufficient iterations, the optimality gap achieved by digital and analog transmissions, respectively, converges to

GDsubscript𝐺D\displaystyle G_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT =η(Lϕ(b)+2L3δ2)gD(𝐫,b)2μ4ηL2gD(𝐫,b),absent𝜂𝐿italic-ϕ𝑏2superscript𝐿3superscript𝛿2subscript𝑔D𝐫𝑏2𝜇4𝜂superscript𝐿2subscript𝑔D𝐫𝑏\displaystyle=\frac{\eta(L\phi(b)+2L^{3}\delta^{2})g_{\text{D}}(\mathbf{r},b)}% {2\mu-4\eta L^{2}g_{\text{D}}(\mathbf{r},b)},= divide start_ARG italic_η ( italic_L italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG , (42)
GAsubscript𝐺A\displaystyle G_{\text{A}}italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT =η(Lφ(𝐫,γth)+2L3δ2gA(𝐫,γth))2μ4ηL2gA(𝐫,γth).absent𝜂𝐿𝜑𝐫subscript𝛾th2superscript𝐿3superscript𝛿2subscript𝑔A𝐫subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔A𝐫subscript𝛾th\displaystyle=\frac{\eta\left(L\varphi(\mathbf{r},\gamma_{\mathrm{th}})+2L^{3}% \delta^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})\right)}{2\mu-4\eta L^{% 2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})}.= divide start_ARG italic_η ( italic_L italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG . (43)

Proof:

Consider the digital transmission scenario with a sufficient number of iterations. We have

limm𝔼[F(𝐰~m+1)]F(𝐰)subscript𝑚𝔼delimited-[]𝐹subscript~𝐰𝑚1𝐹superscript𝐰\displaystyle\lim_{m\to\infty}\mathbb{E}\left[F(\tilde{\mathbf{w}}_{m+1})% \right]-F(\mathbf{w}^{*})roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E [ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
limmL2(1ημ+2η2L2gD(𝐫,b))m+1𝔼[𝐰~0𝐰2]absentsubscript𝑚𝐿2superscript1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔D𝐫𝑏𝑚1𝔼delimited-[]superscriptnormsubscript~𝐰0superscript𝐰2\displaystyle\enspace\leq\lim_{m\to\infty}\frac{L}{2}\left(1-\eta\mu+2\eta^{2}% L^{2}g_{\text{D}}(\mathbf{r},b)\right)^{m+1}\mathbb{E}\left[\left\|\tilde{% \mathbf{w}}_{0}-\mathbf{w}^{*}\right\|^{2}\right]≤ roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ) start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+η(Lϕ(b)+2L3δ2)gD(𝐫,b)2μ4ηL2gD(𝐫,b)𝜂𝐿italic-ϕ𝑏2superscript𝐿3superscript𝛿2subscript𝑔D𝐫𝑏2𝜇4𝜂superscript𝐿2subscript𝑔D𝐫𝑏\displaystyle\enspace\quad+\frac{\eta(L\phi(b)+2L^{3}\delta^{2})g_{\text{D}}(% \mathbf{r},b)}{2\mu-4\eta L^{2}g_{\text{D}}(\mathbf{r},b)}+ divide start_ARG italic_η ( italic_L italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG
=(a)η(Lϕ(b)+2L3δ2)gD(𝐫,b)2μ4ηL2gD(𝐫,b)=GD,(a)𝜂𝐿italic-ϕ𝑏2superscript𝐿3superscript𝛿2subscript𝑔D𝐫𝑏2𝜇4𝜂superscript𝐿2subscript𝑔D𝐫𝑏subscript𝐺D\displaystyle\enspace\overset{\text{(a)}}{=}\frac{\eta(L\phi(b)+2L^{3}\delta^{% 2})g_{\text{D}}(\mathbf{r},b)}{2\mu-4\eta L^{2}g_{\text{D}}(\mathbf{r},b)}=G_{% \text{D}},over(a) start_ARG = end_ARG divide start_ARG italic_η ( italic_L italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG = italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT , (44)

where the inequality is obtained through Theorem 1 and the equality in (a) is due to the fact that η<μ2L2gD(𝐫,b)𝜂𝜇2superscript𝐿2subscript𝑔D𝐫𝑏\eta<\frac{\mu}{2L^{2}g_{\text{D}}(\mathbf{r},b)}italic_η < divide start_ARG italic_μ end_ARG start_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG, i.e., (1ημ+2η2L2gD(𝐫,b))<11𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔D𝐫𝑏1\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{D}}(\mathbf{r},b)\right)<1( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ) < 1. Hence, the achieved optimality gap at convergence is bounded by GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT. As for the analog transmission, the proof is almost the same and is omitted here for simplity. \square

From Corollary 1, we further compare the two typical paradigms from the following perspectives and conclude insightful remarks that are instructive for the deployment of FL in wireless networks. As a summary, we list main comparison results in Table II. For the sake of simplicity in analysis, without loss of generality, we drop the unbalance of the datasets and assume uniform inclusion probabilities, i.e., αk=1Ksubscript𝛼𝑘1𝐾\alpha_{k}=\frac{1}{K}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG, and rk=NKsubscript𝑟𝑘𝑁𝐾r_{k}=\frac{N}{K}italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_N end_ARG start_ARG italic_K end_ARG, kfor-all𝑘\forall k∀ italic_k, which does not cause any essential changes. Also we set that Tmax=TAsubscript𝑇subscript𝑇AT_{\max}=T_{\text{A}}italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT A end_POSTSUBSCRIPT. Note that the learning rate is assumed to be sufficiently small and hence the convergence rate remains the same for all cases.

TABLE II: Main Comparison Results with Respect To Optimality Gap
Paradigms Transmit power budget, P𝑃Pitalic_P Device number, N𝑁Nitalic_N Imperfect CSI, ρ𝜌\rhoitalic_ρ
Low SNR High SNR
Digital 𝒪(exp(εP))𝒪exp𝜀𝑃\mathcal{O}\left(\mathrm{exp}\left(\frac{\varepsilon}{P}\right)\right)caligraphic_O ( roman_exp ( divide start_ARG italic_ε end_ARG start_ARG italic_P end_ARG ) ) \searrow \rightarrow GDsuperscriptsubscript𝐺DG_{\text{D}}^{\infty}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT 𝒪(1Nexp(ε12ε2N/N))𝒪1𝑁expsubscript𝜀1superscript2subscript𝜀2𝑁𝑁\mathcal{O}\left(\frac{1}{N}\mathrm{exp}(\varepsilon_{1}2^{\varepsilon_{2}N}/N% )\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_exp ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N end_POSTSUPERSCRIPT / italic_N ) ) \nearrow /
Analog 𝒪(1P)𝒪1𝑃\mathcal{O}\left(\frac{1}{P}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ) \searrow \rightarrow GAsuperscriptsubscript𝐺AG_{\text{A}}^{\infty}italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT 𝒪(1N)𝒪1𝑁\mathcal{O}\left(\frac{1}{N}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ) \searrow 𝒪(1ρ2)𝒪1superscript𝜌2\mathcal{O}\left(\frac{1}{\rho^{2}}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) \nearrow
  • * The upward arrow indicates amplification at a certain order, while the downward arrow has the opposite meaning. The horizontal arrow indicates that it ultimately tends towards a fixed value.

IV-A1 Impact of Transmit Power

At low SNR levels, the achievable optimality gap under the digital transmission, GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT, vanishes as 𝒪(exp(ε/Pmax))𝒪exp𝜀subscript𝑃\mathcal{O}\left(\mathrm{exp}\left({\varepsilon}/{P_{\max}}\right)\right)caligraphic_O ( roman_exp ( italic_ε / italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ) with the maximum transmit power budget Pmaxsubscript𝑃P_{\max}italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, where εmaxk{BN0θ2Ndkα}𝜀subscript𝑘𝐵subscript𝑁0𝜃2𝑁superscriptsubscript𝑑𝑘𝛼\varepsilon\triangleq\max_{k}\left\{\frac{BN_{0}\theta}{2Nd_{k}^{-\alpha}}\right\}italic_ε ≜ roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_θ end_ARG start_ARG 2 italic_N italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG }. At high SNR regime, i.e., Pmaxsubscript𝑃P_{\max}\to\inftyitalic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT → ∞, the successful transmission probability pk1subscript𝑝𝑘1p_{k}\to 1italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 1, kfor-all𝑘\forall k∀ italic_k and GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT tends to

GDlimPmaxGD=η(Lϕ(b)+2L3δ2)K2μN4ηL2K.superscriptsubscript𝐺Dsubscriptsubscript𝑃subscript𝐺D𝜂𝐿italic-ϕ𝑏2superscript𝐿3superscript𝛿2𝐾2𝜇𝑁4𝜂superscript𝐿2𝐾\displaystyle G_{\text{D}}^{\infty}\triangleq\lim_{P_{\max}\to\infty}G_{\text{% D}}=\frac{\eta(L\phi(b)+2L^{3}\delta^{2})K}{2\mu N-4\eta L^{2}K}.italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ≜ roman_lim start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT = divide start_ARG italic_η ( italic_L italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_K end_ARG start_ARG 2 italic_μ italic_N - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K end_ARG . (45)

On the other hand, the decay rate for GAsubscript𝐺AG_{\text{A}}italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT is equal to 𝒪(1/Pmax)𝒪1subscript𝑃\mathcal{O}\left({1}/{P_{\max}}\right)caligraphic_O ( 1 / italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) with low SNR values and the high SNR-limiting value is

GAlimPmaxGA=2ηL3δ2(KcN)2μN4ηL2(KcN),superscriptsubscript𝐺Asubscriptsubscript𝑃subscript𝐺A2𝜂superscript𝐿3superscript𝛿2𝐾𝑐𝑁2𝜇𝑁4𝜂superscript𝐿2𝐾𝑐𝑁\displaystyle G_{\text{A}}^{\infty}\triangleq\lim_{P_{\max}\to\infty}G_{\text{% A}}=\frac{2\eta L^{3}\delta^{2}\left(Kc-N\right)}{2\mu N-4\eta L^{2}\left(Kc-N% \right)},italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ≜ roman_lim start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT → ∞ end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT = divide start_ARG 2 italic_η italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K italic_c - italic_N ) end_ARG start_ARG 2 italic_μ italic_N - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_K italic_c - italic_N ) end_ARG , (46)

where ceγth+(1ρ2)E1(γth)e2γth2ρ2𝑐superscript𝑒subscript𝛾th1superscript𝜌2subscriptE1subscript𝛾thsuperscript𝑒2subscript𝛾th2superscript𝜌2c\triangleq e^{\gamma_{\mathrm{th}}}+\frac{(1-\rho^{2})\mathrm{E}_{1}\left(% \gamma_{\mathrm{th}}\right)e^{2\gamma_{\mathrm{th}}}}{2\rho^{2}}italic_c ≜ italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + divide start_ARG ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG.

Remark 3:

As SNR increases, the optimality gap for the analog case mainly comes from the non-IID datasets while the impact of the noise asymptotically diminishes. For the digital case, however, quantization errors additionally impose an impact. Under the analog transmission, the negative impact of non-IID datasets is enlarged due to imperfect AirComp. Imperfect CSI results in mismatched channel inversion in AirComp, rendering perfect computation of weighted sum impossible. Moreover, the performance degradation brought by imperfect CSI in the analog transmission cannot be mitigated by occupying more resources. Conversely, in the digital transmission, the convergence performance can be improved by occupying additional resources for increasing the number of quantization bits.

IV-A2 Impact of Device Number

With the increasing number of participating devices, N𝑁Nitalic_N, the virtual sum rate for the analog transmission, gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ), decreases at a rate of 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG, i.e., a faster convergence rate is achieved. As for the optimality gap, the impact of non-IID datasets asymptotically dominates GAsubscript𝐺AG_{\text{A}}italic_G start_POSTSUBSCRIPT A end_POSTSUBSCRIPT and the decay rate is equal to 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/N)caligraphic_O ( 1 / italic_N ). Due to the involvement of more devices, a more accurate global gradient is obtained at the PS, which in turn facilitates the FL convergence and leads to better performance. Meanwhile, since different devices involved in the AirComp share the same time-frequency resource, an increase in access devices causes no deterioration of the AirComp performance, fully capturing the performance gain from more participating devices.

On the other hand, for the digital case, convergence performance does not necessarily monotonically change with N𝑁Nitalic_N. Although more participating devices do bring performance gains, it also leads to a significant deterioration of the transmission performance considering that limited communication resources are divided among additional users. Thus the convergence is compromised between communication reliability and the computation accuracy for wireless FL. Specifically, the optimality gap, GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT, enlarges with a rate of 𝒪(1Nexp(ε12ε2N/N))𝒪1𝑁expsubscript𝜀1superscript2subscript𝜀2𝑁𝑁\mathcal{O}\left(\frac{1}{N}\mathrm{exp}(\varepsilon_{1}2^{\varepsilon_{2}N}/N% )\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_exp ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_N end_POSTSUPERSCRIPT / italic_N ) ) with sufficiently large N𝑁Nitalic_N, where ε1=BN02PdKαsubscript𝜀1𝐵subscript𝑁02𝑃superscriptsubscript𝑑𝐾𝛼\varepsilon_{1}=\frac{BN_{0}}{2Pd_{K}^{-\alpha}}italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_P italic_d start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG and ε2=b+1Msubscript𝜀2𝑏1𝑀\varepsilon_{2}=\frac{b+1}{M}italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_b + 1 end_ARG start_ARG italic_M end_ARG.

Remark 4:

Benefiting from the characteristics of AirComp, more participating devices in the analog transmission always lead to performance improvement regardless of other parameters. Hence, allowing all active devices to participate in the FL training is the best choice for analog transmission. By contrast, in the digital transmission, it is necessary to seek a balance between the transmission performance and diversity gain through an optimization of N𝑁Nitalic_N.

IV-A3 Impact of Imperfect CSI

The imperfect CSI at the transmitter only affects the performance of analog transmission, which deteriorates at the order of 1ρ21superscript𝜌2\frac{1}{\rho^{2}}divide start_ARG 1 end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Due to imperfect CSI, the aggregation computation and the truncation decision in AirComp are contaminated, thus leading to a mismatch in the model aggregation and the impact of noise amplification.

Remark 5:

After incorporating computation capabilities into the analog case, the emergence of computation error as a new source of error has positioned computational accuracy as a crucial factor affecting the convergence performance. It is concluded that CSI is a key factor affecting the performance gain brought by AirComp. Moreover, the truncation threshold γthsubscript𝛾th\gamma_{\mathrm{th}}italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT should be optimized to adapt different levels of channel estimation accuracy. It can be effectively solved via bisection search in [34].

IV-A4 Impact of the Number of Quantization Bits

In the digital transmission, the number of quantization bits, b𝑏bitalic_b, also influences the FL performance in the following implicit ways. By selecting the minimum feasible θ=2Nd(b+1)BTmax1𝜃superscript2𝑁𝑑𝑏1𝐵subscript𝑇1\theta=2^{\frac{Nd(b+1)}{BT_{\max}}}-1italic_θ = 2 start_POSTSUPERSCRIPT divide start_ARG italic_N italic_d ( italic_b + 1 ) end_ARG start_ARG italic_B italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - 1 in (30), the achievable optimality gap GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT is rewritten as

GDsubscript𝐺D\displaystyle G_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT η2μ(LΔ2(2b1)2+2L3δ2)gD(𝐫,b)absent𝜂2𝜇𝐿superscriptΔ2superscriptsuperscript2𝑏122superscript𝐿3superscript𝛿2subscript𝑔D𝐫𝑏\displaystyle\approx\frac{\eta}{2\mu}\left(\frac{L\Delta^{2}}{(2^{b}-1)^{2}}+2% L^{3}\delta^{2}\right)g_{\mathrm{D}}(\mathbf{r},b)≈ divide start_ARG italic_η end_ARG start_ARG 2 italic_μ end_ARG ( divide start_ARG italic_L roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT roman_D end_POSTSUBSCRIPT ( bold_r , italic_b )
=η2μ(LΔ2(2b1)2+2L3δ2)absent𝜂2𝜇𝐿superscriptΔ2superscriptsuperscript2𝑏122superscript𝐿3superscript𝛿2\displaystyle=\frac{\eta}{2\mu}\left(\frac{L\Delta^{2}}{(2^{b}-1)^{2}}+2L^{3}% \delta^{2}\right)= divide start_ARG italic_η end_ARG start_ARG 2 italic_μ end_ARG ( divide start_ARG italic_L roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
×(k=1Kαkrkexp(BN0(2Nd(b+1)BTmax1)2NPkdkα)),absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑟𝑘exp𝐵subscript𝑁0superscript2𝑁𝑑𝑏1𝐵subscript𝑇12𝑁subscript𝑃𝑘superscriptsubscript𝑑𝑘𝛼\displaystyle\quad\times\left(\sum_{k=1}^{K}\frac{\alpha_{k}}{r_{k}}\mathrm{% exp}\left(\frac{BN_{0}\left(2^{\frac{Nd(b+1)}{BT_{\max}}}-1\right)}{2NP_{k}d_{% k}^{-\alpha}}\right)\right),× ( ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG roman_exp ( divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT divide start_ARG italic_N italic_d ( italic_b + 1 ) end_ARG start_ARG italic_B italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - 1 ) end_ARG start_ARG 2 italic_N italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT end_ARG ) ) , (47)

where the approximation is obtained in region of ημ2L2gD(𝐫,b)much-less-than𝜂𝜇2superscript𝐿2subscript𝑔D𝐫𝑏\eta\ll\frac{\mu}{2L^{2}g_{\text{D}}(\mathbf{r},b)}italic_η ≪ divide start_ARG italic_μ end_ARG start_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG. It is found that as b𝑏bitalic_b increases, GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT tends to first decrease and then increase. This is due to the diminishing quantization error term ϕ(b)italic-ϕ𝑏\phi(b)italic_ϕ ( italic_b ) with an increasing quantization accuracy and finally GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT is dominated by the impact of packet loss. Therefore, it is necessary to optimize of the integer variable b𝑏bitalic_b to pursue better convergence performance, which can be solved by a low-complexity exhaustive search method.

IV-B Convergence Analysis under the Average Power Budget

We consider the convergence with the average transmit power budget. For the digital transmission, by replacing Pmaxsubscript𝑃P_{\max}italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT with Pavesubscript𝑃aveP_{\mathrm{ave}}italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT, we derive the similar results as Theorem 1 and is omitted here due to page limit. As for the analog transmission, we have the following corollary.

Corollary 2:

For a fixed learning rate satisfying ημ2L2gA(𝐫,γth)𝜂𝜇2superscript𝐿2subscript𝑔A𝐫subscript𝛾th\eta\leq\frac{\mu}{2L^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})}italic_η ≤ divide start_ARG italic_μ end_ARG start_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG, the optimality gap of the distributed gradient update in the (m+1)𝑚1(m+1)( italic_m + 1 )-th iteration under the analog transmission follows

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [F(𝐰~m+1)]F(𝐰)delimited-[]𝐹subscript~𝐰𝑚1𝐹superscript𝐰\displaystyle\left[F(\tilde{\mathbf{w}}_{m+1})\right]-F(\mathbf{w}^{*})[ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
L2(1ημ+2η2L2gA(𝐫,γth))m+1𝔼[𝐰~0𝐰2]absent𝐿2superscript1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔A𝐫subscript𝛾th𝑚1𝔼delimited-[]superscriptnormsubscript~𝐰0superscript𝐰2\displaystyle\leq\frac{L}{2}\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{A}}(\mathbf% {r},\gamma_{\mathrm{th}})\right)^{m+1}\mathbb{E}\left[\left\|\tilde{\mathbf{w}% }_{0}-\mathbf{w}^{*}\right\|^{2}\right]≤ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+η(Lφave(𝐫,γth)+2L3δ2gA(𝐫,γth))2μ4ηL2gA(𝐫,γth),𝜂𝐿subscript𝜑ave𝐫subscript𝛾th2superscript𝐿3superscript𝛿2subscript𝑔A𝐫subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔A𝐫subscript𝛾th\displaystyle\quad+\frac{\eta\left(L\varphi_{\mathrm{ave}}(\mathbf{r},\gamma_{% \mathrm{th}})+2L^{3}\delta^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})% \right)}{2\mu-4\eta L^{2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})},+ divide start_ARG italic_η ( italic_L italic_φ start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG , (48)

where φave(𝐫,γth)BN0γ2e2γthE1(γth)2Paveρ2maxk{αk2rk2dkα}subscript𝜑ave𝐫subscript𝛾th𝐵subscript𝑁0superscript𝛾2superscript𝑒2subscript𝛾thsubscriptE1subscript𝛾th2subscript𝑃avesuperscript𝜌2subscript𝑘superscriptsubscript𝛼𝑘2superscriptsubscript𝑟𝑘2superscriptsubscript𝑑𝑘𝛼\varphi_{\mathrm{ave}}(\mathbf{r},\gamma_{\mathrm{th}})\triangleq\frac{BN_{0}% \gamma^{2}e^{2\gamma_{\mathrm{th}}}\mathrm{E}_{1}(\gamma_{\mathrm{th}})}{2P_{% \mathrm{ave}}\rho^{2}}\max_{k}\left\{\frac{\alpha_{k}^{2}}{r_{k}^{2}}d_{k}^{% \alpha}\right\}italic_φ start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ≜ divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT }. The optimality gap with sufficient iterations follows

GA,ave=η(Lφave(𝐫,γth)+2L3δ2gA(𝐫,γth))2μ4ηL2gA(𝐫,γth).subscript𝐺Aave𝜂𝐿subscript𝜑ave𝐫subscript𝛾th2superscript𝐿3superscript𝛿2subscript𝑔A𝐫subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔A𝐫subscript𝛾th\displaystyle G_{\text{A},\mathrm{ave}}=\frac{\eta\left(L\varphi_{\mathrm{ave}% }(\mathbf{r},\gamma_{\mathrm{th}})+2L^{3}\delta^{2}g_{\text{A}}(\mathbf{r},% \gamma_{\mathrm{th}})\right)}{2\mu-4\eta L^{2}g_{\text{A}}(\mathbf{r},\gamma_{% \mathrm{th}})}.italic_G start_POSTSUBSCRIPT A , roman_ave end_POSTSUBSCRIPT = divide start_ARG italic_η ( italic_L italic_φ start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG . (49)

Proof:

please refer to Appendix D. \square

Remark 6:

It is worth noting that E1(γth)<1γthsubscriptE1subscript𝛾th1subscript𝛾th\mathrm{E}_{1}(\gamma_{\mathrm{th}})<\frac{1}{\gamma_{\mathrm{th}}}roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) < divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_ARG when γth>0subscript𝛾th0\gamma_{\mathrm{th}}>0italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT > 0. Compared with the maximum transmit power budget, a smaller optimality gap for the analog transmission is achieved with the average power budget. Due to the need for channel alignment in AirComp, the performance is dominantly limited by the device with the worst channel condition. Furthermore, the strict peak power constraint amplifies the impact of worst-case channel conditions, resulting in looser convergence performance compared to the long-term constraint.

To summarize, while the analog AirComp improves the spectrum utilization compared to the digital paradigm, it faces challenges in fully utilizing the power resource, particularly with strict peak power constraints. Conversely, orthogonal access in digital transmission is not suitable for scenarios with massive access due to the limitations in spectrum resources.

IV-C Discussions on Scenarios with Advanced System Designs

To facilitate performance analysis, we introduce assumptions regarding the system design, including multiple access, parameter quantization, and power control methods. Subsequently, we delve into the implications of advanced system designs on the FL performance and comparison.

In the digital transmission, the FL performance can primarily be improved from two aspects, namely enhancing transmission reliability and optimizing resource utilization. Specifically, advanced transmissions strategies help minimize transmission errors and packet losses due to channel fading. Furthermore, if other resource allocation methods, such as the model compression design and device scheduling strategies, are exploited toward the FL tasks, they prioritize crucial parameter/device transmissions and thus lifting the resource utilization. On the other hand, in the analog transmission, the FL performance through AirComp is primarily influenced by the over-the-air computational accuracy. Optimized transceiver and power control designs help mitigate the negative impact of channel fading on the FL performance.

While further optimization of system designs enhances performance, it is essential to note that the performance limits for the digital and analog transmissions remains unchanged. As observed in the above analytical results, in the digital transmission paradigm, due to the decoupling of the communication and computation processes, the number of bits that can be accurately transmitted with the limited resources is determined, which places an upper bound of the FL performance. In contrast, within the analog transmissions, the receiver does not aim to recover information from individual sources but instead prioritizes the precision of computation results derived from the over-the-air superimposed signals, thereby making computational accuracy a decisive role. Hence, the performance limit of the analog transmission is contingent upon the channel estimation accuracy and additive noise level.

V Device Sampling Optimization

Based on the derived results in Section IV, we are able to further establish an optimization design of the device sampling for the wireless FL to improve the convergence.

V-A Digital Transmission

By direct inspection of (42), the optimality gap GDsubscript𝐺DG_{\text{D}}italic_G start_POSTSUBSCRIPT D end_POSTSUBSCRIPT monotonically decreases with a decreasing virtual sum weight. Hence, the device sampling optimization problem with the digital transmission is formulated as

minimize𝐫subscriptminimize𝐫\displaystyle\mathop{\text{minimize}}_{\mathbf{r}}minimize start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT gD(𝐫,b)=k=1Kαkpkrksubscript𝑔D𝐫𝑏superscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑝𝑘subscript𝑟𝑘\displaystyle\quad g_{\text{D}}(\mathbf{r},b)=\sum_{k=1}^{K}\frac{\alpha_{k}}{% p_{k}r_{k}}italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG
subject to k=1Krk=N,rk1,k=1,2,,K,formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝑟𝑘𝑁formulae-sequencesubscript𝑟𝑘1𝑘12𝐾\displaystyle\quad\sum_{k=1}^{K}r_{k}=N,\enspace r_{k}\leq 1,\enspace k=1,2,% \cdots,K,∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 , italic_k = 1 , 2 , ⋯ , italic_K , (50)

which is a convex problem. By exploiting the KKT conditions, we obtain the optimal inclusion probability as

rk=min{αkνpk,1},superscriptsubscript𝑟𝑘subscript𝛼𝑘𝜈subscript𝑝𝑘1\displaystyle r_{k}^{*}=\min\left\{\sqrt{\frac{\alpha_{k}}{\nu p_{k}}},% \enspace 1\right\},italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_min { square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_ν italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG , 1 } , (51)

where ν𝜈\nuitalic_ν is the Lagrangian multiplier and it is selected to satisfy k=1Krk=Nsuperscriptsubscript𝑘1𝐾superscriptsubscript𝑟𝑘𝑁\sum_{k=1}^{K}r_{k}^{*}=N∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_N. Note that the value of k=1Krksuperscriptsubscript𝑘1𝐾superscriptsubscript𝑟𝑘\sum_{k=1}^{K}r_{k}^{*}∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT varies monotonically with ν𝜈\nuitalic_ν and thus we can rely on a bisection-based search method [13] to get the optimal solution of problem (V-A).

Remark 7:

The optimal inclusion probability is positively correlated with the local dataset size while it behaves conversely correlated with the successful transmission probability. In other words, a device with a larger dataset is deemed more important for model training, thereby deserving a sampling bias. Conversely, devices with lower successful transmission probabilities contribute less to the model training process, requiring more frequent sampling to compensate. Thus, the goal of our inclusion probability optimization is to address the imbalances in the dataset size, and the heterogeneity introduced by uneven channel fading. It ensures fair and effective participation among diverse devices.

Moreover, note that the influence of quantization error and data heterogeneity are equally amplified by gD(𝐫,b)subscript𝑔D𝐫𝑏g_{\text{D}}(\mathbf{r},b)italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ). It indicates that the optimization of inclusion probabilities 𝐫𝐫\mathbf{r}bold_r cannot adequately adapt to varying local data distributions.

Refer to caption
Refer to caption
Figure 2: Convergence performance under digital transmission: (a) MNIST dataset, (b) CIFAR-10 dataset.
Refer to caption
Refer to caption
Figure 3: Convergence performance under analog transmission: (a) MNIST dataset, (b) CIFAR-10 dataset.

V-B Analog Transmission

As for the analog transmission, the device sampling optimization is expressed as

minimize𝐫subscriptminimize𝐫\displaystyle\mathop{\text{minimize}}_{\mathbf{r}}minimize start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT φ(𝐫,γth)+2L2δ2gA(𝐫,γth)2μ4ηL2gA(𝐫,γth)𝜑𝐫subscript𝛾th2superscript𝐿2superscript𝛿2subscript𝑔A𝐫subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔A𝐫subscript𝛾th\displaystyle\quad\frac{\varphi(\mathbf{r},\gamma_{\mathrm{th}})+2L^{2}\delta^% {2}g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})}{2\mu-4\eta L^{2}g_{\text{A}}% (\mathbf{r},\gamma_{\mathrm{th}})}divide start_ARG italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG
subject to k=1Krk=N,rk1,k=1,2,,K.formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝑟𝑘𝑁formulae-sequencesubscript𝑟𝑘1𝑘12𝐾\displaystyle\quad\sum_{k=1}^{K}r_{k}=N,\enspace r_{k}\leq 1,\enspace k=1,2,% \cdots,K.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 , italic_k = 1 , 2 , ⋯ , italic_K . (52)

Note that under the average transmit power budget, (49) only differs from the objective value in the constant term, and hence we will not discuss it separately. Considering the intractable fractional form of the objective function in (V-B), we rely on the well-known Dinkelbach algorithm for reformulation [43, 44]. According to the definition of φ(𝐫,γth)𝜑𝐫subscript𝛾th\varphi(\mathbf{r},\gamma_{\mathrm{th}})italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) and gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) in (43), it is easy to check that the denominator of the objective function in (V-B) is concave and the numerator is convex. Hence, the iterative Dinkelbach algorithm guarantees to converge to the global optimum of (V-B). Concretely, in the t𝑡titalic_t-th iteration, we reformulate the problem in (V-B) as

minimize𝐫subscriptminimize𝐫\displaystyle\mathop{\text{minimize}}_{\mathbf{r}}minimize start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT φ(𝐫,γth)+(2L2δ2+4ηL2ς(t1))gA(𝐫,γth)𝜑𝐫subscript𝛾th2superscript𝐿2superscript𝛿24𝜂superscript𝐿2superscript𝜍𝑡1subscript𝑔A𝐫subscript𝛾th\displaystyle\quad\varphi(\mathbf{r},\gamma_{\mathrm{th}})+(2L^{2}\delta^{2}+4% \eta L^{2}\varsigma^{(t-1)})g_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + ( 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ς start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT )
subject to k=1Krk=N,rk1,k=1,2,,K.formulae-sequencesuperscriptsubscript𝑘1𝐾subscript𝑟𝑘𝑁formulae-sequencesubscript𝑟𝑘1𝑘12𝐾\displaystyle\quad\sum_{k=1}^{K}r_{k}=N,\enspace r_{k}\leq 1,\enspace k=1,2,% \cdots,K.∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_N , italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 , italic_k = 1 , 2 , ⋯ , italic_K . (53)

where ς(t1)superscript𝜍𝑡1\varsigma^{(t-1)}italic_ς start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT is a constant determined in the previous round. Note that the problem in (V-B) is convex and thus can be solved by numerical convex program solvers, e.g., CVX tools [45]. After obtaining the optimal 𝐫(t)superscript𝐫𝑡\mathbf{r}^{(t)}bold_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of the t𝑡titalic_t-th subproblem in (V-B), the auxiliary constant is updated as

ς(t)=φ(𝐫(t),γth)+2L2δ2gA(𝐫(t),γth)2μ4ηL2gA(𝐫(t),γth).superscript𝜍𝑡𝜑superscript𝐫𝑡subscript𝛾th2superscript𝐿2superscript𝛿2subscript𝑔Asuperscript𝐫𝑡subscript𝛾th2𝜇4𝜂superscript𝐿2subscript𝑔Asuperscript𝐫𝑡subscript𝛾th\displaystyle\varsigma^{(t)}=\frac{\varphi(\mathbf{r}^{(t)},\gamma_{\mathrm{th% }})+2L^{2}\delta^{2}g_{\text{A}}(\mathbf{r}^{(t)},\gamma_{\mathrm{th}})}{2\mu-% 4\eta L^{2}g_{\text{A}}(\mathbf{r}^{(t)},\gamma_{\mathrm{th}})}.italic_ς start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG italic_φ ( bold_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG start_ARG 2 italic_μ - 4 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG . (54)

Iterating the above steps until convergence, we obtain the optimal 𝐫𝐫\mathbf{r}bold_r of the problem in (V-B).

Remark 8:

Unlike the digital transmission case, the device sampling optimization is committed to seeking a trade-off between the equivalent noise power φ(𝐫,γth)𝜑𝐫subscript𝛾th\varphi(\mathbf{r},\gamma_{\mathrm{th}})italic_φ ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) and virtual sum weight gA(𝐫,γth)subscript𝑔A𝐫subscript𝛾thg_{\text{A}}(\mathbf{r},\gamma_{\mathrm{th}})italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT ( bold_r , italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ), and the parameter δ𝛿\deltaitalic_δ functions as a weighting factor to facilitate the optimal trade-off. At high SNR regimes or with extremely uneven local data distributions, the noise term is comparably ignorable and hence the optimality gap is dominated by gAsubscript𝑔Ag_{\text{A}}italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT. Hence, the optimization of 𝐫𝐫\mathbf{r}bold_r is isolated from specific channel conditions and only needs to match the size of local datasets.

VI Numerical Results

In this section, we provide simulation results to verify the performance analysis and the inclusion probability optimization. We deploy K=20𝐾20K=20italic_K = 20 edge devices uniformly distributed in a square area with radius 500500500500 m and a PS at the center of the square area. The most popular MNIST dataset and CIFAR-10 dataset are exploited for the FL performance evaluation. The MNIST dataset contains 10 classes of handwritten digits ranging from 0 to 9 and we train a multi-layer perceptron (MLP) with d=23,860𝑑23860d=23,860italic_d = 23 , 860 parameters via the wireless FL algorithm for classification purposes. Moreover, the CIFAR-10 dataset includes 10 classes with labels 0-9 and we train a convolutional neural network (CNN) with d=60,000𝑑60000d=60,000italic_d = 60 , 000 parameters. The trained CNN contains two convolutional layers and three fully connected layers. Max pooling operation is conducted following each convolutional layer and the activation function is ReLU. Different edge devices own different data samples, and each local dataset has up to two types of data samples to capture the non-IID characteristic.

Unless otherwise specified, the other parameters are set as: the number of participating devices N=10𝑁10N=10italic_N = 10, the bandwidth, B=1𝐵1B=1italic_B = 1 MHz, the path loss exponent, α=3𝛼3\alpha=3italic_α = 3, the noise power N0=80subscript𝑁080N_{0}=-80italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - 80 dBm/Hz, the maximum transmit power budget, Pmax=0subscript𝑃0P_{\max}=0italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0 dB, the number of quantization bits, b=8𝑏8b=8italic_b = 8, the truncation threshold, γth=0.5subscript𝛾th0.5\gamma_{\mathrm{th}}=0.5italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT = 0.5, the delay target Tmaxsubscript𝑇T_{\max}italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is equal to TAsubscript𝑇AT_{\text{A}}italic_T start_POSTSUBSCRIPT A end_POSTSUBSCRIPT in (32), and the learning rate η=0.01𝜂0.01\eta=0.01italic_η = 0.01. We set L=8𝐿8L=8italic_L = 8 and μ=2𝜇2\mu=2italic_μ = 2, which fall within the existing typical range of values in [46, 47]. Additionally, the parameter δ𝛿\deltaitalic_δ, serving as an upper bound of 𝐰k𝐰2superscriptnormsuperscriptsubscript𝐰𝑘superscript𝐰2\left\|\mathbf{w}_{k}^{*}-\mathbf{w}^{*}\right\|^{2}∥ bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, is estimated through simulation tests.

VI-A Convergence Performance

Refer to caption
Figure 4: Test accuracy versus transmit power budget.
Refer to caption
Figure 5: Test accuracy versus the number of participating devices.

In Figs. 2 and 3, we depict the convergence performance for the digital and analog transmission. As shown in Fig. 2, we observe that the convergence rate and optimality gap under digital transmission exhibit a negative correlation with the virtual sum weight, aligning with our theoretical analysis. Moreover, the convergence behavior remains consistent with the analytical results despite the complexity of the classification task, thereby validating the accuracy of the theoretical analysis.

Refer to caption
Figure 6: Test accuracy versus the accuracy level of channel estimation accuracy.

For the analog case depicted in Fig. 3, consistent with the analytical findings, we notice that the convergence rate is negatively correlated with the virtual sum weight gAsubscript𝑔Ag_{\text{A}}italic_g start_POSTSUBSCRIPT A end_POSTSUBSCRIPT, which is determined by ρ𝜌\rhoitalic_ρ and γthsubscript𝛾th\gamma_{\mathrm{th}}italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT. On the other hand, transmit power only affects the achievable optimality gap after convergence. This is because changes in transmit power only affect the equivalent power of the additive noise. Additionally, modifications in ρ𝜌\rhoitalic_ρ and γthsubscript𝛾th\gamma_{\mathrm{th}}italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT affect the distortion of the aggregation coefficient, which in turn influences the computation error. Furthermore, the increased complexity of FL tasks renders fluctuations in the performance curve more sensitive to noise. Consequently, in the analog communication, the superimposed white Gaussian noise is significantly severer than quantization errors observed in the digital transmission, thus leading to more pronounced fluctuations in convergence performance. It implies that for more complex learning tasks, it becomes imperative to further reduce the variance of gradient estimation to mitigate excessive fluctuations and their adverse impacts on convergence.

VI-B Impact of Transmit Power Budget

In Fig. 4, we show the test accuracy versus different transmit power budgets. It is observed that the digital transmission scheme outperforms the analog scheme, particularly with high SNR levels. In such cases, employing more quantization bits yields the best performance. Conversely, for low SNR levels, reducing the quantization bits leads to marginal performance loss, highlighting the flexibility of the digital schemes by selecting different quantization accuracies. On the other hand, the analog scheme faces significant performance limitations, particularly with the maximum transmit power budget and less CSI, due to the stringent requirements of channel inversion. Therefore, in terms of power utilization, the digital scheme is more efficient than the analog counterpart.

Refer to caption
Refer to caption
Figure 7: Convergence performance with different inclusion probabilities and digital transmission: (a) MNIST dataset, (b) CIFAR-10 dataset.
Refer to caption
Refer to caption
Figure 8: Convergence performance with different inclusion probabilities and analog transmission: (a) MNIST dataset, (b) CIFAR-10 dataset.

VI-C Impact of Participating Device Numbers

Fig. 5 illustrates the test accuracy versus the number of participating devices. We note that for the analog transmission, the test accuracy gradually increases as N𝑁Nitalic_N increases. In contrast, although the performance in digital case may be improved initially, it eventually decline rapidly as each device can only occupy a limited amount of resources, making it unable to support high-rate transmission. Consequently, the results suggest that for digital transmission, the selection of N𝑁Nitalic_N requires further optimization according to the actual conditions, with a preference for fewer devices.

VI-D Impact of Channel Estimation Accuracy

In Fig. 6, we present the impact of channel estimation accuracy on the analog case. It is evident that better performance can be achieved with more accurate CSI. Additionally, we observe that smaller truncation thresholds are more suitable for larger ρ𝜌\rhoitalic_ρ, while larger truncation thresholds are preferred for smaller ρ𝜌\rhoitalic_ρ. This is because higher CSI uncertainties have a significant impact on truncation choices, necessitating looser truncation conditions to reduce incorrect choices.

VI-E Impact of Different Inclusion Probabilities

In Figs. 7 and 8, we depict the convergence performance with different inclusion probabilities. For comparison, we consider the following baselines for comparison. For the sake of fairness, all schemes refrain from utilizing specific information on instantaneous CSI and gradients.

  • Uniform [49]: The inclusion probabilities are uniformly assigned the same value, i.e., pk=1Ksubscript𝑝𝑘1𝐾p_{k}=\frac{1}{K}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG.

  • Learning-oriented [51]: From the perspective of learning algorithms, the probability is set to be proportional to the size of the local datasets, i.e., pkαkproportional-tosubscript𝑝𝑘subscript𝛼𝑘p_{k}\propto\alpha_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∝ italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

  • Channel-aware: From the perspective of wireless channels, the probability is set to be proportional to the large-scale path loss, i.e., pkdkα2proportional-tosubscript𝑝𝑘superscriptsubscript𝑑𝑘𝛼2p_{k}\propto d_{k}^{-\frac{\alpha}{2}}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∝ italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.

  • Min-distortion [52]: To minimize the communication distortion in the analog transmission, the probability is set to be proportional to αkdkα2subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼2\alpha_{k}d_{k}^{\frac{\alpha}{2}}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT by considering both the local datasets and channel conditions.

As shown in Fig. 7, the proposed method consistently outperforms the aforementioned baseline methods across all levels. The first two baselines neglect the influence of the wireless transmission process, resulting in performance degradation. The sampling method based on channel conditions tends to select devices with better channels, effectively reducing packet loss rates and yielding significant performance improvements. However, due to its oversight of imbalanced size of local datasets, its final performance remains inferior to our proposed method. The fourth baseline, tailored for the analog transmission scenarios, partially accounts for the impact of local datasets and wireless channels but lacks optimality, leading to limited performance gains.

As for the analog transmission case in Fig. 8, we note that although the performance of the optimized probability is superior, the performance gain compared to the other baselines is not significant. This limit arises from the reliance on constants L𝐿Litalic_L, μ𝜇\muitalic_μ, and δ𝛿\deltaitalic_δ in the optimization problem (V-B), which are challenging to determine accurately in practice, thus affecting the final performance. Similarly, akin to the digital transmission, the sampling method based on channel conditions effectively mitigates the negative impact of the imperfect wireless transmission. However, its disregard for data characteristics results in suboptimal performance, particularly in the complex classification tasks on CIFAR-10 dataset, leading to significant performance fluctuations. Furthermore, the baseline method of minimizing computational distortion overlooks the impact of data heterogeneity, thus impeding its ability to achieve satisfactory performance.

VII Conclusion

In this paper, we have provided a detailed comparison between digital and analog transmission enabled wireless FL. To this end, we considered general transmission designs for both schemes and conducted a fair comparison between them. Then, we analyzed the convergence behavior of wireless FL in terms of the convergence rate and optimality gap under digital and analog cases, and compared the convergence performance from multiple perspectives. It was found that digital transmission is more suitable for scenarios with sufficient radio resources and CSI uncertainties. On the other hand, analog transmission is suitable when their are massive numbers of participating devices. Next, we addressed sampling optimization for both cases, and further developed insights for optimization, which ars useful for practical deployment. Finally, experimental results illuminated the analytical results and the sampling strategies. Additionally, an explicit and precise characterization of data heterogeneity and targeted system designs with theoretical guarantees should be of our interest in the future work.

Appendix A Proof of Lemma 2

For the digital case, according to [19, Lemma 5], we first conclude that the quantized gradients 𝒬(𝐠mk)𝒬superscriptsubscript𝐠𝑚𝑘\mathcal{Q}(\mathbf{g}_{m}^{k})caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is unbiased, i.e.,

𝔼[𝒬(𝐠mk)]=𝐠mk.𝔼delimited-[]𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝐠𝑚𝑘\displaystyle\mathbb{E}\left[\mathcal{Q}(\mathbf{g}_{m}^{k})\right]=\mathbf{g}% _{m}^{k}.blackboard_E [ caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] = bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT . (55)

Combining with the fact that 𝔼[ξk,D]=1𝔼delimited-[]subscript𝜉𝑘D1\mathbb{E}\left[\xi_{k,\text{D}}\right]=1blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT ] = 1 in (16), we have

𝔼[𝐠^m,D]𝔼delimited-[]subscript^𝐠𝑚D\displaystyle\mathbb{E}\left[\hat{\mathbf{g}}_{m,\text{D}}\right]blackboard_E [ over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT ] =(a)k=1Kαk𝔼[χkrk]𝔼[ξk,D]𝔼[𝒬(𝐠mk)](a)superscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]subscript𝜒𝑘subscript𝑟𝑘𝔼delimited-[]subscript𝜉𝑘D𝔼delimited-[]𝒬superscriptsubscript𝐠𝑚𝑘\displaystyle\overset{\text{(a)}}{=}\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[% \frac{\chi_{k}}{r_{k}}\right]\mathbb{E}\left[\xi_{k,\text{D}}\right]\mathbb{E}% \left[\mathcal{Q}(\mathbf{g}_{m}^{k})\right]over(a) start_ARG = end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ] blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT ] blackboard_E [ caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ]
=k=1Kαk𝐠mk=𝐠m,absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝐠𝑚𝑘subscript𝐠𝑚\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbf{g}_{m}^{k}=\mathbf{g}_{m},= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (56)

where (a) comes from the definition of 𝐠^m,Dsubscript^𝐠𝑚D\hat{\mathbf{g}}_{m,\text{D}}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT and the independence among device sampling, small-scale fadings and stochastic quantization.

As for the analog transmission, by exploiting [34, Lemma 1], we have 𝔼[ξk,A]=1𝔼delimited-[]subscript𝜉𝑘A1\mathbb{E}\left[\xi_{k,\text{A}}\right]=1blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT ] = 1. Combining with the statistical characteristic of χksubscript𝜒𝑘\chi_{k}italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝐳¯msubscript¯𝐳𝑚\bar{\mathbf{z}}_{m}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and following the same procedures in (A), we get the desired conclusion, i.e., 𝔼[𝐠^m,A]=𝐠m𝔼delimited-[]subscript^𝐠𝑚Asubscript𝐠𝑚\mathbb{E}\left[\hat{\mathbf{g}}_{m,\text{A}}\right]=\mathbf{g}_{m}blackboard_E [ over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , A end_POSTSUBSCRIPT ] = bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. The proof completes.

Appendix B Proof of Theorem 1

To begin with, we define an auxiliary variable as

𝐰^m+1=𝐰~mη𝐠m,subscript^𝐰𝑚1subscript~𝐰𝑚𝜂subscript𝐠𝑚\displaystyle\hat{\mathbf{w}}_{m+1}=\tilde{\mathbf{w}}_{m}-\eta\mathbf{g}_{m},over^ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - italic_η bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (57)

which represents the model obtained at (m+1)𝑚1(m+1)( italic_m + 1 )-th round via ideal communication and full participation. Then, by exploiting Assumption 2 and the fact that F(𝐰)=0𝐹superscript𝐰0\nabla F(\mathbf{w}^{*})=0∇ italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 0, we have

𝔼[F(𝐰~m+1)]F(𝐰)L2𝔼[𝐰~m+1𝐰2]𝔼delimited-[]𝐹subscript~𝐰𝑚1𝐹superscript𝐰𝐿2𝔼delimited-[]superscriptnormsubscript~𝐰𝑚1superscript𝐰2\displaystyle\mathbb{E}\left[F(\tilde{\mathbf{w}}_{m+1})\right]-F(\mathbf{w}^{% *})\leq\frac{L}{2}\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{m+1}-\mathbf{w}^{% *}\right\|^{2}\right]blackboard_E [ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ) ] - italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_L end_ARG start_ARG 2 end_ARG blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)L2(𝔼[𝐰~m+1𝐰^m+12]A1+𝔼[𝐰^m+1𝐰2]A2),(a)𝐿2subscript𝔼delimited-[]superscriptnormsubscript~𝐰𝑚1subscript^𝐰𝑚12subscript𝐴1subscript𝔼delimited-[]superscriptnormsubscript^𝐰𝑚1superscript𝐰2subscript𝐴2\displaystyle\quad\overset{\text{(a)}}{=}\frac{L}{2}\left(\underbrace{\mathbb{% E}\left[\left\|\tilde{\mathbf{w}}_{m+1}-\hat{\mathbf{w}}_{m+1}\right\|^{2}% \right]}_{A_{1}}+\underbrace{\mathbb{E}\left[\left\|\hat{\mathbf{w}}_{m+1}-% \mathbf{w}^{*}\right\|^{2}\right]}_{A_{2}}\right),over(a) start_ARG = end_ARG divide start_ARG italic_L end_ARG start_ARG 2 end_ARG ( under⏟ start_ARG blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT - over^ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG blackboard_E [ ∥ over^ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (58)

where (a) is due to the fact that 𝐠^m,Dsubscript^𝐠𝑚D\hat{\mathbf{g}}_{m,\text{D}}over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT is an unbiased estimate of 𝐠msubscript𝐠𝑚\mathbf{g}_{m}bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. For the term A1subscript𝐴1A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, it is bounded by

A1subscript𝐴1\displaystyle A_{1}italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =η2𝔼[𝐠^m,D𝐠m2]absentsuperscript𝜂2𝔼delimited-[]superscriptnormsubscript^𝐠𝑚Dsubscript𝐠𝑚2\displaystyle=\eta^{2}\mathbb{E}\left[\left\|\hat{\mathbf{g}}_{m,\text{D}}-% \mathbf{g}_{m}\right\|^{2}\right]= italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over^ start_ARG bold_g end_ARG start_POSTSUBSCRIPT italic_m , D end_POSTSUBSCRIPT - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=η2𝔼[k=1Kχkαkξk,Drk𝒬(𝐠mk)k=1Kαk𝐠mk2]absentsuperscript𝜂2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑘1𝐾subscript𝜒𝑘subscript𝛼𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝑘1𝐾subscript𝛼𝑘superscriptsubscript𝐠𝑚𝑘2\displaystyle=\eta^{2}\mathbb{E}\left[\left\|\sum_{k=1}^{K}\frac{\chi_{k}% \alpha_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{m}^{k})-\sum_{k=1}^{% K}\alpha_{k}\mathbf{g}_{m}^{k}\right\|^{2}\right]= italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)η2𝔼[k=1Kαk(χkξk,Drk𝒬(𝐠mk)i=1Kαi𝐠mi)2](a)superscript𝜂2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝑖1𝐾subscript𝛼𝑖superscriptsubscript𝐠𝑚𝑖2\displaystyle\overset{\text{(a)}}{=}\eta^{2}\mathbb{E}\left[\left\|\sum_{k=1}^% {K}\alpha_{k}\left(\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g% }_{m}^{k})-\sum_{i=1}^{K}\alpha_{i}\mathbf{g}_{m}^{i}\right)\right\|^{2}\right]over(a) start_ARG = end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(b)η2k=1Kαk𝔼[χkξk,Drk𝒬(𝐠mk)i=1Kαi𝐠mi2](b)superscript𝜂2superscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝑖1𝐾subscript𝛼𝑖superscriptsubscript𝐠𝑚𝑖2\displaystyle\overset{\text{(b)}}{\leq}\eta^{2}\sum_{k=1}^{K}\alpha_{k}\mathbb% {E}\left[\left\|\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{% m}^{k})-\sum_{i=1}^{K}\alpha_{i}\mathbf{g}_{m}^{i}\right\|^{2}\right]over(b) start_ARG ≤ end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=η2k=1Kαk𝔼[(χkξk,Drk𝒬(𝐠mk)𝐠mk)\displaystyle=\eta^{2}\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\left(% \frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{m}^{k})-\mathbf{% g}_{m}^{k}\right)\right.\right.= italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+(𝐠mki=1Kαi𝐠mi)2]\displaystyle\quad\left.\left.+\left(\mathbf{g}_{m}^{k}-\sum_{i=1}^{K}\alpha_{% i}\mathbf{g}_{m}^{i}\right)\right\|^{2}\right]+ ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(c)η2k=1Kαk𝔼[χkξk,Drk𝒬(𝐠mk)𝐠mk2]B1(c)superscript𝜂2subscriptsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝐠𝑚𝑘2subscript𝐵1\displaystyle\overset{\text{(c)}}{=}\eta^{2}\underbrace{\sum_{k=1}^{K}\alpha_{% k}\mathbb{E}\left[\left\|\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(% \mathbf{g}_{m}^{k})-\mathbf{g}_{m}^{k}\right\|^{2}\right]}_{B_{1}}over(c) start_ARG = end_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
+η2k=1Kαk𝔼[𝐠mki=1Kαi𝐠mi2]B2,superscript𝜂2subscriptsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsuperscriptsubscript𝐠𝑚𝑘superscriptsubscript𝑖1𝐾subscript𝛼𝑖superscriptsubscript𝐠𝑚𝑖2subscript𝐵2\displaystyle\quad+\eta^{2}\underbrace{\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left% [\left\|\mathbf{g}_{m}^{k}-\sum_{i=1}^{K}\alpha_{i}\mathbf{g}_{m}^{i}\right\|^% {2}\right]}_{B_{2}},+ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (59)

where (a) is because k=1Kαk=1superscriptsubscript𝑘1𝐾subscript𝛼𝑘1\sum_{k=1}^{K}\alpha_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, (b) exploits the convexity of 2\|\cdot\|^{2}∥ ⋅ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and (c) is due to the fact that 𝔼[χkξk,Drk𝒬(𝐠mk)]=𝐠mk𝔼delimited-[]subscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝐠𝑚𝑘\mathbb{E}\left[\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{% m}^{k})\right]=\mathbf{g}_{m}^{k}blackboard_E [ divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] = bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. According to [20], the variance of quantization error is bounded as

𝔼[𝒬(𝐠mk)𝐠mk2]𝔼delimited-[]superscriptnorm𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝐠𝑚𝑘2\displaystyle\mathbb{E}\left[\left\|\mathcal{Q}(\mathbf{g}_{m}^{k})-\mathbf{g}% _{m}^{k}\right\|^{2}\right]blackboard_E [ ∥ caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] d4(gm,maxkgm,mink2b1)2absent𝑑4superscriptsuperscriptsubscript𝑔𝑚𝑘superscriptsubscript𝑔𝑚𝑘superscript2𝑏12\displaystyle\leq\frac{d}{4}\left(\frac{g_{m,\max}^{k}-g_{m,\min}^{k}}{2^{b}-1% }\right)^{2}≤ divide start_ARG italic_d end_ARG start_ARG 4 end_ARG ( divide start_ARG italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Δ2(2b1)2ϕ(b)absentsuperscriptΔ2superscriptsuperscript2𝑏12italic-ϕ𝑏\displaystyle\leq\frac{\Delta^{2}}{(2^{b}-1)^{2}}\triangleq\phi(b)≤ divide start_ARG roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≜ italic_ϕ ( italic_b ) (60)

where Δ2superscriptΔ2\Delta^{2}roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is defined as a uniform upper bound of d4(gm,maxkgm,mink)2𝑑4superscriptsuperscriptsubscript𝑔𝑚𝑘superscriptsubscript𝑔𝑚𝑘2\frac{d}{4}\left(g_{m,\max}^{k}-g_{m,\min}^{k}\right)^{2}divide start_ARG italic_d end_ARG start_ARG 4 end_ARG ( italic_g start_POSTSUBSCRIPT italic_m , roman_max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_g start_POSTSUBSCRIPT italic_m , roman_min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, m,kfor-all𝑚𝑘\forall m,k∀ italic_m , italic_k. Then, B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is bounded by

B1subscript𝐵1\displaystyle B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =k=1Kαk𝔼[(χkξk,Drk𝒬(𝐠mk)χkξk,Drk𝐠mk)\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\left(\frac{\chi_% {k}\xi_{k,\text{D}}}{r_{k}}\mathcal{Q}(\mathbf{g}_{m}^{k})-\frac{\chi_{k}\xi_{% k,\text{D}}}{r_{k}}\mathbf{g}_{m}^{k}\right)\right.\right.= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT )
+(χkξk,Drk𝐠mk𝐠mk)2]\displaystyle\quad\left.\left.+\left(\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}% \mathbf{g}_{m}^{k}-\mathbf{g}_{m}^{k}\right)\right\|^{2}\right]+ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k=1Kαk𝔼[(χkξk,Drk)2]𝔼[𝒬(𝐠mk)𝐠mk2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘2𝔼delimited-[]superscriptnorm𝒬superscriptsubscript𝐠𝑚𝑘superscriptsubscript𝐠𝑚𝑘2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left(\frac{\chi_{k}\xi_% {k,\text{D}}}{r_{k}}\right)^{2}\right]\mathbb{E}\left[\left\|\mathcal{Q}(% \mathbf{g}_{m}^{k})-\mathbf{g}_{m}^{k}\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] blackboard_E [ ∥ caligraphic_Q ( bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+k=1Kαk𝔼[(χkξk,Drk1)2]𝔼[𝐠mk2]superscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘12𝔼delimited-[]superscriptnormsuperscriptsubscript𝐠𝑚𝑘2\displaystyle\quad+\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left(\frac{\chi_{k% }\xi_{k,\text{D}}}{r_{k}}-1\right)^{2}\right]\mathbb{E}\left[\left\|\mathbf{g}% _{m}^{k}\right\|^{2}\right]+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] blackboard_E [ ∥ bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(a)k=1Kϕ(b)αkpkrk+k=1Kαk(1pkrk1)𝔼[Fk(𝐰~m)2],(a)superscriptsubscript𝑘1𝐾italic-ϕ𝑏subscript𝛼𝑘subscript𝑝𝑘subscript𝑟𝑘superscriptsubscript𝑘1𝐾subscript𝛼𝑘1subscript𝑝𝑘subscript𝑟𝑘1𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2\displaystyle\overset{\text{(a)}}{\leq}\sum_{k=1}^{K}\frac{\phi(b)\alpha_{k}}{% p_{k}r_{k}}+\sum_{k=1}^{K}\alpha_{k}\left(\frac{1}{p_{k}r_{k}}-1\right)\mathbb% {E}\left[\left\|\nabla F_{k}(\tilde{\mathbf{w}}_{m})\right\|^{2}\right],over(a) start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_ϕ ( italic_b ) italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (61)

where (a) uses 𝔼[(χkξk,Drk)2]=1pkrk𝔼delimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘21subscript𝑝𝑘subscript𝑟𝑘\mathbb{E}\left[\left(\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}\right)^{2}\right]% =\frac{1}{p_{k}r_{k}}blackboard_E [ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG and 𝔼[(χkξk,Drk1)2]=1pkrk1𝔼delimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘121subscript𝑝𝑘subscript𝑟𝑘1\mathbb{E}\left[\left(\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}-1\right)^{2}% \right]=\frac{1}{p_{k}r_{k}}-1blackboard_E [ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1. Next, by expanding the square term, we reformulate B2subscript𝐵2B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as

B2subscript𝐵2\displaystyle B_{2}italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =k=1Kαk𝔼[Fk(𝐰~m)i=1KαiFi(𝐰~m)2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚superscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝐹𝑖subscript~𝐰𝑚2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\nabla F_{k}(% \tilde{\mathbf{w}}_{m})-\sum_{i=1}^{K}\alpha_{i}\nabla F_{i}(\tilde{\mathbf{w}% }_{m})\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k=1Kαk(𝔼[Fk(𝐰~m)2]+𝔼[i=1KαiFi(𝐰~m)2]\displaystyle=\sum_{k=1}^{K}\alpha_{k}\left(\mathbb{E}\left[\left\|\nabla F_{k% }(\tilde{\mathbf{w}}_{m})\right\|^{2}\right]+\mathbb{E}\left[\left\|\sum_{i=1}% ^{K}\alpha_{i}\nabla F_{i}(\tilde{\mathbf{w}}_{m})\right\|^{2}\right]\right.= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
2𝔼[Fk(𝐰~m)T(i=1KαiFi(𝐰~m))])\displaystyle\quad\left.-2\mathbb{E}\left[\nabla F_{k}(\tilde{\mathbf{w}}_{m})% ^{T}\left(\sum_{i=1}^{K}\alpha_{i}\nabla F_{i}(\tilde{\mathbf{w}}_{m})\right)% \right]\right)- 2 blackboard_E [ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ] )
=k=1Kαk𝔼[Fk(𝐰~m)2]𝔼[i=1KαiFi(𝐰~m)2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2𝔼delimited-[]superscriptnormsuperscriptsubscript𝑖1𝐾subscript𝛼𝑖subscript𝐹𝑖subscript~𝐰𝑚2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\nabla F_{k}(% \tilde{\mathbf{w}}_{m})\right\|^{2}\right]-\mathbb{E}\left[\left\|\sum_{i=1}^{% K}\alpha_{i}\nabla F_{i}(\tilde{\mathbf{w}}_{m})\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k=1Kαk𝔼[Fk(𝐰~m)2]𝔼[F(𝐰~m)2].absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2𝔼delimited-[]superscriptnorm𝐹subscript~𝐰𝑚2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\nabla F_{k}(% \tilde{\mathbf{w}}_{m})\right\|^{2}\right]-\mathbb{E}\left[\left\|\nabla F(% \tilde{\mathbf{w}}_{m})\right\|^{2}\right].= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E [ ∥ ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (62)

Then for A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have

A2subscript𝐴2\displaystyle A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =𝔼[𝐰~m𝐰ηF(𝐰~m)2]absent𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰𝜂𝐹subscript~𝐰𝑚2\displaystyle=\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}-% \eta\nabla F(\tilde{\mathbf{w}}_{m})\right\|^{2}\right]= blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_η ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼[𝐰~m𝐰2]2η𝔼[(𝐰~m𝐰)TF(𝐰~m)]absent𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰22𝜂𝔼delimited-[]superscriptsubscript~𝐰𝑚superscript𝐰𝑇𝐹subscript~𝐰𝑚\displaystyle=\mathbb{E}\left[\|\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}\|^{2}% \right]-2\eta\mathbb{E}\left[(\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*})^{T}\nabla F% (\tilde{\mathbf{w}}_{m})\right]= blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - 2 italic_η blackboard_E [ ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ]
+η2𝔼[F(𝐰~m)2]superscript𝜂2𝔼delimited-[]superscriptnorm𝐹subscript~𝐰𝑚2\displaystyle\quad+\eta^{2}\mathbb{E}\left[\|\nabla F(\tilde{\mathbf{w}}_{m})% \|^{2}\right]+ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(a)(1ημ)𝔼[𝐰~m𝐰2]+2η𝔼[F(𝐰)F(𝐰~)](a)1𝜂𝜇𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰22𝜂𝔼delimited-[]𝐹superscript𝐰𝐹~𝐰\displaystyle\overset{\text{(a)}}{\leq}(1-\eta\mu)\mathbb{E}\left[\left\|% \tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}\right\|^{2}\right]+2\eta\mathbb{E}\left[% F(\mathbf{w}^{*})-F(\tilde{\mathbf{w}})\right]over(a) start_ARG ≤ end_ARG ( 1 - italic_η italic_μ ) blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 italic_η blackboard_E [ italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_F ( over~ start_ARG bold_w end_ARG ) ]
+η2𝔼[F(𝐰~m)2]superscript𝜂2𝔼delimited-[]superscriptnorm𝐹subscript~𝐰𝑚2\displaystyle\quad+\eta^{2}\mathbb{E}\left[\|\nabla F(\tilde{\mathbf{w}}_{m})% \|^{2}\right]+ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(b)(1ημ)𝔼[𝐰~m𝐰2]+η2𝔼[F(𝐰~m)2],(b)1𝜂𝜇𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰2superscript𝜂2𝔼delimited-[]superscriptnorm𝐹subscript~𝐰𝑚2\displaystyle\overset{\text{(b)}}{\leq}\left(1-\eta\mu\right)\mathbb{E}\left[% \left\|\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}\right\|^{2}\right]+\eta^{2}% \mathbb{E}\left[\|\nabla F(\tilde{\mathbf{w}}_{m})\|^{2}\right],over(b) start_ARG ≤ end_ARG ( 1 - italic_η italic_μ ) blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ ∇ italic_F ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (63)

where the inequality in (a) is due to Assumption 1, and (b) is due to the fact that F(𝐰)F(𝐰)0𝐹superscript𝐰𝐹𝐰0F(\mathbf{w}^{*})-F(\mathbf{w})\leq 0italic_F ( bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_F ( bold_w ) ≤ 0 for 𝐰dfor-all𝐰superscript𝑑\forall\mathbf{w}\in\mathbb{R}^{d}∀ bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Combining all the results in (B)-(B), it yields

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [𝐰~m+1𝐰2]delimited-[]superscriptnormsubscript~𝐰𝑚1superscript𝐰2\displaystyle\left[\left\|\tilde{\mathbf{w}}_{m+1}-\mathbf{w}^{*}\right\|^{2}\right][ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(1ημ)𝔼[𝐰~m𝐰2]absent1𝜂𝜇𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰2\displaystyle\leq\left(1-\eta\mu\right)\mathbb{E}\left[\left\|\tilde{\mathbf{w% }}_{m}-\mathbf{w}^{*}\right\|^{2}\right]≤ ( 1 - italic_η italic_μ ) blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+k=1Kη2αkpkrk𝔼[Fk(𝐰~m)2]+k=1Kη2αkϕ(b)pkrk.superscriptsubscript𝑘1𝐾superscript𝜂2subscript𝛼𝑘subscript𝑝𝑘subscript𝑟𝑘𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2superscriptsubscript𝑘1𝐾superscript𝜂2subscript𝛼𝑘italic-ϕ𝑏subscript𝑝𝑘subscript𝑟𝑘\displaystyle\quad+\sum_{k=1}^{K}\frac{\eta^{2}\alpha_{k}}{p_{k}r_{k}}\mathbb{% E}\left[\left\|\nabla F_{k}(\tilde{\mathbf{w}}_{m})\right\|^{2}\right]+\sum_{k% =1}^{K}\frac{\eta^{2}\alpha_{k}\phi(b)}{p_{k}r_{k}}.+ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ ( italic_b ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG . (64)

We further rewrite the second term in the right hand side (RHS) of (B) as

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [Fk(𝐰~m)2]delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2\displaystyle\left[\left\|\nabla F_{k}(\tilde{\mathbf{w}}_{m})\right\|^{2}\right][ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=(a)𝔼[Fk(𝐰~m)Fk(𝐰k)2](a)𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚subscript𝐹𝑘superscriptsubscript𝐰𝑘2\displaystyle\overset{\text{(a)}}{=}\mathbb{E}\left[\left\|\nabla F_{k}(\tilde% {\mathbf{w}}_{m})-\nabla F_{k}(\mathbf{w}_{k}^{*})\right\|^{2}\right]over(a) start_ARG = end_ARG blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(b)L2𝔼[𝐰~m𝐰k2]=L2𝔼[𝐰~m𝐰+𝐰𝐰k2](b)superscript𝐿2𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscriptsubscript𝐰𝑘2superscript𝐿2𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰superscript𝐰superscriptsubscript𝐰𝑘2\displaystyle\overset{\text{(b)}}{\leq}L^{2}\mathbb{E}\left[\left\|\tilde{% \mathbf{w}}_{m}-\mathbf{w}_{k}^{*}\right\|^{2}\right]=L^{2}\mathbb{E}\left[% \left\|\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}+\mathbf{w}^{*}-\mathbf{w}_{k}^{*}% \right\|^{2}\right]over(b) start_ARG ≤ end_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(c)2L2𝔼[𝐰~m𝐰2]+2L2δ2,(c)2superscript𝐿2𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰22superscript𝐿2superscript𝛿2\displaystyle\overset{\text{(c)}}{\leq}2L^{2}\mathbb{E}\left[\left\|\tilde{% \mathbf{w}}_{m}-\mathbf{w}^{*}\right\|^{2}\right]+2L^{2}\delta^{2},over(c) start_ARG ≤ end_ARG 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (65)

where (a) comes from Fk(𝐰k)=𝟎subscript𝐹𝑘superscriptsubscript𝐰𝑘0\nabla F_{k}(\mathbf{w}_{k}^{*})=\mathbf{0}∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = bold_0, (b) exploits Assumption 2, and (c) uses Assumption 3 and the inequality 𝐚+𝐛22𝐚2+2𝐛2superscriptnorm𝐚𝐛22superscriptnorm𝐚22superscriptnorm𝐛2\|\mathbf{a}+\mathbf{b}\|^{2}\leq 2\|\mathbf{a}\|^{2}+2\|\mathbf{b}\|^{2}∥ bold_a + bold_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ∥ bold_a ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ bold_b ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. By defining gD(𝐫,b)=k=1Kαkpkrksubscript𝑔D𝐫𝑏superscriptsubscript𝑘1𝐾subscript𝛼𝑘subscript𝑝𝑘subscript𝑟𝑘g_{\text{D}}(\mathbf{r},b)=\sum_{k=1}^{K}\frac{\alpha_{k}}{p_{k}r_{k}}italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, we conclude that

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [𝐰~m+1𝐰2]delimited-[]superscriptnormsubscript~𝐰𝑚1superscript𝐰2\displaystyle\left[\left\|\tilde{\mathbf{w}}_{m+1}-\mathbf{w}^{*}\right\|^{2}\right][ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
(1ημ+2η2L2gD(𝐫,b))𝔼[𝐰~m𝐰2]absent1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔D𝐫𝑏𝔼delimited-[]superscriptnormsubscript~𝐰𝑚superscript𝐰2\displaystyle\leq\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{D}}(\mathbf{r},b)% \right)\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{m}-\mathbf{w}^{*}\right\|^{2% }\right]≤ ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ) blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+η2(ϕ(b)+2L2δ2)gD(𝐫,b)superscript𝜂2italic-ϕ𝑏2superscript𝐿2superscript𝛿2subscript𝑔D𝐫𝑏\displaystyle\quad+\eta^{2}(\phi(b)+2L^{2}\delta^{2})g_{\text{D}}(\mathbf{r},b)+ italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b )
(1ημ+2η2L2gD(𝐫,b))m+1𝔼[𝐰~0𝐰2]absentsuperscript1𝜂𝜇2superscript𝜂2superscript𝐿2subscript𝑔D𝐫𝑏𝑚1𝔼delimited-[]superscriptnormsubscript~𝐰0superscript𝐰2\displaystyle\leq\left(1-\eta\mu+2\eta^{2}L^{2}g_{\text{D}}(\mathbf{r},b)% \right)^{m+1}\mathbb{E}\left[\left\|\tilde{\mathbf{w}}_{0}-\mathbf{w}^{*}% \right\|^{2}\right]≤ ( 1 - italic_η italic_μ + 2 italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) ) start_POSTSUPERSCRIPT italic_m + 1 end_POSTSUPERSCRIPT blackboard_E [ ∥ over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - bold_w start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+η(ϕ(b)+2L2δ2)gD(𝐫,b)μ2ηL2gD(𝐫,b).𝜂italic-ϕ𝑏2superscript𝐿2superscript𝛿2subscript𝑔D𝐫𝑏𝜇2𝜂superscript𝐿2subscript𝑔D𝐫𝑏\displaystyle\quad+\frac{\eta(\phi(b)+2L^{2}\delta^{2})g_{\text{D}}(\mathbf{r}% ,b)}{\mu-2\eta L^{2}g_{\text{D}}(\mathbf{r},b)}.+ divide start_ARG italic_η ( italic_ϕ ( italic_b ) + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG start_ARG italic_μ - 2 italic_η italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT D end_POSTSUBSCRIPT ( bold_r , italic_b ) end_ARG . (66)

Plugging (B) into (B), we obtain the convergence result and complete the proof.

Appendix C Proof of Theorem 2

As for the analog transmission, the main difference from the digital transmission lies in the term B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in (B). With the analog case, B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is expressed as

B1subscript𝐵1\displaystyle B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =k=1Kαk𝔼[χkξk,Ark𝐠mk+𝐳¯m𝐠mk2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝜒𝑘subscript𝜉𝑘Asubscript𝑟𝑘superscriptsubscript𝐠𝑚𝑘subscript¯𝐳𝑚superscriptsubscript𝐠𝑚𝑘2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\frac{\chi_{k}\xi% _{k,\text{A}}}{r_{k}}\mathbf{g}_{m}^{k}+\bar{\mathbf{z}}_{m}-\mathbf{g}_{m}^{k% }\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k=1Kαk𝔼[(χkξk,Ark1)𝐠mk2]+𝔼[𝐳¯m2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptnormsubscript𝜒𝑘subscript𝜉𝑘Asubscript𝑟𝑘1superscriptsubscript𝐠𝑚𝑘2𝔼delimited-[]superscriptnormsubscript¯𝐳𝑚2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left\|\left(\frac{\chi_% {k}\xi_{k,\text{A}}}{r_{k}}-1\right)\mathbf{g}_{m}^{k}\right\|^{2}\right]+% \mathbb{E}\left[\|\bar{\mathbf{z}}_{m}\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ∥ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + blackboard_E [ ∥ over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=k=1Kαk𝔼[(χkξk,Ark1)2]𝔼[Fk(𝐰~m)2]absentsuperscriptsubscript𝑘1𝐾subscript𝛼𝑘𝔼delimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Asubscript𝑟𝑘12𝔼delimited-[]superscriptnormsubscript𝐹𝑘subscript~𝐰𝑚2\displaystyle=\sum_{k=1}^{K}\alpha_{k}\mathbb{E}\left[\left(\frac{\chi_{k}\xi_% {k,\text{A}}}{r_{k}}-1\right)^{2}\right]\mathbb{E}\left[\left\|\nabla F_{k}(% \tilde{\mathbf{w}}_{m})\right\|^{2}\right]= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] blackboard_E [ ∥ ∇ italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( over~ start_ARG bold_w end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
+𝔼[𝐳¯m2].𝔼delimited-[]superscriptnormsubscript¯𝐳𝑚2\displaystyle\quad+\mathbb{E}\left[\|\bar{\mathbf{z}}_{m}\|^{2}\right].+ blackboard_E [ ∥ over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (67)

For the equivalent noise term, recalling that 𝐳¯m={𝐳m}ζsubscript¯𝐳𝑚subscript𝐳𝑚𝜁\bar{\mathbf{z}}_{m}=\frac{\Re\left\{\mathbf{z}_{m}\right\}}{\zeta}over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG roman_ℜ { bold_z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } end_ARG start_ARG italic_ζ end_ARG, we first derive the scaling factor ζ𝜁\zetaitalic_ζ. Constrained by the transmit power budget in (33), the scaling factor ζ𝜁\zetaitalic_ζ must satisfy

maxk𝒮m{ζ2e2γthαk2dkαρ2rk2|h^k|2𝔼[𝐠mk2]}Pmax.subscript𝑘subscript𝒮𝑚superscript𝜁2superscript𝑒2subscript𝛾thsuperscriptsubscript𝛼𝑘2superscriptsubscript𝑑𝑘𝛼superscript𝜌2superscriptsubscript𝑟𝑘2superscriptsubscript^𝑘2𝔼delimited-[]superscriptnormsuperscriptsubscript𝐠𝑚𝑘2subscript𝑃\displaystyle\max_{k\in\mathcal{S}_{m}}\left\{\frac{\zeta^{2}e^{2\gamma_{% \mathrm{th}}}\alpha_{k}^{2}d_{k}^{\alpha}}{\rho^{2}r_{k}^{2}|\hat{h}_{k}|^{2}}% \mathbb{E}\left[\left\|\mathbf{g}_{m}^{k}\right\|^{2}\right]\right\}\leq P_{% \max}.roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_S start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT { divide start_ARG italic_ζ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ ∥ bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] } ≤ italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT . (68)

Based on Assumption 3 and the definition in (3), we can conclude that

𝐠mk1Dk𝐮𝒟k(𝐰m,𝐮)γ.normsuperscriptsubscript𝐠𝑚𝑘1subscript𝐷𝑘subscript𝐮subscript𝒟𝑘normsubscript𝐰𝑚𝐮𝛾\displaystyle\left\|\mathbf{g}_{m}^{k}\right\|\leq\frac{1}{D_{k}}\sum_{\mathbf% {u}\in\mathcal{D}_{k}}\left\|\nabla\mathcal{L}(\mathbf{w}_{m},\mathbf{u})% \right\|\leq\gamma.∥ bold_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_u ∈ caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ∇ caligraphic_L ( bold_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_u ) ∥ ≤ italic_γ . (69)

Note that for all the activated devices, we have |h^k|2γthsuperscriptsubscript^𝑘2subscript𝛾th|\hat{h}_{k}|^{2}\geq\gamma_{\mathrm{th}}| over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT. Hence, we select the feasible ζ𝜁\zetaitalic_ζ as

ζ=ρPmaxγthγeγthmink{rkαkdkα2}.𝜁𝜌subscript𝑃subscript𝛾th𝛾superscript𝑒subscript𝛾thsubscript𝑘subscript𝑟𝑘subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼2\displaystyle\zeta=\frac{\rho\sqrt{P_{\max}\gamma_{\mathrm{th}}}}{\gamma e^{% \gamma_{\mathrm{th}}}}\min_{k}\left\{\frac{r_{k}}{\alpha_{k}}d_{k}^{-\frac{% \alpha}{2}}\right\}.italic_ζ = divide start_ARG italic_ρ square-root start_ARG italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_γ italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT } . (70)

Then, we have

𝔼[𝐳¯m2]=BN0γ2e2γth2Pmaxρ2γthmaxk{αk2rk2dkα}.𝔼delimited-[]superscriptnormsubscript¯𝐳𝑚2𝐵subscript𝑁0superscript𝛾2superscript𝑒2subscript𝛾th2subscript𝑃superscript𝜌2subscript𝛾thsubscript𝑘superscriptsubscript𝛼𝑘2superscriptsubscript𝑟𝑘2superscriptsubscript𝑑𝑘𝛼\displaystyle\mathbb{E}\left[\|\bar{\mathbf{z}}_{m}\|^{2}\right]=\frac{BN_{0}% \gamma^{2}e^{2\gamma_{\mathrm{th}}}}{2P_{\max}\rho^{2}\gamma_{\mathrm{th}}}% \max_{k}\left\{\frac{\alpha_{k}^{2}}{r_{k}^{2}}d_{k}^{\alpha}\right\}.blackboard_E [ ∥ over¯ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG italic_B italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_P start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_ARG roman_max start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT } . (71)

Next, the variance of the coefficient distortion, χkξk,Drksubscript𝜒𝑘subscript𝜉𝑘Dsubscript𝑟𝑘\frac{\chi_{k}\xi_{k,\text{D}}}{r_{k}}divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , D end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, is calculated as

𝔼𝔼\displaystyle\mathbb{E}blackboard_E [(χkξk,Ark1)2]=(a)rk𝔼[(ξk,Ark1)2]+1rkdelimited-[]superscriptsubscript𝜒𝑘subscript𝜉𝑘Asubscript𝑟𝑘12(a)subscript𝑟𝑘𝔼delimited-[]superscriptsubscript𝜉𝑘Asubscript𝑟𝑘121subscript𝑟𝑘\displaystyle\left[\left(\frac{\chi_{k}\xi_{k,\text{A}}}{r_{k}}-1\right)^{2}% \right]\overset{\text{(a)}}{=}r_{k}\mathbb{E}\left[\left(\frac{\xi_{k,\text{A}% }}{r_{k}}-1\right)^{2}\right]+1-r_{k}[ ( divide start_ARG italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] over(a) start_ARG = end_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E [ ( divide start_ARG italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] + 1 - italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
=rk(𝔼[({hkh^k}eγth|h^k|2ρrk1)2||h^k|2γth]\displaystyle=r_{k}\left(\mathbb{E}\left[\left.\left(\frac{\Re\{h_{k}^{*}\hat{% h}_{k}\}\mathrm{e}^{\gamma_{\mathrm{th}}}}{|\hat{h}_{k}|^{2}\rho r_{k}}-1% \right)^{2}\right||\hat{h}_{k}|^{2}\geq\gamma_{\mathrm{th}}\right]\right.= italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( blackboard_E [ ( divide start_ARG roman_ℜ { italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } roman_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ]
×Pr{|h^k|2γth}+Pr{|h^k|2<γth})+1rk\displaystyle\quad\left.\times\Pr\left\{|\hat{h}_{k}|^{2}\geq\gamma_{\mathrm{% th}}\right\}+\Pr\left\{|\hat{h}_{k}|^{2}<\gamma_{\mathrm{th}}\right\}\right)+1% -r_{k}× roman_Pr { | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT } + roman_Pr { | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT } ) + 1 - italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
=rkeγth𝔼[({hkh^k}eγth|h^k|2ρrk1)2||h^k|2γth]\displaystyle=r_{k}e^{-\gamma_{\mathrm{th}}}\mathbb{E}\left[\left.\left(\frac{% \Re\{h_{k}^{*}\hat{h}_{k}\}\mathrm{e}^{\gamma_{\mathrm{th}}}}{|\hat{h}_{k}|^{2% }\rho r_{k}}-1\right)^{2}\right||\hat{h}_{k}|^{2}\geq\gamma_{\mathrm{th}}\right]= italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E [ ( divide start_ARG roman_ℜ { italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } roman_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ρ italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ]
+1rkeγth1subscript𝑟𝑘superscript𝑒subscript𝛾th\displaystyle\quad+1-r_{k}e^{-\gamma_{\mathrm{th}}}+ 1 - italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
=(b)(eγth+(1ρ2)E1(γth)e2γth2ρ2)1rk1,(b)superscript𝑒subscript𝛾th1superscript𝜌2subscriptE1subscript𝛾thsuperscript𝑒2subscript𝛾th2superscript𝜌21subscript𝑟𝑘1\displaystyle\overset{\text{(b)}}{=}\left(e^{\gamma_{\mathrm{th}}}+\frac{(1-% \rho^{2})\mathrm{E}_{1}\left(\gamma_{\mathrm{th}}\right)e^{2\gamma_{\mathrm{th% }}}}{2\rho^{2}}\right)\frac{1}{r_{k}}-1,over(b) start_ARG = end_ARG ( italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + divide start_ARG ( 1 - italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT 2 italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) divide start_ARG 1 end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG - 1 , (72)

where (a) exploits the independence of χksubscript𝜒𝑘\chi_{k}italic_χ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and ξk,Asubscript𝜉𝑘A\xi_{k,\text{A}}italic_ξ start_POSTSUBSCRIPT italic_k , A end_POSTSUBSCRIPT, (b) is due to [34, Eq. (25)]. Substituting (C) into (C) and combining the results in (B) and (B), we complete the proof.

Appendix D Proof of Corollary 2

To begin with, the expectation of |βk|2superscriptsubscript𝛽𝑘2|\beta_{k}|^{2}| italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is calculated as

𝔼[|βk|2]𝔼delimited-[]superscriptsubscript𝛽𝑘2\displaystyle\mathbb{E}\left[|\beta_{k}|^{2}\right]blackboard_E [ | italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =ζ2λ2αkdkαrk2𝔼[1|h^k|2]absentsuperscript𝜁2superscript𝜆2subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼superscriptsubscript𝑟𝑘2𝔼delimited-[]1superscriptsubscript^𝑘2\displaystyle=\frac{\zeta^{2}\lambda^{2}\alpha_{k}d_{k}^{\alpha}}{r_{k}^{2}}% \mathbb{E}\left[\frac{1}{|\hat{h}_{k}|^{2}}\right]= divide start_ARG italic_ζ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]
=(a)ζ2λ2αkdkαrk2E1(γth),(a)superscript𝜁2superscript𝜆2subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼superscriptsubscript𝑟𝑘2subscriptE1subscript𝛾th\displaystyle\overset{\text{(a)}}{=}\frac{\zeta^{2}\lambda^{2}\alpha_{k}d_{k}^% {\alpha}}{r_{k}^{2}}\mathrm{E}_{1}(\gamma_{\mathrm{th}}),over(a) start_ARG = end_ARG divide start_ARG italic_ζ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) , (73)

where (a) comes from the fact that |h^k|2superscriptsubscript^𝑘2|\hat{h}_{k}|^{2}| over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT follows an exponential distribution and the integral γth1xexdx=E1(γth)superscriptsubscriptsubscript𝛾th1𝑥superscript𝑒𝑥differential-d𝑥subscriptE1subscript𝛾th\int_{\gamma_{\mathrm{th}}}^{\infty}\frac{1}{x}e^{-x}\mathrm{d}x=\mathrm{E}_{1% }(\gamma_{\mathrm{th}})∫ start_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_x end_ARG italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT roman_d italic_x = roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ). Substituting (69) into (34), we get a feasible ζ𝜁\zetaitalic_ζ as

ζ=ρPaveγeγthE1(γth)mink{rkαkdkα2}.𝜁𝜌subscript𝑃ave𝛾superscript𝑒subscript𝛾thsubscriptE1subscript𝛾thsubscript𝑘subscript𝑟𝑘subscript𝛼𝑘superscriptsubscript𝑑𝑘𝛼2\displaystyle\zeta=\frac{\rho\sqrt{P_{\mathrm{ave}}}}{\gamma e^{\gamma_{% \mathrm{th}}}\sqrt{\mathrm{E}_{1}(\gamma_{\mathrm{th}})}}\min_{k}\left\{\frac{% r_{k}}{\alpha_{k}}d_{k}^{-\frac{\alpha}{2}}\right\}.italic_ζ = divide start_ARG italic_ρ square-root start_ARG italic_P start_POSTSUBSCRIPT roman_ave end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_γ italic_e start_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT end_POSTSUPERSCRIPT square-root start_ARG roman_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT ) end_ARG end_ARG roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT { divide start_ARG italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT } . (74)

Then, following the same steps as in Appendix C, we complete the proof.

References

  • [1] J. Yao, W. Xu, Z. Yang, X. You, M. Bennis, and H. V. Poor, “Digital versus analog transmissions for federated learning over wireless networks,” accepted by Proc. IEEE Int. Conf. Commun. (ICC), 2024. Available: https://arxiv.longhoe.net/abs/2402.09657.
  • [2] Y. C. Eldar, A. Goldsmith, D. Gundüz, and H. V. Poor, Machine Learning and Wireless Communications. Cambridge, UK: Cambridge University Press, 2022.
  • [3] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems: Applications, trends, technologies, and open research problems,” IEEE Netw., vol. 34, no. 3, pp. 134–142, May/Jun. 2020.
  • [4] W. Shi et al., “Intelligent reflection enabling technologies for integrated and green Internet-of-Everything beyond 5G: Communication, sensing, and security,” IEEE Wireless Commun., vol. 30, no. 2, pp. 147–154, Apr. 2023.
  • [5] Z. He et al., “Unlocking potentials of near-field propagation: ELAA-empowered integrated sensing and communication,” 2024, arXiv:2404.18587.
  • [6] W. Xu et al., “Edge learning for B5G networks with distributed signal processing: Semantic communication, edge computing, and wireless sensing,” IEEE J. Sel. Topics Signal Process., vol. 17, no. 1, pp. 9–39, Jan. 2023.
  • [7] G. Zhu et al., “Pushing AI to wireless network edge: An overview on integrated sensing, communication, and computation towards 6G,” Sci. China Inf. Sci., vol. 66, no. pp. 130301:1–19, Feb. 2023.
  • [8] M. Chen et al., “Distributed learning in wireless networks: Recent progress and future challenges,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3579–3605, Dec. 2021.
  • [9] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. Int. Conf. Artif. Intell. Statist., 2017, pp. 1273–1282.
  • [10] Y. Yang, Z. Zhang, and Q. Yang, “Communication-efficient federated learning with binary neural networks,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3836–3850, Dec. 2021.
  • [11] Y. Guo, R. Zhao, S. Lai, L. Fan, X. Lei, and G. K. Karagiannidis, “Distributed machine learning for multiuser mobile edge computing systems,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 3, pp. 460–473, Apr. 2022.
  • [12] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, Jan. 2021.
  • [13] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, Mar. 2021.
  • [14] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, May 2020.
  • [15] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, May 2020.
  • [16] M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui , “Communication-efficient federated learning,” Proc. Nat. Acad. Sci., vol. 118, no. 17, 2021.
  • [17] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “UVeQFed: Universal vector quantization for federated learning,” IEEE Trans. Signal Process., vol. 69, pp. 500–514, 2021.
  • [18] M. Salehi and E. Hossain, “Federated learning in unreliable and resource-constrained cellular wireless networks,” IEEE Trans. Commun., vol. 69, no. 8, pp. 5136–5151, Aug. 2021.
  • [19] S. Zheng, C. Shen and X. Chen, “Design and analysis of uplink and downlink communications for federated learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 7, pp. 2150–2167, Jul. 2021.
  • [20] Y. Wang, Y. Xu, Q. Shi, and T.-H. Chang, “Quantized federated learning under transmission delay and outage constraints,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 323–341, Jan. 2022.
  • [21] M. Lan, Q. Ling, S. Xiao, and W. Zhang, “Quantization bits allocation for wireless federated learning,” IEEE Trans. Wireless Commun., vol. 22, no. 11, pp. 8336–8351, Nov. 2023.
  • [22] H. Li, R. Wang, W. Zhang, and J. Wu, “One bit agregation for federated edge learning with reconfigurable intelligent surface: Analysis and optimization,” IEEE Trans. Wireless Commun., vol. 22, no. 2, pp. 872–888, Feb. 2023.
  • [23] H. Ye, L. Liang, and G. Y. Li, “Decentralized federated learning with unreliable communications,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 3, pp. 487–500, Apr. 2022.
  • [24] J. Yao, Z. Yang, W. Xu, M. Chen, and D. Niyato, “GoMORE: Global model reuse for rescource-constrained wireless federated learning,” IEEE Wireless Lett., vol. 12, no. 9, pp. 1543–1547, Sept. 2023.
  • [25] S. Liu, G. Yu, R. Yin, J. Yuan, L. Shen, and C. Liu, “Joint model pruning and device selection for communication-efficient federated edge learning,” IEEE Trans. Commun., vol. 70, no. 1, pp. 231–244, Jan. 2022.
  • [26] H. H. Yang, Z. Chen, T. Q. S. Quek, and H. V. Poor, “Revisiting analog over-the-air machine learning: The blessing and curse of interference,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 3, pp. 406–419, Apr. 2022.
  • [27] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” IEEE Trans. Signal Process., vol. 68, pp. 2897–2911, Apr. 2020.
  • [28] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 491–506, Jan. 2020.
  • [29] X. Cao, G. Zhu, J. Xu, and S. Cui, “Transmission power control for over-the-air federated averaging at network edge,” IEEE J. Sel. Areas Commun., vol. 40, no. 5, pp. 1571–1586, May 2022.
  • [30] X. Yu, B. Xiao, W. Ni, and X. Wang, “Optimal adaptive power control for over-the-air federated edge learning under fading channels,” IEEE Trans. Commun., vol. 71, no. 9, pp. 5199–5213, Sept. 2023.
  • [31] W. Guo et al., “Joint device selection and power control for wireless federated learning,” IEEE J. Sel. Areas Commun., vol. 40, no. 8, pp. 2395–2410, Aug. 2022.
  • [32] F. Ang et al., “Robust federated learning with noisy communication,” IEEE Trans. Commun., vol. 68, no. 6, pp. 3452–3464, Jun. 2020.
  • [33] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, Mar. 2020.
  • [34] J. Yao, Z. Yang, W. Xu, D. Niyato, and X. You, “Imperfect CSI: A key factor of uncertainty to over-the-air federated learning,” IEEE Wireless Lett., vol. 12, no. 12, pp. 2273–2277, Dec. 2023.
  • [35] K. Guo, Z. Chen, H. H. Yang, and T. Q. S. Quek, “Dynamic scheduling for heterogeneous federated learning in private 5G edge networks,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 1, pp. 26–40, Jan. 2022.
  • [36] X. Wei and C. Shen, “Federated learning over noisy channels: Convergence analysis and design examples,” IEEE Trans. Cogn. Commun. Netw., vol. 8, no. 2, pp. 1253–1268, Jun. 2022.
  • [37] D. Liu and O. Simeone, “Privacy for free: Wireless federated learning via uncoded transmission with adaptive power control,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 170–185, Jan. 2021.
  • [38] X. Li, G. Zhu, Y. Gong, and K. Huang, “Wirelessly powered data aggregation for IoT via over-the-air function computation: Beamforming and power control,” IEEE Trans. Wireless Commun., vol. 18, no. 7, pp. 3437–3452, Jul. 2019.
  • [39] Z. Lin, X. Li, V. K. N. Lau, Y. Gong, and K. Huang, “Deploying federated learning in large-scale cellular networks: Spatial convergence analysis,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1542–1556, Mar. 2022.
  • [40] M. Mohammadi Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,” IEEE Trans. Signal Process., vol. 68, pp. 2155–2169, 2020.
  • [41] H. Xing, O. Simeone, and S. Bi, “Federated learning over wireless device-to-device networks: Algorithms and convergence analysis,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3723–3741, Dec. 2021.
  • [42] E. Rizk, S. Vlaski, and A. H. Sayed, , “Federated learning under importance sampling,” IEEE Trans. Signal Process., vol. 70, pp. 5381–5396, 2022.
  • [43] W. Dinkelbach, “On nonlinear fractional programming,” Manage. Sci., vol. 133, no. 7, pp. 492–498, Mar. 1967.
  • [44] Z. He et al., “Energy efficient beamforming optimization for integrated sensing and communication,” IEEE Wireless Commun. Lett., vol. 11, no. 7, pp. 1374–1378, Jul. 2022.
  • [45] M. Grant and S. Boyd. (2016). CVX: MATLAB Software for Disciplined Convex Programming. [Online]. Available: http://cvxr.com/cvx
  • [46] J. Wang, Y. Mao, T. Wang, and Y. Shi, “Green federated learning over cloud-RAN with limited fronthaul capacity and quantized neural networks,” IEEE Trans. Wireless Commun., vol. 23, no. 5, pp. 4300–4314, May 2024.
  • [47] M. M. Amiri, D. Gündüz, S. R. Kulkarni, and H. V. Poor, “Convergence of federated learning over a noisy downlink,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1422–1437, Mar. 2022.
  • [48] J. Yao et al., “Superimposed RIS-phase modulation for MIMO communications: A novel paradigm of information transfer,” IEEE Trans. Wireless Commun., vol. 23, no. 4, pp. 2978–2993, Apr. 2024.
  • [49] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-iid data.” in Proc. Int. Conf. Learn. Representations (ICLR), 2019.
  • [50] W. Shi et al., “On secrecy performance of RIS-assisted MISO systems over Rician channels with spatially random eavesdroppers,” IEEE Trans. Wireless Commun., early access. 2024.
  • [51] T. Li et al., “Federated optimization in heterogeneous networks,” in Proc. Mach. Learn. Syst., 2020, pp. 429–450.
  • [52] Y. Sun, Z. Lin, Y. Mao, S. **, and J. Zhang, “Channel and gradient-importance aware device scheduling for over-the-air federated learning,” IEEE Trans. Wireless Commun. (Early Access). 2023.