Maverick-Aware Shapley Valuation for
Client Selection in Federated Learning

Mengwei Yang1, Ismat Jarin1, Baturalp Buyukates3, Salman Avestimehr3, Athina Markopoulou1 Emails: {mengwey, ijarin, athina}@uci.edu, {buyukate, avestime}@usc.edu 1Department of EECS and CS, University of California, Irvine, CA, USA 3Ming Hsieh Department of ECE, University of Southern California, CA, USA

Abstract

Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. One key challenge in practical FL systems is data heterogeneity, particularly in handling clients with rare data, also referred to as Mavericks. These clients own one or more data classes exclusively, and the model performance becomes poor without their participation. Thus, utilizing Mavericks throughout training is crucial. In this paper, we first design a Maverick-aware Shapley valuation that fairly evaluates the contribution of Mavericks. The main idea is to compute the clients’ Shapley values (SV) class-wise, i.e., per label. Next, we propose FedMS, a Maverick-Shapley client selection mechanism for FL that intelligently selects the clients that contribute the most in each round, by employing our Maverick-aware SV-based contribution score. We show that, compared to an extensive list of baselines, FedMS achieves better model performance and fairer Shapley Rewards distribution.

I Introduction

As the pace of legislation on user privacy accelerates, regulations such as the General Data Protection Regulation (GDPR) [1] and the California Consumer Privacy Act (CCPA) [2] have been released to give users more control over their personal information. In this landscape, Federated Learning (FL) has been proposed [3] to facilitate machine learning (ML) over decentralized user data, taking the place of traditional centralized training approaches with significant privacy challenges. In FL, many clients collaboratively train a model by only transmitting their model updates instead of their private data. Despite this increased privacy notion, practical FL systems usually face the challenge of data heterogeneity. Unlike the idealistic data center environments, in FL, participating clients usually have heterogeneous data, which can easily cause poor accuracy and slow convergence. Even though many works have tackled data heterogeneity from model performance, client selection, and rewarding perspectives in FL [4, 5, 6, 7], a prevalent scenario remains largely understudied: clients with rare data. Clients providing rare and previously unseen data are crucial to the success of the trained ML models. Training on diverse data avoids the common bias in algorithms, leading to fairer and trustworthy ML systems.

Refer to caption — Figure 1: Multiple devices participate in FL for a voice AI task. A few devices that exclusively own rare data, i.e., non-native accent data, are the Mavericks and crucial for training.

In [8], the term Mavericks was coined to refer to clients with rare data in FL, and more specifically to clients that exclusively own one or more classes (i.e., labels) of data, whereas the non-Maverick clients have a balanced distribution from the remaining classes. Some examples are shown in Fig. 1. When training an FL model for a disease classification task, most hospitals (i.e., clients) possess data indicating common diseases such as flu or cold. However, very few hospitals possess rare disease datasets such as for leukemia or thyroid cancers, making them Mavericks in this learning task. Another example of Mavericks is people with rare accents in training voice-activated AI systems like Amazon’s Alexa and Google’s Home Assistant. While the majority of these devices contain native accent data, a few of them contain data from users with non-native accents. Recent studies report that these assistant devices struggle to understand non-native accents, with more than 6% performance gap between the Western accents and the minority accents [9]. This performance disparity indicates a biased performance and demonstrates the importance of training with rare (or less common) data, from Mavericks, to create models that “speak" to everyone.

Prior work in FL has not sufficiently addressed this problem. The random sampling of clients at each round adopted by the conventional FL scheme, FedAvg [10], does not fully exploit rare data and can cause slow convergence, low model performance, and degraded fairness [11]. Existing techniques for selecting clients in FL includes contribution-based approaches S-FedAvg [12], GreedyFed [13] and distance-based methods such as FedEMD [8]. In contribution-based client selection methods, Shapley value (SV) [14] is widely applied for measuring clients’ contribution during training. Previous works [8] and [15] have shown that, despite their accurate performance in i.i.d. scenarios, SV-based methods systematically undervalue the Mavericks (although they are paramount for achieving high accuracy on certain classes of data), suffering from unfairness and performance loss due to under-utilization of rare data.

In this paper, we offer a principled way to value and utilize the Mavericks in FL. We design (i) a novel Maverick-aware Shapley valuation and (ii) a corresponding client selection mechanism that can more fairly assess the contribution of Mavericks and can effectively utilize them in each round. Our main contributions can be summarized as follows:

•

We propose a class-wise SV-based contribution score to value the contributions of clients in FL. To compute this score, we define the class difficulty in order to combine the class-wise SVs and fairly evaluate the contribution of the clients (Mavericks and non-Mavericks).
•

We then introduce FedMS, Maverick-Shapley client selection mechanism for FL, to effectively utilize the Mavericks during training based on the contribution scores. We show that FedMS significantly increases the model accuracy compared to an extensive list of baselines.

II Problem Setup and Background

In this section, we first formalize the FL framework [10] and define Mavericks [8]. We then give an overview of the SV-based methods for evaluating the contribution of clients.

Federated Learning (FL). We consider a general FL system with multiple clients and one server. We let $\mathcal{K}$ denote the set of clients such that $\mathcal{K}=\{1,2,...,I\}$ . Each client $i$ owns dataset $\mathcal{D}_{i}$ , where $n_{i}=|\mathcal{D}_{i}|$ . Each data point is a pair $(\bm{x},y)$ , where $\bm{x}$ is the feature vector and $y$ is the corresponding label. We let $\mathcal{M}=\{1,2,...,C\}$ denote the set of class labels. ${\bm{w}}$ is the learnable weights of the global model and each client $i$ has local model ${\bm{w}}_{i}$ . The training objective is defined as

\displaystyle\min_{\bm{w}}\mathcal{L}(\bm{w})=\min\sum_{i\in\mathcal{K}}\frac{% n_{i}}{n}\mathcal{L}_{i}(\bm{w}_{i}),

(1)

where we have $n=\sum_{i\in\mathcal{K}}n_{i}$ and the loss at client $i$ is $\mathcal{L}_{i}(\bm{w}_{i})=\frac{1}{n_{i}}\sum_{d\in\mathcal{D}_{i}}\mathcal{% L}_{d}(\bm{w}_{i})$ . The FL training process includes the following steps: (i) Initialization: The server initializes the global model parameters $\bm{w}$ and broadcasts it to clients. (ii) Client Selection: In round $t$ , the server selects $i\in\mathcal{K}^{t}=\{1,2,...,I^{t}\}$ clients with selection strategy $\pi$ . (iii) Local Update and Model Aggregation: Each selected client $i$ in round $t$ performs local training and sends ${\bm{w}}_{i}^{t}$ to the server. Then, the server updates the global model using the model updates of the clients as ${\bm{w}}^{t+1}=\sum_{i=1}^{I^{t}}\frac{n_{i}^{t}}{\sum_{i=1}^{I_{t}}n_{i}^{t}}% \bm{w}_{i}^{t}$ . Steps (ii) and (iii) are repeated until the convergence of the global model.

2Input:

T

: number of training rounds;

E

: number of local epochs;

\mathcal{K}

: set of clients;

\mathcal{D}_{i}

: dataset of client

i

;

B

: minibatch size;

n_{i}^{t}

: dataset size of the

i

th client in round

t

;

\mathcal{M}

: set of class labels;

\mathcal{D}_{val}

: validation dataset;

\mathcal{V}_{class}(\cdot)

: class-wise accuracy function;

\eta_{i}

: learning rate at client

i

3Server executes:

4 Initialize

\bm{w}^{0}

\beta

\hat{S}

6 for each round t = 0, … $T-1$ do

8 // Compute contribution score

\hat{S}_{i}

\hat{S}_{i}=\sum\limits_{c\in\mathcal{M}}\beta^{c}\bm{\cdot}S_{i}^{c},\forall i% \in\mathcal{K}

10 // Sample clients from

P_{\hat{S},i}

P_{\hat{S},i}=\frac{exp(\hat{S}_{i})}{\sum\limits_{i\in\mathcal{K}}exp(\hat{S}% _{i})},\forall i\in\mathcal{K}

\mathcal{K}^{t}\leftarrow

sample

i

clients

\sim P_{\hat{S},i}

14 for each client $i$ $\in$ $\mathcal{K}^{t}$ in parallel do

\bm{w}_{i}^{t}\leftarrow

UserUpdate

(\bm{w}^{t},i)

17 // Calculate class-wise Shapley value

\phi_{i}

, class difficulty

\beta

and the best clients set

\hat{\mathcal{K}}^{t}

\phi,\beta,\hat{\mathcal{K}}^{t}\leftarrow

Maverick-Shapley

{\hskip 56.9055pt}(\{\bm{w}_{i}^{t}\}_{i\in\mathcal{K}^{t}},\bm{w}^{t},D_{val}% ,\mathcal{V}_{class}(\cdot),\mathcal{M})

19 // Compute the accumulated class-wise Shapley value

S_{i}

S_{i}^{c}=\alpha\bm{\cdot}S_{i}^{c}+(1-\alpha)\bm{\cdot}\phi_{i}^{c},\forall i% \in\mathcal{K}^{t},\forall c\in\mathcal{M}

{\bm{w}}^{t+1}\leftarrow\sum\limits_{i\in\hat{\mathcal{K}}^{t}}\frac{n_{i}^{t}% }{\sum\limits_{i\in\hat{\mathcal{K}}^{t}}n_{i}^{t}}\bm{w}_{i}^{t}

;

23 function UserUpdate (

\bm{w}^{t},i

24 for each local epoch e = 1… $E$ do

\mathcal{D}^{B}_{i}\leftarrow

select a minibatch of size

B\subseteq\mathcal{D}_{i}

\bm{w}_{i}^{t}\leftarrow\bm{w}_{i}^{t}-\eta_{i}\nabla\mathcal{L}_{i}(\mathcal{% D}^{B}_{i},\bm{w}_{i}^{t})

28 return ${\bm{w}}_{i}^{t}$ to server

Algorithm 1 FedMS: a Maverick-Shapley Client Selection Mechanism for FL

Mavericks. A Maverick is a client that owns one or more classes exclusively [8]. Let $\mathcal{M}_{mav}$ denote the set of class labels exclusively owned by Mavericks. If a client is a Maverick, then its dataset satisfies $\mathcal{D}_{i}=\{\{x^{c},y^{c}\}^{i}_{c\in M_{mav}},\{x^{c},y^{c}\}^{i}_{c% \notin M_{mav}}\}$ . Here, $\{x^{c},y^{c}\}^{i}$ denotes the data points in $\mathcal{D}_{i}$ with label $c$ . If a client is not a Maverick, then its dataset satisfies $\mathcal{D}_{i}=\{\{x^{c},y^{c}\}^{i}_{c\notin M_{mav}}\}$ . As in [8], we assume the data samples $\{x^{c},y^{c}\}_{c\notin M_{mav}}$ are evenly distributed among all clients but the data samples $\{x^{c},y^{c}\}_{c\in M_{mav}}$ are exclusively owned by the Mavericks. We note that there can be multiple Mavericks jointly owning the rare labels, which we call the shared Mavericks.

Shapley Value (SV) for Client Valuation in FL. SV [14, 16] of client $i$ is given by

\displaystyle\phi_{i}(\mathcal{K},\mathcal{V})=\sum_{\mathcal{Q}\subseteq% \mathcal{K}\backslash\{i\}}\frac{\mathcal{V}(\mathcal{Q}\cup\{i\})-\mathcal{V}% (\mathcal{Q})}{\tbinom{|\mathcal{K}|-1}{|\mathcal{Q}|}},

(2)

where $\phi_{i}$ is the SV for client $i$ , $\mathcal{Q}$ denotes the subset of participants from $\mathcal{K}$ . The utility function $\mathcal{V}(\cdot)$ can assume any form which can evaluate the utility of the input. The conventional SV in (2) requires retraining the FL model for all subsets of clients, which is computationally prohibitive [17]. For client contribution assessment in FL, gradient-based SV approximation techniques such as MR [18], TMR [19], and GTG [17] are employed (see Appendix -A for an overview).

III Proposed Method: FedMS

In this section, we describe the proposed Maverick-Shapley client selection mechanism, FedMS, that fairly computes the contributions of all clients using a class-wise SV-based contribution scoring and selects the most contributing clients in each round. The steps are outlined in Algorithm 1 and the list of variables is given in Appendix -C.

Maverick-Shapley Contribution Score. When training a model for multi-class tasks, the difficulty of learning each class is different. Particularly in the presence of Mavericks, rare classes are harder to learn than the others. In order to differentiate between classes and accurately compute the contribution of each client (Mavericks and non-Mavericks alike), we propose a class-wise SV-based contribution score. In particular, we use the class-wise accuracy as the utility function in SV computations to better capture the difficulty level of each class. Class-wise accuracy is calculated as

\displaystyle\mathcal{V}_{class}^{c}(w;\mathcal{D}_{\textrm{val}})=\frac{N^{cc% }}{\sum\limits_{j\in\mathcal{M}}N^{cj}},\quad\forall c\in\mathcal{M},

(3)

where $w$ is a given model, $\mathcal{D}_{\textrm{val}}$ is validation dataset at the server, $N^{cj}$ represents the number of validation data points of class $c$ predicted as class $j$ , $\mathcal{M}$ is set of class labels. Our main distinction in SV computation is the fact that we compute it in a class-wise manner to better capture the diverse resources of Mavericks (hence the name Maverick-Shapley).

In each FL round, after receiving model updates from the participating clients, the server computes the SV of a client $i$ for class $c$ , $\phi_{i}^{c}$ , by utilizing a gradient-based SV approximation method of its choice using (3). It then computes the accumulated SVs $S_{i}^{c}$ using a decay factor $\alpha$ as

\displaystyle S_{i}^{c}=\alpha*S_{i}^{c}+(1-\alpha)*\phi_{i}^{c},\quad\forall i% \in\mathcal{K}^{t},\forall c\in\mathcal{M}.

(4)

Finally, the server computes the contribution score of each client $i$ as a weighted sum of its class-wise accumulated SVs¹¹1In the first round, the server initializes contribution scores by calculating the cosine distance between each client model and the aggregate model.

\displaystyle\hat{S}_{i}=\sum\limits_{c\in\mathcal{M}}\beta^{c}\bm{\cdot}S_{i}% ^{c},\quad\forall i\in\mathcal{K},

(5)

where $\beta$ denotes the class difficulty, adaptively adjusting the impact of each class in the contribution scores such that

\displaystyle\beta^{c}=\frac{exp\left(\frac{1-\mathcal{V}_{class}^{c}(w;% \mathcal{D}_{\textrm{val}})}{T}\right)}{\sum\limits_{c\in\mathcal{M}}exp\left(% \frac{1-\mathcal{V}_{class}^{c}(w;\mathcal{D}_{\textrm{val}})}{T}\right)},% \quad\forall c\in\mathcal{M},

(6)

where the temperature $T$ controls the distribution. Since the difficulty of learning each class is dynamically changing, the server updates the class difficulty $\beta$ and the contribution score $\hat{S}$ in each round.

The proposed Maverick-Shapley approach is universally applicable to the existing SV approximation algorithms. Algorithm 2 describes the procedure for the MR [18] technique.²²2Another example is in Appendix -B for the GTG-Shapley [17] technique.

2Input: Updated client models

\{\bm{w}_{i}^{t}\}_{i\in\mathcal{K}^{t}}

; current server model

\bm{w}^{t}

; validation dataset at server

\mathcal{D}_{\textrm{val}}

; class-wise accuracy function

\mathcal{V}_{class}(\cdot)

;

\mathcal{M}

: set of class labels.

3 Hyperparameter: Temperature T

4 Initialize:

\phi_{i}=0,\forall i\in\mathcal{K}^{t}

5 for each subset $Q$ $\subseteq$ $\mathcal{K}^{t}$ do

\widetilde{w}_{Q}=\textnormal{{ModelAverage}}(\{\bm{w}_{i}^{t}\}_{i\in Q},\bm{% w}^{t})

9 for client $i$ $\in$ $\mathcal{K}^{t}$ do

10 for class $c$ $\in$ $\mathcal{M}$ do

\phi_{i}^{c}=\sum\limits_{\mathcal{Q}\subseteq\mathcal{K}^{t}\backslash\{i\}}% \frac{\mathcal{V}_{class}^{c}(\widetilde{w}_{\mathcal{Q}\cup\{i\}};\mathcal{D}% _{\textrm{val}})-\mathcal{V}_{class}^{c}(\widetilde{w}_{\mathcal{Q}};\mathcal{% D}_{\textrm{val}})}{\tbinom{|\mathcal{K}^{t}|-1}{|\mathcal{Q}|}}

15 // Find the best clients set

\hat{\mathcal{K}}^{t}

and its class-wise accuracy

\hat{v}

\hat{\mathcal{K}}^{t},\hat{v}\leftarrow\operatorname*{argmax}_{Q\subseteq% \mathcal{K}^{t}}\sum\limits_{c\in\mathcal{M}}\mathcal{V}_{class}^{c}(% \widetilde{w}_{Q},\mathcal{D}_{\textrm{val}})

17 // Obtain class difficulty

\beta

\beta^{c}=\frac{exp(\frac{1-\hat{v}^{c}}{T})}{\sum\limits_{c\in M}exp(\frac{1-% \hat{v}^{c}}{T})},\forall c\in M

return $\phi,\beta,\hat{K}^{t}$

Algorithm 2 Maverick-Shapley

Client Selection. Based on the contribution scores, the server selects the most contributing clients in each FL training round. To this end, it calculates the selection probability of each client according to their contribution scores $\hat{S}$ as

\displaystyle P_{\hat{S},i}=\frac{exp(\hat{S}_{i})}{\sum\limits_{i\in\mathcal{% K}}exp(\hat{S}_{i})},\quad\forall i\in\mathcal{K},

(7)

and samples clients based on $P_{\hat{S}}$ in each round. As Mavericks exclusively own certain classes, they are the most contributing clients for those rare classes and have higher probability to be selected when the model performs poorly on rare classes.

As the server computes the class-wise accuracy considering multiple client permutations during the contribution score computation (e.g., line 8 in Algorithm 2), it can further refine the client selection. In each training round in FedMS, the server finds the subset of clients leading to the highest total class-wise accuracy increase, i.e., the best client set $\hat{\mathcal{K}}^{t}$ in line 10 in Algorithm 2, and aggregates only their updates.

Shapley Rewards (SR). In each round, the server computes the SV of a selected client $i$ for class $c$ , $\phi_{i}^{c}$ . It then calculates the Shapley Rewards of each client $i$ for round $t$ as a weighted sum of its class-wise SVs using the current class difficulty $\beta$

\displaystyle R_{i}^{t}=\sum\limits_{c\in\mathcal{M}}\beta^{c}\bm{\cdot}\phi_{% i}^{c},\quad\forall i\in\mathcal{K}^{t}.

(8)

IV Evaluation

In this section, we comprehensively evaluate the effectiveness of our algorithm, FedMS, on two datasets against six baselines. We demonstrate an improved accuracy and fairer Shapley rewards for both Mavericks and non-Mavericks.

Datasets and Models. We use two benchmark datasets, (i) MNIST[20] consisting of handwritten digits, with 60,000 samples for training and 10,000 for testing, and (ii) CIFAR-10 [21] consisting of colored images of 10 classes, with 50,000 samples for training and 10,000 for testing. We utilize a lightweight MLP neural network [22] for MNIST and a commonly employed CNN [23] for the CIFAR-10 dataset.

Implementation Details. Both MNIST and CIFAR-10 datasets are uniformly distributed across all 10 class labels. Here, to satisfy our Mavericks setting, we split the dataset into two scenarios: (i) 5 clients (4 non-Mavericks and 1 Maverick) without client selection and (ii) 50 clients (48 non-Mavericks and 2 Mavericks) with 10% selection rate of 50 clients in each round. Each Maverick exclusively owns one class in both scenarios (i and ii). The training process involves 100 global training rounds for MNIST and 200 for CIFAR-10 both with a batch size of 64; the learning rate is 0.05 in both datasets. We employ 1 local training on MNIST and 10 local training on CIFAR-10. We choose $\alpha$ as 0.6 for both MNIST and CIFAR-10. In the proposed FedMS, we use class-wise GTG-Shapley (shown in Appendix -B) for Shapley Rewards computation. We note that our class-wise approach is applicable to other SV approximation methods such as MR and TMR as well (see Tables II and III for performance under different methods).

Evaluation Metrics. To assess the effectiveness of evaluated mechanisms, we consider the test accuracy as the utility metric. In addition, we evaluate different schemes based on their Shapley Rewards (SR) to the Mavericks. A larger SR is associated with higher contributions. While we compute SR as in (8), previous works use SVs of the clients simply as SR.

Baselines. We consider six client selection baselines: FedAvg [10], S-FedAvg [12], FedEMD [8], FedProx [24], GreedyFed[13], and PoC [25]. FedAvg applies random sampling in each round. S-FedAvg and GreedyFed combine SV-based methods with client selection. FedProx and PoC propose mechanisms regarding data heterogeneity in FL. FedEMD combines EMD distance with client selection in the presence of Mavericks.

Fairer Shapley (Reward) Distribution. Fig. 2 illustrates how the test accuracy and SR change during training in the 5 client setting (w/o client selection). We see in Fig. 2 that Mavericks helps increase the model accuracy. Despite this benefit of training with Mavericks, we observe that the average SR of Mavericks is considerably lower than that of non-Mavericks in FedAvg when the rewards are based on the SVs. In contrast, Fig. 2(b) exhibits a fairer SR for Mavericks. In Fig. 3, when considering the scenario with 50 clients (w/ client selection), FedMS assigns higher rewards to Mavericks than all baseline methods. In these experiments, we use the GTG-Shapley [17] for SV computation. We deduce that FedMS shows effectiveness for fairer SR distribution for Mavericks and non-Mavericks in both settings (w/ and w/o client selection), thanks to the class-wise SV-based rewards as in (8).

Improved Model Performance. Figs. 3(a), 3(b) and 3(c) display a similar test accuracy in the settings of All Clients and Without Mavericks for FedAvg, S-FedAvg, and FedEMD, respectively. On the other hand, in Fig. 3(d), our method FedMS demonstrates an elevated accuracy in All Clients setting compared to the Without Mavericks setting, illustrating the effective utilization of Mavericks during FL training under the proposed approach.

Comparisons with Baselines. Our proposed method, FedMS, outperforms the baselines in both SR and utility metrics. In regards to the SR, FedMS computes a fairer SR for all clients by considering class-wise SVs and class difficulties $\beta$ . If rare classes owned by Mavericks perform poorly on the validation dataset, our mechanism increases the $\beta$ associated with these rare classes. Hence, our system boosts fairer SR for the Mavericks. Alternatively, in the baseline methods, only S-FedAvg and GreedyFed adopt SV in their client selection process but none of them considers the Mavericks settings. In those SV-based methods, the low SR of Mavericks decreases their selection probability during training, resulting in under-utilization of Mavericks. Since FedMS can effectively select the most contributing clients, it successfully selects Mavericks and shows an increased model accuracy, as shown in Fig. 3(d). FedEMD applies a decreasing selection probability of Mavericks as iterations progress and our approach differs from FedEMD by not relying on the distance of local & global data distributions. Instead, we prioritize the class-wise contribution of each client during the selection process, thus, our method achieves fairer SR and improved accuracy compared to FedEMD (see more comparisons in Appendix -D).

V Conclusion and Future Directions

The selection of clients plays a pivotal role in achieving success in FL, as it allows for the optimization of the utility derived from diverse model updates, specifically in the presence of Mavericks. In this work, we propose FedMS, a Maverick-aware Shapley valuation mechanism for client selection in FL that not only fairly evaluates the contributions of the Mavericks but also effectively selects the most contributing clients in each training round. Our proposed FedMS achieves better model performance and fairer Shapley Rewards distribution compared to the existing methods.

Future Directions: FedMS does not consider potential attacks in the presence of Mavericks, particularly how attackers might exploit the system (e.g., attackers or outliers can pretend as Mavericks or the Mavericks themselves can take advantage of the system). Future research should focus on investigating potential attacks, such as poisoning and adversarial attacks, that target Maverick-friendly FL systems.

References

[1] Paul Voigt and Axel Von dem Bussche. The EU General data protection regulation (GDPR). volume 10, pages 10–5555. Springer, 2017.
[2] Preston Bukaty. The California Consumer Privacy Act (CCPA): An Implementation Guide. IT Governance Ltd, 2019.
[3] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. 2016.
[4] Bing Luo, Wenli Xiao, Shiqiang Wang, Jianwei Huang, and Leandros Tassiulas. Tackling system and statistical heterogeneity for federated learning with adaptive client sampling. In IEEE INFOCOM- IEEE Conference on Computer Communications, London, United Kingdom, pages 1739–1748. IEEE, 2022.
[5] Fan Xin, **ghui Zhang, Junzhou Luo, and Fang Dong. Federated Learning Client Selection Mechanism Under System and Data Heterogeneity. In 25th IEEE International Conference on Computer Supported Cooperative Work in Design, CSCWD Hangzhou, China,, pages 1239–1244. IEEE, May,2022.
[6] Yutong Dai, Zeyuan Chen, Junnan Li, Shelby Heinecke, Lichao Sun, and Ran Xu. Tackling data heterogeneity in federated learning with class prototypes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7314–7322, 2023.
[7] Jianyi Zhang, Ang Li, Minxue Tang, **gwei Sun, Xiang Chen, Fan Zhang, Changyou Chen, Yiran Chen, and Hai Li. Fed-CBS: A Heterogeneity-Aware Client Sampling Mechanism for Federated Learning via Class-Imbalance Reduction. In International Conference on Machine Learning, ICML ,Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 41354–41381. PMLR, July, 2023.
[8] Jiyue Huang, Chi Hong, Yang Liu, Lydia Y Chen, and Stefanie Roos. Tackling Mavericks in Federated Learning via Adaptive Client Selection Strategy. AAAI, 2022.
[9] The Washington Post. Why Alexa Can’t Understand Some Accents. The Washington Post, 2018.
[10] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
[11] Lei Fu, Huanle Zhang, Ge Gao, Mi Zhang, and Xin Liu. Client selection in federated learning: Principles, challenges, and opportunities. IEEE Internet of Things Journal, 2023.
[12] Lokesh Nagalapatti and Ramasuri Narayanam. Game of Gradients: Mitigating irrelevant clients in federated learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9046–9054, 2021.
[13] Pranava Singhal, Shashi Raj Pandey, and Petar Popovski. Greedy Shapley Client Selection for Communication-Efficient Federated Learning. IEEE Networking Letters, 2024.
[14] Lloyd S Shapley. Cores of convex games. International journal of game theory, 1:11–26, 1971.
[15] Baturalp Buyukates, Chaoyang He, Shanshan Han, Zhiyong Fang, Yupeng Zhang, Jieyi Long, Ali Farahanchi, and Salman Avestimehr. Proof-of-Contribution-Based Design for Collaborative Machine Learning on Blockchain. In IEEE International Conference on Decentralized Applications and Infrastructures (IEEE DAPPS 2023), July 2023.
[16] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. In International conference on machine learning, pages 2242–2251. PMLR, 2019.
[17] Zelei Liu, Yuanyuan Chen, Han Yu, Yang Liu, and Lizhen Cui. GTG-Shapley: Efficient and accurate participant contribution evaluation in federated learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13(4):1–21, 2022.
[18] Tianshu Song, Yongxin Tong, and Shuyue Wei. Profit allocation for federated learning. In 2019 IEEE International Conference on Big Data (Big Data), pages 2577–2586. IEEE, 2019.
[19] Shuyue Wei, Yongxin Tong, Zimu Zhou, and Tianshu Song. Efficient and fair data valuation for horizontal federated learning. Federated Learning: Privacy and Incentive, pages 139–152, 2020.
[20] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[21] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). Technical Report 1, Technical Report, 2009.
[22] Marius-Constantin Popescu, Valentina E Balas, Liliana Perescu-Popescu, and Nikos Mastorakis. Multilayer perceptron and neural networks. WSEAS Transactions on Circuits and Systems, 8(7):579–588, 2009.
[23] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. Understanding of a convolutional neural network. In 2017 international conference on engineering and technology (ICET), pages 1–6. Ieee, 2017.
[24] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. volume 2, pages 429–450, 2020.
[25] Yae Jee Cho, Jianyu Wang, and Gauri Joshi. Towards understanding biased client selection in federated learning. In International Conference on Artificial Intelligence and Statistics, pages 10351–10375. PMLR, 2022.

-A Related Works

Client Selection. In an FL system, clients show different degrees of heterogeneity in data distribution and system resources. The vanilla mechanism FedAvg [10] that randomly samples clients in each round may not fully leverage the diverse local updates from heterogeneous clients [11]. Various selection methods have been proposed to deal with heterogeneity in FL to improve the model performance [12, 8, 13, 25]. S-FedAvg [12] combines FedAvg with SV and empowers the server to select relevant clients with high probability. GreedyFed [13] greedily selects the most contributing clients in each round by employing a fast Shapley approximation algorithm named the GTG-Shapley [17]. In Power-of-Choice (POC) [25], authors propose a client scheduling strategy that selects the client models with the highest loss in each round. Common to all these methods is the fact that none of them considers the Mavericks. Recently, authors in [8] introduced the concept of Mavericks and proposed FedEMD to adaptively select clients based on the Wasserstein distance between the local and global data distributions. Although FedEMD increases the probability of selecting the Mavericks, it does not provide a solution to fairly evaluate the contribution of the Mavericks.

Contribution Evaluation via Gradient Shapley Methods. SV-based methods are widely employed in FL to compute the contributions of the participating clients [19, 18, 17]. Despite its prominence in the game theory literature, in the context of ML, SV [14] is not practical as it requires retraining from scratch considering each client permutation. Gradient Shapley methods aim to eliminate the lengthy retraining of FL models by utilizing gradient updates of the clients to approximate the FL sub-models for various clients permutations in the SV computation. Reference [18] proposes two gradient Shapley methods: one-round (OR) and multi-round (MR). OR calculates the SV once after the training while MR calculates the SV in every FL round. Truncated Multi-Rounds Construction (TMR) [19] eliminates unnecessary FL sub-model reconstructions by adding a decay factor. In Guided Truncation Gradient Shapley (GTG-Shapley) [17], authors design a guided Monte Carlo sampling approach combined with truncation techniques to further improve the computation efficiency. Despite these efforts to efficiently and accurately approximate the SV, previous works [8, 15] showed that the current SV-based methods are unable to fairly assess the contributions of the Mavericks. Motivated by these, one of our goals in this work is to propose a SV-based contribution score that can appreciate the contributions of both Maverick and non-Maverick clients. We then use the accumulated contribution scores of the clients to perform intelligent client selection in each round to better utilize the Mavericks during training and improve the model performance.

-B Maverick GTG-Shapley

The employed class-wise Shapley value computation technique in FedMS, i.e., Maverick-Shapley, is compatible with the existing SV approximation approaches. In Algorithm 3, we describe the class-wise Shapley computation by using the GTG-Shapley [17] technique.

2Input: Updated client models

\{\bm{w}_{i}^{t}\}_{i\in\mathcal{K}^{t}}

; current server model

\bm{w}^{t}

; validation dataset at server

D_{val}

; class-wise accuracy function

\mathcal{V}_{class}(\cdot)

;

\mathcal{M}

: set of class labels.

3 Hyperparameters: Error threshold

\epsilon_{b}

\epsilon_{i}

, temperature T.

4 Initialize:

\phi_{i}=0,\forall i\in\mathcal{K}^{t},r=0

5 Compute

w^{t+1}=\textnormal{{ModelAverage}}(n_{i},\,\{\bm{w}_{i}^{t}\}_{i\in\mathcal{K% }^{t}})

v_{0}=\mathcal{V}_{class}(w^{t};\mathcal{D}_{\textrm{val}}),\,v_{N}=\mathcal{V% }_{class}(w^{t+1};\mathcal{D}_{\textrm{val}}),\,

7 # between round truncation

8 if

|v_{N}-v_{0}|>\epsilon_{b}

then

9 while Convergence criteria not met do

r=r+1

11 for client $i\in\mathcal{K}^{t}$ do

12 permute

\mathcal{K}^{t}\setminus\{i\}:\pi^{r}[0]=i,\pi^{r}[1:n]

v_{0}^{r}=v_{0}

14 # within-round truncation

15 for $j=1,\dots,n$ do

16 if

|v_{N}-v^{r}_{j-1}|\geq\epsilon_{i}

H=\pi^{r}[:j]

\widetilde{w}_{H}=\textnormal{{ModelAverage}}(\{\bm{w}_{i}^{t}\}_{i\in H},\bm{% w}^{t})

v^{r}_{j}=\mathcal{V}_{class}(\widetilde{w}_{H};\mathcal{D}_{\textrm{val}})

20 else

v^{r}_{j}=v^{r}_{j-1}

22 for class $c$ $\in$ $\mathcal{M}$ do

\phi_{\pi^{r}[j]}^{c}=\frac{r-1}{r}\phi_{\pi^{r}[j]}^{c}+\frac{(v_{j}^{r,c}-v_% {j-1}^{r,c})}{r}

27 # Find best clients set

\hat{\mathcal{K}}^{t}

and its class-wise accuracy

\hat{v}

\hat{\mathcal{K}}^{t},\hat{v}\leftarrow\operatorname*{argmax}_{H}\sum\limits_{% c\in\mathcal{M}}\mathcal{V}_{class}^{c}(\widetilde{w}_{H};\mathcal{D}_{\textrm% {val}})

29 # Obtain class difficulty

\beta

\beta^{c}=\frac{exp(\frac{1-\hat{v}^{c}}{T})}{\sum\limits_{c\in M}exp(\frac{1-% \hat{v}^{c}}{T})},\forall c\in M

return $\phi,\beta,\hat{\mathcal{K}}^{t}$

Algorithm 3 Maverick GTG-Shapley

-C FedMS Parameters & Notation

Table I lists the parameters and notation we use in FedMS.

Notation	Description
$(\bm{x},y)$	$\bm{x}$ is the feature vector and $y$ is the corresponding label
$\mathcal{K}$	Set of clients such that $\mathcal{K}=\{1,2,...,I\}$
$\mathcal{M}$	Set of class labels $\mathcal{M}=\{1,2,...,C\}$
$\eta_{i}$	Learning rate at client $i$
$\mathcal{D}_{i}$	Local dataset of client $i$
$\mathcal{D}_{val}$	Validation dataset at the server
${\bm{w}}$	Learnable weights of the global model
${\bm{w}}_{i}$	Local model of client $i$ ${\bm{w}}_{i}$
$\mathcal{V}_{class}(\cdot)$	Class-wise accuracy function
$\phi$	Class-wise Shapley value vector (including all clients)
$\phi_{i}$	Class-wise Shapley value of the $i$ -th client
$\beta$	Class difficulty vector (including all classes)
$\beta^{c}$	Class difficulty of class $c$
$S_{i}^{c}$	Accumulated Shapley value of client $i$ for class $c$
$\alpha$	Decay factor for the accumulated Shapley value $S_{i}^{c}$
$\hat{S}_{i}^{c}$	Contribution score of client $i$
$P_{\hat{S}}$	Selection probability vector for client selection (including all clients)
$P_{\hat{S},i}$	Selection probability of client $i$ for client selection
$T$	Number of FL training rounds
$t$	Index of FL round, $t=0,1,2,...T-1$
$E$	Number of local epochs
$B$	Mini-batch size
${\bm{w}^{t}}$	Learnable weights of the global model in round $t$
${\bm{w}}_{i}^{t}$	Local model of client $i$ in round $t$
$\mathcal{K}^{t}$	Set of selected clients in round $t$ with selection strategy $\pi$
$\hat{\mathcal{K}}^{t}$	Best clients set with the highest class-accuracy in round $t$
$n_{i}^{t}$	Dataset size of the $i$ -th client in round $t$

TABLE I: Main parameters and notation.

-D Additional Experiments

In this section, we present additional experimental evaluations conducted on both MNIST and CIFAR-10 datasets concerning our reward and utility metrics. When examining the reward metrics, focusing on the fairer Shapley Rewards, we can observe from Figs. 4, 6, 8, and 10 that FedMS provides more rewards to Mavericks compared to non-Mavericks for both MNIST and CIFAR-10 dataset, in line with the observed accuracy benefit of training with the Mavericks. In contrast, the state-of-the-art (SOTA) techniques provide lower rewards to Mavericks compared to non-Mavericks (example, FedAvg in Figs. 5(b), 7(b), 9(b) or FedEMD in Figs. 5(d), 7(d), 9(d)). Regarding the utility metric, we notice that FedMS better utilizes Mavericks, resulting in an overall improvement in model accuracy compared to the SOTA methods on both MNIST and CIFAR-10 datasets. From these two observations, we can deduce that our method not only effectively selects Mavericks, thereby enhancing model performance, but also ensures a fairer Shapley Rewards distribution among Mavericks and non-Mavericks. Accuracy performance of the proposed FedMS in comparison with the SOTA baselines considering various SV approximation techniques for MNIST and CIFAR-10 dataset is given in Table II and III.

	FedMS	FedAvg	FedProx	S-FedAvg	GreedyFed	PoC	FedEMD
GTG-Shapley	82.91 $\pm$ 0.5	73.77 $\pm$ 0.1	73.52 $\pm$ 0.4	74.00 $\pm$ 0.5	73.91 $\pm$ 0.2	74.49 $\pm$ 0.7	73.79 $\pm$ 0.2
MR	82.81 $\pm$ 1.8	73.99 $\pm$ 0.4	73.79 $\pm$ 0.4	73.94 $\pm$ 0.2	73.66 $\pm$ 0.2	73.94 $\pm$ 0.1	73.87 $\pm$ 0.1
TMR	80.27 $\pm$ 2.4	73.74 $\pm$ 0.1	74.02 $\pm$ 0.4	73.67 $\pm$ 0.1	73.99 $\pm$ 0.1	73.42 $\pm$ 0.4	75.49 $\pm$ 1.0

TABLE II: Model performance (test accuracy in %) of different client selection algorithms (CS-Alg) including FedMS for MNIST dataset under various Shapley value approximation methods.

	FedMS	FedAvg	FedProx	S-FedAvg	GreedyFed	PoC	FedEMD
GTG-Shapley	64.79 $\pm$ 0.5	60.87 $\pm$ 0.1	60.25 $\pm$ 0.4	61.53 $\pm$ 0.3	59.87 $\pm$ 0.1	61.65 $\pm$ 0.3	62.7 $\pm$ 0.1
MR	64.56 $\pm$ 1.5	61.84 $\pm$ 0.2	60.97 $\pm$ 0.4	61.24 $\pm$ 0.2	58.81 $\pm$ 0.2	62.29 $\pm$ 0.3	62.25 $\pm$ 0.1
TMR	64.5 $\pm$ 0.6	61.84 $\pm$ 0.1	59.9 $\pm$ 0.1	61.5 $\pm$ 0.5	57.7 $\pm$ 0.2	61.25 $\pm$ 0.2	61.85 $\pm$ 0.15

TABLE III: Model performance (test accuracy in %) of different client selection algorithms (CS-Alg) including FedMS for CIFAR-10 dataset under various Shapley value approximation methods.

Maverick-Aware Shapley Valuation for Client Selection in Federated Learning