1 Introduction

\OneAndAHalfSpacedXI\ECRepeatTheorems\EquationsNumberedThrough\MANUSCRIPTNO\RUNAUTHOR

Qi, Zhu

\RUNTITLE

A Pitfall of Shapley Values in Collaborative Federated Learning

\TITLE

Mechanism for Decision-aware Collaborative Federated Learning: A Pitfall of Shapley Values

\ARTICLEAUTHORS\AUTHOR

Meng Qi \AFFSC Johnson College of Business, Cornell University, Ithaca, NY, 14850,
\EMAIL[email protected] \AUTHORMingxi Zhu \AFFScheller College of Business, Georgia Institute of Technology
\EMAIL[email protected]

\ABSTRACT

This paper investigates mechanism design for decision-aware collaboration via federated learning (FL) platforms. Our framework consists of a digital platform and multiple decision-aware agents, each endowed with proprietary data sets. The platform offers an infrastructure that enables access to the proprietary data, creates incentives for collaborative learning aimed at operational decision-making, and conducts federated learning (FL) to avoid direct raw data sharing. The computation and communication efficiency of the FL process is inherently influenced by the agent participation equilibrium induced by the mechanism. Therefore, assessing the collaborative learning system’s efficiency involves two critical factors: the surplus created by coalition formation and the communication costs incurred across the coalition during FL. To evaluate the system efficiency under the intricate interplay between mechanism design, agent participation, operational decision-making, and the performance of FL algorithms, we introduce a multi-action collaborative federated learning (MCFL) framework for decision-aware agents. Under the MCFL framework, we further analyze the equilibrium for the renowned Shapley value based mechanisms. Specifically, we examine the issue of false-name manipulation, a form of dishonest behavior where participating agents create duplicate fake identities to split their original data among these identities. By solving the agent participation decisions in equilibrium, we demonstrate that, while Shapley value effectively maximizes coalition-generated surplus by encouraging full participation, it inadvertently promotes false-name manipulation. This further significantly increases the communication costs when the platform conducts federated learning across the coalition members. Thus, we highlight a significant pitfall of Shapley value based mechanisms, which implicitly incentivizes data splitting and identity duplication, ultimately impairing the overall efficiency in federated learning systems.

\KEYWORDS

collaborative federated learning, mechanism design, operational analytics, optimization algorithms

1 Introduction

Digitalization has brought revolutionary changes across traditional sectors such as retail, finance, and healthcare, allowing companies to offer services via online platforms and thus highlighting the importance of data-driven approaches. Moreover, the advances in large-scale machine learning models underscore the necessity for collaboration in develo** machine learning methods, where individuals and organizations unite, sharing their data for mutual advancement.

Our framework models the collaborative learning system consisting of a platform and multiple decision-aware agents. The platform offers an infrastructure that enables access to each agent’s proprietary data set and coordinates the collaboration by incentivizing agent participation in collaborative learning through mechanism design. There are two main types of collaborative learning approaches: centralized learning, which involves training on a centralized machine; and federated learning, which is a decentralized approach with local training. Centralized learning involves raw data sharing, which is often less preferred due to privacy concerns (AbdulRahman et al. 2020, Choudhury 2023). In contrast, federated learning (FL) does not require any of the participants to reveal their raw data to the platform, safeguarding the privacy of local agents’ data (McMahan et al. 2017). Therefore, FL techniques become prevailing for the platform to address privacy and safety concerns arising from the agents.

Normally, the decision-aware agents have their specific operational objectives when participating in collaborative federated learning. For example, the primary goal of sellers in e-commerce platforms is to make informed data-driven pricing or inventory decisions, rather than concentrating on purely statistical predictive objectives, as often assumed in prior research (McMahan et al. 2017). Moreover, the decision-aware agents could choose how to participate in collaborative learning by partially contributing a subset of their datasets or splitting their datasets to participate under fake identities. This type of dishonest conduct is known as false-name manipulation (Conitzer and Yokoo 2010). Hence, the platform must carefully design a mechanism to encourage full participation among the decision-aware agents with heterogeneous data volumes, while making the mechanism more robust to false-name manipulation.

It is worth emphasizing that, there is an intricate interplay between mechanism design, agent participation, operational decision-making, and the performance of FL algorithms, as described in the following questions: 1) how is the allocation mechanism designed based on performance guarantees of learning algorithms; 2) how a mechanism influences the agent participation equilibrium; and 3) how the computation and communication of the FL learning algorithm is influenced by the equilibrium. This complex interaction among these factors has been overlooked in existing studies, while prior research focusing on encouraging local agent participation often neglects its impact on algorithm performance, and studies aimed at improving FL algorithm efficiency and reducing communication costs tend to disregard the crucial role of mechanism design in improving algorithm performance.

1.1 A Motivating Example

One use of this scenario is the customer-to-manufacturer (C2M) initiative. For online e-commerce platforms such as Amazon, FlipKart, Alibaba, and JD.com, products and services are provided by sellers/manufacturers to customers over such online platforms. These platforms’ digital infrastructure is the key to accessing a great amount of granular and precise data (e.g. search and clicks of products, customer reviews and ratings) compared to traditional brick-and-mortar retailers (Qi et al. 2020). There is a clear incentive to leverage this data further up the supply chain, enabling sellers and manufacturers to make more informed decisions in planning and operations. In practice, platforms like Walmart have begun sharing these datasets with their sellers (Masters 2019, Arora and Jain 2023).

The C2M paradigm aims to build digital connections between end consumers and upstream manufacturers, often through online retailing platforms, as has been witnessed at JD.com, Alibaba, and Pinduoduo (PDD) (Mak and Max Shen 2021). Therefore, many operational decisions, such as pricing, product design, and inventory management, are indispensable to the resources provided by these digital platforms and cannot be effectively managed by sellers or manufacturers alone. Furthermore, by aggregating the information potentially obtained from their individual data, sellers/manufacturers will significantly benefit from having a more refined and accurate data-driven decision. To assist sellers, the platform provides collaboration opportunities for them to form a coalition to aggregate their data for more informed decision-making (Masters 2019). Such a collaboration is essential for harnessing the full potential of digital transformation.

1.2 Outline and Main Contributions

In this part, we present an outline of our paper and summarize our main methodlogical contributions.

In Section 3, we introduce a multi-action collaborative federated learning framework (MCFL) with decision-aware agents. The MCFL framework models the collaborative federated learning system by characterizing the mechanism designed by the platform, agent participation cooperative game, operational decision-making, and the performance of FL algorithms conducted within the formed coalition. This fills the gap in modeling all these key factors and investigating the complex interplay between them.

In Section 4 we investigate the surplus allocation mechanism. We specifically investigate the widely recognized Shapley value-based mechanism, a prominent method in multi-agent collaborative learning (Shapley et al. 1953). We further analyze the agent participation equilibrium considering the possible dishonest behavior - false-name manipulation. By solving the agent participation decisions in equilibrium, we demonstrate that the Shapley value is not robust against preventing false-name manipulation, which further impacts the performance of MCFL.

In Section 5, we analyze the system efficiency, which consists of two critical factors: the surplus generated through collaborative federated learning and the communication costs incurred across the coalition while performing FL algorithms. We focus on the Federated Averaging algorithm (FedAvg) where the agents perform local training and the platform periodically collects interim results from each agent and synchronizes all agents. The communication cost, which occurs during the synchronization, is affected by the number of agent identities within the coalition. We demonstrate that, while Shapley value effectively maximizes coalition-generated surplus by encouraging full participation, it inadvertently promotes false-name manipulation. As a result, agents tend to participate with data split among fake identities, which further significantly increases the communication costs for performing FL across agents. This pitfall of Shaply value based mechanism ultimately reduces the system efficiency under the MCFL framework.

Thus, we highlight a significant pitfall of Shapley value based mechanisms, which implicitly incentivizes data splitting and identity duplication, ultimately impairing the overall efficiency of the collaborative federated learning system.

2 Literature Review

Our paper is closely related to collaborations involving multiple agents. We model such collaboration as a cooperative game, which studies the coalitions formed by players and their cooperative actions Branzei et al. (2008). A well-known allocation rule within this framework is based on the Shapley value (Shapley et al. 1953), which allocates the payoffs to players based on their marginal contribution to each coalition she is a member of, ensuring a fair and efficient distribution. Although the Shapley value initially considers only binary participation actions of agents, Hsiao and Raghavan (1993) extends the Shapley value to scenarios where agents have multiple actions. Building on these works, we present a multi-action collaborative federated learning (MCFL) framework and summarize the related literature below.

Shapley value in operations management.

Shapley value is extensively used to incentivize collaboration among decision-aware agents with specific operational objectives. This topic is heavily studied in the field of operations management, where agents typically operate within distinct business contexts. Specifically, in supply chain management, Leng and Parlar (2009) explores the Shapley value for distributing surplus generated from shared demand information among a manufacturer, a distributor, and a retailer. Kemahlıoğlu-Ziya and Bartholdi III (2011) applies the Shapley mechanism for allocating surplus generated from inventory pooling among retailers. Gopalakrishnan et al. (2021) use the Shapley value to allocate carbon emission responsibilities among firms in a supply chain. Beyond supply chain management, Anily and Haviv (2010) employs the Shapley mechanism to divide the surplus from pooling service capacities in service systems. Singal et al. (2019) tackles the challenge of allocating eventual conversion among online advertisers through a modified counterfactual adjusted Shapley value. Bergantinos and Moreno-Ternero (2020) considers the equal-split rule, aligning with Shapley values in this specific context of splitting revenues from broadcasting sports events. Leng et al. (2021) investigates a game class with diminishing marginal contributions and analyzes Shapley value mechanism properties. Gopalakrishnan and Sankaranarayanan (2023) considers a Shapley mechanism variant for firm security cost-sharing arrangements. Several works also consider other mechanisms besides Shapley value for decision-aware agents. Gopalakrishnan et al. (2014) provides a summary of commonly used mechanisms in cost-sharing games. Our paper, while also considering decision-aware agents, specifically examines a scenario where the agents engage in informed decision-making through collaborative learning. In our work, each agent possesses a proprietary dataset. The collaboration is facilitated by a platform that provides the infrastructure for learning across multiple agents without sharing raw data. We refer to this setup as the multi-action collaborative federated learning (MCFL) framework for decision-aware agents.

Shapley value and machine learning.

Shapley value is also widely used in the field of machine learning and the computer science community. They consider the collaborative learning framework where multiple agents jointly minimize a global loss function through contributing individual data sets. Ghorbani and Zou (2019) provides a metric based on Shapley value to evaluate individual data contribution to empirical risk minimization. Jia et al. (2019) uses Shapley value to fairly distribute profits among multiple data contributors in collaborative machine learning. Sim et al. (2020) uses a variant of Shapley value to incentivize collaboration in data sharing for obtaining high-quality machine learning models. Rozemberczki et al. (2022) presents an overview of the applications of Shapley value in machine learning. Our paper differs from this flow of research in the following aspects. First, while our paper also considers mechanisms that incentivize data provision and data sharing among participating agents, the goal of our agents is not on minimizing a global loss function, but on making well-informed business decisions. Moreover, we consider a scenario where agents do not have incentives to directly share raw data, and the platform must provide an infrastructure for privacy-preserved collaborative learning, which leads to the analysis under the MCFL framework.

Federated learning.

Federated learning (FL) is a machine learning approach where a model is trained across multiple decentralized agents holding local data samples, without exchanging them, thus enhancing privacy and efficiency. McMahan et al. (2017) introduces a unified framework for federated learning, and Kairouz et al. (2021) offers a comprehensive survey on different aspects of the FL approach. A fundamental algorithm that is widely used in FL is FedAvg (or local SGD), which is based on the parallel stochastic gradient descent method Zinkevich et al. (2010). Since then, several works aimed at quantifying the performance of FedAvg Stich (2018), Yu et al. (2019), Khaled et al. (2019). The majority of the cost for performing FL normally occurs in its communication process when the platform queries the local interim results across decentralized agents Kairouz et al. (2021). And many studies have focused on reducing communication costs either through improving the algorithm design for faster convergence Shamir et al. (2014), Yuan and Ma (2020), or reducing the communication bandwidth Konečnỳ et al. (2016), Chraibi et al. (2019), Hamer et al. (2020). However, to our knowledge, there exists no previous work that considers the impact of mechanism design to reduce the communication cost of FL algorithms. In our work, we adopt the FL technology in privacy-preserving collaborative learning. We specifically consider the impact of mechanism design on the FL algorithm performance.

Incentive design in federated learning.

Lastly, our work is closely related to mechanism design in FL. Zhan et al. (2021) and Zeng et al. (2021) provide surveys in recent works on incentives and mechanism design in FL. Most of the works in this area focus on incentivizing agents to participate in federated learning, where the agents’ participation decisions are binary, and the cost of federated learning is not considered. Few works have considered partial provision of data, with the cost that occurs either in FL process or through data provision.Karimireddy et al. (2022) considers a model with partial participation due to the cost of data provision, and constructs a mechanism to incentivize collaborative learning. In our paper, we do not directly consider the cost of data provision, but our main result on the pitfall of Shapley would still hold with data provision cost. Gafni and Tennenholtz (2022) considers an FL platform with non-collaborative agents and investigates in how to manage the conflicting incentives, where we focus on a collaborative environment. Zhang et al. (2022) analyzes the incentive mechanism design while considering partial participation and the cost of computation for FL algorithms. We differ from this paper in the following aspects. Firstly, they assume specific exogenous functional forms of learning benefit and FL cost, while our work models the FL cost as an outcome of equilibrium participation, influenced by the mechanism. Secondly, we account for the decision-aware agents, where coalition surplus is also shaped by the operational decisions. More importantly, we address the potential for dishonest behaviors under the MCFL framework, such as data splitting and fake identity creation, which is not covered in prior research.

To our knowledge, our paper is the first work that considers decision-aware collaborative learning through the FL approach, with an analysis of the impact of mechanism design on both the coalition surplus and FL efficiency. Specifically, we consider that in equilibrium, the agents may conduct dishonest behavior on false name manipulation Iwasaki et al. (2010), Conitzer and Yokoo (2010), Aziz et al. (2011), where an agent creates fake identities and splits her data among two or more identities to participate in collaborative learning. We show that, while a Shapley value based mechanism encourages agent full participation, it inadvertently promotes false-name manipulation, which hugely increases the FL training cost. This eventually leads to a pitfall of Shapley value under the MCFL framework.

3 The Multi-Action Collaborative Federated Learning (MCFL) Framework

In this section, we describe a multi-action collaborative federated learning (MCFL) framework for decision-aware agents. The system consists of a digital platform, which is the coordinator, and $K$ agents on the digital platform. These agents share a common operational objective and aim to make informed decisions after collaborative learning. The platform provides an infrastructure that enables cross-agent learning and seeks to design a mechanism that incentivizes all agents to form a joint coalition.

Each agent holds a specific quantity of proprietary data samples. We use the vector $\mathbf{m}=[m_{1},\dots,m_{K}]\in\mathbb{R}^{K}$ to denote the amount of data each agent possesses. Specifically, an agent $k$ possesses a data set in the form of $S_{k}:=\{\mathbf{y}_{j},\quad\forall j=1,\dots,m_{k}\}$ with each observation $\mathbf{y}_{j}\in\mathbb{R}^{p}$ represents a data sample. All data samples are generated independently from an unknown ground-truth distribution $F_{\BFtheta^{*}}$ . We consider a parametric family $\mathcal{F}_{\Theta}:=\{F_{\BFtheta}:\BFtheta\in\BFTheta\}$ . In this case, $\BFtheta^{\ast}$ is unknown and could be estimated from data. The goal of each agent is to learn the unknown $\BFtheta^{\ast}$ from the data and make informed decisions, ideally with a guaranteed performance for a decision-aware objective.

Since we assume that all the data samples are i.i.d., each data sample contributes equally to the learning precision of unknown $\BFtheta^{\ast}$ . Moreover, in the absence of the infrastructure provided by the platform, an agent on her own lacks the capability to train the learning model. Therefore, individual agents are naturally incentivized to collaborate towards a common goal, which involves forming a coalition to aggregate data samples from all participants for improved estimation. Unlike most of the previous literature, we assume that agents can form a coalition by contributing only a subset of their data samples, rather than all. Specifically, for an agent $k$ , we let $\tau_{k}\in[0,m_{k}]$ represent the size of data that agent $k$ contributes to the coalition. We further define the agent participation decision profile, $\BFtau_{\mathcal{A}}=[\tau_{1},\dots,\tau_{K}]$ , for the coalition $\mathcal{A}$ . When $\tau_{k}=0$ , the agent $k$ is not in any coalition. We say that an agent $k$ belongs to coalition $\mathcal{A}$ if and only if $\tau_{k}>0$ . Hence $\mathcal{A}=\{k:\tau_{k}>0\}$ . Given $\BFtau_{\mathcal{A}}$ , we denote $S_{\BFtau_{\mathcal{A}}}=\{S_{\tau_{k}}:\forall k\in\mathcal{A}\}$ , the data samples from coalition $\mathcal{A}$ with participation decision profile $\BFtau_{\mathcal{A}}$ , where $S_{\tau_{k}}=\{\mathbf{y}_{i},i=1,\dots,\tau_{k}\}\subseteq S_{k}$ , with $S_{\tau_{k}=0}=\emptyset$ .

Additional Notation.

For any $\BFtau$ , by slightly abusing notation, we let $|\BFtau|:=\sum_{k}\tau_{k}$ denote the total size of data within a coalition that is characterized by $\BFtau$ , and $|\BFm|:=\sum_{k}m_{k}$ be the total number of data samples. We let $\mathbb{P}(A)$ denote the probability of any event $A$ . We let $2^{\mathcal{N}(K)}$ denote the set of all subsets of $\{1,\dots,K\}$ . We let $\BFm_{[2:K]}=[m_{2},\dots,m_{K}]$ be the maximum number of data points for agent $k=2,\dots,K$ , and $\BFtau_{[2:K]}=[\tau_{2},\dots,\tau_{K}]$ with $\tau_{k}\in\{0,\dots,m_{k}\}$ be the participation decision profile of agent $k=2,\dots,K$ . Lastly, Let $M_{k}(\BFtau)=\{v|\tau_{v}\neq m_{v},v\neq k\}$ denote the set of agents in profile $\BFtau$ who have not reached their maximum exertion effort, excluding agent $k$ .

3.1 A Decision-Aware Learning objective

On this platform, all agents are facing a common decision-making problem with uncertainty, whereas the objective function is jointly determined by an agent’s decision $\BFw$ and a random parameter $\BFy$ with unknown distribution. Specifically, we assume each agent aims to develop a good data-driven solution for the following problem

z^{*}(\BFtheta^{\ast}):=\max_{\BFw\in\mathcal{C}}\big{[}\pi(\BFw,F_{\BFtheta^{% \ast}}):=\mathbb{E}_{\BFy\sim F_{\BFtheta^{\ast}}}[r(\BFw,\BFy)]\big{]},

where $F_{\BFtheta^{\ast}}$ is the underlying true distribution of the random parameter $\BFy\in\mathcal{Y}$ , $\BFw\in\mathbb{R}^{d}$ is the decision variable restricted in a convex feasible region $\mathcal{C}$ , and $r:\mathcal{C}\times\mathcal{Y}\rightarrow\mathbb{R}$ is the objective/reward function. Here, we consider a strictly concave reward function $r(\BFw,\BFy)$ in $w$ for all $\BFy\in\mathcal{Y}$ . Thus, there exists a unique optimal decision

\BFw^{*}(\BFtheta^{\ast}):=\argmax_{\BFw\in\mathcal{C}}\pi(\BFw,F_{\BFtheta^{% \ast}}).

It is worthy noting that we do not consider competition among users in this model for now. Given any estimator $\hat{\BFtheta}$ , the agents make informed decisions

\BFw^{\ast}(\hat{\BFtheta}):=\argmax_{\BFw\in\mathcal{C}}\pi(\BFw,\hat{% \BFtheta})=\argmax_{\BFw\in\mathcal{C}}E_{\BFy\sim F_{\hat{\BFtheta}}}[r(\BFw,% \BFy)].

With the jointly learned parameter $\hat{\BFtheta}$ , the characteristic function of a coalition with decision profile $\mathbf{\BFtau}_{\mathcal{A}}$ can be defined as

v(\BFtau_{\mathcal{A}})=\pi(\BFw^{*}(\hat{\BFtheta}_{\BFtau_{\mathcal{A}}}),% \BFtheta^{\ast}).

Below, we present an example of the decision-making problem along with its associated characteristic function.

Example 3.1 (The Newsvendor Problem.)

In the Newsvendor example, $\BFy\in\mathcal{Y}\subseteq\mathbb{R}$ is the random demand, and $\BFw\in\mathbb{R}$ is the decision of order quantities. The decision-maker aims to minimize the cost $r(\BFw,\BFy)=h(\BFw-\BFy)^{+}+b(\BFy-\BFw)^{+}$ , where $h$ and $b$ represents the unit overstock and understock cost, respectively. Given any estimator $\hat{\theta}$ , $w^{*}(\hat{\BFtheta})$ is the $\frac{b}{b+h}$ quantile of $F_{\hat{\theta}}$ .

Since every data sample equally enhances the learning precision of the unknown $\BFtheta^{\ast}$ , only the total number of samples in the decision profile impacts the learning quality. Therefore, we can further express $v(\BFtau_{\mathcal{A}})$ and $\hat{\BFtheta}_{\BFtau_{\mathcal{A}}}$ as functions that depend on $|\BFtau_{\mathcal{A}}|$ , the total number of data samples in coalition $\mathcal{A}$ ,

v(\BFtau_{\mathcal{A}})=\pi(\BFw^{*}(\hat{\BFtheta}_{|\BFtau_{\mathcal{A}}|}),% \theta^{\ast}):=v(|\BFtau_{\mathcal{A}}|).

In order to encourage agent participation, the platform provides a performance guarantee in coalition surplus $v(|\BFtau_{\mathcal{A}}|)$ for a coalition $\mathcal{A}$ in the form of ensuring a small performance gap between the oracle surplus $z^{*}$ and actual coalition surplus $v(|\BFtau_{\mathcal{A}}|)$ . Under certain standard assumptions provided in Appendix 8, the surplus gap between $z^{*}$ and $v(|\BFtau_{\mathcal{A}}|)$ could be translated to the distance between estimator gap $\|\hat{\BFtheta}_{FL}-\BFtheta^{*}\|$ . Given the data size $|\BFtau_{\mathcal{A}}|$ , it is often possible to obtain a statistical high probability performance guarantee as stated in the following format:

{\mathbb{P}(\|\hat{\BFtheta}_{FL}-\BFtheta^{*}\|\geq\varepsilon(|\BFtau_{% \mathcal{A}}|,\delta_{0}))\leq\delta_{0},}

(1)

where $\varepsilon(|\BFtau_{\mathcal{A}}|,\delta_{0})$ depends on the pre-specified probability $\delta_{0}$ and the sample size $|\BFtau_{\mathcal{A}}|$ . Typically, $\varepsilon(|\BFtau_{\mathcal{A}}|,\delta_{0})$ decreases with growing $|\BFtau_{\mathcal{A}}|$ . And the platform guarantees that with probability greater than $1-\delta_{0}$ , the surplus gap between $z^{*}$ and $v(|\BFtau_{\mathcal{A}}|)$ is bounded by $L_{r,w}\varepsilon(|\BFtau_{\mathcal{A}}|,\delta_{0})$ ,

{\mathbb{P}(z^{*}-v(|\BFtau_{\mathcal{A}}|)\geq L_{r,w}\varepsilon(|\BFtau_{% \mathcal{A}}|,\delta_{0}))\leq\delta_{0},}

(2)

where $L_{r,w}$ is a given constant that depends on the Lipschitzness constants for reward and decision functions. We also show the performance guarantee bounds in (1) and (2) are equivalent under standard assumptions in Appendix 8.

3.2 Federated Learning within Agent Coalition

After forming a coalition, the platform aims to conduct collaborative learning that is privacy-preserving. One common approach is through Federated Learning (FL) frameworks, with a representative widely adopted algorithm known as Federated Averaging algorithm (FedAvg) (Kairouz et al. 2021). In FedAvg, agents keep their raw data locally to preserve privacy and perform local training on their own data. Periodically, the platform collects interim results from each agent and synchronizes all agents by distributing the average of these local outcomes. The details of the algorithm are provided in Section 5.

Given a coalition $\mathcal{A}$ , agent participation is characterized by the participation decision profile $\BFtau_{\mathcal{A}}$ , and $|\BFtau_{\mathcal{A}}|$ denotes the total size of data within the coalition. We let $\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}$ denote the estimator produced by the platform conducting FL with coalition data $S_{\BFtau_{\mathcal{A}}}$ .

FL requires multiple rounds of synchronization to obtain an estimator $\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}$ that satisfies the performance guarantee in (1). The majority cost of performing FL lies in this synchronization process where the platform is required to aggregate and communicate the results across agents Kairouz et al. (2021). A measure of this cost is determined by the total number of synchronizations needed to achieve an estimator meeting the performance criterion, denoted as $N_{sync}(\delta_{0},\mathcal{M},\Phi)$ , where $\Phi$ specifies FL algorithm parameters such as the initial point and the step-size choices. It’s worth mentioning that the number of synchronizations required to converge intrinsically depends on the announced mechanism $\mathcal{M}$ through agent participation profile $\BFtau_{\mathcal{A}}$ , which contains the information of how many agents are participating, and the number of local data samples.

3.3 MCFL System Synergy

The decision timeline of the MCFL system is described as the following:

•

The platform specifies a guaranteed performance based on the size of the aggregated data from the coalition and the revenue surplus division rule. Specifically, the platform specifies probability bound $p_{0}$ , FL learning parameter set $\Phi$ , and the mechanism $\mathcal{M}$ on surplus allocation.
•

The agents participate by deciding how to share, and how much to share. The coalition $\mathcal{A}$ was formed with the associated participation profile $\BFtau_{\mathcal{A}}$ .
•

The platform conducts the learning task through federated learning and shares the learning results with the agents. The result is guaranteed to satisfy the performance guarantee bound in (1).
•

The platform announces $\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}$ , and the agents in coalition $\mathcal{A}$ make informed decision $w^{*}(\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|})$ . The actual coalition surplus $v(|\BFtau_{\mathcal{A}}|)=\pi(\BFw^{*}(\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{% A}}|}),\BFtheta^{\ast})$ is realized, and the platform redistributes the coalition surplus according to the mechanism $\mathcal{M}$ .

In this work, we investigate the synergy between mechanism design, agent participation, operational decision-making, and the performance of FL algorithms in the MCFL system. The surplus allocation mechanism is designed based on possible statistical performance guarantees of learning algorithms. Moreover, such a mechanism influences the agent participation equilibrium which further impacts the computation and communication cost of the FL learning algorithm through the total number of synchronizations required in FL, $N_{sync}$ .

4 A Shapley Value Based Mechanism for MCFL

In this section, we investigate the surplus allocation mechanism announced by the platform in the MCFL framework. Particularly, we focus on the Shaply-value-based mechanism. We first introduce the MCFL Shapley value, and then discuss the equilibrium induced by the MCFL Shapley value considering false name manipulation.

4.1 The MCFL Shapley Value

A natural idea of a fair and efficient payoff allocation mechanism in cooperative games is based on the renowned Shapley value. The original definition of Shapley value is traced back to Shapley et al. (1953). We state the definition as follows

Definition 4.1 (Shapley value Shapley et al. (1953))

Suppose a cooperative game consists of $K$ agents and the characteristic function is $v:2^{\mathcal{N}(K)}\to\mathbb{R}$ . Then the payoff allocated to agent $k$ is defined as

\psi_{k}(v)=\sum_{\mathcal{S}\subseteq\mathcal{N}\backslash\{k\}}\frac{|% \mathcal{S}|!(K-|\mathcal{S}|-1)!}{K!}(v(\mathcal{S}\cup\{k\})-v(\mathcal{S})).

Recall that the mechanism specified by the platform announces the allocation rule $\mathbf{\psi}(v)\in\mathbb{R}^{K}$ which specifies the share of the payoff allocated to each player $k=1,\dots,K$ . Agents thus decide the participation profiles $\BFtau$ after observing the allocation rule $\mathbf{\psi}(v)$ .

As mentioned in section 3, we model the decision-aware collaboration problem under MCFL which allows agents to have various levels of participation decisions by selecting the number of samples they would like to contribute to the coalition, as quantified by the participation profile. To incorporate the multiple levels of participation, we adopt the multi-choice Shapley value proposed in Hsiao and Raghavan (1993), Hsiao (2004). Note that to define the multi-choice Shapley value, a weight function must be defined prior to the Shapley value (Hsiao and Raghavan (1993), Hsiao (2004)). Particularly, the weight function maps any possible action to a non-negative number and satisfies $\alpha(0)=0$ , and $\alpha(i)\leq\alpha(i+1)$ for any $i=1,\dots,K-1$ . The weight function is defined as prior knowledge of the power (or importance) of each action. Moreover, given the action space $\{0,1,\dots,m_{k}\}^{K}$ , in order to guarantee fairness in effort exertion, an ideal mechanism $\psi$ should satisfy the following axiom:

Axiom 1

Axiom 1 in Hsiao (2004), Hsiao and Raghavan (1993) Given any $\BFtau$ , for the unanimity game where the value function is defined as

V^{\BFtau}(\BFtau^{\prime})=\begin{cases}1\qquad&\textrm{if }\BFtau^{\prime}% \geq\BFtau\\ 0\qquad&\textrm{otherwise, }\end{cases}

The payoff allocated to agent $k$ is proportional to $\alpha(\tau_{k})$ .

We are now ready to define the Shapley value under MCFL, which further defines the platform allocation mechanism based on the MCFL Shapley value.

Definition 4.2 (MCFL Shapley Value)

For any agent participation decision profile $\BFtau$ , we define $M_{k}(\BFtau)=\{v|\tau_{v}\neq m_{v},v\neq k\}$ as the set of players who is not agent $k$ and does not share all the data. Let $\BFb(k)=[0,0\dots,1,0,\dots,0]\in R^{N}$ where the $k^{th}$ element of $\BFb(k)$ equals to 1. Then, for an agent $k$ sharing $i$ observations, the allocated payoff is given by

\psi^{\alpha}_{i,k}(v)=\sum^{i}_{j=1}\sum_{\BFtau:\tau_{k}=j,\BFtau\neq 0}% \left[\sum_{T\subseteq M_{k}(\BFtau)}(-1)^{|T|}\frac{\alpha(j)}{||\BFtau||_{% \alpha}+\sum_{r\in T}[\alpha(\tau_{r}+1)-\alpha(\tau_{r})]}\right][v(\BFtau)-v% (\BFtau-\BFb(k))],

where for any $\BFtau\in\{0,1,\dots,m_{k}\}^{K}$ , $\|\BFtau\|_{\alpha}:=\sum_{k=1}^{K}\alpha(\tau_{k})$ .

Remark 1 (Interpreting the weight function)

The weight of the action sharing a sample with size $i$ can be interpreted as the importance or power of this action. In the problem context of data sharing, there often exists a unit effort to obtain data samples (Karimireddy et al. 2022). Thus, it is natural to consider linear weights, specifically, sharing a sample of size $i$ has a weight as a linear function of $i$ .

For simplicity, we set $\alpha(\tau_{k})=\tau_{k}$ ¹¹1 This is without loss of generosity as any proportional functions $\alpha(\tau_{k})=\alpha\tau_{k}$ could be degenerate to $\alpha(\tau_{k})=\tau_{k}$ ., for all $\tau_{k}\in\mathbb{N}^{+}$ . Thus the Shapley value defined in 4.2 can be simplified as the following

Definition 4.3 (MCFL Shapley Value with Linear Weights)

For any agent participation decision profile $\BFtau$ , for an agent $k$ sharing $i$ observations, the allocated payoff is given by

\psi_{i,k}(v)=\sum^{i}_{j=1}\sum_{\BFtau:\tau_{k}=j,\BFtau\neq 0}\left[\sum_{T% \subseteq M_{k}(\BFtau)}(-1)^{|T|}\frac{j}{|\BFtau|+|T|}\right][v(\BFtau)-v(% \BFtau-\BFb(k))].

The uniqueness of Shapley value lies in that, it satisfies all the desired properties for an allocation mechanism. An ideal mechanism should (1) prevent free-riding from a non-contributing agent; (2) ensure equality of treatment for agents who contribute equally; (3) guarantee that the surplus is fully allocated without any waste; and (4) should be consistent and scalable across different situations. These desired properties could further be translated into the following axioms.

Axiom 2 (Desired Axioms for MCFL Mechanisms)

Desired axioms for MCFL mechanisms include the following:

Null player. A player that doesn’t add value gets nothing. for any player $k$ with action $\tau_{k}$ , if

v([\tau_{1},\dots,\tau_{k},\dots,\tau_{K}]])=v([\tau_{1},\dots,0,\dots,\tau_{K% }]),

then $\psi_{\tau_{k},k}(v)=0$ .

B.

Symmetry. If $v([\tau_{1},\dots,\tau_{i},0,\dots,\tau_{K}]])=v([\tau_{1},\dots,0,\tau_{j},% \dots,\tau_{K}]])$ for $\tau_{i}=\tau_{j}$ , then $\psi_{i,k}(v)=\psi_{j,k}(v)$ for all $k$ .
C.

Efficiency (Budget balanced). $\sum_{i=1}^{n}\psi_{m_{k},k}(v)=v(\BFm)$ .
D.

Additivity. For two characteristic functions $v$ and $u$ , $\psi_{i,k}(v+u)=\psi_{i,k}(u)+\psi_{i,k}(v)$ .

It’s worth noting that, the MCFL Shapley value is the only mechanism that satisfies Axiom 1 and Axiom 2 Hsiao (2004). Hence, if a platform wants to guarantee the desired properties of an allocation mechanism and fairness in effort exertion, the MCFL Shapley value is the only choice among all possible allocation rules.

4.2 False-Name Manipulation

As we presented in the previous section, Shapley value is the only option if the platform requires certain desirable properties stated in Axiom 1 and 2. However, when applied in the real world, Shapley value based mechanism can be vulnerable to dishonest behaviors or manipulations conducted by the participating agents. One possible manipulation is the false-name manipulation. False-name manipulation is the behavior where an agent creates fake identities in the game and then splits her data among two or more identities. To be more specific, if the player $k$ splits into $m$ identities, then the original sample $S_{k}$ is split into $m$ small sub-samples, each containing at least one data sample, and the player $k$ pretends that the $m$ small sub-samples comes from $m$ different (fake) agents.

We first demonstrate that the allocation mechanism defined by Shapley values suffers the potential risk of false-name manipulation. For any agent $k$ , suppose agent $k$ adopts false-name manipulation and splits into two fake identities, agents $k_{1}$ and $k_{2}$ . Then we compare the payoff that the original receives, with the total payoff that the two fake agents receive in Theorem 4.1.

Theorem 4.1 (Vulnerability of MCFL Shapley under False-name Manipulation)

For any $T,T^{\prime}\leq i$ such that $T+T^{\prime}=i$ , we have

(i)

$\psi_{i,k}(v)<\psi_{T,k_{1}}(v)+\psi_{T^{\prime},k_{2}}(v)$ if $v(\cdot)$ is a strictly concave function.
(ii)

$\psi_{i,k}(v)=\psi_{T,k_{1}}(v)+\psi_{T^{\prime},k_{2}}(v)$ if $v(\cdot)$ is a linear function.

In other words, with a concave value function, an agent tends to create a duplicated identity and split her original data set for higher benefit allocation. Moreover, for an agent $k$ with $m_{k}$ samples of data, the equilibrium participation profile under MCFL Shapley is: $\tau_{\hat{k}}=1,\ \hat{k}=1,\dots,m_{k}$ , where an agent fully participates in the coalition but splits all the data samples, with each identity contributes one sample.

Normally, $v(|\BFtau_{\mathcal{A}}|)$ is a strictly increasing and concave function for many business or operations decisions. In Appendix 8, we provide a specific example for $v(|\BFtau_{\mathcal{A}}|)$ being a strictly increasing and concave function for pricing under uncertainty. Theorem 4.1 suggests that, while Shapley value satisfies all the desired properties and encourages full participation in providing all data samples, it inherently incentivizes agents to split data with fake identities. While this dishonest behavior does not impact the total coalition surplus $v(|\BFtau_{\mathcal{A}}|)$ , which only depends on the total number of samples in the coalition, it significantly hurts the performance of the FL algorithm. In the next section, we further elaborate on how the false-name manipulation hurts the learning process, and may further decrease the overall system efficiency.

5 System Efficiency

In section 4, we describe a mechanism based on Shapley value that incentivizes full participation of the agents. We also discuss the potential vulnerability of Shapley value under dishonest behaviors like false-name manipulation. In this section, we elaborate on how false-name manipulation would impact the performance of FL algorithms. Our study is the first to study the impact of mechanism design on agents’ decisions to provide data, and how these decisions subsequently affect the performance of platform learning algorithms. We delve into how an efficient mechanism like Shapley, one that is budget-balanced and promotes complete data provision to enhance estimator quality, is vulnerable to false-name manipulation, which leads to agents participating with splitting their data. Such vulnerability could result in escalated communication costs within federated learning frameworks, presenting a complex challenge that intertwines mechanism design with operational efficiency.

The structure of this section is organized as follows. We first introduce the platform federated learning process and define the system efficiency that combines both the agent allocation of MCFL and the communication cost of federated learning. We then conduct an analysis of the system efficiency to Shapley value based mechanism. Our findings emphasize that although the Shapley mechanism adheres to the desired axioms of mechanism design, it fails to guard against false-name manipulation, resulting in considerably higher training costs compared to other mechanisms. This ultimately leads to the pitfall of Shapley mechanism.

5.1 Algorithm for Federated Learning

In this section, we first introduce the general approach of federated learning. We then apply federated learning to our collaborative learning setup and define the system efficiency under federated learning.

Federated learning is a decentralized and privacy-preserving machine learning framework, where multiple clients collaboratively train a model under a central platform without sharing raw data. The platform seeks to minimize a global loss function, which could be further represented as the sum of each agent $k$ ’s local loss functions :

{\min_{\BFtheta}\ L(\BFtheta)=\sum^{K}_{k=1}L_{k}(\BFtheta)}

(3)

A basic blueprint for designing Federated Learning (FL) training algorithms is the Federated Averaging algorithm (FedAvg) McMahan et al. (2017). While there are many variants of FedAvg Kairouz et al. (2021), in this section, we consider the following variant of FedAvg, a parallel gradient descent method, also known as local gradient descent (local GD) presented in Algorithm 1 Mangasarian (1995), Khaled et al. (2019). The reason for not adopting a local stochastic gradient descent (SGD) method lies in that, the equilibrium participation profile may involve each identity of agent providing only one data sample, making SGD infeasible under such conditions. Algorithm 1 is parameterized with step-size $\rho$ , the initial point $\BFtheta^{0}$ , the synchronization interval $H$ , and the total number of iteration $T$ , which are all platform decision variables. In local GD, each agent individually computes gradients on her own machine for a given interval $H$ and then synchronizes with the platform. The platform averages over the local results and broadcasts back to each agent. Define the set of platform decision variables as $\Phi=\{\rho,\BFtheta^{0},T,H\}$ . The details are provided in Algorithm 1.

Input:

t=0

\Phi=\{\rho,\BFtheta^{0},T,H\}

\BFtheta^{0}_{k}=\BFtheta^{0}

for all

k

Output: Training weights

\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}

as average of

\BFtheta^{t},t\in\{H,2H,\dots,T-1\}

1 while $t<T$ do

2 for $k=1,\dots,K$ do

3 if $t\in\{H,2H,\dots,T-1\}$ then

\BFtheta^{t+1}=\frac{1}{K}\sum^{K}_{i=1}(\BFtheta^{t}_{i}-\rho\nabla L_{i}(% \BFtheta^{t}_{i})),\ \BFtheta^{t+1}_{k}=\BFtheta^{t+1}

for all

k

;

// Synchronization

5 else

\BFtheta^{t+1}_{k}=\BFtheta^{t}_{k}-\rho\nabla L_{k}(\BFtheta^{t}_{k})

;

// Local Training

t=t+1

Algorithm 1 Federated learning approach with Parallel Gradient Descent for solving (3) (e.g. Mangasarian (1995), Khaled et al. (2019))

In collaborative learning, we consider an example where the platform utilizes maximum likelihood estimation (MLE) to learn $\BFtheta$ through federated learning. Here, the loss function is the negative of the likelihood. Let $\BFtau=[\tau_{1},\dots,\tau_{K}]$ denote the agents participation profile, where each agent $k=1,\dots,K$ provides $\{\BFy_{i},i=1,\dots,\tau_{k}\}$ . The associated probability density function of $\BFy_{i}$ is given by $f(\mathbf{y}_{i};\BFtheta^{*})$ . And the objective function is given by

{L(\BFtheta)=-\frac{1}{|\BFtau|}\sum^{K}_{k=1}\sum^{\tau_{k}}_{j=1}\ log\ f(% \mathbf{y}_{j};\BFtheta).}

(4)

In practice, $\BFtheta^{*}_{FL}=\textup{argmin}_{\BFtheta}\ L(\BFtheta)$ is not attainable due to optimization/training loss in accuracy, and the platform trains $\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}$ that converges to $\BFtheta^{*}_{FL}$ . The agents’ total surplus is thus given by

{\sum_{k\in\mathcal{A}}\psi_{\tau_{k},k}(v(|\BFtau_{\mathcal{A}}|))=\sum_{k\in% \mathcal{A}}\psi_{\tau_{k},k}(\pi(\BFw^{*}(\hat{\BFtheta}^{FL}_{|\BFtau_{% \mathcal{A}}|}),\BFtheta^{\ast})).}

(5)

The cost of federated learning occurs in the synchronization step of Algorithm 1. Let the total number of communication required to converge to the target performance guarantee be $N_{sync}(\delta_{0},\Phi,\mathcal{M})$ . At each synchronization, the platform queries for local training results across the coalition of agents. And let the cost per query round be $c$ . we define the system efficiency as

{\Pi(p_{0},\Phi,\mathcal{M})=\sum_{k\in\mathcal{A}}\psi_{\tau_{k},k}(\pi(\BFw^% {*}(\hat{\BFtheta}^{FL}_{|\BFtau_{\mathcal{A}}|}),\BFtheta^{\ast}))-cN_{sync}(% \delta_{0},\Phi,\mathcal{M}),}

(6)

which is the agents’ total surplus, subtracting the communication cost $cN_{sync}$ . It’s worth noticing that, here we assume that the cost incurred per synchronization round with querying all the identities of agents in coalition is $c$ . If $c$ linearly increases with the number of identities of agents, which essentially assumes that the platform incurs a cost per communication to each agent identity, then having fake identities trivially increases the cost of federated learning. Here, we assume that $c$ , the cost per synchronization round across all the identities of agents can be a constant, that does not increase in the number of identities. Still, in the next section, we manage to show that even under the case where the cost per synchronization round is constant, the total cost of federated learning still grows linearly with the number of fake identities created, which eventually leads to a pitfall of Shapley value based mechanism.

5.2 Analysis of System Efficiency

In section 3, we show that the Shapley value based mechanism is the only mechanism that satisfies the desired properties of axioms and leads to efficient allocation with no extra surplus left on the table. In this section, we focus more on how the mechanism would impact the platform training process. Specifically, while false-name participation in data splitting would not negatively impact the estimator quality, as data splitting does not vary the total number of the observations, false-name participation would significantly increase the training cost to attain the performance guarantee, as it creates redundant communications and operations between the central platform and the fake identities of agents.

Under the following assumption, we can show that $N_{sync}$ strictly increases with the number of participating agents $K$ .

Assumption 5.1 ( $L$ -smoothness, bounded gradient and strong convexity)

$L_{k}(\BFtheta)$ is $L$ -smooth for all $k$ . $\|\nabla L_{k}(\BFtheta)\|\leq\xi$ for all $\BFtheta$ , $k$ . $\BFtheta^{*}_{FL}$ is the unique minimizer of $L(\BFtheta)$ in the interior of the parameter space $\Theta$ , $H(\BFtheta)=\nabla^{2}L(\BFtheta)$ is smooth, and all eigenvalues of $H(\BFtheta^{*}_{FL})$ are strictly positive.

Assumption 5.1 guarantees that the gradient descent algorithm converges under reasonable choices of step size, and the minimizer to the loss function also converges to the ground truth estimator $\BFtheta^{*}$ . With assumption 5.1, we are now ready to state the main proposition on system efficiency.

Theorem 5.2

Under the MCFL framework, the system efficiency of the Shapley value based mechanism is upper bounded by

{\Pi(\delta_{0},\Phi,\mathcal{M}=\textup{Shapley})\leq v(|\BFm|)-c\left(\left(% \frac{64L\|\BFtheta_{0}-\BFtheta^{*}\|^{2}}{\mu}+\frac{12\sigma^{2}}{L\mu}% \right)^{1/2}+\frac{\xi}{4L}\right)^{3}(\varepsilon(|\BFm|,\delta_{0}))^{-3}}

(7)

Specifically, with $\varepsilon(|\BFtau_{\mathcal{A}}|,\delta_{0})=\beta_{1}log(\frac{\beta_{2}}{% \delta_{0}})|\BFtau_{\mathcal{A}}|^{-\alpha}$ , the system efficiency $\Pi(\delta_{0},\Phi,\mathcal{M})$ is bounded by

{\Pi(\delta_{0},\Phi,\mathcal{M})\leq v(|\BFm|)-c\lambda|\BFm|^{3\alpha},}

(8)

with $\lambda$ being a constant.

In concentration bounds, $\alpha>0$ . For example, when applying Hoeffding’s style bounds to MLE estimation with i.i.d. Bernoulli random variables with parameter $P(y_{i}=1)=\theta$ gives $\alpha=\frac{1}{2}$ , and the system loss increases in the order of $|\BFm|^{3/2}$ as we increase number of observations $|\BFm|$ . Moreover, by the concavity of $v(|\BFm|)$ , as long as $\alpha>\frac{1}{3}$ , adding an extra observation would hurt the system efficiency at a certain point under false name manipulation due to the increased cost in communication.

Theorem 5.2 underscores the importance of development for new mechanism designs that are more attuned to the practical challenges and operational realities of FL, moving beyond traditional Shapley valued based approaches to ensure more effective and efficient collaborative learning. It’s worth mentioning that, there is no free lunch in designing a mechanism that satisfies all the desired properties, while still being robust to false name manipulation and minimizing computation cost. As the previous result suggests, Shapley is the only mechanism that satisfies all the desired axioms, however, it is not robust enough to prevent data splitting. This implies that there is no optimal mechanism that satisfies the desired properties, while still minimizing the the communication cost in FL. In practice, the platform needs to carefully balance the trade-offs when designing the mechanism.

In the following subsection, we introduce a simple numerical example to demonstrate the harm of data splitting to FL algorithm.

5.2.1 Numerical Example

The Newsvendor Problem

In this numerical example, we focus on the newsvendor problem as presented in Example 3.1. Specifically, we assume the demand data held by each agent independently follows the distribution of $d_{i}=\BFx^{T}_{i}\BFtheta^{*}+\epsilon_{i}$ , and $\BFy_{i}:=(\BFx_{i},d_{i})$ , where $\BFx_{i}\in\mathbb{R}^{p}$ is the contextual information, and $\epsilon_{i}$ follows the Normal distribution with zero mean and variance $\sigma^{2}$ . Here, $\BFtheta^{*}\in\mathbb{R}^{p}$ is the unknown estimator we try to obtain from FL. According to the newsvendor problem formulation, $l_{j}(w,d_{j})=h(w-d_{j})^{+}+b(d_{j}-w)^{+}$ , where $(w-d_{i})^{+}=\max\{0,w-d_{i}\}$ , and $(d_{i}-w)^{+}=\max\{0,d_{i}-w)^{+}\}$ , the platform objective is to obtain an estimator $\hat{\BFtheta}$ that minimizes the following L2-regularized NV objective

{\hat{\BFtheta}=\arg\min_{\BFtheta}\ L(\BFtheta)=\sum^{K}_{k=1}\sum_{\BFy_{j}% \in S_{k}}\ l_{j}(\BFy^{T}_{j}\BFtheta,d_{j})+\lambda\|\BFtheta\|^{2}_{2}.}

(9)

In the following numerical example, we consider the case where $h=0.1$ , $b=0.9$ , and $\lambda=1$ . We further set $\sigma^{2}=2.25$ and let $\epsilon_{i}$ follows distribution of $N(0,\sigma^{2})$ . We let $x_{j}$ follows distribution of $N(0,\sigma^{2}_{x})$ where $\sigma_{x}=2$ . We assume there are two agents, denoted by agent $k$ and agent $j$ . Agent $k$ possesses samples $\BFy_{1}$ and $\BFy_{2}$ while agent $j$ possesses $\BFy_{3}$ and $\BFy_{4}$ . Agent $k$ (or $j$ ) could potentially participate under identity $k_{1}$ (or $j_{1}$ ) and $k_{2}$ (or $j_{2}$ ), with each fake identity contributing one sample. Figure 1 compares the convergence and number of synchronizations with and without data splitting. Figure 1 shows that having fake identities adds noise to the convergence process during the epochs when each agent independently performs local gradient descents without synchronization. And in Figure 0(b), when agents split data, the convergence speed significantly decreases compared to Figure 0(a), when there are no fake identities, fixing the number of synchronizations the same. Hence, in order to converge to the same performance guarantee as in Figure 0(a), the total number of synchronization required increases from $6$ to $19$ in Figure 0(c), which doubles the cost of communication.

Refer to caption — (a) Performance of FL, no data splitting.

We further investigate the cost in newsvendor. In figure 2, we observe that the better quality of estimator presented in figure 1 directly leads to better performance and cost-reduction in decision-making under uncertainty for the decision-aware agents. Without data splitting, the newsvendor loss quickly converges to the minimized loss for the decision-aware agents. However, when agents split data, in figure 1(b), we zoom in to the 15 - 55 epochs and find that similar to the estimator performance in figure 0(b), the loss also oscillates around the optimal loss, and within $T=55$ epochs, the platform cannot provide an estimator that satisfies the performance guarantee. Hence, the platform is required to increase the number of synchronization from $6$ to $19$ , in order to promise the guaranteed surplus to the decision-aware agents. This further hugely increases the communication cost.

Portfolio Optimization

We consider a risk-averse portfolio optimization problem as an example of the operational decision among agents. We let a random vector $\BFxi\in\mathbb{R}^{d}$ denote the random return, and the agent aims to make investment decisions $\BFw\in\mathbb{R}^{d}$ and $w_{0}\in\mathbb{R}$ to optimize the allocation of assets. The objective is formulated as $c(\BFw,w_{0},\BFxi):=\alpha(\sum_{l=1}^{d}w^{l}\xi^{l}-w_{0})^{2}-\sum_{l=1}^{% d}w^{l}\xi^{l}$ , where $w^{l}$ and $\xi^{l}$ denote the $l$ -th component, respectively. Moreover, we assume that $\xi_{i}=\BFx^{T}_{i}\BFtheta^{*}+N(0,\sigma)$ , where $\BFx_{i}\in R^{1\times p}$ is fixed local feature data. And $\BFy_{i}:=(\BFx_{i},\xi_{i})$ . Here, $\BFtheta^{*}\in\mathbb{R}^{p\times 1}$ is the unknown estimator we try to obtain from FL. In the MCFL framework, the agents obtain a maximum likelihood estimator $\hat{\BFtheta}$ by minimizing the mean square error (MSE) loss function $l_{MSE}(\BFxi,\BFxi^{\prime}):=\|\BFxi-\BFxi^{\prime}\|^{2}$ for any $\BFxi$ , $\BFxi^{\prime}$ , through FL:

\hat{\BFtheta}=\arg\min_{\BFtheta}L(\BFtheta)=\sum_{k=1}^{K}\sum_{\BFy_{j}\in S% _{k}}l_{MSE}(\BFx^{T}_{j}\BFtheta,\BFxi_{j}).

The following Figure 3 demonstrates the intuition on how data splitting hurts the performance of FL under portfolio optimization. In the following numerical example, we consider the case where two agents may split their samples and participate as four agents under fake identities. The details of the numerical setup are as follows. There are two agents, denoted by agent $k$ and agent $j$ . Agent $k$ possesses $m_{k}=8$ samples of $[\BFx_{k}]\in R^{m_{k}\times 1}$ and $[\BFxi_{k}]\in R^{m_{k}}$ , while agent $j$ possesses $m_{j}=8$ samples of $[\BFx_{j}\times 1]\in R^{m_{j}}$ and $[\BFxi_{j}]\in R^{m_{j}}$ . Observations of $x_{k}$ and $x_{j}$ follow i.i.d. standard normal distribution with $x_{k},x_{j}\sim N(0,\ 1)$ . and $\xi_{k}=x^{T}_{k}\theta^{*}+N(0,\sigma)$ , with $\theta^{*}=1$ , and $\sigma=0.01$ . Further more, we set $\alpha=\frac{1}{2}$ in the the objective function. The platform adopts algorithm 1 to obtain $\hat{\theta}^{FL}_{|\BFtau_{\mathcal{A}}|}$ . The FL algorithm parameter is given by $\Phi=\{\rho=0.1,\theta^{0}=2,T=55,H=10\}$ . Agent $k$ (or $j$ ) could potentially participate under identity $k_{1}$ (or $j_{1}$ ) and $k_{2}$ (or $j_{2}$ ), with each fake identity contributing 4 samples.

Similarly, once an estimator $\hat{\BFtheta}$ is obtained, we look into how the quality of the estimator translates to the profits of decision-aware agents. To be more specific, we evaluate the out-of-sample profit, evaluated by the objective function $c$ . From figure 4, we observe that similar to the case of newsvendor, the quality of the estimator nicely translates into the profits for decision-aware agents. Here, unlike the newsvendor problem where we try to minimize loss, in portfolio optimization, the platform tries to maximize the profit, and data splitting directly leads to a potential decrease in the profits that the agents would gain.

6 Conclusions

In conclusion, our study has shed light on the intricate dynamics of collaborative learning in multi-agent systems through the lens of Federated Learning (FL) technology and Shapley value-based mechanisms. By establishing a comprehensive framework, we have underscored the critical role of platform-facilitated collaboration among decision-aware agents and delved into the nuanced impacts of mechanism design on both decision quality and FL algorithm efficiency.

Our investigation reveals that while Shapley value based mechanism ensures fair allocation and guarantees quality decisions among agents through encouraging full participation, they inadvertently introduce significant communication costs during the FL process due to the agents’ dishonest behavior of false-name manipulation, highlighting a crucial trade-off between decision quality and operational efficiency. This discovery not only addresses a gap in existing research but also opens new avenues for exploring mechanism design that balances decision quality with the practicalities of implementation in FL environments.

Moreover, our work stands as a pioneering effort to systematically explore the interplay between mechanism design and FL performance, offering valuable insights for both theoreticians and practitioners interested in optimizing collaborative learning settings. The identification of Shapley value mechanisms’ limitations further enriches the studies in collaborative learning, prompting a reevaluation of widely accepted practices and encouraging the development of more efficient, cost-effective solutions. Several future promising directions involve investigating other mechanisms beyond Shapley value and analyzing this framework under more specific business or operations contexts, including pricing, and inventory management, to name a few.

References

AbdulRahman et al. (2020) AbdulRahman S, Tout H, Ould-Slimane H, Mourad A, Talhi C, Guizani M (2020) A survey on federated learning: The journey from centralized to distributed on-site learning and beyond. IEEE Internet of Things Journal 8(7):5476–5497.
Anily and Haviv (2010) Anily S, Haviv M (2010) Cooperation in service systems. Operations Research 58(3):660–673.
Arora and Jain (2023) Arora A, Jain T (2023) Data sharing between platform and seller: An analysis of contracts, privacy, and regulation. European Journal of Operational Research .
Aziz et al. (2011) Aziz H, Bachrach Y, Elkind E, Paterson M (2011) False-name manipulations in weighted voting games. Journal of Artificial Intelligence Research 40:57–93.
Bergantinos and Moreno-Ternero (2020) Bergantinos G, Moreno-Ternero JD (2020) Sharing the revenues from broadcasting sport events. Management Science 66(6):2417–2431.
Branzei et al. (2008) Branzei R, Dimitrov D, Tijs S (2008) Models in cooperative game theory, volume 556 (Springer Science & Business Media).
Choudhury (2023) Choudhury O (2023) Federated learning on aws with fedml: Health analytics without sharing sensitive data URL https://aws.amazon.com/blogs/machine-learning/federated-learning-on-aws-with-fedml-health-analytics-without-sharing-sensitive-data/, aWS Machine Learning Blog.
Chraibi et al. (2019) Chraibi S, Khaled A, Kovalev D, Richtárik P, Salim A, Takáč M (2019) Distributed fixed point methods with compressed iterates. arXiv preprint arXiv:1912.09925 .
Conitzer and Yokoo (2010) Conitzer V, Yokoo M (2010) Using mechanism design to prevent false-name manipulations. AI magazine 31(4):65–78.
Gafni and Tennenholtz (2022) Gafni Y, Tennenholtz M (2022) Long-term data sharing under exclusivity attacks. Proceedings of the 23rd ACM Conference on Economics and Computation, 739–759.
Ghorbani and Zou (2019) Ghorbani A, Zou J (2019) Data shapley: Equitable valuation of data for machine learning. International conference on machine learning, 2242–2251 (PMLR).
Gopalakrishnan et al. (2014) Gopalakrishnan R, Marden JR, Wierman A (2014) Potential games are necessary to ensure pure nash equilibria in cost sharing games. Mathematics of Operations Research 39(4):1252–1296.
Gopalakrishnan et al. (2021) Gopalakrishnan S, Granot D, Granot F, Sošić G, Cui H (2021) Incentives and emission responsibility allocation in supply chains. Management Science 67(7):4172–4190.
Gopalakrishnan and Sankaranarayanan (2023) Gopalakrishnan S, Sankaranarayanan S (2023) Cooperative security against interdependent risks. Production and Operations Management 32(11):3504–3520.
Hamer et al. (2020) Hamer J, Mohri M, Suresh AT (2020) Fedboost: A communication-efficient algorithm for federated learning. International Conference on Machine Learning, 3973–3983 (PMLR).
Hsiao (2004) Hsiao CR (2004) The power indices for multi-choice multi-valued games. Taiwanese Journal of Mathematics 8(2):259–270.
Hsiao and Raghavan (1993) Hsiao CR, Raghavan T (1993) Shapley value for multichoice cooperative games, i. Games and economic behavior 5(2):240–256.
Iwasaki et al. (2010) Iwasaki A, Conitzer V, Omori Y, Sakurai Y, Todo T, Guo M, Yokoo M (2010) Worst-case efficiency ratio in false-name-proof combinatorial auction mechanisms. Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, 633–640.
Jia et al. (2019) Jia R, Dao D, Wang B, Hubis FA, Hynes N, Gürel NM, Li B, Zhang C, Song D, Spanos CJ (2019) Towards efficient data valuation based on the shapley value. The 22nd International Conference on Artificial Intelligence and Statistics, 1167–1176 (PMLR).
Kairouz et al. (2021) Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R, et al. (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14(1–2):1–210.
Karimireddy et al. (2022) Karimireddy SP, Guo W, Jordan MI (2022) Mechanisms that incentivize data sharing in federated learning. arXiv preprint arXiv:2207.04557 .
Kemahlıoğlu-Ziya and Bartholdi III (2011) Kemahlıoğlu-Ziya E, Bartholdi III JJ (2011) Centralizing inventory in supply chains by using shapley value to allocate the profits. Manufacturing & Service Operations Management 13(2):146–162.
Khaled et al. (2019) Khaled A, Mishchenko K, Richtárik P (2019) First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715 .
Konečnỳ et al. (2016) Konečnỳ J, McMahan HB, Yu FX, Richtárik P, Suresh AT, Bacon D (2016) Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 .
Leng et al. (2021) Leng M, Luo C, Liang L (2021) Multiplayer allocations in the presence of diminishing marginal contributions: Cooperative game analysis and applications in management science. Management Science 67(5):2891–2903.
Leng and Parlar (2009) Leng M, Parlar M (2009) Allocation of cost savings in a three-level supply chain with demand information sharing: A cooperative-game approach. Operations Research 57(1):200–213.
Mak and Max Shen (2021) Mak HY, Max Shen ZJ (2021) When triple-a supply chains meet digitalization: The case of jd. com’s c2m model. Production and Operations Management 30(3):656–665.
Mangasarian (1995) Mangasarian L (1995) Parallel gradient distribution in unconstrained optimization. SIAM Journal on Control and Optimization 33(6):1916–1925.
Masters (2019) Masters K (2019) Amazon releases free analytics that brands previously paid $30,000-plus per year for. Forbes URL https://www.forbes.com/sites/kirimasters/2019/02/12/amazon-releases-free-analytics-that-brands-previously-paid-30000-per-year-for/?sh=f48b36c44b4e, last Accessed on 19 August 2023.
McMahan et al. (2017) McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. Artificial intelligence and statistics, 1273–1282 (PMLR).
Qi et al. (2021) Qi M, Grigas P, Shen ZJM (2021) Integrated conditional estimation-optimization. arXiv preprint arXiv:2110.12351 .
Qi et al. (2020) Qi M, Mak HY, Shen ZJM (2020) Data-driven research in retail operations—a review. Naval Research Logistics (NRL) 67(8):595–616.
Rozemberczki et al. (2022) Rozemberczki B, Watson L, Bayer P, Yang HT, Kiss O, Nilsson S, Sarkar R (2022) The shapley value in machine learning. arXiv preprint arXiv:2202.05594 .
Shamir et al. (2014) Shamir O, Srebro N, Zhang T (2014) Communication-efficient distributed optimization using an approximate newton-type method. International conference on machine learning, 1000–1008 (PMLR).
Shapley et al. (1953) Shapley LS, et al. (1953) A value for n-person games .
Sim et al. (2020) Sim RHL, Zhang Y, Chan MC, Low BKH (2020) Collaborative machine learning with incentive-aware model rewards. International conference on machine learning, 8927–8936 (PMLR).
Singal et al. (2019) Singal R, Besbes O, Desir A, Goyal V, Iyengar G (2019) Shapley meets uniform: An axiomatic framework for attribution in online advertising. The World Wide Web Conference, 1713–1723.
Stich (2018) Stich SU (2018) Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767 .
Yu et al. (2019) Yu H, Yang S, Zhu S (2019) Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 5693–5700.
Yuan and Ma (2020) Yuan H, Ma T (2020) Federated accelerated stochastic gradient descent. Advances in Neural Information Processing Systems, volume 33, 5332–5344.
Zeng et al. (2021) Zeng R, Zeng C, Wang X, Li B, Chu X (2021) A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 .
Zhan et al. (2021) Zhan Y, Zhang J, Hong Z, Wu L, Li P, Guo S (2021) A survey of incentive mechanism design for federated learning. IEEE Transactions on Emerging Topics in Computing 10(2):1035–1044.
Zhang et al. (2022) Zhang Z, Liu G, Wu J, Tan Y (2022) Data and algorithm pricing: Incentive mechanisms design for federated learning. Available at SSRN 4061980 .
Zinkevich et al. (2010) Zinkevich M, et al. (2010) Parallelized stochastic gradient descent. Advances in neural information processing systems, volume 23.

{APPENDICES}

7 Supplementary materials for Theorem 1

In this part, we let

\mathcal{T}(\BFm_{[2:K]}):=\{\BFtau:\BFtau_{k}\in\{0,\dots,m_{k}\},\forall k=2% ,\dots,K\}

which denotes the set of all possible profiles given $\BFm_{[2:K]}$ .

We also define

\nabla v^{t}_{|\BFtau|}=v(|\BFtau|+t)-v(|\BFtau|+t-1),

and

c^{k}_{t}(\BFtau)=\sum^{|M_{k}(\BFtau)|}_{l=0}C^{l}_{|M_{k}(\BFtau)|}(-1)^{l}% \frac{1}{|\BFtau|+l+t}.

For notation simplicity, in the proof when we analyze the incentive on data splitting for a specific agent $k$ , without loss of generosity we let $k=1$ , and we omit the dependency on agent index $k$ and let $c^{k}_{t}(\BFtau)=c_{t}(\BFtau)$ for future analysis. We first introduce the following lemma.

Lemma 7.1

For any $T$ and $T^{\prime}\in\{1,\dots,i\}$ , and any maximum data vector $\BFm_{[2:K]}$ , we have

\psi_{T,k_{1}}=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{|\BFtau|}+\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}% \sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+% t}_{|\BFtau|}-\nabla v^{t_{1}+t-1}_{|\BFtau|})\right].

Similarly,

\psi_{T^{\prime},k_{2}}=\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm% _{[2:K]})}tc_{t}(\BFtau)\nabla v^{t}_{|\BFtau|}+\sum^{T^{\prime}}_{t=1}\left[% \sum^{T}_{t_{1}=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau% )(\nabla v^{t_{1}+t}_{|\BFtau|}-\nabla v^{t_{1}+t-1}_{|\BFtau|})\right].

Proof.

Proof of Lemma 7.1

	$\displaystyle\psi_{T,k_{1}}$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(% \BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|% \BFtau\|}\right.$
	$\displaystyle\quad\left.+\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}t\left(\sum^% {\|M_{k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+T^% {\prime}}\right)\nabla v^{T^{\prime}+t}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(% \BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{% 1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+t_{1}}\right)\right.$
	$\displaystyle\quad\left.\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|}_{l=0}[C^{l-1}_{\|M_{k}(% \BFtau)\|}+C^{l}_{\|M_{k}(\BFtau)\|}](-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}\right.\right.$
	$\displaystyle\quad\left.\left.+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{% k}(\BFtau)\|+1)+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|% \BFtau\|}\right],$

where we use the recurrence relation of binomial coefficient. And, the previous equation equals to

	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)\nabla v^{t_{1}+t}_{\|\BFtau\|}+% \sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}t\left(% \sum^{\|M_{k}(\BFtau)\|}_{l=1}C^{l-1}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|% +l+t+t_{1}}\right.\right.$
	$\displaystyle\quad\left.\left.+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{% k}(\BFtau)\|+1)+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|% \BFtau\|}\right].$

The equability is by the definition of $c_{t+t_{1}}(\BFtau)$ , and changing the index from $l=1$ to $l=0$ , we have

	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)\nabla v^{t_{1}+t}_{\|\BFtau\|}+% \sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+% T^{\prime}}_{\|\BFtau\|}\right.$
	$\displaystyle\quad\left.-\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}% \frac{(-1)^{l}}{\|\BFtau\|+l+1+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)\nabla v^{t_{1}+t}_{\|\BFtau\|}-% \sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_% {1}+1}(\BFtau)\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in\mathcal{T}(\BFm_{[% 2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|\BFtau\|}\right].$

Similarly, the previous equality comes from the definition of $c_{t+T^{\prime}}(\BFtau)$ and $c_{t+t_{1}+1}(\BFtau)$ , and changing the index of $t_{1}=0$ to $t_{1}=1$ ,

	$\displaystyle\psi_{T,k_{1}}$	$\displaystyle=$
		$\displaystyle=\sum^{T}_{t=1}\left[\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_% {t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)\nabla v^{t_{1}+t}_{\|\BFtau\|}-% \sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1% }}(\BFtau)\nabla v^{t_{1}+t-1}_{\|\BFtau\|}\right]$
		$\displaystyle=\sum^{T}_{t=1}\left[\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_% {t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+t}_{\|\BFtau\|}-% \nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right].$

Similarly, we have

\psi_{T^{\prime},k_{2}}=\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm% _{[2:K]})}tc_{t}(\BFtau)\nabla v^{t}_{|\BFtau|}+\sum^{T^{\prime}}_{t=1}\left[% \sum^{T}_{t_{1}=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau% )(\nabla v^{t_{1}+t}_{|\BFtau|}-\nabla v^{t_{1}+t-1}_{|\BFtau|})\right].

∎

We further introduce the following lemma 7.2 for our final proof.

Lemma 7.2

For any $K$ and and $\BFm$

\sum_{\BFtau\in\mathcal{T}(\BFm)}c_{t}(\BFtau)=\frac{1}{t}

Proof.

Proof of Lemma 7.2 Start induction from $K=1$ , $\BFm_{1}=[m_{1}]$ , we have

	$\displaystyle\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{1})}c_{t}(\BFtau)$	$\displaystyle=\sum^{m_{1}-1}_{j=0}\left[\sum^{1}_{l=0}C^{l}_{1}(-1)^{l}\frac{1% }{j+l+t}\right]+(-1)^{0}\frac{1}{m_{1}+t}$
		$\displaystyle=\sum^{m_{1}-1}_{j=0}\left[\frac{1}{j+t}-\frac{1}{j+t+1}\right]+% \frac{1}{m_{1}+t}=\frac{1}{t}.$

Generally, suppose with $\mathbf{m}_{K}=[m_{1},\dots,m_{K}]$ , the previous induction step holds

\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}c_{t}(\BFtau)=\frac{1}{t}.

Then with $\mathbf{m}_{K+1}=[m_{1},\dots,m_{K},m_{K+1}]$ , one have

	$\displaystyle\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K+1})}c_{t}(\BFtau)$	$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}$
		$\displaystyle\quad\quad+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\sum^{\|M_{k% }(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+m_{K+1})}$
		$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}c_{t+m_{K+1}}(\BFtau)$
		$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}+\frac{1}{t+m_{K+1}},$

where the last equality holds by induction hypothesis. Define $C^{-1}_{M}=0$ for all $M$ , we have

	$\displaystyle\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\sum^{\|M_{k}(\BFtau)\|+% 1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}$
	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}+(-1% )^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+j}\right]$
	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|}_{l=0}[C^{l-1}_{\|M_{k}(\BFtau)\|}+C^{l}_{\|M_{k}(\BFtau)\|}](-1)^{l}% \frac{1}{\|\BFtau\|+l+(t+j)}+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{k}(% \BFtau)\|+1)+t+j}\right].$

Again, for the previous equations, we apply the recurrence relation of binomial coefficient. Moreover, for the above equation we have

	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|}_{l=1}C^{l-1}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}+(-1% )^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+j}\right]$
	$\displaystyle\quad+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{% k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}\right]$

	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|}_{l=1}C^{l-1}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}+(-1% )^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+j}\right]$
	$\displaystyle\quad+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}c_{t+j}(\BFtau)$
	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|}_{l=1}C^{l-1}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+j)}+(-1% )^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+j}\right]+\frac{1% }{t+j}$

	$\displaystyle=\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k}(% \BFtau)\|-1}_{i^{\prime}=0}C^{i^{\prime}}_{\|M_{k}(\BFtau)\|}(-1)^{i^{\prime}+1}% \frac{1}{\|\BFtau\|+(i^{\prime}+1)+(t+j)}+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|% \BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+j}\right]$
	$\displaystyle\quad+\frac{1}{t+j}.$

The above relations hold because we apply the definition of $c_{t+j}(\BFtau)$ , and we use the induction hypothesis. Further, take out the common coefficient $(-1)$ , we have

	$\displaystyle=(-1)\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k% }(\BFtau)\|-1}_{i^{\prime}=0}C^{i^{\prime}}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|% \BFtau\|+(i^{\prime}+1)+(t+j)}+(-1)^{\|M_{k}(\BFtau)\|}\frac{1}{\|\BFtau\|+(\|M_{k}(% \BFtau)\|+1)+t+j}\right]$
	$\displaystyle\quad+\frac{1}{t+j}$
	$\displaystyle=(-1)\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\left[\sum^{\|M_{k% }(\BFtau)\|}_{i^{\prime}=0}C^{i^{\prime}}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|% \BFtau\|+i^{\prime}+(t+j+1)}\right]+\frac{1}{t+j}$
	$\displaystyle=(-1)\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}c_{t+j+1}(\BFtau)% +\frac{1}{t+j}=\frac{1}{t+j}-\frac{1}{t+j+1}.$

Hence

\displaystyle\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K+1})}c_{t}(\BFtau)=\sum^{% m_{K+1}-1}_{j=0}\left[\frac{1}{t+j}-\frac{1}{t+j+1}\right]+\frac{1}{t+m_{K+1}}% =\frac{1}{t}.

And the proof is complete. ∎

With lemma 7.2,we can show the following lemma 7.3 holds.

Lemma 7.3

For any $K,T,T^{\prime}$ , and any maximum data vector $\BFm_{[2:K]}$ , we have

	$\displaystyle\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}-(t+T)c_{t+T}(\BFtau)\nabla v^{t+T}% _{\|\BFtau\|}\right]$
	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)\nabla v^{t_{1}+t}% _{\|\BFtau\|}-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\nabla v^{t_{1}+t+1}_{\|\BFtau\|}% \right].$

Proof.

Proof of Lemma 7.3 For any fixed $t_{1}\in[1,\dots,T^{\prime}]$ and $\BFtau\in\mathcal{T}(\BFm_{[2:K]})$ ,

	$\displaystyle t_{1}c_{t_{1}}(\BFtau)\nabla v^{t_{1}}_{\|\BFtau\|}-(t_{1}+T)c_{t_% {1}+T}(\BFtau)\nabla v^{t_{1}+T}_{\|\BFtau\|}$
	$\displaystyle=\sum_{t^{\prime}=t_{1}}^{t_{1}+T-1}\left[t^{\prime}c_{t^{\prime}% }(\BFtau)\nabla v^{t^{\prime}}_{\|\BFtau\|}-(t^{\prime}+1)c_{t^{\prime}+1}(% \BFtau)\nabla v^{t^{\prime}+1}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum_{t=0}^{T-1}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)\nabla v^{t_{1% }+t}_{\|\BFtau\|}-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\nabla v^{t_{1}+t+1}_{\|\BFtau\|% }\right].$

Thus the desired result follows. ∎

We introduce the final lemma before the main proof of the theorem.

Lemma 7.4

For any $k$ , $n\in\mathbb{N}^{+}$ , we have

\sum_{k=0}^{n}C_{n}^{k}\frac{(-1)^{k}}{x+k}=\frac{n!}{\Pi_{k=0}^{n}(x+k)}.

Proof.

Proof of Lemma 7.4 We consider the partial fraction expansion of

H(x):=\frac{1}{\Pi_{l=0}^{n}(x+k)}.

Since $k=-n,\dots,0$ are simple poles of $H(x)$ , then there exists a decomposition

H(x)=\sum_{k=0}^{n}\frac{a_{k}}{x+k},

and

	$\displaystyle a_{i}$	$\displaystyle=\lim_{x\rightarrow-k}(x+k)H(x)$
		$\displaystyle=\lim_{x\rightarrow-k}\frac{1}{(-k)(-k+1)\cdots(-1)}\cdot\frac{1}% {(1)\cdots(-k+n)}$
		$\displaystyle=\frac{(-1)^{k}}{k!(n-k)!}.$

Therefore, we have

	$\displaystyle n!H(x)$	$\displaystyle=\sum_{k=0}^{n}\frac{(-1)^{k}n!}{k!(n-k)!}\frac{1}{x+k}$
		$\displaystyle=\sum_{k=0}^{n}C_{n}^{k}\frac{(-1)^{k}}{x+k}.$

∎

With all the above lemmas, we are now ready to prove the main theorem.

Proof.

Proof of Theorem 4.1

By Lemma 7.1, we have

	$\displaystyle\psi_{T,k_{1}}+\psi_{T^{\prime},k_{2}}$	$\displaystyle=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}% \sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+% t}_{\|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
		$\displaystyle\quad+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:% K]})}tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t=1}\left[\sum^{% T}_{t_{1}=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(% \nabla v^{t_{1}+t}_{\|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
		$\displaystyle=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}$
		$\displaystyle\quad+\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau% \in\mathcal{T}(\BFm_{[2:K]})}(t+t_{1})c_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+t}_{% \|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right].$

Noting that

	$\displaystyle\psi_{T+T^{\prime},k}$	$\displaystyle=\sum^{T+T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}% )}tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}$
		$\displaystyle=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}(t+T)c_{t+T}(\BFtau)\nabla v^{t+T}_{\|\BFtau\|}.$

Thus, we have

$\displaystyle\psi_{T,k_{1}}+\psi_{T^{\prime},k_{2}}-\psi_{T+T^{\prime},k}$	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}(t+t_{1})c_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+t}_{\|% \BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
	$\displaystyle\quad+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:% K]})}\left[tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}-(t+T)c_{t+T}(\BFtau)\nabla v^% {t+T}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T-1}_{t=0}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}(t+t_{1}+1)c_{t+t_{1}+1}(\BFtau)(\nabla v^{t_{1}+t+1% }_{\|\BFtau\|}-\nabla v^{t_{1}+t}_{\|\BFtau\|})\right]$
	$\displaystyle\quad+\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)\nabla v^{t_{1}+t}% _{\|\BFtau\|}-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\nabla v^{t_{1}+t+1}_{\|\BFtau\|}\right]$	(10)
	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{% 1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}.$	(11)

where (10) holds according to Lemma 7.3.

We first prove (ii). If $v(\cdot)$ is a linear function, there exists a constant $v_{0}$ such that $\nabla v^{t}_{|\BFtau|}=v_{0}$ for any $t$ and $\BFtau$ . Therefore, we have

$\displaystyle\eqref{eq: thm1-2}$	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{% 1}+t+1}(\BFtau)\right]v_{0}$
	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\left[(t_{1}+t)\sum_{% \BFtau\in\mathcal{T}(\BFm_{[2:K]})}c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)\sum_{\BFtau% \in\mathcal{T}(\BFm_{[2:K]})}c_{t_{1}+t+1}(\BFtau)\right]v_{0}$
	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\left[(t_{1}+t)\frac{% 1}{t_{1}+t}-(t_{1}+t+1)\frac{1}{t_{1}+t+1}\right]v_{0}$	(12)
	$\displaystyle=0.$

where (12) holds due to Lemma 7.2.

Now we prove (i). If $v(\cdot)$ is concave, then according to Lemma 7.4, one knows that, for any $t$ , $c_{t}(\BFtau)=\sum^{|M_{k}(\BFtau)|}_{l=0}C^{l}_{|M_{k}(\BFtau)|}(-1)^{l}\frac% {1}{|\BFtau|+l+t}=\frac{|M_{k}(\BFtau)|!}{\Pi_{k=0}^{|M_{k}(\BFtau)|}(|\BFtau|% +t+k)}$ . Therefore, we have

\displaystyle\frac{tc_{t}(\BFtau)}{(t+1)c_{t+1}(\BFtau)}=\frac{t\Pi_{k=0}^{|M_% {k}(\BFtau)|}(|\BFtau|+t+k+1)}{(t+1)\Pi_{k=0}^{|M_{k}(\BFtau)|}(|\BFtau|+t+k)}.

Note that if $|\BFtau|=0$ , then

\frac{t\Pi_{k=0}^{|M_{k}(\BFtau)|}(|\BFtau|+t+k+1)}{(t+1)\Pi_{k=0}^{|M_{k}(% \BFtau)|}(|\BFtau|+t+k)}=\frac{\Pi_{k=1}^{|M_{k}(\BFtau)|}(|\BFtau|+t+k+1)}{% \Pi_{k=1}^{|M_{k}(\BFtau)|}(|\BFtau|+t+k)}>1.

Moreover, $\frac{t\Pi_{k=0}^{|M_{k}(\BFtau)|}(|\BFtau|+t+k+1)}{(t+1)\Pi_{k=0}^{|M_{k}(% \BFtau)|}(|\BFtau|+t+k)}$ decreases as $|\BFtau|$ grows larger. We let $\bar{|\tau|}_{t}\in\mathbb{R}$ denote the constant that satisfies

\frac{t\Pi_{k=0}^{|M_{k}(\BFtau)|}(\bar{|\tau|}_{t}+t+k+1)}{(t+1)\Pi_{k=0}^{|M% _{k}(\BFtau)|}(\bar{|\tau|}_{t}+t+k)}=1.

Let $T^{+}:=\{t,t_{1}|t=0,\dots,T-1;t_{1}=1,\dots,T^{\prime};\max_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}|\BFtau|<\bar{|\tau|}_{t+t_{1}}\}$ and let $T^{-}=\{t,t_{1}|t=0,\dots,T-1;t_{1}=1,\dots,T^{\prime}\}/T^{+}$ .

Therefore, we have

	$\displaystyle\eqref{eq: thm1-2}=$	$\displaystyle\sum_{t,t_{1}\in T^{+}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle\geq$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle=$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),\|% \BFtau\|\leq\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+% 1)c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),% \|\BFtau\|>\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)% c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle>$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),\|% \BFtau\|\leq\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+% 1)c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),% \|\BFtau\|>\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)% c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}$
	$\displaystyle=$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}=0.$

It’s straight-forward to show that $\psi_{i,k}(v)$ for any $i>0$ . Hence, recursively applying Theorem 4.1 to each fake identity, we have, the equilibrium participation profile for an agent $k$ with $m_{k}$ data samples is of the form : $\tau_{\hat{k}}=1,\ \hat{k}=1,\dots,m_{k}$ . In other words, an agent fully participates in the coalition but splits all the data samples, with $m_{k}$ number of fake identities, and each fake identity contributes one sample.

∎

8 Supplementary material for Section 3

In this section, we provide more details of the characteristic function $v(|\BFtau_{\mathcal{A}}|)$ solely as a function of the sample size $|\BFtau_{\mathcal{A}}|$ . We first present an analytical format of $v(|\BFtau_{\mathcal{A}}|)$ for a pricing example where the closed-form solution of $w^{*}$ is available. For a general case when $w^{*}$ does not have a closed-form solution, we offer a comprehensive approach based on the Lipschitz assumptions on reward and decision functions.

Example 8.1 (Pricing under uncertainty)

Consider the agents try to predict the unknown consumer willingness-to-pay $y$ , with $y\sim F_{\theta^{*}}$ , and $\theta^{*}$ being an unknown distribution parameter. Specifically, we assume $F_{\theta^{*}}\sim\exp(\theta^{*})$ . The agent’s decision is to set an optimal price to maximize the expected surplus : $\max_{p}\ \mathbb{E}_{y\sim F_{\theta^{*}}}[I(y\geq p)p]\big{]}=\max_{p}\ p(1-% F_{\theta^{*}}(p))$ . For $|\BFtau_{\mathcal{A}}|$ i.i.d. data samples of $\mathbf{y}_{\BFtau_{\mathcal{A}}}$ , with $y_{j}\sim F_{\theta^{*}}$ , let $\hat{\theta}$ be the coalition sample mean and $\hat{F}$ be the predicted consumer income distribution under $\hat{\theta}$ , we have

p^{*}(\hat{F})\sim\frac{1}{|\BFtau_{\mathcal{A}}|}Erlang(|\BFtau_{\mathcal{A}}% |,\lambda)=Erlang(|\BFtau_{\mathcal{A}}|,|\BFtau_{\mathcal{A}}|\lambda).

Let $X_{|\BFtau_{\mathcal{A}}|,\lambda}\sim Erlang(|\BFtau_{\mathcal{A}}|,|\BFtau_{% \mathcal{A}}|\lambda)$ , we have

p^{*}(\hat{F})(1-F_{\theta^{*}}(p^{*}(\hat{F})))\sim X_{|\BFtau_{\mathcal{A}}|% ,\lambda}exp(-\lambda X_{|\BFtau_{\mathcal{A}}|,\lambda}).

For simplicity, let $\theta^{*}=1$ , and

v(|\BFtau_{\mathcal{A}}|)=\int^{\infty}_{0}xexp(-\lambda x)\frac{(\lambda|% \BFtau_{\mathcal{A}}|)^{|\BFtau_{\mathcal{A}}|}x^{|\BFtau_{\mathcal{A}}|-1}exp% (-\lambda|\BFtau_{\mathcal{A}}|x)}{(|\BFtau_{\mathcal{A}}|-1)!}dx=\left(\frac{% |\BFtau_{\mathcal{A}}|}{|\BFtau_{\mathcal{A}}|+1}\right)^{|\BFtau_{\mathcal{A}% }|+1},

where $v(|\BFtau_{\mathcal{A}}|)$ is a strictly increasing and concave function. It’s worth mentioning that in this example, we assume the platform could get an accurate sample mean, without considering the loss in performing FL.

In general, it is often difficult to obtain a closed-form solution. For such cases, we provide an expression of the characteristic function based on the Lipschitz assumption. We first state the Lipschitzness assumptions in the following:

Assumption 8.1

Lipschitzness assumptions for the decision-aware objective are as follows

(Lipschitzness of $\BFw^{*}$ ) For any $\BFtheta_{1},\BFtheta_{2}\in\BFTheta$ , we have

\|\BFw^{*}(\BFtheta_{1})-\BFw^{*}(\BFtheta_{2})\|\leq L_{w}\|\BFtheta_{1}-% \BFtheta_{2}\|.

(Lipschitzness of $r$ ) The reward is Lipschitz with respect to decision $\BFw$

|r(\BFw_{1},\BFy)-r(\BFw_{2},\BFy)|\leq L_{r}\|\BFw_{1}-\BFw_{2}\|.

Assumption 8.1.A assumes that the optimal solution $\BFw^{*}(\cdot)$ is a $L_{w}$ -Lipschitz with respect to the parameter $\BFtheta$ . It can be further justified when $F_{\theta}$ is a discrete distribution and $r$ is strongly concave with respect to $w$ Qi et al. (2021). Assumption 8.1.B is a common assumption assuming that the reward function is Lipschitz with respect to the decision.

Under these assumptions, we have

|\pi(\hat{\BFtheta}_{|\BFtau_{\mathcal{A}}|})-z^{*}|\leq L_{r}L_{w}\|\hat{% \BFtheta}_{|\BFtau_{\mathcal{A}}|}-\BFtheta^{*}\|.

Then for any $\BFtau_{\mathcal{A}}$ , we can represent

z^{*}-v(|\BFtau_{\mathcal{A}}|)\leq L_{r}L_{w}\|\hat{\BFtheta}_{|\BFtau_{% \mathcal{A}}|}-\BFtheta^{*}\|=L_{r,w}\|\hat{\BFtheta}_{|\BFtau_{\mathcal{A}}|}% -\BFtheta^{*}\|,

where $L_{r,w}=L_{r}L_{w}$ .

9 Proof for Proposition 5.2

In this proof, we omit the dependency of $\BFtau_{\mathcal{A}}$ in $\hat{\BFtheta}^{FL}_{\BFtau_{\mathcal{A}}}$ for notation simplicity. And we use $\hat{\BFtheta}_{FL}$ to denote $\hat{\BFtheta}^{FL}_{\BFtau_{\mathcal{A}}}$ . Under Shapely equilibrium, all agents provide data but split each sample as a single agent identity. Hence $\BFtau=[1,\dots,1]\in R^{|\BFm|\times 1}$ . By assumption 5.1, for some $\mu>0$ and all $\BFx,\ \BFy$ , $L(\cdot)$ satisfies

L(\BFy)\geq L(\BFx)+\nabla L(\BFx)^{T}(\BFy-\BFx)+\frac{\mu}{2}\|\BFy-\BFx\|^{% 2}.

Hence, for $\BFtheta^{*}_{FL}=\argmin_{\BFtheta}L(\BFtheta)$ ,

L(\BFtheta^{*}_{FL})-L(\hat{\BFtheta}_{FL})\leq\epsilon\ \Rightarrow\ \|% \BFtheta^{*}_{FL}-\hat{\BFtheta}_{FL}\|\leq\left(\frac{2\epsilon}{\mu}\right)^% {1/2}.

(13)

We now consider a sequence of auxiliary training results $\hat{\BFtheta}^{a}_{FL}=\frac{1}{T}\sum^{T-1}_{t=0}\frac{1}{|\BFm|}\sum^{\BFm}% _{k=1}\BFtheta^{t}_{k}$ (which is not actually computed in algorithm, but is useful for analysis). Let $\hat{\BFtheta}_{FL}=\frac{1}{T}\sum^{T-1}_{t=0}\BFtheta^{t}$ , with $\BFtheta^{t}=\BFtheta^{t-1}$ if $t\notin\{H,2H,\dots,T-1\}$ .

$\displaystyle\\|\hat{\BFtheta}^{a}_{FL}-\hat{\BFtheta}_{FL}\\|$	$\displaystyle=\\|\frac{1}{T}\sum^{T}_{t=1}\left(\frac{1}{\|\BFm\|}\sum^{\BFm}_{k=% 1}\BFtheta^{t}_{k}-\BFtheta^{t}\right)\\|$	(14)
	$\displaystyle\leq\frac{1}{T}\sum^{T}_{t=1}\\|\frac{1}{\|\BFm\|}\sum^{\BFm}_{k=1}% \BFtheta^{t}_{k}-\BFtheta^{t}\\|$
	$\displaystyle\leq\frac{1}{T}\sum_{t^{\prime}\in\{H,2H,\dots,T-1\}}\sum^{H-1}_{% j=1}\\|\frac{1}{\|\BFm\|}\sum^{\BFm}_{j=1}\BFtheta^{t^{\prime}+j}_{k}-\BFtheta^{t% ^{\prime}}\\|$
	$\displaystyle=\frac{1}{T}\sum_{t^{\prime}\in\{H,2H,\dots,T-1\}}\sum^{H-1}_{j=1% }\\|\frac{1}{\|\BFm\|}\sum^{\BFm}_{j=1}(\BFtheta^{t^{\prime}+j-1}_{k}-\rho\nabla L% _{k}(\BFtheta^{t^{\prime}+j-1}_{k}))-\BFtheta^{t^{\prime}}\\|$
	$\displaystyle\leq\frac{1}{T}\frac{T}{H}\sum^{H-1}_{j=1}j\rho\xi\leq\frac{\rho H% \xi}{2}.$

Define $\sigma^{2}=\frac{1}{|\BFm|}\sum^{|\BFm|}_{k=1}\|\nabla L_{k}(\BFtheta^{*}_{FL}% )\|^{2}$ , and let $L$ be the $L$ -smooth parameter where

0\leq L_{k}(\BFx)-L_{k}(\BFy)-\langle\nabla L_{k}(\BFy),\BFx-\BFy\rangle\leq% \frac{L}{2}\|\BFx-\BFy\|^{2}.

Following Corollary 1 of Khaled et al. (2019), under data splitting equilibrium, for large $|\BFm|$ where $|\BFm|$ is in same order of $T$ , in order to get same order dependency on $T$ and $|\BFm|$ for total communication number, $H=T^{1/4}|\BFm|^{-3/4}$ , with optimized step size $\rho=\frac{\sqrt{|\BFm|}}{4L\sqrt{T}}$ , this leads to

L(\hat{\BFtheta}^{a}_{FL})-L({\BFtheta}^{*}_{FL})\leq\left(8L\|\BFtheta_{0}-% \BFtheta^{*}\|^{2}+\frac{3\sigma^{2}}{2L}\right)\frac{1}{\sqrt{T|\BFm|}},

which, by equation (13), implies

\|\BFtheta^{*}_{FL}-\hat{\BFtheta}^{a}_{FL}\|\leq\left(\frac{16L\|\BFtheta_{0}% -\BFtheta^{*}\|^{2}}{\mu}+\frac{3\sigma^{2}}{L\mu}\right)^{1/2}(T|\BFm|)^{-1/4}.

Moreover, by equation (14),

\|\hat{\BFtheta}^{a}_{FL}-\hat{\BFtheta}_{FL}\|\leq\frac{\xi}{8L}(T|\BFm|)^{-1% /4}.

Hence,

\|\BFtheta^{*}_{FL}-\hat{\BFtheta}_{FL}\|\leq\left(\left(\frac{16L\|\BFtheta_{% 0}-\BFtheta^{*}\|^{2}}{\mu}+\frac{3\sigma^{2}}{L\mu}\right)^{1/2}+\frac{\xi}{8% L}\right)(T|\BFm|)^{-1/4}.

In order to guarantee ${P(\|\hat{\BFtheta}_{FL}-\BFtheta^{*}\|\geq\varepsilon(|\BFm|,\delta_{0}))\leq% \delta_{0}}$ . Let MLE optimal estimator $\BFtheta^{*}_{FL}$ satisfies ${P(\|\BFtheta^{*}_{FL}-\BFtheta^{*}\|\geq\frac{\varepsilon(|\BFm|,\delta_{0})}% {2})\leq p_{0},}$ A sufficient condition is

\|\hat{\BFtheta}_{FL}-\hat{\BFtheta}\|\leq\left(\left(\frac{16L\|\BFtheta_{0}-% \BFtheta^{*}\|^{2}}{\mu}+\frac{3\sigma^{2}}{L\mu}\right)^{1/2}+\frac{\xi}{8L}% \right)(T|\BFm|)^{-1/4}\leq\frac{\varepsilon(|\BFm|,\delta_{0})}{2}.

Hence, $N_{sync}=\frac{T}{H}=(T|\BFm|)^{3/4}$ should satisfy

{N_{sync}=(T|\BFm|)^{3/4}\geq\left(\left(\frac{64L\|\BFtheta_{0}-\BFtheta^{*}% \|^{2}}{\mu}+\frac{12\sigma^{2}}{L\mu}\right)^{1/2}+\frac{\xi}{4L}\right)^{3}(% \varepsilon(|\BFm|,\delta_{0}))^{-3}.}

(15)

	$\displaystyle\psi_{T,k_{1}}$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(% \BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|% \BFtau\|}\right.$
	$\displaystyle\quad\left.+\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}t\left(\sum^% {\|M_{k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+T^% {\prime}}\right)\nabla v^{T^{\prime}+t}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(% \BFtau)\|+1}(-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{% 1}{\|\BFtau\|+(\|M_{k}(\BFtau)\|+1)+t+t_{1}}\right)\right.$
	$\displaystyle\quad\left.\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}-1}_{t_{1}=0}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}t\left(\sum^{\|M_{k}(\BFtau)\|}_{l=0}[C^{l-1}_{\|M_{k}(% \BFtau)\|}+C^{l}_{\|M_{k}(\BFtau)\|}](-1)^{l}\frac{1}{\|\BFtau\|+l+t+t_{1}}\right.\right.$
	$\displaystyle\quad\left.\left.+(-1)^{\|M_{k}(\BFtau)\|+1}\frac{1}{\|\BFtau\|+(\|M_{% k}(\BFtau)\|+1)+t+t_{1}}\right)\nabla v^{t_{1}+t}_{\|\BFtau\|}+\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}tc_{t+T^{\prime}}(\BFtau)\nabla v^{t+T^{\prime}}_{\|% \BFtau\|}\right],$

	$\displaystyle\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K+1})}c_{t}(\BFtau)$	$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}$
		$\displaystyle\quad\quad+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}\sum^{\|M_{k% }(\BFtau)\|}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|}(-1)^{l}\frac{1}{\|\BFtau\|+l+(t+m_{K+1})}$
		$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}+\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})}c_{t+m_{K+1}}(\BFtau)$
		$\displaystyle=\sum^{m_{K+1}-1}_{j=0}\sum_{\BFtau\in\mathcal{T}(\mathbf{m}_{K})% }\sum^{\|M_{k}(\BFtau)\|+1}_{l=0}C^{l}_{\|M_{k}(\BFtau)\|+1}(-1)^{l}\frac{1}{\|% \BFtau\|+l+(t+j)}+\frac{1}{t+m_{K+1}},$

	$\displaystyle\psi_{T,k_{1}}+\psi_{T^{\prime},k_{2}}$	$\displaystyle=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}% \sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+% t}_{\|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
		$\displaystyle\quad+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:% K]})}tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t=1}\left[\sum^{% T}_{t_{1}=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t+t_{1}}(\BFtau)(% \nabla v^{t_{1}+t}_{\|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
		$\displaystyle=\sum^{T}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}tc_{t}(% \BFtau)\nabla v^{t}_{\|\BFtau\|}+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{% T}(\BFm_{[2:K]})}tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}$
		$\displaystyle\quad+\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau% \in\mathcal{T}(\BFm_{[2:K]})}(t+t_{1})c_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+t}_{% \|\BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right].$

$\displaystyle\psi_{T,k_{1}}+\psi_{T^{\prime},k_{2}}-\psi_{T+T^{\prime},k}$	$\displaystyle=\sum^{T}_{t=1}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}(t+t_{1})c_{t+t_{1}}(\BFtau)(\nabla v^{t_{1}+t}_{\|% \BFtau\|}-\nabla v^{t_{1}+t-1}_{\|\BFtau\|})\right]$
	$\displaystyle\quad+\sum^{T^{\prime}}_{t=1}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:% K]})}\left[tc_{t}(\BFtau)\nabla v^{t}_{\|\BFtau\|}-(t+T)c_{t+T}(\BFtau)\nabla v^% {t+T}_{\|\BFtau\|}\right]$
	$\displaystyle=\sum^{T-1}_{t=0}\left[\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}(t+t_{1}+1)c_{t+t_{1}+1}(\BFtau)(\nabla v^{t_{1}+t+1% }_{\|\BFtau\|}-\nabla v^{t_{1}+t}_{\|\BFtau\|})\right]$
	$\displaystyle\quad+\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)\nabla v^{t_{1}+t}% _{\|\BFtau\|}-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\nabla v^{t_{1}+t+1}_{\|\BFtau\|}\right]$	(10)
	$\displaystyle=\sum_{t=0}^{T-1}\sum^{T^{\prime}}_{t_{1}=1}\sum_{\BFtau\in% \mathcal{T}(\BFm_{[2:K]})}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{% 1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}.$	(11)

	$\displaystyle\eqref{eq: thm1-2}=$	$\displaystyle\sum_{t,t_{1}\in T^{+}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle\geq$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle=$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),\|% \BFtau\|\leq\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+% 1)c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),% \|\BFtau\|>\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)% c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\|\BFtau\|}$
	$\displaystyle>$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),\|% \BFtau\|\leq\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+% 1)c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}$
		$\displaystyle+\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]}),% \|\BFtau\|>\bar{\|\tau\|}_{t+t_{1}}}\left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)% c_{t_{1}+t+1}(\BFtau)\right]\nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}$
	$\displaystyle=$	$\displaystyle\sum_{t,t_{1}\in T^{-}}\sum_{\BFtau\in\mathcal{T}(\BFm_{[2:K]})}% \left[(t_{1}+t)c_{t_{1}+t}(\BFtau)-(t_{1}+t+1)c_{t_{1}+t+1}(\BFtau)\right]% \nabla v^{t_{1}+t}_{\bar{\|\tau\|}_{t+t_{1}}}=0.$

1 Introduction

1.1 A Motivating Example

1.2 Outline and Main Contributions

2 Literature Review

Shapley value in operations management.

Shapley value and machine learning.

Federated learning.

Incentive design in federated learning.

3 The Multi-Action Collaborative Federated Learning (MCFL) Framework

Additional Notation.

3.1 A Decision-Aware Learning objective

Example 3.1 (The Newsvendor Problem.)

3.2 Federated Learning within Agent Coalition

3.3 MCFL System Synergy

4 A Shapley Value Based Mechanism for MCFL

4.1 The MCFL Shapley Value

Definition 4.1 (Shapley value Shapley et al. (1953))

Axiom 1

Definition 4.2 (MCFL Shapley Value)

Remark 1 (Interpreting the weight function)

Definition 4.3 (MCFL Shapley Value with Linear Weights)

Axiom 2 (Desired Axioms for MCFL Mechanisms)

4.2 False-Name Manipulation

Theorem 4.1 (Vulnerability of MCFL Shapley under False-name Manipulation)

5 System Efficiency

5.1 Algorithm for Federated Learning

5.2 Analysis of System Efficiency

Assumption 5.1 (L𝐿Litalic_L-smoothness, bounded gradient and strong convexity)

Theorem 5.2

5.2.1 Numerical Example

The Newsvendor Problem

Portfolio Optimization

6 Conclusions

References

7 Supplementary materials for Theorem 1

Lemma 7.1

Proof.

Lemma 7.2

Proof.

Lemma 7.3

Proof.

Lemma 7.4

Proof.

Proof.

8 Supplementary material for Section 3

Example 8.1 (Pricing under uncertainty)

Assumption 8.1

9 Proof for Proposition 5.2

Assumption 5.1 ( $L$ -smoothness, bounded gradient and strong convexity)