¹¹institutetext: Delft University of Technology, Delft, The Netherlands
¹¹email: {b.a.cox,j.m.galjaard,a.shankar,j.decouchant}@tudelft.nl
¹¹email: [email protected]

Parameterizing Federated Continual Learning for Reproducible Research

Bart Cox\orcidlink0000-0001-5209-6161 Jeroen Galjaard\orcidlink0000-0003-3681-7226 Aditya Shankar\orcidlink0009-0009-3046-8724 Jérémie Decouchant\orcidlink0000-0001-9143-3984 Lydia Y. Chen\orcidlink0000-0002-4228-6735

Abstract

Federated Learning (FL) systems evolve in heterogeneous and ever-evolving environments that challenge their performance. Under real deployments, the learning tasks of clients can also evolve with time, which calls for the integration of methodologies such as Continual Learning. To enable research reproducibility, we propose a set of experimental best practices that precisely capture and emulate complex learning scenarios. Our framework, Freddie, is the first entirely configurable framework for Federated Continual Learning (FCL), and it can be seamlessly deployed on a large number of machines thanks to the use of Kubernetes and containerization. We demonstrate the effectiveness of Freddie on two use cases, (i) large-scale FL on CIFAR100 and (ii) heterogeneous task sequence on FCL, which highlight unaddressed performance challenges in FCL scenarios.

Keywords:

Federated Continual Learning Resource and Data heterogeneity

1 Introduction

Federated Learning (FL) [12] performs distributed optimization thanks to a central federator server that maintains a global model using model updates computed by clients. It is common for data to be distributed among the clients of an FL system in a non-independent and identically distributed (non-IID) way. Moreover, in practice, client learning tasks also evolve over time. Continual Learning (CL) [5] is a technique that addresses the scenario where a model is continuously trained on evolving client tasks.

One of the key challenges in CL is catastrophic forgetting: parameters or semantic representations learned for past tasks drift under the influence of new tasks. Three categories of techniques address this challenge [6]. Replay mechanisms, like GEM [8] and DGR [15], retain or generate data from earlier tasks for new task adaptation, which allow the network to revise previously learned tasks. Regularization techniques, such as EWC [5], penalize the divergence of model parameters, preventing the adaptation process on new tasks from deviating too far from the model learned on prior tasks. Parameter isolation methods use specific weights of the network for the task at hand, i.e., use a mask to freeze the weights of other tasks [10].

Continual Learning allows a client to learn from its previous tasks if features are repeated over time. Federated Continual Learning [18] (FCL) combines CL and FL, enabling clients to indirectly learn from each other. Existing CL frameworks do not take this indirect learning into account and therefore provide limited support for Federated Continual Learning.

In addition, reproducing FCL results that were obtained in deployment is difficult. For example, experimental environments are often tightly controlled and steady, while real-world environments are often dynamic and heterogeneous. In addition, clients might be punctually busy processing co-located tasks. Several FL simulation [9, 14] and emulation [1, 14] frameworks have been proposed, but they cannot be easily extended to support heterogeneous data, learning tasks and hardware platforms. In addition, frameworks that focus on enabling large-scale FL experiments impose a significant overhead to manage the execution or require the use of a strict pipeline.

In this paper, we address the lack of a scalable yet flexible framework for reproducible FCL experiments. Overall, we make the following contributions:

•

We identify key requirements for FL and FCL emulation: ease of use, reproducibility, support for complex workloads, and resource heterogeneity.
•

We develop Freddie—a framework for Federated and distributed machine-learning—the first open source¹¹1https://gitlab.ewi.tudelft.nl/dmls/publications/freddie framework that addresses these requirements. Freddie supports small scale deployments, i.e., single machine simulations, and large scale emulation over self-managed and cloud systems using Kubernetes. Freddie enables the emulation of both data and resource heterogeneity.
•

We provide benchmarking generating methods for FCL that explore both data and task heterogeneity across clients —realistic workloads tailored for Federated Continual Learning systems.

2 Related Work

Federated Learning (FL). Existing FL frameworks support a fixed set of learning tasks across clients. Flower [1] provides a client-server framework that needs to be manually started on different devices. Differently, Fate [3] focuses on providing a secure and production-ready federated learning setup. Fate supports Kubernetes deployments but requires the use of its pipelines to run experiments. Although this might provide desirable security additions for production, it also tends less to prototy** and active research needs. Besides research endeavours, popular deep learning platforms such as TorchX and Tensorflow Federated can respectively run distributed and federated experiments at scale, but they lack the flexibility to use other ML libraries.

Continual Learning (CL). FACIL [11], PyCIL [19], and Pycontinual [4] provide CL frameworks and CL algorithms such as LwF, iCaRL, EWC, and GEM. Continual World [17] adds a simulation world for robotics tasks for Continual Reinforcement Learning. Avalanche [7] is focused on reproducible End-to-End Continual Learning. The aforementioned frameworks support CL only on a single machine. FedWEIT [18] combines parameter isolation and regularization and extends CL to a federated setting. However, it does not consider the impact of task sequences on the global model’s quality. Last but not least, current FL frameworks cannot be easily extended to support CL scenarios where the output types evolve.

Continual Learning methods leverage task IDs during training and evaluation, enabling the exploitation of a specific set of model weights or restricting the output classes based on a task’s ID. This is also known as Task-Interactive Learning (task-IL) and Domain-Interactive Learning (domain-IL) [16], which are both supported by Freddie.

3 Specifying Data Heterogeneity and Learning Workloads

This section briefly surveys system requirements addressed by Freddie.

Classical FL parameters. The number of clients, the federators’ aggregation and client selection strategies must be configurable. In addition, FL-related and common hyperparameters, such as the training epochs, learning rates, etc., must be configurable.

Statistical heterogeneity. As in FL, local data distributions remain of high importance, and their non-iidness should be configurable. In the context of FCL, local distributions also limit the tasks that clients might be able to train for.

Resource heterogeneity. It should be possible to specify the computing power of the clients and of the federator, and the characteristics (latency, throughput) of the network links that interconnect them.

Task description. The task sequence of each client can be specified. Non-IID task distribution can be assimilated to the situation where clients learn tasks with high intra-task variance, e.g. due to different domains. In such settings, it is often unclear how the quality of current CL methods is impacted by aggregation.

Refer to caption — Figure 1: Overview of Freddie. An Orchestrator and an Extractor are respectively used for deploying experiments and collecting data. Experiments are run as TrainJobs managed by Kubeflow Training Operators. Within such a job, the experiment is controlled by the federator and learned by the clients.

4 Freddie: A Framework for Reproducible FCL Research

In this section, for space reasons, we focus on Freddie’s implementation based on containerization and orchestration methods, and on its support for FCL.

Kubernetes and containers. Fig. 1 presents a high-level overview of Freddie. The Orchestrator provides functionality to kick-off experiments, in turn, deployed on the cluster by Kubeflow’s training operators. The extractor provides volumes for experiments to write to provided by an NFS provisioner and server. Experiments themselves are performed by federator and client nodes. The overall flow of the federated learning system with Freddie is as follows. First, the user submits an experiment description of the system and hyper-parameters. Following this, the Orchestrator deploys and monitors the experiment. This design allows the user to scale his experiment up from small-scale prototy** with minimal effort. The Extractor allows users to store and retrieve experiment statistics and artefacts created by the federator or clients.

The communication between any two parties in the system is asynchronous, allowing the development of FL systems with non-blocking federator-client interactions. With this flexibility, clients can run in Kubernetes clusters and on individual machines. To allow users to describe their experiment as distributed code, we rely on Kubeflow’s training operators, which provide a means to set up distributed learning on Kubernetes with popular ML libraries.

Novel support for FCL. Freddie supports the SOTA algorithms for FCL [18] and common CL methods such as EWC [5] and GEM [8]. For CL, Freddie implements Task-IL and Domain-IL [16] through the use of sliding, expanding, and full window mechanisms. A sliding window restricts the output classes only to those of the task evaluated at a time $t$ . Expanding windows do not utilize the task IDs to make any such restriction, so the output classes include all classes learned until that time. A full window does not restrict outputs based on the task and can be used in the standard federated learning scenario.

The added complexity of FCL allows for workloads over the same set of tasks to produce different results. We devise three different schemes that partition tasks differently, and that can be used to evaluate a FCL scheme over a representative set of scenarios. We discuss these three schemes, which we coin Column, Balanced, and Shuffled, respectively. Column, as shown in Fig. 2(a), all clients handle tasks in the same order. Thereby naively splitting the CL workload across clients, resulting in a significant expected catastrophic forgetting effect. Balanced, as depicted in Fig. 2(b), aims to prevent catastrophic forgetting by organizing the tasks so that different clients train on them across multiple steps. We propose a partitioning scheme to lessen the effect of catastrophic forgetting, resulting in a task being trained on by at most one client. While this scheme addresses short term-forgetting, long-term forgetting may still occur. Shuffled, see Fig. 2(c), randomly orders each client’s task, thereby relying on pseudo-randomness in conjunction with a pre-specified seed.

5 Performance evaluation

We demonstrate some features of Freddie through experiments. For this purpose, we use the overlap** CIFAR100 dataset. The labels that already exist in CIFAR100 are used to partition the data into different tasks for FCL, following the same steps as in [18]. FL can be viewed as a corner case of FCL where there is a single task consisting of the whole dataset. We first consider a FL scenario with the default version of CIFAR100, and then consider the overlap** CIFAR100 split into 10 separate tasks in a FCL scenario. We use the average accuracy metric following the CL literature [2, 13].

Scalability. To investigate Freddie’s emulation capability, we perform a small and large scale experiment on a Google Kubernetes Engine (GKE) cluster to cover possible use cases. During deployment, the pods of the federation and clients were run on a separate node pool scaled to meet each experiment’s requirements. We study the performance of an FL experiment emulated on a CPU-enabled Kubernetes cluster, where multiple clients may run on a single Kubernetes node. Parameters of the experiments are provided in Tab. 1.

Table 1: System and hyperparameters used in ‘small’ and ‘scale’ experiments. All experiments were run on ‘e2-standard-8’ nodes.

System

Federator (F)

Clients (C)

Nodes

CPU

(F/C)

Memory

(F/C)

Strategy

#Rounds

(R)

#C/R

Model

Data

Small

2,2,3

5,10,20

2/1

2/2G

FedAvg

100

LeNet

CIFAR10

Scale

4,12

25,75

2/2,6G

all

ResNet

CIFAR100

The small experiment in Fig. 3(a) depicts the spread round times of clients (scaled) and the federator, with 5 selected clients per round. The client round duration is scaled by the number of clients (World Size WS) ( $|\mathcal{D}_{Cifar}|/\text{WS}$ ) to account for differences in clients’ datasets as the WS increases. The outliers in the plots originate from the first epoch run on clients, which are inherently slower due to loading data into memory. Nevertheless, it is expected that the scaled client duration stays relatively constant, while the result shows an increase as the number of clients increases (from 115 to 123 s, and from 136 to 138 s). Similarly, the federator sees a positive correlation between round duration and WS. The number of co-scheduled clients on the same node can explain this trend, as the networking overhead stays the same.

For the scale experiments, we provide the round time density estimate in Fig. 3(b). The client round times exhibit the same range of processing times that were observed in the ‘small’ setting. In both settings, participating clients in each round may run on the same node, varying from 3 to 7 clients per node. We use similar settings in the ‘small’ configuration that involves 20 clients, where 4 nodes are used. As such, confirming that resource contingencies due to co-scheduling will likely cause the increased client round time with 20 clients. The different modes within the client’s round duration can be explained by imperfect data splits and the imbalanced assignment of the number of clients to be co-scheduled with the federator. The federator’s density estimate shows a similar pattern with two distinct modes. With the cluster configurations employed, i.e., 4 and 12 nodes, it is possible for the federator to be co-scheduled on a machine with different numbers of clients. As a result, the federator experiences a variable level of resource contingency. However, an increase in the two modes is visible as the number of clients increases, which is expected due to the increased communication volumes.

Task-IL vs Domain-IL. Assuming that the model knows the ID of the task it is currently training on, or evaluating, increases its accuracy. For FCL, Task-Interactive Learning and Domain-Interactive Learning are implemented using the sliding and expanding-window, respectively. Let us recall that sliding-windows use task IDs, contrary to expanding-windows. For the overlap** CIFAR100 dataset, if one assumes that the task ID is known, then the number of output classes is restricted to only the five sub-classes within that task. Thus leading to higher average task probabilities for Task-IL scenarios. This difference is prevalent in Fig. 4(a). Under the expanding window scheme, classification outputs one of $5T$ classes, where $T$ is the number of tasks learned until evaluation time. Therefore, the probability of classifying correctly is even lower than in the sliding window scenario. Fig. 4(a) shows the positive impact of using a task ID on accuracy. Using sliding-window results in higher accuracy than expanding-window, which sometimes has to be used because of the application use case. Because of this difference, Freddie supports both Task-IL and Domain-IL.

FCL Task Heterogeneity. As discussed in Section 3, tasks can be processed in different orders at each client. To demonstrate the different effects that different sequences of tasks produce, we implement the Overlapped-CIFAR100 dataset with 20 tasks that can be used for FCL [18]. The accuracy in Fig. 4(b) is calculated as the average accuracy of all tasks seen until that point, resulting in expected ‘drops’ in accuracy as new tasks are introduced. Indeed, the learning curves in Fig. 4(b) show noticeable drops over time. However, different trends are visible between workloads. The column scheme suffers more from more pronounced catastrophic forgetting than the shuffled and balanced scheme, resulting in lower accuracy. We observe that the column scheme, on average, results in a 4% test accuracy drop compared to the column and shuffled schemes.

6 Conclusion

We presented Freddie, the first framework for reproducible Federated Continual Learning research, which is motivated by the increasing importance of Federated and Continual Learning. Freddie’s deployment abilities on different platforms, scalability with the number of clients, and support for data and task heterogeneity provide FL practitioners with a powerful tool. Our experimental results showcase previously unaddressed performance issues that Federated Continual Learning systems might face: severe catastrophic forgetting in different task heterogeneity settings. Freddie is open-source, and will soon be extended to support new CL datasets, algorithms, and generative models.

References

[1] Beutel, D.J., Topal, T., Mathur, A., et al.: Flower: A friendly federated learning research framework. CoRR abs/2007.14390 (2020)
[2] Chaudhry, A., Ranzato, M., Rohrbach, M., et al.: Efficient lifelong learning with A-GEM. In: ICLR (2019)
[3] FedAI: Fate (2019), https://fate.fedai.org/
[4] Ke, Z., Liu, B., Ma, N., et al.: Achieving forgetting prevention and knowledge transfer in continual learning. In: NeurIPS (2021)
[5] Kirkpatrick, J., Pascanu, R., Rabinowitz, N.C., et al.: Overcoming catastrophic forgetting in neural networks. CoRR abs/1612.00796 (2016)
[6] Lange, M.D., Aljundi, R., Masana, M., et al.: Continual learning: A comparative study on how to defy forgetting in classification tasks. CoRR abs/1909.08383 (2019)
[7] Lomonaco, V., Pellegrini, L., Cossu, A., et al.: Avalanche: an end-to-end library for continual learning. In: CVPR (2021)
[8] Lopez-Paz, D., Ranzato, M.A.: Gradient episodic memory for continual learning. In: NeurIPS (2017)
[9] Ma, Y., Yu, D., Wu, T., et al.: PaddlePaddle: An open-source deep learning platform from industrial practice. Frontiers of Data and Computing 1(1), 105–115 (2019)
[10] Mallya, A., Lazebnik, S.: Packnet: Adding multiple tasks to a single network by iterative pruning. In: CVPR (2018)
[11] Masana, M., Liu, X., Twardowski, B., Menta, M., et al.: Class-incremental learning: Survey and performance evaluation on image classification. IEEE Trans. Pattern Anal. Mach. Intell. 45(5), 5513–5533 (2023)
[12] McMahan, B., Moore, E., Ramage, D., et al.: Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. pp. 1273–1282. PMLR (2017)
[13] Mirzadeh, S.I., Farajtabar, M., Pascanu, R., et al.: Understanding the role of training regimes in continual learning. NeurIPS (2020)
[14] Reina, G.A., Gruzdev, A., Foley, P., et al.: Openfl: An open-source framework for federated learning. CoRR abs/2105.06413 (2021)
[15] Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: NeurIPS. pp. 2990–2999 (2017)
[16] van de Ven, G.M., Tolias, A.S.: Three scenarios for continual learning. CoRR abs/1904.07734 (2019)
[17] Wołczyk, M., Zajkac, M., Pascanu, R., et al.: Continual world: A robotic benchmark for continual reinforcement learning. In: NeurIPS (2021)
[18] Yoon, J., Jeong, W., Lee, G., et al.: Federated continual learning with weighted inter-client transfer. In: ICML (2021)
[19] Zhou, D.W., Wang, F.Y., Ye, H.J., et al.: Pycil: A python toolbox for class-incremental learning. CoRR abs/2112.12533 (2021)