-
Personalized Federated Learning with Clustering: Non-IID Heart Rate Variability Data Application
Authors:
Joo Hun Yoo,
Ha Min Son,
Hyejun Jeong,
Eun-Hye Jang,
Ah Young Kim,
Han Young Yu,
Hong ** Jeon,
Tai-Myoung Chung
Abstract:
While machine learning techniques are being applied to various fields for their exceptional ability to find complex relations in large datasets, the strengthening of regulations on data ownership and privacy is causing increasing difficulty in its application to medical data. In light of this, Federated Learning has recently been proposed as a solution to train on private data without breach of co…
▽ More
While machine learning techniques are being applied to various fields for their exceptional ability to find complex relations in large datasets, the strengthening of regulations on data ownership and privacy is causing increasing difficulty in its application to medical data. In light of this, Federated Learning has recently been proposed as a solution to train on private data without breach of confidentiality. This conservation of privacy is particularly appealing in the field of healthcare, where patient data is highly confidential. However, many studies have shown that its assumption of Independent and Identically Distributed data is unrealistic for medical data. In this paper, we propose Personalized Federated Cluster Models, a hierarchical clustering-based FL process, to predict Major Depressive Disorder severity from Heart Rate Variability. By allowing clients to receive more personalized model, we address problems caused by non-IID data, showing an accuracy increase in severity prediction. This increase in performance may be sufficient to use Personalized Federated Cluster Models in many existing Federated Learning scenarios.
△ Less
Submitted 10 August, 2021; v1 submitted 4 August, 2021;
originally announced August 2021.
-
"Playing the whole game": A data collection and analysis exercise with Google Calendar
Authors:
Albert Y. Kim,
Johanna Hardin
Abstract:
We provide a computational exercise suitable for early introduction in an undergraduate statistics or data science course that allows students to 'play the whole game' of data science: performing both data collection and data analysis. While many teaching resources exist for data analysis, such resources are not as abundant for data collection given the inherent difficulty of the task. Our propose…
▽ More
We provide a computational exercise suitable for early introduction in an undergraduate statistics or data science course that allows students to 'play the whole game' of data science: performing both data collection and data analysis. While many teaching resources exist for data analysis, such resources are not as abundant for data collection given the inherent difficulty of the task. Our proposed exercise centers around student use of Google Calendar to collect data with the goal of answering the question 'How do I spend my time?' On the one hand, the exercise involves answering a question with near universal appeal, but on the other hand, the data collection mechanism is not beyond the reach of a typical undergraduate student. A further benefit of the exercise is that it provides an opportunity for discussions on ethical questions and considerations that data providers and data analysts face in today's age of large-scale internet-based data collection.
△ Less
Submitted 18 June, 2020; v1 submitted 23 February, 2020;
originally announced February 2020.
-
Integrating data science ethics into an undergraduate major: A case study
Authors:
Benjamin S. Baumer,
Randi L. Garcia,
Albert Y. Kim,
Katherine M. Kinnaird,
Miles Q. Ott
Abstract:
We present a programmatic approach to incorporating ethics into an undergraduate major in statistical and data sciences. We discuss departmental-level initiatives designed to meet the National Academy of Sciences recommendation for integrating ethics into the curriculum from top-to-bottom as our majors progress from our introductory courses to our senior capstone course, as well as from side-to-si…
▽ More
We present a programmatic approach to incorporating ethics into an undergraduate major in statistical and data sciences. We discuss departmental-level initiatives designed to meet the National Academy of Sciences recommendation for integrating ethics into the curriculum from top-to-bottom as our majors progress from our introductory courses to our senior capstone course, as well as from side-to-side through co-curricular programming. We also provide six examples of data science ethics modules used in five different courses at our liberal arts college, each focusing on a different ethical consideration. The modules are designed to be portable such that they can be flexibly incorporated into existing courses at different levels of instruction with minimal disruption to syllabi. We connect our efforts to a growing body of literature on the teaching of data science ethics, present assessments of our effectiveness, and conclude with next steps and final thoughts.
△ Less
Submitted 31 January, 2022; v1 submitted 21 January, 2020;
originally announced January 2020.
-
Optimizing Waiting Thresholds Within A State Machine
Authors:
Rohit Pandey,
Yifan Chang,
Cameron White,
Gaurav Jagtiani,
Aerin Young Kim,
Gil Lapid Shafriri,
Sathya Singh
Abstract:
Azure (the cloud service provided by Microsoft) is composed of physical computing units which are called nodes. These nodes are controlled by a software component called Fabric Controller (FC), which can consider the nodes to be in one of many different states such as Ready, Unhealthy, Booting, etc. Some of these states correspond to a node being unresponsive to FCs requests. When a node goes unre…
▽ More
Azure (the cloud service provided by Microsoft) is composed of physical computing units which are called nodes. These nodes are controlled by a software component called Fabric Controller (FC), which can consider the nodes to be in one of many different states such as Ready, Unhealthy, Booting, etc. Some of these states correspond to a node being unresponsive to FCs requests. When a node goes unresponsive for more than a set threshold, FC intervenes and reboots the node. We minimized the downtime caused by the intervention threshold when a node switches to the Unhealthy state by fitting various heavy-tail probability distributions. We consider using features of the node to customize the organic recovery model to the individual nodes that go unhealthy. This regression approach allows us to use information about the node like hardware, software versions, historical performance indicators, etc. to inform the organic recovery model and hence the optimal threshold. In another direction, we consider generalizing this to an arbitrary number of thresholds within the node state machine (or Markov chain). When the states become intertwined in ways that different thresholds start affecting each other, we can't simply optimize each of them in isolation. For best results, we must consider this as an optimization problem in many variables (the number of thresholds). We no longer have a nice closed form solution for this more complex problem like we did with one threshold, but we can still use numerical techniques (gradient descent) to solve it.
△ Less
Submitted 8 October, 2018;
originally announced October 2018.
-
Curriculum Guidelines for Undergraduate Programs in Data Science
Authors:
Richard De Veaux,
Mahesh Agarwal,
Maia Averett,
Benjamin Baumer,
Andrew Bray,
Thomas Bressoud,
Lance Bryant,
Lei Cheng,
Amanda Francis,
Robert Gould,
Albert Y. Kim,
Matt Kretchmar,
Qin Lu,
Ann Moskol,
Deborah Nolan,
Roberto Pelayo,
Sean Raleigh,
Ricky J. Sethi,
Mutiara Sondjaja,
Neelesh Tiruviluamala,
Paul Uhlig,
Talitha Washington,
Curtis Wesley,
David White,
** Ye
Abstract:
The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in Data Science. The group consisted of 25 undergraduate faculty from a variety of institutions in the U.S., primarily from the disciplines of mathematics, statistics and computer science. These guidelines are meant to provide some structure for insti…
▽ More
The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in Data Science. The group consisted of 25 undergraduate faculty from a variety of institutions in the U.S., primarily from the disciplines of mathematics, statistics and computer science. These guidelines are meant to provide some structure for institutions planning for or revising a major in Data Science.
△ Less
Submitted 21 January, 2018;
originally announced January 2018.