Search | arXiv e-print repository

Open Case Studies: Statistics and Data Science Education through Real-World Applications

Authors: Carrie Wright, Qier Meng, Michael R. Breshock, Lyla Atta, Margaret A. Taub, Leah R Jager, John Muschelli, Stephanie C. Hicks

Abstract: With unprecedented and growing interest in data science education, there are limited educator materials that provide meaningful opportunities for learners to practice statistical thinking, as defined by Wild and Pfannkuch (1999), with messy data addressing real-world challenges. As a solution, Nolan and Speed (1999) advocated for bringing applications to the forefront in undergraduate statistics c… ▽ More With unprecedented and growing interest in data science education, there are limited educator materials that provide meaningful opportunities for learners to practice statistical thinking, as defined by Wild and Pfannkuch (1999), with messy data addressing real-world challenges. As a solution, Nolan and Speed (1999) advocated for bringing applications to the forefront in undergraduate statistics curriculum with the use of in-depth case studies to encourage and develop statistical thinking in the classroom. Limitations to this approach include the significant time investment required to develop a case study -- namely, to select a motivating question and to create an illustrative data analysis -- and the domain expertise needed. As a result, case studies based on realistic challenges, not toy examples, are scarce. To address this, we developed the Open Case Studies (https://www.opencasestudies.org) project, which offers a new statistical and data science education case study model. This educational resource provides self-contained, multimodal, peer-reviewed, and open-source guides (or case studies) from real-world examples for active experiences of complete data analyses. We developed an educator's guide describing how to most effectively use the case studies, how to modify and adapt components of the case studies in the classroom, and how to contribute new case studies. (https://www.opencasestudies.org/OCS_Guide). △ Less

Submitted 12 January, 2023; originally announced January 2023.

Comments: 16 pages in main text, 3 figures, and 2 tables; 9 page in supplement

arXiv:2104.12555 [pdf, other]

Linking open-source code commits and MOOC grades to evaluate massive online open peer review

Authors: Siruo Wang, Leah R. Jager, Kai Kammers, Aboozar Hadavand, Jeffrey T. Leek

Abstract: Massive Open Online Courses (MOOCs) have been used by students as a low-cost and low-touch educational credential in a variety of fields. Understanding the grading mechanisms behind these course assignments is important for evaluating MOOC credentials. A common approach to grading free-response assignments is massive scale peer-review, especially used for assignments that are not easy to grade pro… ▽ More Massive Open Online Courses (MOOCs) have been used by students as a low-cost and low-touch educational credential in a variety of fields. Understanding the grading mechanisms behind these course assignments is important for evaluating MOOC credentials. A common approach to grading free-response assignments is massive scale peer-review, especially used for assignments that are not easy to grade programmatically. It is difficult to assess these approaches since the responses typically require human evaluation. Here we link data from public code repositories on GitHub and course grades for a large massive-online open course to study the dynamics of massive scale peer review. This has important implications for understanding the dynamics of difficult to grade assignments. Since the research was not hypothesis-driven, we described the results in an exploratory framework. We find three distinct clusters of repeated peer-review submissions and use these clusters to study how grades change in response to changes in code submissions. Our exploration also leads to an important observation that massive scale peer-review scores are highly variable, increase, on average, with repeated submissions, and changes in scores are not closely tied to the code changes that form the basis for the re-submissions. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:1811.02021 [pdf, other]

Using GitHub Classroom To Teach Statistics

Authors: Jacob Fiksel, Johanna S. Hardin, Leah R. Jager, Margaret A. Taub

Abstract: Git and GitHub are common tools for kee** track of multiple versions of data analytic content, which allow for more than one person to simultaneously work on a project. GitHub Classroom aims to provide a way for students to work on and submit their assignments via Git and GitHub, giving teachers an opportunity to teach these version control tools as part of their course. In the Fall 2017 semeste… ▽ More Git and GitHub are common tools for kee** track of multiple versions of data analytic content, which allow for more than one person to simultaneously work on a project. GitHub Classroom aims to provide a way for students to work on and submit their assignments via Git and GitHub, giving teachers an opportunity to teach these version control tools as part of their course. In the Fall 2017 semester, we implemented GitHub Classroom in two educational settings--an introductory computational statistics lab and a more advanced computational statistics course. We found many educational benefits of implementing GitHub Classroom, such as easily providing coding feedback during assignments and making students more confident in their ability to collaborate and use version control tools for future data science work. To encourage and ease the transition into using GitHub Classroom, we provide free and publicly available resources--both for students to begin using Git/GitHub and for teachers to use GitHub Classroom for their own courses. △ Less

Submitted 5 November, 2018; originally announced November 2018.

arXiv:1301.3718 [pdf]

Empirical estimates suggest most published medical research is true

Authors: Leah R. Jager, Jeffrey T. Leek

Abstract: The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in th… ▽ More The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported P-values as the data. We then collect P-values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, P = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, P = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress. △ Less

Submitted 16 January, 2013; originally announced January 2013.

Comments: 11 pages, 4 figures, Correspondance to J. Leek

Showing 1–4 of 4 results for author: Jager, L R