-
edibble: An R package to encapsulate elements of experimental designs for better planning, management and workflow
Authors:
Emi Tanaka
Abstract:
I present an R package called edibble that facilitates the design of experiments by encapsulating elements of the experiment in a series of composable functions. This package is an interpretation of "the grammar of experimental designs" by Tanaka (2023) in the R programming language. The main features of the edibble package are demonstrated, illustrating how it can be used to create a wide array o…
▽ More
I present an R package called edibble that facilitates the design of experiments by encapsulating elements of the experiment in a series of composable functions. This package is an interpretation of "the grammar of experimental designs" by Tanaka (2023) in the R programming language. The main features of the edibble package are demonstrated, illustrating how it can be used to create a wide array of experimental designs. The implemented system aims to encourage cognitive thinking for holistic planning and data management of experiments in a streamlined workflow. This workflow can increase the inherent value of experimental data by reducing potential errors or noise with careful preplanning, as well as, ensuring fit-for-purpose analysis of experimental data.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol
Authors:
Weihao Li,
Dianne Cook,
Emi Tanaka,
Susan VanderPlas
Abstract:
Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conc…
▽ More
Regression experts consistently recommend plotting residuals for model diagnosis, despite the availability of many numerical hypothesis test procedures designed to use residuals to assess problems with a model fit. Here we provide evidence for why this is good advice using data from a visual inference experiment. We show how conventional tests are too sensitive, which means that too often the conclusion would be that the model fit is inadequate. The experiment uses the lineup protocol which puts a residual plot in the context of null plots. This helps generate reliable and consistent reading of residual plots for better model diagnosis. It can also help in an obverse situation where a conventional test would fail to detect a problem with a model due to contaminated data. The lineup protocol also detects a range of departures from good residuals simultaneously. Supplemental materials for the article are available online.
△ Less
Submitted 24 March, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
Towards a unified language in experimental designs propagated by a software framework
Authors:
Emi Tanaka
Abstract:
Experiments require human decisions in the design process, which in turn are reformulated and summarized as inputs into a system (computational or otherwise) to generate the experimental design. I leverage this system to promote a language of experimental designs by proposing a novel computational framework, called "the grammar of experimental designs", to specify experimental designs based on an…
▽ More
Experiments require human decisions in the design process, which in turn are reformulated and summarized as inputs into a system (computational or otherwise) to generate the experimental design. I leverage this system to promote a language of experimental designs by proposing a novel computational framework, called "the grammar of experimental designs", to specify experimental designs based on an object-oriented programming system that declaratively encapsulates the experimental structure. The framework aims to engage human cognition by building experimental designs with modular functions that modify a targeted singular element of the experimental design object. The syntax and semantics of the framework are built upon consideration from multiple perspectives. While the core framework is language-agnostic, the framework is implemented in the `edibble` R-package. A range of examples is shown to demonstrate the utility of the framework.
△ Less
Submitted 24 July, 2023; v1 submitted 11 July, 2023;
originally announced July 2023.
-
Current state and prospects of R-packages for the design of experiments
Authors:
Emi Tanaka,
Dewi Amaliah
Abstract:
Re-running an experiment is generally costly and, in some cases, impossible due to limited resources; therefore, the design of an experiment plays a critical role in increasing the quality of experimental data. In this paper, we describe the current state of R-packages for the design of experiments through an exploratory data analysis of package downloads, package metadata, and a comparison of cha…
▽ More
Re-running an experiment is generally costly and, in some cases, impossible due to limited resources; therefore, the design of an experiment plays a critical role in increasing the quality of experimental data. In this paper, we describe the current state of R-packages for the design of experiments through an exploratory data analysis of package downloads, package metadata, and a comparison of characteristics with other topics. We observed that experimental designs in practice appear to be sufficiently manufactured by a small number of packages, and the development of experimental designs often occurs in silos. We also discuss the interface designs of widely utilized R packages in the field of experimental design and discuss their future prospects for advancing the field in practice.
△ Less
Submitted 13 December, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database
Authors:
Dewi Amaliah,
Dianne Cook,
Emi Tanaka,
Kate Hyde,
Nicholas Tierney
Abstract:
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This pa…
▽ More
Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper describes the trials and tribulations of refreshing a textbook data set on wages, extracted from the National Longitudinal Survey of Youth (NLSY79) in the early 1990s. The data is useful for teaching modeling and exploratory analysis of longitudinal data. Subsets of NLSY79, including the wages data, can be found in supplementary files from numerous textbooks and research articles. The NLSY79 database has been continuously updated through to 2018, so new records are available. Here we describe our journey to refresh the wages data, and document the process so that the data can be regularly updated into the future. Our journey was difficult because the steps and decisions taken to get from the raw data to the wages textbook subset have not been clearly articulated. We have been diligent to provide a reproducible workflow for others to follow, which also hopefully inspires more attempts at refreshing data for teaching. Three new data sets and the code to produce them are provided in the open source R package called `yowie`.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Symbolic Formulae for Linear Mixed Models
Authors:
Emi Tanaka,
Francis K. C. Hui
Abstract:
A statistical model is a mathematical representation of an often simplified or idealised data-generating process. In this paper, we focus on a particular type of statistical model, called linear mixed models (LMMs), that is widely used in many disciplines e.g.~agriculture, ecology, econometrics, psychology. Mixed models, also commonly known as multi-level, nested, hierarchical or panel data models…
▽ More
A statistical model is a mathematical representation of an often simplified or idealised data-generating process. In this paper, we focus on a particular type of statistical model, called linear mixed models (LMMs), that is widely used in many disciplines e.g.~agriculture, ecology, econometrics, psychology. Mixed models, also commonly known as multi-level, nested, hierarchical or panel data models, incorporate a combination of fixed and random effects, with LMMs being a special case. The inclusion of random effects in particular gives LMMs considerable flexibility in accounting for many types of complex correlated structures often found in data. This flexibility, however, has given rise to a number of ways by which an end-user can specify the precise form of the LMM that they wish to fit in statistical software. In this paper, we review the software design for specification of the LMM (and its special case, the linear model), focusing in particular on the use of high-level symbolic model formulae and two popular but contrasting R-packages in lme4 and asreml.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Simple robust genomic prediction and outlier detection for a multi-environmental field trial
Authors:
Emi Tanaka
Abstract:
The aim of plant breeding trials is often to identify germplasms that are well adapt to target environments. These germplasms are identified through genomic prediction from the analysis of multi-environmental field trial (MET) using linear mixed models. The occurrence of outliers in MET are common and known to adversely impact accuracy of genomic prediction yet the detection of outliers, and subse…
▽ More
The aim of plant breeding trials is often to identify germplasms that are well adapt to target environments. These germplasms are identified through genomic prediction from the analysis of multi-environmental field trial (MET) using linear mixed models. The occurrence of outliers in MET are common and known to adversely impact accuracy of genomic prediction yet the detection of outliers, and subsequently its treatment, are often neglected. A number of reasons stand for this - complex data such as MET give rise to distinct levels of residuals and thus offers additional challenges of an outlier detection method and many linear mixed model software are ill-equipped for robust prediction. We present outlier detection methods using a holistic approach that borrows the strength across trials. We furthermore evaluate a simple robust genomic prediction that is applicable to any linear mixed model software. These are demonstrated using simulation based on two real bread wheat yield METs with a partially replicated design and an alpha lattice design.
△ Less
Submitted 19 July, 2018;
originally announced July 2018.