-
On Measuring Calibration of Discrete Probabilistic Neural Networks
Authors:
Spencer Young,
Porter Jenkins
Abstract:
As machine learning systems become increasingly integrated into real-world applications, accurately representing uncertainty is crucial for enhancing their safety, robustness, and reliability. Training neural networks to fit high-dimensional probability distributions via maximum likelihood has become an effective method for uncertainty quantification. However, such models often exhibit poor calibr…
▽ More
As machine learning systems become increasingly integrated into real-world applications, accurately representing uncertainty is crucial for enhancing their safety, robustness, and reliability. Training neural networks to fit high-dimensional probability distributions via maximum likelihood has become an effective method for uncertainty quantification. However, such models often exhibit poor calibration, leading to overconfident predictions. Traditional metrics like Expected Calibration Error (ECE) and Negative Log Likelihood (NLL) have limitations, including biases and parametric assumptions. This paper proposes a new approach using conditional kernel mean embeddings to measure calibration discrepancies without these biases and assumptions. Preliminary experiments on synthetic data demonstrate the method's potential, with future work planned for more complex applications.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
The reliability of the gender Implicit Association Test (gIAT) for high-ability careers
Authors:
S. Stanley Young,
Warren B. Kindzierski
Abstract:
Males outnumber females in many high-ability careers in the fields of science, technology, engineering, and mathematics, STEM, and academic medicine, to name a few. These differences are often attributed to subconscious bias as measured by the gender Implicit Association Test, gIAT. We compute p-value plots for results from two meta-analyses, one examines the predictive power of gIAT, and the othe…
▽ More
Males outnumber females in many high-ability careers in the fields of science, technology, engineering, and mathematics, STEM, and academic medicine, to name a few. These differences are often attributed to subconscious bias as measured by the gender Implicit Association Test, gIAT. We compute p-value plots for results from two meta-analyses, one examines the predictive power of gIAT, and the other examines the predictive power of vocational interests, i.e. personal interests, and behaviors, for explaining gender differences in high-ability careers. The results are clear, the gender Implicit Association Test provides little or no information on male versus female differences, whereas vocational interests are strongly predictive. Researchers of implicit bias should expand their modeling to include additional relevant covariates. In short, these meta-analyses provide no support for the gender Implicit Association Test influencing choice and gender differences of high-ability careers.
△ Less
Submitted 15 May, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Reproducibility of Implicit Association Test (IAT) -- Case study of meta-analysis of racial bias research claims
Authors:
S. Stanley Young,
Warren B. Kindzierski
Abstract:
The Implicit Association Test, IAT, is widely used to measure hidden (subconscious) human biases, implicit bias, of many topics: race, gender, age, ethnicity, religion stereotypes. There is a need to understand the reliability of these measures as they are being used in many decisions in society today. A case study was undertaken to independently test the reliability of (ability to reproduce) raci…
▽ More
The Implicit Association Test, IAT, is widely used to measure hidden (subconscious) human biases, implicit bias, of many topics: race, gender, age, ethnicity, religion stereotypes. There is a need to understand the reliability of these measures as they are being used in many decisions in society today. A case study was undertaken to independently test the reliability of (ability to reproduce) racial bias research claims of Black White relations based on IAT (implicit bias) and explicit bias measurements using statistical p value plots. These claims were for IAT, real world behavior correlations and explicit bias, real world behavior correlations of Black White relations.
The p value plots were constructed using data sets from published literature and the plots exhibited considerable randomness for all correlations examined. This randomness supports a lack of correlation between IAT, implicit bias, and explicit bias measurements with real world behaviors of Whites towards Blacks. These findings were for microbehaviors (measures of nonverbal and subtle verbal behavior) and person perception judgments (explicit judgments about others). Findings of the p value plots were consistent with the case study research claim that the IAT provides little insight into who will discriminate against whom. It was also observed that the amount of real world variance explained by the IAT and explicit bias measurements was small, less than 5 percent. Others have noted that the poor performance of both the IAT and explicit bias measurements are mostly consistent with a (flawed instruments explanation) problems in theories that motivated development and use of these instruments.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Statistical reliability of meta_analysis research claims for gas stove cooking_childhood respiratory health associations
Authors:
Warren B. Kindzierski,
S. Stanley Young,
John D. Dunn
Abstract:
Odds ratios or p_values from individual observational studies can be combined to examine a common cause_effect research question in meta_analysis. However, reliability of individual studies used in meta_analysis should not be taken for granted as claimed cause_effect associations may not reproduce. An evaluation was undertaken on meta_analysis of base papers examining gas stove cooking, including…
▽ More
Odds ratios or p_values from individual observational studies can be combined to examine a common cause_effect research question in meta_analysis. However, reliability of individual studies used in meta_analysis should not be taken for granted as claimed cause_effect associations may not reproduce. An evaluation was undertaken on meta_analysis of base papers examining gas stove cooking, including nitrogen dioxide, NO2, and childhood asthma and wheeze associations. Numbers of hypotheses tested in 14 of 27 base papers, 52 percent, used in meta_analysis of asthma and wheeze were counted. Test statistics used in the meta_analysis, 40 odds ratios with 95 percent confidence limits, were converted to p_values and presented in p_value plots. The median and interquartile range of possible numbers of hypotheses tested in the 14 base papers was 15,360, 6,336_49,152. None of the 14 base papers made mention of correcting for multiple testing, nor was any explanation offered if no multiple testing procedure was used. Given large numbers of hypotheses available, statistics drawn from base papers and used for meta-analysis are likely biased. Even so, p-value plots for gas stove_current asthma and gas stove_current wheeze associations show randomness consistent with unproven gas stove harms. The meta-analysis fails to provide reliable evidence for public health policy making on gas stove harms to children in North America. NO2 is not established as a biologically plausible explanation of a causal link with childhood asthma. Biases_multiple testing and p-hacking_cannot be ruled out as explanations for a gas stove_current asthma association claim. Selective reporting is another bias in published literature of gas stove_childhood respiratory health studies. Keywords gas stove, asthma, meta-analysis, p-value plot, multiple testing, p_hacking
△ Less
Submitted 4 June, 2023; v1 submitted 26 March, 2023;
originally announced April 2023.
-
Mortality Rates of US Counties: Are they Reliable and Predictable?
Authors:
Robert L. Obenchain,
S. Stanley Young
Abstract:
We examine US County-level observational data on Lung Cancer mortality rates in 2012 and overall Circulatory Respiratory mortality rates in 2016 as well as their "Top Ten" potential causes from Federal or State sources. We find that these two mortality rates for 2,812 US Counties have remarkably little in common. Thus, for predictive modeling, we use a single "compromise" measure of mortality that…
▽ More
We examine US County-level observational data on Lung Cancer mortality rates in 2012 and overall Circulatory Respiratory mortality rates in 2016 as well as their "Top Ten" potential causes from Federal or State sources. We find that these two mortality rates for 2,812 US Counties have remarkably little in common. Thus, for predictive modeling, we use a single "compromise" measure of mortality that has several advantages. The vast majority of our new findings have simple implications that we illustrate graphically.
△ Less
Submitted 16 May, 2023; v1 submitted 6 March, 2023;
originally announced March 2023.
-
A Novel Mixture Model for Characterizing Human Aiming Performance Data
Authors:
Yanxi Li,
Derek S. Young,
Julien Gori,
Olivier Rioul
Abstract:
Fitts' law is often employed as a predictive model for human movement, especially in the field of human-computer interaction. Models with an assumed Gaussian error structure are usually adequate when applied to data collected from controlled studies. However, observational data (often referred to as data gathered "in the wild") typically display noticeable positive skewness relative to a mean tren…
▽ More
Fitts' law is often employed as a predictive model for human movement, especially in the field of human-computer interaction. Models with an assumed Gaussian error structure are usually adequate when applied to data collected from controlled studies. However, observational data (often referred to as data gathered "in the wild") typically display noticeable positive skewness relative to a mean trend as users do not routinely try to minimize their task completion time. As such, the exponentially-modified Gaussian (EMG) regression model has been applied to aimed movements data. However, it is also of interest to reasonably characterize those regions where a user likely was not trying to minimize their task completion time. In this paper, we propose a novel model with a two-component mixture structure -- one Gaussian and one exponential -- on the errors to identify such a region. An expectation-conditional-maximization (ECM) algorithm is developed for estimation of such a model and some properties of the algorithm are established. The efficacy of the proposed model, as well as its ability to inform model-based clustering, are addressed in this work through extensive simulations and an insightful analysis of a human aiming performance study.
△ Less
Submitted 30 September, 2022;
originally announced September 2022.
-
EPA Particulate Matter Data -- Analyses using Local Control Strategy
Authors:
Robert L. Obenchain,
S. Stanley Young
Abstract:
Statistical Learning methodology for analysis of large collections of cross-sectional observational data can be most effective when the approach used is both Nonparametric and Unsupervised. We illustrate use of our NU Learning approach on 2016 US environmental epidemiology data that we have made freely available. We encourage other researchers to download these data, apply whatever methodology the…
▽ More
Statistical Learning methodology for analysis of large collections of cross-sectional observational data can be most effective when the approach used is both Nonparametric and Unsupervised. We illustrate use of our NU Learning approach on 2016 US environmental epidemiology data that we have made freely available. We encourage other researchers to download these data, apply whatever methodology they wish, and contribute to development of a broad-based ``consensus view'' of potential effects of Secondary Organic Aerosols (volatile organic compounds of predominantly biogenic or anthropogenic origin) within PM2.5 particulate matter on circulatory and/or respiratory mortality. Our analyses here focus on the question: ``Are regions with relatively high air-borne biogenic particulate matter also expected to have relatively high circulatory and/or respiratory mortality?''
△ Less
Submitted 19 December, 2022; v1 submitted 1 September, 2022;
originally announced September 2022.
-
Case Study: Evaluation of a meta-analysis of the association between soy protein and cardiovascular disease
Authors:
S. Stanley Young,
Warren B. Kindzierski,
Douglas Hawkins,
Paul Fogel,
Terry Meyer
Abstract:
It is well-known that claims coming from observational studies most often fail to replicate. Experimental (randomized) trials, where conditions are under researcher control, have a high reputation and meta-analysis of experimental trials are considered the best possible evidence. Given the irreproducibility crisis, experiments lately are starting to be questioned. There is a need to know the relia…
▽ More
It is well-known that claims coming from observational studies most often fail to replicate. Experimental (randomized) trials, where conditions are under researcher control, have a high reputation and meta-analysis of experimental trials are considered the best possible evidence. Given the irreproducibility crisis, experiments lately are starting to be questioned. There is a need to know the reliability of claims coming from randomized trials. A case study is presented here independently examining a published meta-analysis of randomized trials claiming that soy protein intake improves cardiovascular health. Counting and p-value plotting techniques (standard p-value plot, p-value expectation plot, and volcano plot) are used. Counting (search space) analysis indicates that reported p-values from the meta-analysis could be biased low due to multiple testing and multiple modeling. Plotting techniques used to visualize the behavior of the data set used for meta-analysis suggest that statistics drawn from the base papers do not satisfy key assumptions of a random-effects meta-analysis. These assumptions include using unbiased statistics all drawn from the same population. Also, publication bias is unaddressed in the meta-analysis. The claim that soy protein intake should improve cardiovascular health is not supported by our analysis.
△ Less
Submitted 28 November, 2021;
originally announced December 2021.
-
Evaluation of a meta-analysis of the association between red and processed meat and selected human health effects
Authors:
S. Stanley Young,
Warren Kindzierski
Abstract:
Background: Risk ratios or p-values from multiple, independent studies, observational or randomized, can be computationally combined to provide an overall assessment of a research question in meta-analysis. However, an irreproducibility crisis currently afflicts a wide range of scientific disciplines, including nutritional epidemiology. An evaluation was undertaken to assess the reliability of a m…
▽ More
Background: Risk ratios or p-values from multiple, independent studies, observational or randomized, can be computationally combined to provide an overall assessment of a research question in meta-analysis. However, an irreproducibility crisis currently afflicts a wide range of scientific disciplines, including nutritional epidemiology. An evaluation was undertaken to assess the reliability of a meta-analysis examining the association between red and processed meat and selected human health effects (all-cause mortality, cardiovascular mortality, overall cancer mortality, breast cancer incidence, colorectal cancer incidence, type 2 diabetes incidence).
Methods: The number of statistical tests and models were counted in 15 randomly selected base papers (14%) from 105 used in the meta-analysis. Relative risk with 95% confidence limits for 125 risk results were converted to p-values and p-value plots were constructed to evaluate the effect heterogeneity of the p-values.
Results: The number of statistical tests possible in the 15 randomly selected base papers was large, median = 20,736 (interquartile range = 1,728 to 331,776). Each p-value plot for the six selected health effects showed either a random pattern (p-values > 0.05), or a two-component mixture with small p-values < 0.001 while other p-values appeared random. Given potentially large numbers of statistical tests conducted in the 15 selected base papers, questionable research practices cannot be ruled out as explanations for small p-values.
Conclusions: This independent analysis, which complements the findings of the original meta-analysis, finds that the base papers used in the red and resulting processed meat meta-analysis do not provide evidence for the claimed health effects.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Standard meta-analysis methods are not robust
Authors:
S. Stanley Young,
Warren B. Kindzierski
Abstract:
P values or risk ratios from multiple, independent studies, observational or randomized, can be computationally combined to provide an overall assessment of a research question in meta-analysis. There is a need to examine the reliability of these methods of combination. It is typical in observational studies to statistically test many questions and not correct the analysis results for multiple tes…
▽ More
P values or risk ratios from multiple, independent studies, observational or randomized, can be computationally combined to provide an overall assessment of a research question in meta-analysis. There is a need to examine the reliability of these methods of combination. It is typical in observational studies to statistically test many questions and not correct the analysis results for multiple testing or multiple modeling, MTMM. The same problem can happen for randomized, experimental trials. There is the additional problem that some of the base studies may be using fabricated or fraudulent data. If there is no attention to MTMM or fraud in the base studies, there is no guarantee that the results to be combined are unbiased, the key requirement for the valid combining of results. We note that methods of combination are not robust; even one extreme base study value can overwhelm standard methods of combination. It is possible that multiple, extreme (MTMM or fraudulent) results can feed from the base studies to bias the combined result. A meta-analysis of observational (or even randomized studies) may not be reliable. Examples are given along with some methods to evaluate existing base studies and meta-analysis studies.
△ Less
Submitted 26 October, 2021;
originally announced October 2021.
-
Particulate Matter Exposure and Lung Cancer: A Review of two Meta-Analysis Studies
Authors:
S. Stanley Young,
Warren Kindzierski
Abstract:
The current regulatory paradigm is that PM2.5, over time causes lung cancer. This claim is based on cohort studies and meta-analysis that use cohort studies as their base studies. There is a need to evaluate the reliability of this causal claim. Our idea is to examine the base studies with respect to multiple testing and multiple modeling and to look closer at the meta-analysis using p-value plots…
▽ More
The current regulatory paradigm is that PM2.5, over time causes lung cancer. This claim is based on cohort studies and meta-analysis that use cohort studies as their base studies. There is a need to evaluate the reliability of this causal claim. Our idea is to examine the base studies with respect to multiple testing and multiple modeling and to look closer at the meta-analysis using p-value plots. For two meta-analysis we investigated, some extremely small p-values were observed in some of the base studies, which we think are due to a combination of bias and small standard errors. The p-value plot for one meta-analysis indicates no effect. For the other meta-analysis, we note the p-value plot is consistent with a two-component mixture. Small p-values might be real or due to some combination of p-hacking, publication bias, covariate problems, etc. The large p-values could indicate no real effect, or be wrong due to low power, missing covariates, etc. We conclude that the results are ambiguous at best. These meta-analyses do not establish that PM2.5 is causal of lung tumors.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
PM2.5 and all-cause mortality
Authors:
S. Stanley Young,
Warren Kindzierski
Abstract:
The US EPA and the WHO claim that PM2.5 is causal of all-cause deaths. Both support and fund research on air quality and health effects. WHO funded a massive systematic review and meta-analyses of air quality and health-effect papers. 1,632 literature papers were reviewed and 196 were selected for meta-analyses. The standard air components, particulate matter, PM10 and PM2.5, nitrogen dioxide, NO2…
▽ More
The US EPA and the WHO claim that PM2.5 is causal of all-cause deaths. Both support and fund research on air quality and health effects. WHO funded a massive systematic review and meta-analyses of air quality and health-effect papers. 1,632 literature papers were reviewed and 196 were selected for meta-analyses. The standard air components, particulate matter, PM10 and PM2.5, nitrogen dioxide, NO2, and ozone, were selected as causes and all-cause and cause-specific mortalities were selected as outcomes. A claim was made for PM2.5 and all-cause deaths, risk ratio of 1.0065, with confidence limits of 1.0044 to 1.0086. There is a need to evaluate the reliability of this causal claim. Based on a p-value plot and discussion of several forms of bias, we conclude that the association is not causal.
△ Less
Submitted 31 October, 2020;
originally announced November 2020.
-
Reliability of meta-analysis of an association between ambient air quality and development of asthma later in life
Authors:
S. Stanley Young,
Kai-Chieh Cheng,
** Hua Chen,
Shu-Chuan Chen,
Warren B. Kindzierski
Abstract:
Claims from observational studies often fail to replicate. A study was undertaken to assess the reliability of cohort studies used in a highly cited meta-analysis of the association between ambient nitrogen dioxide, NO2, and fine particulate matter, PM2.5, concentrations early in life and development of asthma later in life. The numbers of statistical tests possible were estimated for 19 base pape…
▽ More
Claims from observational studies often fail to replicate. A study was undertaken to assess the reliability of cohort studies used in a highly cited meta-analysis of the association between ambient nitrogen dioxide, NO2, and fine particulate matter, PM2.5, concentrations early in life and development of asthma later in life. The numbers of statistical tests possible were estimated for 19 base papers considered for the meta-analysis. A p-value plot for NO2 and PM2.5 was constructed to evaluate effect heterogeneity of p-values used from the base papers. The numbers of statistical tests possible in the base papers were large - median 13,824, interquartile range 1,536-221,184; range 96-42M, in comparison to statistical test results presented. Statistical test results drawn from the base papers are unlikely to provide unbiased measures for meta-analysis. The p-value plot indicated that heterogeneity of the NO2 results across the base papers is consistent with a two-component mixture. First, it makes no sense to average across a mixture in meta-analysis. Second, the shape of the p-value plot for NO2 appears consistent with the possibility of analysis manipulation to obtain small p-values in several of the cohort studies. As for PM2.5, all corresponding p-values fall on a 45-degree line indicating complete randomness rather than a true association. Our interpretation of the meta-analysis is that the random p-values indicating no cause-effect associations are more plausible and that their meta-analysis will not likely replicate in the absence of bias. We conclude that claims made in the base papers used for meta-analysis are unreliable due to bias induced by multiple testing and multiple modelling, MTMM. We also show there is evidence that the heterogeneity across the base papers used for meta-analysis is more complex than simple sampling from a normal process.
△ Less
Submitted 18 October, 2020;
originally announced October 2020.
-
Evaluation of a meta-analysis of ambient air quality as a risk factor for asthma exacerbation
Authors:
Warren B. Kindzierski,
S. Stanley Young,
Terry G. Meyer,
John D. Dunn
Abstract:
False-positive results and bias may be common features of the biomedical literature today, including risk factor-chronic disease research. A study was undertaken to assess the reliability of base studies used in a meta-analysis examining whether carbon monoxide, particulate matter 10 and 2.5 micro molar, sulfur dioxide, nitrogen dioxide and ozone are risk factors for asthma exacerbation (hospital…
▽ More
False-positive results and bias may be common features of the biomedical literature today, including risk factor-chronic disease research. A study was undertaken to assess the reliability of base studies used in a meta-analysis examining whether carbon monoxide, particulate matter 10 and 2.5 micro molar, sulfur dioxide, nitrogen dioxide and ozone are risk factors for asthma exacerbation (hospital admission and emergency room visits for asthma attack). The number of statistical tests and models were counted in 17 randomly selected base papers from 87 used in the meta-analysis. P-value plots for each air component were constructed to evaluate the effect heterogeneity of p-values used from all 87 base papers The number of statistical tests possible in the 17 selected base papers was large, median=15,360 (interquartile range=1,536 to 40,960), in comparison to results presented. Each p-value plot showed a two-component mixture with small p-values less than .001 while other p-values appeared random (p-values greater than .05). Given potentially large numbers of statistical tests conducted in the 17 selected base papers, p-hacking cannot be ruled out as explanations for small p-values. Our interpretation of the meta-analysis is that the random p-values indicating null associations are more plausible and that the meta-analysis will not likely replicate in the absence of bias. We conclude the meta-analysis and base papers used are unreliable and do not offer evidence of value to inform public health practitioners about air quality as a risk factor for asthma exacerbation. The following areas are crucial for enabling improvements in risk factor chronic disease observational studies at the funding agency and journal level: preregistration, changes in funding agency and journal editor (and reviewer) practices, open sharing of data and facilitation of reproducibility research.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Exascale Deep Learning to Accelerate Cancer Research
Authors:
Robert M. Patton,
J. Travis Johnston,
Steven R. Young,
Catherine D. Schuman,
Thomas E. Potok,
Derek C. Rose,
Seung-Hwan Lim,
Junghoon Chae,
Le Hou,
Shahira Abousamra,
Dimitris Samaras,
Joel Saltz
Abstract:
Deep learning, through the use of neural networks, has demonstrated remarkable ability to automate many routine tasks when presented with sufficient data for training. The neural network architecture (e.g. number of layers, types of layers, connections between layers, etc.) plays a critical role in determining what, if anything, the neural network is able to learn from the training data. The trend…
▽ More
Deep learning, through the use of neural networks, has demonstrated remarkable ability to automate many routine tasks when presented with sufficient data for training. The neural network architecture (e.g. number of layers, types of layers, connections between layers, etc.) plays a critical role in determining what, if anything, the neural network is able to learn from the training data. The trend for neural network architectures, especially those trained on ImageNet, has been to grow ever deeper and more complex. The result has been ever increasing accuracy on benchmark datasets with the cost of increased computational demands. In this paper we demonstrate that neural network architectures can be automatically generated, tailored for a specific application, with dual objectives: accuracy of prediction and speed of prediction. Using MENNDL--an HPC-enabled software stack for neural architecture search--we generate a neural network with comparable accuracy to state-of-the-art networks on a cancer pathology dataset that is also $16\times$ faster at inference. The speedup in inference is necessary because of the volume and velocity of cancer pathology data; specifically, the previous state-of-the-art networks are too slow for individual researchers without access to HPC systems to keep pace with the rate of data generation. Our new model enables researchers with modest computational resources to analyze newly generated data faster than it is collected.
△ Less
Submitted 26 September, 2019;
originally announced September 2019.
-
Many perspectives on Deborah Mayo's "Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars"
Authors:
Andrew Gelman,
Brian Haig,
Christian Hennig,
Art Owen,
Robert Cousins,
Stan Young,
Christian Robert,
Corey Yanofsky,
E. J. Wagenmakers,
Ron Kenett,
Daniel Lakeland
Abstract:
The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion…
▽ More
The new book by philosopher Deborah Mayo is relevant to data science for topical reasons, as she takes various controversial positions regarding hypothesis testing and statistical practice, and also as an entry point to thinking about the philosophy of statistics. The present article is a slightly expanded version of a series of informal reviews and comments on Mayo's book. We hope this discussion will introduce people to Mayo's ideas along with other perspectives on the topics she addresses.
△ Less
Submitted 29 May, 2019; v1 submitted 21 May, 2019;
originally announced May 2019.
-
Evaluation of a meta-analysis of air quality and heart attacks, a case study
Authors:
S. Stanley Young,
Warren B. Kindzierski
Abstract:
It is generally acknowledged that claims from observational studies often fail to replicate. An exploratory study was undertaken to assess the reliability of base studies used in meta-analysis of short-term air quality-myocardial infarction risk and to judge the reliability of statistical evidence from meta-analysis that uses data from observational studies. A highly cited meta-analysis paper exam…
▽ More
It is generally acknowledged that claims from observational studies often fail to replicate. An exploratory study was undertaken to assess the reliability of base studies used in meta-analysis of short-term air quality-myocardial infarction risk and to judge the reliability of statistical evidence from meta-analysis that uses data from observational studies. A highly cited meta-analysis paper examining whether short-term air quality exposure triggers myocardial infarction was evaluated as a case study. The paper considered six air quality components - carbon monoxide, nitrogen dioxide, sulfur dioxide, particulate matter 10 and 2.5 micrometers in diameter (PM10 and PM2.5), and ozone. The number of possible questions and statistical models at issue in each of 34 base papers used were estimated and p-value plots for each of the air components were constructed to evaluate the effect heterogeneity of p-values used from the base papers. Analysis search spaces (number of statistical tests possible) in the base papers were large, median of 12,288, interquartile range: 2,496 to 58,368, in comparison to actual statistical test results presented. Statistical test results taken from the base papers may not provide unbiased measures of effect for meta-analysis. Shapes of p-value plots for the six air components were consistent with the possibility of analysis manipulation to obtain small p-values in several base papers. Results suggest the appearance of heterogeneous, researcher-generated p-values used in the meta-analysis rather than unbiased evidence of real effects for air quality. We conclude that this meta-analysis does not provide reliable evidence for an association of air quality components with myocardial risk.
△ Less
Submitted 2 April, 2019;
originally announced April 2019.
-
The reliability of an environmental epidemiology meta-analysis, a case study
Authors:
S. Stanley Young,
Mithun Kumar Acharjee,
Kumer Das
Abstract:
Summary
Background Claims made in science papers are coming under increased scrutiny with many claims failing to replicate. Meta-analysis studies that use unreliable observational studies should be in question. We examine the reliability of the base studies used in an air quality/heart attack meta-analysis and the resulting meta-analysis.
Methods A meta-analysis study that includes 14 observat…
▽ More
Summary
Background Claims made in science papers are coming under increased scrutiny with many claims failing to replicate. Meta-analysis studies that use unreliable observational studies should be in question. We examine the reliability of the base studies used in an air quality/heart attack meta-analysis and the resulting meta-analysis.
Methods A meta-analysis study that includes 14 observational air quality/heart attack studies is examined for its statistical reliability. We use simple counting to evaluate the reliability of the base papers and a p-value plot of the p-values from the base studies to examine study heterogeneity.
Findings We find that the based papers have massive multiple testing and multiple modeling with no statistical adjustments. Statistics coming from the base papers are not guaranteed to be unbiased, a requirement for a valid meta-analysis. There is study heterogeneity for the base papers with strong evidence for so called p-hacking.
Interpretation We make two observations: there are many claims at issue in each of the 14 base studies so uncorrected multiple testing is a serious issue. We find the base papers and the resulting meta-analysis are unreliable.
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
Deep Learning for Vertex Reconstruction of Neutrino-Nucleus Interaction Events with Combined Energy and Time Data
Authors:
Linghao Song,
Fan Chen,
Steven R. Young,
Catherine D. Schuman,
Gabriel Perdue,
Thomas E. Potok
Abstract:
We present a deep learning approach for vertex reconstruction of neutrino-nucleus interaction events, a problem in the domain of high energy physics. In this approach, we combine both energy and timing data that are collected in the MINERvA detector to perform classification and regression tasks. We show that the resulting network achieves higher accuracy than previous results while requiring a sm…
▽ More
We present a deep learning approach for vertex reconstruction of neutrino-nucleus interaction events, a problem in the domain of high energy physics. In this approach, we combine both energy and timing data that are collected in the MINERvA detector to perform classification and regression tasks. We show that the resulting network achieves higher accuracy than previous results while requiring a smaller model size and less training time. In particular, the proposed model outperforms the state-of-the-art by 4.00% on classification accuracy. For the regression task, our model achieves 0.9919 on the coefficient of determination, higher than the previous work (0.96).
△ Less
Submitted 2 February, 2019;
originally announced February 2019.
-
Combined background information for meta-analysis evaluation
Authors:
S. Stanley Young,
Warren Kindzierski
Abstract:
Massive numbers of meta-analysis studies are being published. A Google Scholar search of "systematic review and meta-analysis" returns about 452k hits since 2014. The search was done on Jan 14, 2019. There is a need to have some way to judge the reliability of a positive claim made in a meta-analysis that uses observational studies. Our idea is to examine the quality of the observational studies u…
▽ More
Massive numbers of meta-analysis studies are being published. A Google Scholar search of "systematic review and meta-analysis" returns about 452k hits since 2014. The search was done on Jan 14, 2019. There is a need to have some way to judge the reliability of a positive claim made in a meta-analysis that uses observational studies. Our idea is to examine the quality of the observational studies used in the meta-analysis and to examine the heterogeneity of those studies. We provide background information and examples: a listing of negative studies, a simulation of p-value plots, and multiple examples of p-value plots.
△ Less
Submitted 15 January, 2019; v1 submitted 13 August, 2018;
originally announced August 2018.
-
Deep Super Learner: A Deep Ensemble for Classification Problems
Authors:
Steven Young,
Tamer Abdou,
Ayse Bener
Abstract:
Deep learning has become very popular for tasks such as predictive modeling and pattern recognition in handling big data. Deep learning is a powerful machine learning method that extracts lower level features and feeds them forward for the next layer to identify higher level features that improve performance. However, deep neural networks have drawbacks, which include many hyper-parameters and inf…
▽ More
Deep learning has become very popular for tasks such as predictive modeling and pattern recognition in handling big data. Deep learning is a powerful machine learning method that extracts lower level features and feeds them forward for the next layer to identify higher level features that improve performance. However, deep neural networks have drawbacks, which include many hyper-parameters and infinite architectures, opaqueness into results, and relatively slower convergence on smaller datasets. While traditional machine learning algorithms can address these drawbacks, they are not typically capable of the performance levels achieved by deep neural networks. To improve performance, ensemble methods are used to combine multiple base learners. Super learning is an ensemble that finds the optimal combination of diverse learning algorithms. This paper proposes deep super learning as an approach which achieves log loss and accuracy results competitive to deep neural networks while employing traditional machine learning algorithms in a hierarchical structure. The deep super learner is flexible, adaptable, and easy to train with good performance across different tasks using identical hyper-parameter values. Using traditional machine learning requires fewer hyper-parameters, allows transparency into results, and has relatively fast convergence on smaller datasets. Experimental results show that the deep super learner has superior performance compared to the individual base learners, single-layer ensembles, and in some cases deep neural networks. Performance of the deep super learner may further be improved with task-specific tuning.
△ Less
Submitted 6 March, 2018;
originally announced March 2018.
-
A methodology for calculating the latency of GPS-probe data
Authors:
Zhongxiang Wang,
Masoud Hamedi,
Stanley Young
Abstract:
Crowdsourced GPS probe data has been gaining popularity in recent years as a source for real-time traffic information. Efforts have been made to evaluate the quality of such data from different perspectives. A quality indicator of any traffic data source is latency that describes the punctuality of data, which is critical for real-time operations, emergency response, and traveler information syste…
▽ More
Crowdsourced GPS probe data has been gaining popularity in recent years as a source for real-time traffic information. Efforts have been made to evaluate the quality of such data from different perspectives. A quality indicator of any traffic data source is latency that describes the punctuality of data, which is critical for real-time operations, emergency response, and traveler information systems. This paper offers a methodology for measuring the probe data latency, with respect to a selected reference source. Although Bluetooth re-identification data is used as the reference source, the methodology can be applied to any other ground-truth data source of choice (i.e. Automatic License Plate Readers, Electronic Toll Tag). The core of the methodology is a maximum pattern matching algorithm that works with three different fitness objectives. To test the methodology, sample field reference data were collected on multiple freeways segments for a two-week period using portable Bluetooth sensors as ground-truth. Equivalent GPS probe data was obtained from a private vendor, and its latency was evaluated. Latency at different times of the day, the impact of road segmentation scheme on latency, and sensitivity of the latency to both speed slowdown, and recovery from slowdown episodes are also discussed.
△ Less
Submitted 18 January, 2018;
originally announced January 2018.
-
A Benchmarking Environment for Reinforcement Learning Based Task Oriented Dialogue Management
Authors:
Iñigo Casanueva,
Paweł Budzianowski,
Pei-Hao Su,
Nikola Mrkšić,
Tsung-Hsien Wen,
Stefan Ultes,
Lina Rojas-Barahona,
Steve Young,
Milica Gašić
Abstract:
Dialogue assistants are rapidly becoming an indispensable daily aid. To avoid the significant effort needed to hand-craft the required dialogue flow, the Dialogue Management (DM) module can be cast as a continuous Markov Decision Process (MDP) and trained through Reinforcement Learning (RL). Several RL models have been investigated over recent years. However, the lack of a common benchmarking fram…
▽ More
Dialogue assistants are rapidly becoming an indispensable daily aid. To avoid the significant effort needed to hand-craft the required dialogue flow, the Dialogue Management (DM) module can be cast as a continuous Markov Decision Process (MDP) and trained through Reinforcement Learning (RL). Several RL models have been investigated over recent years. However, the lack of a common benchmarking framework makes it difficult to perform a fair comparison between different models and their capability to generalise to different environments. Therefore, this paper proposes a set of challenging simulated environments for dialogue model development and evaluation. To provide some baselines, we investigate a number of representative parametric algorithms, namely deep reinforcement learning algorithms - DQN, A2C and Natural Actor-Critic and compare them to a non-parametric model, GP-SARSA. Both the environments and policy models are implemented using the publicly available PyDial toolkit and released on-line, in order to establish a testbed framework for further experiments and to facilitate experimental reproducibility.
△ Less
Submitted 6 April, 2018; v1 submitted 29 November, 2017;
originally announced November 2017.
-
A cross-vendor and cross-state analysis of the GPS-probe data latency
Authors:
Zhongxiang Wang,
Masoud Hamedi,
Elham Sharifi,
Stanley Young
Abstract:
Crowdsourced GPS probe data has become a major source of real-time traffic information applications. In addition to traditional traveler advisory systems such as dynamic message signs (DMS) and 511 systems, probe data is being used for automatic incident detection, Integrated Corridor Management (ICM), end of queue warning systems, and mobility-related smartphone applications. Several private sect…
▽ More
Crowdsourced GPS probe data has become a major source of real-time traffic information applications. In addition to traditional traveler advisory systems such as dynamic message signs (DMS) and 511 systems, probe data is being used for automatic incident detection, Integrated Corridor Management (ICM), end of queue warning systems, and mobility-related smartphone applications. Several private sector vendors offer minute by minute network-wide travel time and speed probe data. The quality of such data in terms of deviation of the reported travel time and speeds from ground-truth has been extensively studied in recent years, and as a result concerns over the accuracy of probe data has mostly faded away. However, the latency of probe data, defined as the lag between the time that disturbance in traffic speed is reported in the outsourced data feed, and the time that the traffic is perturbed, has become a subject of interest. The extent of latency of probe data for real-time applications is critical, so it is important to have a good understanding of the amount of latency and its influencing factors. This paper uses high-quality independent Bluetooth/Wi-Fi re-identification data collected on multiple freeway segments in three different states, to measure the latency of the vehicle probe data provided by three major vendors. The statistical distribution of the latency and its sensitivity to speed slowdown and recovery periods are discussed.
△ Less
Submitted 17 January, 2018; v1 submitted 7 November, 2017;
originally announced November 2017.
-
The reliability of a nutritional meta-analysis study
Authors:
Karl E. Peace,
**g**g Yin,
Haresh Rochani,
Sarbesh Pandeya,
S. Stanley Young
Abstract:
Background: Many researchers have studied the relationship between diet and health. There are papers showing an association between the consumption of sugar-sweetened beverages and Type 2 diabetes. Many meta-analyses use individual studies that do not adjust for multiple testing or multiple modeling and thus provide biased estimates of effect. Hence the claims reported in a meta-analysis paper may…
▽ More
Background: Many researchers have studied the relationship between diet and health. There are papers showing an association between the consumption of sugar-sweetened beverages and Type 2 diabetes. Many meta-analyses use individual studies that do not adjust for multiple testing or multiple modeling and thus provide biased estimates of effect. Hence the claims reported in a meta-analysis paper may be unreliable if the primary papers do not ensure unbiased estimates of effect. Objective: Determine the statistical reliability of 10 papers and indirectly the reliability of the meta-analysis study. Method: Ten primary papers used in a meta-analysis paper and counted the numbers of outcomes, predictors, and covariates. We estimated the size of the potential analysis search space available to the authors of these papers; i.e. the number of comparisons and models available. Since we noticed that there were differences between predictors and covariates cited in the abstract and in the text, we applied this formula to information found in the abstracts, Space A, as well as the text, Space T, of each primary paper. Results: The median and range of the number of comparisons possible across the primary papers are 6.5 and (2-12,288) for abstracts, and 196,608 and (3,072-117,117,952) the texts. Note that the median of 6.5 for Space A is misleading as each primary study has 60-165 foods not mentioned in the abstract. Conclusion: Given that testing is at the 0.05 level and the number of comparisons is very large, nominal statistical significance is very weak support for a claim. The claims in these papers are not statistically supported and hence are unreliable. Thus, the claims of the meta-analysis paper lack evidentiary confirmation.
△ Less
Submitted 5 October, 2017;
originally announced October 2017.
-
Reward-Balancing for Statistical Spoken Dialogue Systems using Multi-objective Reinforcement Learning
Authors:
Stefan Ultes,
Paweł Budzianowski,
Iñigo Casanueva,
Nikola Mrkšić,
Lina Rojas-Barahona,
Pei-Hao Su,
Tsung-Hsien Wen,
Milica Gašić,
Steve Young
Abstract:
Reinforcement learning is widely used for dialogue policy optimization where the reward function often consists of more than one component, e.g., the dialogue success and the dialogue length. In this work, we propose a structured method for finding a good balance between these components by searching for the optimal reward component weighting. To render this search feasible, we use multi-objective…
▽ More
Reinforcement learning is widely used for dialogue policy optimization where the reward function often consists of more than one component, e.g., the dialogue success and the dialogue length. In this work, we propose a structured method for finding a good balance between these components by searching for the optimal reward component weighting. To render this search feasible, we use multi-objective reinforcement learning to significantly reduce the number of training dialogues required. We apply our proposed method to find optimized component weights for six domains and compare them to a default baseline.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
Latent Intention Dialogue Models
Authors:
Tsung-Hsien Wen,
Yishu Miao,
Phil Blunsom,
Steve Young
Abstract:
Develo** a dialogue agent that is capable of making autonomous decisions and communicating by natural language is one of the long-term goals of machine learning research. Traditional approaches either rely on hand-crafting a small state-action set for applying reinforcement learning that is not scalable or constructing deterministic models for learning dialogue sentences that fail to capture nat…
▽ More
Develo** a dialogue agent that is capable of making autonomous decisions and communicating by natural language is one of the long-term goals of machine learning research. Traditional approaches either rely on hand-crafting a small state-action set for applying reinforcement learning that is not scalable or constructing deterministic models for learning dialogue sentences that fail to capture natural conversational variability. In this paper, we propose a Latent Intention Dialogue Model (LIDM) that employs a discrete latent variable to learn underlying dialogue intentions in the framework of neural variational inference. In a goal-oriented dialogue scenario, these latent intentions can be interpreted as actions guiding the generation of machine responses, which can be further refined autonomously by reinforcement learning. The experimental evaluation of LIDM shows that the model out-performs published benchmarks for both corpus-based and human evaluation, demonstrating the effectiveness of discrete latent variable models for learning goal-oriented dialogues.
△ Less
Submitted 29 May, 2017;
originally announced May 2017.
-
Conditional Generation and Snapshot Learning in Neural Dialogue Systems
Authors:
Tsung-Hsien Wen,
Milica Gasic,
Nikola Mrksic,
Lina M. Rojas-Barahona,
Pei-Hao Su,
Stefan Ultes,
David Vandyke,
Steve Young
Abstract:
Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential…
▽ More
Recently a variety of LSTM-based conditional language models (LM) have been applied across a range of language generation tasks. In this work we study various model architectures and different ways to represent and aggregate the source information in an end-to-end neural dialogue system framework. A method called snapshot learning is also proposed to facilitate learning from supervised sequential signals by applying a companion cross-entropy objective function to the conditioning vector. The experimental and analytical results demonstrate firstly that competition occurs between the conditioning vector and the LM, and the differing architectures provide different trade-offs between the two. Secondly, the discriminative power and transparency of the conditioning vector is key to providing both model interpretability and better performance. Thirdly, snapshot learning leads to consistent performance improvements independent of which architecture is used.
△ Less
Submitted 10 June, 2016;
originally announced June 2016.
-
A Network-based End-to-End Trainable Task-oriented Dialogue System
Authors:
Tsung-Hsien Wen,
David Vandyke,
Nikola Mrksic,
Milica Gasic,
Lina M. Rojas-Barahona,
Pei-Hao Su,
Stefan Ultes,
Steve Young
Abstract:
Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, develo** task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-…
▽ More
Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, develo** task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collecting dialogue data based on a novel pipe-lined Wizard-of-Oz framework. This approach allows us to develop dialogue systems easily and without making too many assumptions about the task at hand. The results show that the model can converse with human subjects naturally whilst hel** them to accomplish tasks in a restaurant search domain.
△ Less
Submitted 24 April, 2017; v1 submitted 15 April, 2016;
originally announced April 2016.
-
Bias and response heterogeneity in an air quality data set
Authors:
S. Stanley Young,
Robert L. Obenchain,
Christophe Lambert
Abstract:
It is well-known that claims coming from observational studies often fail to replicate when rigorously re-tested. The technical problems include multiple testing, multiple modeling and bias. Any or all of these problems can give rise to claims that will fail to replicate. There is a need for statistical methods that are easily applied, are easy to understand, and are likely to give reliable result…
▽ More
It is well-known that claims coming from observational studies often fail to replicate when rigorously re-tested. The technical problems include multiple testing, multiple modeling and bias. Any or all of these problems can give rise to claims that will fail to replicate. There is a need for statistical methods that are easily applied, are easy to understand, and are likely to give reliable results. In particular, simple ways for reducing the influence of bias are essential. In this paper, the Local Control method developed by Robert Obenchain is explicated using a small air quality/longevity data set first analyzed in the New England Journal of Medicine. The benefits of our paper are twofold. First, we describe a reliable strategy for analysis of observational data. Second and importantly, the global claim that longevity increases with improvements in air quality made in the NEJM paper needs to be modified. There is subgroup heterogeneity in the effect of air quality on longevity (one size does not fit all), and this heterogeneity is largely explained by factors other than air quality.
△ Less
Submitted 7 April, 2015; v1 submitted 3 April, 2015;
originally announced April 2015.
-
Air quality and acute deaths in California, 2000-2012
Authors:
Kenneth K. Lopiano,
Richard L. Smith,
S. Stanley Young
Abstract:
Many studies have sought to determine if there is an association between air quality and acute deaths. Many consider it plausible that current levels of air quality cause acute deaths. However, several factors call causation and even association into question. Observational data sets are large and complex. Multiple testing and multiple modeling can lead to false positive findings. Publication, con…
▽ More
Many studies have sought to determine if there is an association between air quality and acute deaths. Many consider it plausible that current levels of air quality cause acute deaths. However, several factors call causation and even association into question. Observational data sets are large and complex. Multiple testing and multiple modeling can lead to false positive findings. Publication, confirmation and other biases are also possible problems.
Moreover, the fact that most data sets used in studies evaluating the relationships among air quality and public health outcomes are not publicly available makes reproducing the claims nearly impossible. Here we have built and made publicly available a dataset containing daily air quality levels, PM2.5 and ozone, daily temperature levels, minimum and maximum and daily relative humidity levels for the eight most populous California air basins. We analyzed the dataset using a moving median analysis, a standard time series analysis, and a prediction analysis within the following analysis strategy. We examine the eight air basins separately to see if estimates replicate across locations. We use leave one year out cross validation analysis to evaluate predictions. Both the moving medians analysis and the standard time series analysis found little evidence for association between air quality and acute deaths. The prediction analysis process was a run as a large factorial design using different models and holding out one year at a time. Among the variables used to predict acute death, most of the daily death variability was explained by time of year or weather variables. In summary, the empirical evidence is that current levels of air quality, ozone and PM2.5, are not causally related to acute deaths for California. An empirical and logical case can be made air quality is not causally related to acute deaths for the rest of the United States.
△ Less
Submitted 13 May, 2015; v1 submitted 10 February, 2015;
originally announced February 2015.
-
Statistical Modeling in Continuous Speech Recognition (CSR)(Invited Talk)
Authors:
Steve Young
Abstract:
Automatic continuous speech recognition (CSR) is sufficiently mature that a variety of real world applications are now possible including large vocabulary transcription and interactive spoken dialogues. This paper reviews the evolution of the statistical modelling techniques which underlie current-day systems, specifically hidden Markov models (HMMs) and N-grams. Starting from a description of the…
▽ More
Automatic continuous speech recognition (CSR) is sufficiently mature that a variety of real world applications are now possible including large vocabulary transcription and interactive spoken dialogues. This paper reviews the evolution of the statistical modelling techniques which underlie current-day systems, specifically hidden Markov models (HMMs) and N-grams. Starting from a description of the speech signal and its parameterisation, the various modelling assumptions and their consequences are discussed. It then describes various techniques by which the effects of these assumptions can be mitigated. Despite the progress that has been made, the limitations of current modelling techniques are still evident. The paper therefore concludes with a brief review of some of the more fundamental modelling work now in progress.
△ Less
Submitted 10 January, 2013;
originally announced January 2013.