-
Applications of Improvements to the Pythagorean Won-Loss Expectation in Optimizing Rosters
Authors:
Alexander F. Almeida,
Kevin Dayaratna,
Steven J. Miller,
Andrew K. Yang
Abstract:
Bill James' Pythagorean formula has for decades done an excellent job estimating a baseball team's winning percentage from very little data: if the average runs scored and allowed are denoted respectively by ${\rm RS}$ and ${\rm RA}$, there is some $γ$ such that the winning percentage is approximately ${\rm RS}^γ/ ({\rm RS}^γ+ {\rm RA}^γ)$. One important consequence is to determine the value of di…
▽ More
Bill James' Pythagorean formula has for decades done an excellent job estimating a baseball team's winning percentage from very little data: if the average runs scored and allowed are denoted respectively by ${\rm RS}$ and ${\rm RA}$, there is some $γ$ such that the winning percentage is approximately ${\rm RS}^γ/ ({\rm RS}^γ+ {\rm RA}^γ)$. One important consequence is to determine the value of different players to the team, as it allows us to estimate how many more wins we would have given a fixed increase in run production. We summarize earlier work on the subject, and extend the earlier theoretical model of Miller (who estimated the run distributions as arising from independent Weibull distributions with the same shape parameter; this has been observed to describe the observed run data well). We now model runs scored and allowed as being drawn from independent Weibull distributions where the shape parameter is not necessarily the same, and then use the Method of Moments to solve a system of four equations in four unknowns. Doing so yields a predicted winning percentage that is consistently better than earlier models over the last 30 MLB seasons (1994 to 2023). This comes at a small cost as we no longer have a closed form expression but must evaluate a two-dimensional integral of two Weibull distributions and numerically estimate the solutions to the system of equations; as these are trivial to do with simple computational programs it is well worth adopting this framework and avoiding the issues of implementing the Method of Least Squares or the Method of Maximum Likelihood.
△ Less
Submitted 20 February, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
Lessons from the German Tank Problem
Authors:
George Clark,
Alex Gonye,
Steven J Miller
Abstract:
During World War II the German army used tanks to devastating advantage. The Allies needed accurate estimates of their tank production and deployment. They used two approaches to find these values: spies, and statistics. This note describes the statistical approach. Assuming the tanks are labeled consecutively starting at 1, if we observe $k$ serial numbers from an unknown number $N$ of tanks, wit…
▽ More
During World War II the German army used tanks to devastating advantage. The Allies needed accurate estimates of their tank production and deployment. They used two approaches to find these values: spies, and statistics. This note describes the statistical approach. Assuming the tanks are labeled consecutively starting at 1, if we observe $k$ serial numbers from an unknown number $N$ of tanks, with the maximum observed value $m$, then the best estimate for $N$ is $m(1 + 1/k) - 1$. This is now known as the German Tank Problem, and is a terrific example of the applicability of mathematics and statistics in the real world. The first part of the paper reproduces known results, specifically deriving this estimate and comparing its effectiveness to that of the spies. The second part presents a result we have not found in print elsewhere, the generalization to the case where the smallest value is not necessarily 1. We emphasize in detail why we are able to obtain such clean, closed-form expressions for the estimates, and conclude with an appendix highlighting how to use this problem to teach regression and how statistics can help us find functional relationships.
△ Less
Submitted 21 January, 2021; v1 submitted 19 January, 2021;
originally announced January 2021.
-
Categorical Co-Frequency Analysis: Clustering Diagnosis Codes to Predict Hospital Readmissions
Authors:
Hallee E. Wong,
Brianna C. Heggeseth,
Steven J. Miller
Abstract:
Accurately predicting patients' risk of 30-day hospital readmission would enable hospitals to efficiently allocate resource-intensive interventions. We develop a new method, Categorical Co-Frequency Analysis (CoFA), for clustering diagnosis codes from the International Classification of Diseases (ICD) according to the similarity in relationships between covariates and readmission risk. CoFA measur…
▽ More
Accurately predicting patients' risk of 30-day hospital readmission would enable hospitals to efficiently allocate resource-intensive interventions. We develop a new method, Categorical Co-Frequency Analysis (CoFA), for clustering diagnosis codes from the International Classification of Diseases (ICD) according to the similarity in relationships between covariates and readmission risk. CoFA measures the similarity between diagnoses by the frequency with which two diagnoses are split in the same direction versus split apart in random forests to predict readmission risk. Applying CoFA to de-identified data from Berkshire Medical Center, we identified three groups of diagnoses that vary in readmission risk. To evaluate CoFA, we compared readmission risk models using ICD majors and CoFA groups to a baseline model without diagnosis variables. We found substituting ICD majors for the CoFA-identified clusters simplified the model without compromising the accuracy of predictions. Fitting separate models for each ICD major and CoFA group did not improve predictions, suggesting that readmission risk may be more homogeneous that heterogeneous across diagnosis groups.
△ Less
Submitted 31 August, 2019;
originally announced September 2019.
-
Recovery of the fetal electrocardiogram for morphological analysis from two trans-abdominal channels via optimal shrinkage
Authors:
Pei-Chun Su,
Stephen Miller,
Salim Idriss,
Piers Barker,
Hau-Tieng Wu
Abstract:
We propose a novel algorithm to recover fetal electrocardiogram (ECG) for both the fetal heart rate analysis and morphological analysis of its waveform from two or three trans-abdominal maternal ECG channels. We design an algorithm based on the optimal-shrinkage and the nonlocal Euclidean median under the wave-shape manifold model. For the fetal heart rate analysis, the algorithm is evaluated on p…
▽ More
We propose a novel algorithm to recover fetal electrocardiogram (ECG) for both the fetal heart rate analysis and morphological analysis of its waveform from two or three trans-abdominal maternal ECG channels. We design an algorithm based on the optimal-shrinkage and the nonlocal Euclidean median under the wave-shape manifold model. For the fetal heart rate analysis, the algorithm is evaluated on publicly available database, 2013 PhyioNet/Computing in Cardiology Challenge, set A. For the morphological analysis, we propose to simulate semi-real databases by mixing the MIT-BIH Normal Sinus Rhythm Database and MITDB Arrhythmia Database. For the fetal R peak detection, the proposed algorithm outperforms all algorithms under comparison. For the morphological analysis, the algorithm provides an encouraging result in recovery of the fetal ECG waveform, including PR, QT and ST intervals, even when the fetus has arrhythmia. To the best of our knowledge, this is the first work focusing on recovering the fetal ECG for morphological analysis from two or three channels with an algorithm potentially applicable for continuous fetal electrocardiographic monitoring, which creates the potential for long term monitoring purpose.
△ Less
Submitted 8 August, 2019; v1 submitted 20 April, 2019;
originally announced April 2019.
-
Use of Modality and Negation in Semantically-Informed Syntactic MT
Authors:
Kathryn Baker,
Michael Bloodgood,
Bonnie J. Dorr,
Chris Callison-Burch,
Nathaniel W. Filardo,
Christine Piatko,
Lori Levin,
Scott Miller
Abstract:
This paper describes the resource- and system-building efforts of an eight-week Johns Hopkins University Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN tagge…
▽ More
This paper describes the resource- and system-building efforts of an eight-week Johns Hopkins University Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, the creation of a (publicly available) MN lexicon, and two automated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation) and a holder (an experiencer of modality). We describe how our MN lexicon was semi-automatically produced and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set.
We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. While the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described in this paper. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.
△ Less
Submitted 5 February, 2015;
originally announced February 2015.
-
Semantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach
Authors:
Kathryn Baker,
Michael Bloodgood,
Chris Callison-Burch,
Bonnie J. Dorr,
Nathaniel W. Filardo,
Lori Levin,
Scott Miller,
Christine Piatko
Abstract:
We describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reporte…
▽ More
We describe a unified and coherent syntactic framework for supporting a semantically-informed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English translation task. This finding supports the hypothesis (posed by many researchers in the MT community, e.g., in DARPA GALE) that both syntactic and semantic information are critical for improving translation quality---and further demonstrates that large gains can be achieved for low-resource languages with different word order than English.
△ Less
Submitted 24 September, 2014;
originally announced September 2014.
-
Relieving and Readjusting Pythagoras
Authors:
Victor Luo,
Steven J. Miller
Abstract:
Bill James invented the Pythagorean expectation in the late 70's to predict a baseball team's winning percentage knowing just their runs scored and allowed. His original formula estimates a winning percentage of ${\rm RS}^2/({\rm RS}^2+{\rm RA}^2)$, where ${\rm RS}$ stands for runs scored and ${\rm RA}$ for runs allowed; later versions found better agreement with data by replacing the exponent 2 w…
▽ More
Bill James invented the Pythagorean expectation in the late 70's to predict a baseball team's winning percentage knowing just their runs scored and allowed. His original formula estimates a winning percentage of ${\rm RS}^2/({\rm RS}^2+{\rm RA}^2)$, where ${\rm RS}$ stands for runs scored and ${\rm RA}$ for runs allowed; later versions found better agreement with data by replacing the exponent 2 with numbers near 1.83. Miller and his colleagues provided a theoretical justification by modeling runs scored and allowed by independent Weibull distributions. They showed that a single Weibull distribution did a very good job of describing runs scored and allowed, and led to a predicted won-loss percentage of $({\rm RS_{\rm obs}}-1/2)^γ/ (({\rm RS_{\rm obs}}-1/2)^γ+ ({\rm RA_{\rm obs}}-1/2)^γ)$, where ${\rm RS_{\rm obs}}$ and ${\rm RA_{\rm obs}}$ are the observed runs scored and allowed and $γ$ is the shape parameter of the Weibull (typically close to 1.8). We show a linear combination of Weibulls more accurately determines a team's run production and increases the prediction accuracy of a team's winning percentage by an average of about 25% (thus while the currently used variants of the original predictor are accurate to about four games a season, the new combination is accurate to about three). The new formula is more involved computationally; however, it can be easily computed on a laptop in a matter of minutes from publicly available season data. It performs as well (or slightly better) than the related Pythagorean formulas in use, and has the additional advantage of having a theoretical justification for its parameter values (and not just an optimization of parameters to minimize prediction error).
△ Less
Submitted 16 June, 2014; v1 submitted 12 June, 2014;
originally announced June 2014.
-
Pythagoras at the Bat
Authors:
Steven J. Miller,
Taylor Corcoran,
Jennifer Gossels,
Victor Luo,
Jaclyn Porfilio
Abstract:
The Pythagorean formula is one of the most popular ways to measure the true ability of a team. It is very easy to use, estimating a team's winning percentage from the runs they score and allow. This data is readily available on standings pages; no computationally intensive simulations are needed. Normally accurate to within a few games per season, it allows teams to determine how much a run is wor…
▽ More
The Pythagorean formula is one of the most popular ways to measure the true ability of a team. It is very easy to use, estimating a team's winning percentage from the runs they score and allow. This data is readily available on standings pages; no computationally intensive simulations are needed. Normally accurate to within a few games per season, it allows teams to determine how much a run is worth in different situations. This determination helps solve some of the most important economic decisions a team faces: How much is a player worth, which players should be pursued, and how much should they be offered. We discuss the formula and these applications in detail, and provide a theoretical justification, both for the formula as well as simpler linear estimators of a team's winning percentage. The calculations and modeling are discussed in detail, and when possible multiple proofs are given. We analyze the 2012 season in detail, and see that the data for that and other recent years support our modeling conjectures. We conclude with a discussion of work in progress to generalize the formula and increase its predictive power \emph{without} needing expensive simulations, though at the cost of requiring play-by-play data.
△ Less
Submitted 29 May, 2014;
originally announced June 2014.
-
The Pythagorean Won-Loss Formula and Hockey: A Statistical Justification for Using the Classic Baseball Formula as an Evaluative Tool in Hockey
Authors:
Kevin D. Dayaratna,
Steven J. Miller
Abstract:
Originally devised for baseball, the Pythagorean Won-Loss formula estimates the percentage of games a team should have won at a particular point in a season. For decades, this formula had no mathematical justification. In 2006, Steven Miller provided a statistical derivation by making some heuristic assumptions about the distributions of runs scored and allowed by baseball teams. We make a similar…
▽ More
Originally devised for baseball, the Pythagorean Won-Loss formula estimates the percentage of games a team should have won at a particular point in a season. For decades, this formula had no mathematical justification. In 2006, Steven Miller provided a statistical derivation by making some heuristic assumptions about the distributions of runs scored and allowed by baseball teams. We make a similar set of assumptions about hockey teams and show that the formula is just as applicable to hockey as it is to baseball. We hope that this work spurs research in the use of the Pythagorean Won-Loss formula as an evaluative tool for sports outside baseball.
△ Less
Submitted 18 October, 2013; v1 submitted 8 August, 2012;
originally announced August 2012.
-
First Order Approximations of the Pythagorean Won-Loss Formula for Predicting MLB Teams' Winning Percentages
Authors:
Kevin D. Dayaratna,
Steven J. Miller
Abstract:
We mathematically prove that an existing linear predictor of baseball teams' winning percentages (Jones and Tappin 2005) is simply just a first-order approximation to Bill James' Pythagorean Won-Loss formula and can thus be written in terms of the formula's well-known exponent. We estimate the linear model on twenty seasons of Major League Baseball data and are able to verify that the resulting co…
▽ More
We mathematically prove that an existing linear predictor of baseball teams' winning percentages (Jones and Tappin 2005) is simply just a first-order approximation to Bill James' Pythagorean Won-Loss formula and can thus be written in terms of the formula's well-known exponent. We estimate the linear model on twenty seasons of Major League Baseball data and are able to verify that the resulting coefficient estimate, with 95% confidence, is virtually identical to the empirically accepted value of 1.82. Our work thus helps explain why this simple and elegant model is such a strong linear predictor.
△ Less
Submitted 21 May, 2012;
originally announced May 2012.