-
"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity
Authors:
William H. Press
Abstract:
We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this method…
▽ More
We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
NYSE Price Correlations Are Abitrageable Over Hours and Predictable Over Years
Authors:
William H. Press
Abstract:
Trade prices of about 1000 New York Stock Exchange-listed stocks are studied at one-minute time resolution over the continuous five year period 2018--2022. For each stock, in dollar-volume-weighted transaction time, the discrepancy from a Brownian-motion martingale is measured on timescales of minutes to several days. The result is well fit by a power-law shot-noise (or Gaussian) process with Hurs…
▽ More
Trade prices of about 1000 New York Stock Exchange-listed stocks are studied at one-minute time resolution over the continuous five year period 2018--2022. For each stock, in dollar-volume-weighted transaction time, the discrepancy from a Brownian-motion martingale is measured on timescales of minutes to several days. The result is well fit by a power-law shot-noise (or Gaussian) process with Hurst exponent 0.465, that is, slightly mean-reverting. As a check, we execute an arbitrage strategy on simulated Hurst-exponent data, and a comparable strategy in backtesting on the actual data, obtaining similar results (annualized returns $\sim 60$\% if zero transaction costs). Next examining the cross-correlation structure of the $\sim 1000$ stocks, we find that, counterintuitively, correlations increase with time lag in the range studied. We show that this behavior that can be quantitatively explained if the mean-reverting Hurst component of each stock is uncorrelated, i.e., does not share that stock's overall correlation with other stocks. Overall, we find that $\approx 45$\% of a stock's 1-hour returns variance is explained by its particular correlations to other stocks, but that most of this is simply explained by the movement of all stocks together. Unexpectedly, the fraction of variance explained is greatest when price volatility is high, for example during COVID-19 year 2020. An arbitrage strategy with cross-correlations does significantly better than without (annualized returns $\sim 100$\% if zero transaction costs). Measured correlations from any single year in 2018--2022 are about equally good in predicting all the other years, indicating that an overall correlation structure is persistent over the whole period.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
Optimal Cross-Correlation Estimates from Asynchronous Tick-by-Tick Trading Data
Authors:
William H. Press
Abstract:
Given two time series, A and B, sampled asynchronously at different times {t_A_i} and {t_B_j}, termed "ticks", how can one best estimate the correlation coefficient ρbetween changes in A and B? We derive a natural, minimum-variance estimator that does not use any interpolation or binning, then derive from it a fast (linear time) estimator that is demonstrably nearly as good. This "fast tickwise es…
▽ More
Given two time series, A and B, sampled asynchronously at different times {t_A_i} and {t_B_j}, termed "ticks", how can one best estimate the correlation coefficient ρbetween changes in A and B? We derive a natural, minimum-variance estimator that does not use any interpolation or binning, then derive from it a fast (linear time) estimator that is demonstrably nearly as good. This "fast tickwise estimator" is compared in simulation to the usual method of interpolating changes to a regular grid. Even when the grid spacing is optimized for the particular parameters (not often possible in practice), the fast tickwise estimator has generally smaller estimation errors, often by a large factor. These results are directly applicable to tick-by-tick price data of financial assets.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
Should the Endless Frontier of Federal Science be Expanded?
Authors:
David Baltimore,
Robert Conn,
William H Press,
Thomas Rosenbaum,
David N Spergel,
Shirley M Tilghman,
Harold Varmus
Abstract:
Scientific research in the United States could receive a large increase in federal funding--up to 100 billion dollars over five years -- if proposed legislation entitled the Endless Frontiers Act becomes law. This bipartisan and bicameral bill, introduced in May 2020 by Senators Chuck Schumer (D-NY) and Todd Young (R-IN) and Congressmen Ro Khanna (D-CA) and Mike Gallagher (R-WI), is intended to ex…
▽ More
Scientific research in the United States could receive a large increase in federal funding--up to 100 billion dollars over five years -- if proposed legislation entitled the Endless Frontiers Act becomes law. This bipartisan and bicameral bill, introduced in May 2020 by Senators Chuck Schumer (D-NY) and Todd Young (R-IN) and Congressmen Ro Khanna (D-CA) and Mike Gallagher (R-WI), is intended to expand the funding of the physical sciences, engineering, and technology at the National Science Foundation (NSF) and create a new Technology Directorate focused on use-inspired research. In addition to provisions to protect the NSF's current missions, a minimum of 15\% of the newly appropriated funds would be used to enhance NSF's basic science portfolio. The Endless Frontier Act offers a rare opportunity to enhance the breadth and financial support of the American research enterprise. In this essay, we consider the benefits and the liabilities of the proposed legislation and recommend changes that would further strengthen it.
△ Less
Submitted 17 March, 2021;
originally announced March 2021.
-
Likelihood Models for Forensic Genealogy
Authors:
William H. Press,
John Hawkins
Abstract:
In the idealized Morgan model of crossover, we study the probability distributions of shared DNA (identical by descent) between individuals having a wide range of relationships (not just lineal descendants), especially cases for which previous work produces inaccurate results. Using Monte Carlo simulation, we show that a particular, complicated functional form with just one continuous fitted param…
▽ More
In the idealized Morgan model of crossover, we study the probability distributions of shared DNA (identical by descent) between individuals having a wide range of relationships (not just lineal descendants), especially cases for which previous work produces inaccurate results. Using Monte Carlo simulation, we show that a particular, complicated functional form with just one continuous fitted parameter accurately approximates the distributions in all cases tried. Analysis of that functional form shows that it is close to a normal distribution, not in shared fraction f, but in the square-root of f. We describe a multivariate normal model in this variable for use as a practical framework for several general tasks in forensic genealogy that are currently done by less-accurate and less well-founded methods.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage
Authors:
William H. Press,
John A. Hawkins
Abstract:
Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and mu…
▽ More
Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and multiple alignment to detect and correct indels -- in effect an inefficient multiple-repetition code. This paper describes an ECC, termed "HEDGES", that corrects simultaneously for substitutions, insertions, and deletions in a single read. Varying code rates allow for correction of up to ~10% nucleotide errors and achieve 50% or better of the estimated Shannon limit.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Density-Dependent Luminosity Functions for Galaxies in the Las Campanas Redshift Survey
Authors:
Benjamin C. Bromley,
William H. Press,
Huan Lin,
Robert P. Kirshner
Abstract:
Galaxies in the Las Campanas Redshift Survey are classified according to their spectra, and the resulting spectral types are analyzed to determine if local environment affects their properties. We find that the luminosity function of early-type objects varies as a function of local density. Our results suggest that early-type galaxies (presumably ellipticals and S0's) are, on average, fainter wh…
▽ More
Galaxies in the Las Campanas Redshift Survey are classified according to their spectra, and the resulting spectral types are analyzed to determine if local environment affects their properties. We find that the luminosity function of early-type objects varies as a function of local density. Our results suggest that early-type galaxies (presumably ellipticals and S0's) are, on average, fainter when they are located in high-density regions of the Universe. The same effect may operate for some, but not all, late-type galaxies. We discuss the implications of this result for theories of galaxy formation and evolution.
△ Less
Submitted 14 May, 1998;
originally announced May 1998.
-
Magnification Ratio of the Fluctuating Light in Gravitational Lens 0957+561
Authors:
William H. Press,
George B. Rybicki
Abstract:
Radio observations establish the B/A magnification ratio of gravitational lens 0957+561 at about 0.75. Yet, for more than 15 years, the optical magnfication ratio has been between 0.9 and 1.12. The accepted explanation is microlensing of the optical source. However, this explanation is mildly discordant with (i) the relative constancy of the optical ratio, and (ii) recent data indicating possibl…
▽ More
Radio observations establish the B/A magnification ratio of gravitational lens 0957+561 at about 0.75. Yet, for more than 15 years, the optical magnfication ratio has been between 0.9 and 1.12. The accepted explanation is microlensing of the optical source. However, this explanation is mildly discordant with (i) the relative constancy of the optical ratio, and (ii) recent data indicating possible non-achromaticity in the ratio. To study these issues, we develop a statistical formalism for separately measuring, in a unified manner, the magnification ratio of the fluctuating and constant parts of the light curve. Applying the formalism to the published data of Kundić et al. (1997), we find that the magnification ratios of fluctuating parts in both the g and r colors agrees with the magnification ratio of the constant part in g-band, and tends to disagree with the r-band value. One explanation could be about 0.1 mag of consistently unsubtracted r light from the lensing galaxy G1, which seems unlikely. Another could be that 0957+561 is approaching a caustic in the microlensing pattern.
△ Less
Submitted 17 March, 1998;
originally announced March 1998.
-
Spectral Classification and Luminosity Function of Galaxies in the Las Campanas Redshift Survey
Authors:
Benjamin C. Bromley,
William H. Press,
Huan Lin,
Robert P. Kirshner
Abstract:
We construct a spectral classification scheme for the galaxies of the Las Campanas Redshift Survey (LCRS) based on a principal component analysis of the measured galaxy spectra. We interpret the physical significance of our six spectral types and conclude that they are sensitive to morphological type and the amount of active star formation. In this first analysis of the LCRS to include spectral…
▽ More
We construct a spectral classification scheme for the galaxies of the Las Campanas Redshift Survey (LCRS) based on a principal component analysis of the measured galaxy spectra. We interpret the physical significance of our six spectral types and conclude that they are sensitive to morphological type and the amount of active star formation. In this first analysis of the LCRS to include spectral classification, we estimate the general luminosity function, expressed as a weighted sum of the type-specific luminosity functions. In the R-band magnitude range of -23 < M <= -16.5, this function exhibits a broad shoulder centered near M = -20, and an increasing faint-end slope which formally converges on an alpha value of about -1.8 in the faint limit. The Schechter parameterization does not provide a good representation in this case, a fact which may partly explain the reported discrepancy between the luminosity functions of the LCRS and other redshift catalogs such as the Century Survey (Geller et al. 1997). The discrepancy may also arise from environmental effects such as the density-morphology relationship for which we see strong evidence in the LCRS galaxies. However, the Schechter parameterization is more effective for the luminosity functions of the individual spectral types. The data show a significant, progressive steepening of the faint-end slope, from alpha = +0.5 for early-type objects, to alpha = -1.8 for the extreme late-type galaxies. The extreme late-type population has a sufficiently high space density that its contribution to the general luminosity function is expected to dominate fainter than M = -16. We conclude that an evaluation of type-dependence is essential to any assessment of the general luminosity function.
△ Less
Submitted 14 May, 1998; v1 submitted 19 November, 1997;
originally announced November 1997.
-
Understanding Data Better with Bayesian and Global Statistical Methods
Authors:
William H. Press
Abstract:
To understand their data better, astronomers need to use statistical tools that are more advanced than traditional ``freshman lab'' statistics. As an illustration, the problem of combining apparently incompatible measurements of a quantity is presented from both the traditional, and a more sophisticated Bayesian, perspective. Explicit formulas are given for both treatments. Results are shown for…
▽ More
To understand their data better, astronomers need to use statistical tools that are more advanced than traditional ``freshman lab'' statistics. As an illustration, the problem of combining apparently incompatible measurements of a quantity is presented from both the traditional, and a more sophisticated Bayesian, perspective. Explicit formulas are given for both treatments. Results are shown for the value of the Hubble Constant, and a 95% confidence interval of 66 < H0 < 82 (km/s/Mpc) is obtained.
△ Less
Submitted 22 April, 1996;
originally announced April 1996.
-
Determining the Motion of the Local Group Using SN Ia Light Curve Shapes
Authors:
Adam G. Riess,
William H. Press,
Robert P. Kirshner
Abstract:
We have measured our Galaxy's motion relative to distant galaxies in which type Ia supernovae (SN Ia) have been observed. The effective recession velocity of this sample is 7000 km s$^{-1}$, which approaches the depth of the survey of brightest cluster galaxies by Lauer and Postman (1994). We use the Light Curve Shape (LCS) method for deriving distances to SN Ia, providing relative distance esti…
▽ More
We have measured our Galaxy's motion relative to distant galaxies in which type Ia supernovae (SN Ia) have been observed. The effective recession velocity of this sample is 7000 km s$^{-1}$, which approaches the depth of the survey of brightest cluster galaxies by Lauer and Postman (1994). We use the Light Curve Shape (LCS) method for deriving distances to SN Ia, providing relative distance estimates to individual supernovae with a precision of $\sim$ 5\% (Riess, Press, \& Kirshner 1995). Analyzing the distribution on the sky of velocity residuals from a pure Hubble flow for 13 recent SN Ia drawn primarily from the Calán/Tololo survey (Hamuy 1993a, 1994, 1995a, 1995b, Maza et al. 1994), we find the best solution for the motion of the Local Group in this frame is 600 $\pm 350$ km s$^{-1}$ in the direction b=260$°$ {\it l}=+54$°$ with a 1 $σ$ error ellipse that measures 90$°$ by 25$°$. This solution is consistent with the rest frame of the cosmic microwave background (CMB) as determined by the Cosmic Background Explorer (COBE) measurement of the dipole temperature anisotropy (Smoot et al. 1992). It is inconsistent with the velocity observed by Lauer and Postman.
△ Less
Submitted 6 December, 1994;
originally announced December 1994.
-
Using SN Ia Light Curve Shapes to Measure The Hubble Constant
Authors:
Adam G. Riess,
William H. Press,
Robert P. Kirshner
Abstract:
We present an empirical method which uses visual band light curve shapes (LCS) to estimate the luminosity of type Ia supernovae (SN Ia). This method is first applied to a ``training set'' of 8 SN Ia light curves with independent distance estimates to derive the correlation between the LCS and the luminosity. We employ a linear estimation algorithm of the type developed by Rybicki and Press (1992…
▽ More
We present an empirical method which uses visual band light curve shapes (LCS) to estimate the luminosity of type Ia supernovae (SN Ia). This method is first applied to a ``training set'' of 8 SN Ia light curves with independent distance estimates to derive the correlation between the LCS and the luminosity. We employ a linear estimation algorithm of the type developed by Rybicki and Press (1992). The result is similar to that obtained by Hamuy et al. (1995a) with the advantage that LCS produces quantitative error estimates for the distance. We then examine the light curves for 13 SN Ia to determine the LCS distances of these supernovae. The Hubble diagram constructed using these LCS distances has a remarkably small dispersion of $σ_V$ = 0.21 mag. We use the light curve of SN 1972E and the Cepheid distance to NGC 5253 to derive $67 \pm 7 $ km s$^{-1}$ Mpc$^{-1}$ for the Hubble constant.
△ Less
Submitted 18 October, 1994;
originally announced October 1994.
-
A Class of Fast Methods for Processing Irregularly Sampled or Otherwise Inhomogeneous One-Dimensional Data
Authors:
George B. Rybicki,
William H. Press
Abstract:
With the ansatz that a data set's correlation matrix has a certain parametrized form (one general enough, however, to allow the arbitrary specification of a slowly-varying decorrelation distance and population variance) the general machinery of Wiener or optimal filtering can be reduced from $O(n^3)$ to $O(n)$ operations, where $n$ is the size of the data set. The implied vast increases in compu…
▽ More
With the ansatz that a data set's correlation matrix has a certain parametrized form (one general enough, however, to allow the arbitrary specification of a slowly-varying decorrelation distance and population variance) the general machinery of Wiener or optimal filtering can be reduced from $O(n^3)$ to $O(n)$ operations, where $n$ is the size of the data set. The implied vast increases in computational speed can allow many common sub-optimal or heuristic data analysis methods to be replaced by fast, relatively sophisticated, statistical algorithms. Three examples are given: data rectification, high- or low- pass filtering, and linear least squares fitting to a model with unaligned data points.
△ Less
Submitted 20 May, 1994;
originally announced May 1994.
-
Properties of High-Redshift Lyman Alpha Clouds II. Statistical Properties of the Clouds
Authors:
William H. Press,
George B. Rybicki
Abstract:
Curve of growth analysis, applied to the Lyman series absorption ratios deduced in our previous paper, yields a measurement of the logarithmic slope of distribution of \Lya\ clouds in column density $N$. The observed exponential distribution of the clouds' equivalent widths $W$ is then shown to require a broad distribution of velocity parameters $b$, extending up to 80 km s$^{-1}$. We show how t…
▽ More
Curve of growth analysis, applied to the Lyman series absorption ratios deduced in our previous paper, yields a measurement of the logarithmic slope of distribution of \Lya\ clouds in column density $N$. The observed exponential distribution of the clouds' equivalent widths $W$ is then shown to require a broad distribution of velocity parameters $b$, extending up to 80 km s$^{-1}$. We show how the exponential itself emerges in a natural way. An absolute normalization for the differential distribution of cloud numbers in $z$, $N$, and $b$ is obtained. By detailed analysis of absorption fluctuations along the line of sight we are able to put upper limits on the cloud-cloud correlation function $ξ$ on several megaparsec length scales. We show that observed $b$ values, if thermal, are incompatible, in several different ways, with the hypothesis of equilibrium heating and ionization by a background UV flux. Either a significant component of $b$ is due to bulk motion (which we argue against on several grounds), or else the clouds are out of equilibrium, and hotter than is implied by their ionization state, a situation which could be indicative of recent adiabatic collapse.
△ Less
Submitted 29 March, 1993;
originally announced March 1993.
-
Properties of High-Redshift Lyman Alpha Clouds I. Statistical Analysis of the SSG Quasars
Authors:
William H. Press,
George B. Rybicki,
Donald P. Schneider
Abstract:
Techniques for the statistical analysis of the \Lya\ forest in high redshift quasars are developed, and applied to the low resolution (25 Å) spectra of 29 of the 33 quasars in the Schneider-Schmidt-Gunn (SSG) sample.We find that the mean absorption increases with $z$ approximately as a power law $(1+z)^{γ+1}$ with $γ= 2.46\pm 0.37$. The mean ratio of \Lya\ to Lyman $β$ absorption in the clouds i…
▽ More
Techniques for the statistical analysis of the \Lya\ forest in high redshift quasars are developed, and applied to the low resolution (25 Å) spectra of 29 of the 33 quasars in the Schneider-Schmidt-Gunn (SSG) sample.We find that the mean absorption increases with $z$ approximately as a power law $(1+z)^{γ+1}$ with $γ= 2.46\pm 0.37$. The mean ratio of \Lya\ to Lyman $β$ absorption in the clouds is $0.476\pm 0.054$. We also detect, and obtain ratios, for Lyman $β$, $γ$, and possibly $ε$. We are also able to quantify the fluctuations of the absorption around its mean, and find that these are comparable to, or perhaps slightly larger than, that expected from an uncorrelated distribution of clouds. The techniques in this paper, which include the use of bootstrap resampling of the quasar sample to obtain estimated errors and error covariances, and a mathematical treatment of absorption from a (possibly non-uniform) stochastic distribution of lines, should be applicable to future, more extensive, data sets.
△ Less
Submitted 29 March, 1993;
originally announced March 1993.