Search | arXiv e-print repository

"Minus-One" Data Prediction Generates Synthetic Census Data with Good Crosstabulation Fidelity

Abstract: We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this method… ▽ More We propose to capture relevant statistical associations in a dataset of categorical survey responses by a method, here termed MODP, that "learns" a probabilistic prediction function L. Specifically, L predicts each question's response based on the same respondent's answers to all the other questions. Draws from the resulting probability distribution become synthetic responses. Applying this methodology to the PUMS subset of Census ACS data, and with a learned L akin to multiple parallel logistic regression, we generate synthetic responses whose crosstabulations (two-point conditionals) are found to have a median accuracy of ~5% across all crosstabulation cells, with cell counts ranging over four orders of magnitude. We investigate and attempt to quantify the degree to which the privacy of the original data is protected. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 35 pages, 17 figures, 6 tables

MSC Class: 62P25 ACM Class: J.4

arXiv:2305.08241 [pdf, other]

NYSE Price Correlations Are Abitrageable Over Hours and Predictable Over Years

Authors: William H. Press

Abstract: Trade prices of about 1000 New York Stock Exchange-listed stocks are studied at one-minute time resolution over the continuous five year period 2018--2022. For each stock, in dollar-volume-weighted transaction time, the discrepancy from a Brownian-motion martingale is measured on timescales of minutes to several days. The result is well fit by a power-law shot-noise (or Gaussian) process with Hurs… ▽ More Trade prices of about 1000 New York Stock Exchange-listed stocks are studied at one-minute time resolution over the continuous five year period 2018--2022. For each stock, in dollar-volume-weighted transaction time, the discrepancy from a Brownian-motion martingale is measured on timescales of minutes to several days. The result is well fit by a power-law shot-noise (or Gaussian) process with Hurst exponent 0.465, that is, slightly mean-reverting. As a check, we execute an arbitrage strategy on simulated Hurst-exponent data, and a comparable strategy in backtesting on the actual data, obtaining similar results (annualized returns $\sim 60$\% if zero transaction costs). Next examining the cross-correlation structure of the $\sim 1000$ stocks, we find that, counterintuitively, correlations increase with time lag in the range studied. We show that this behavior that can be quantitatively explained if the mean-reverting Hurst component of each stock is uncorrelated, i.e., does not share that stock's overall correlation with other stocks. Overall, we find that $\approx 45$\% of a stock's 1-hour returns variance is explained by its particular correlations to other stocks, but that most of this is simply explained by the movement of all stocks together. Unexpectedly, the fraction of variance explained is greatest when price volatility is high, for example during COVID-19 year 2020. An arbitrage strategy with cross-correlations does significantly better than without (annualized returns $\sim 100$\% if zero transaction costs). Measured correlations from any single year in 2018--2022 are about equally good in predicting all the other years, indicating that an overall correlation structure is persistent over the whole period. △ Less

Submitted 14 May, 2023; originally announced May 2023.

Comments: 48 pages, 21 figures, 2 tables

MSC Class: 91G15

arXiv:2303.16153 [pdf, other]

Optimal Cross-Correlation Estimates from Asynchronous Tick-by-Tick Trading Data

Authors: William H. Press

Abstract: Given two time series, A and B, sampled asynchronously at different times {t_A_i} and {t_B_j}, termed "ticks", how can one best estimate the correlation coefficient ρbetween changes in A and B? We derive a natural, minimum-variance estimator that does not use any interpolation or binning, then derive from it a fast (linear time) estimator that is demonstrably nearly as good. This "fast tickwise es… ▽ More Given two time series, A and B, sampled asynchronously at different times {t_A_i} and {t_B_j}, termed "ticks", how can one best estimate the correlation coefficient ρbetween changes in A and B? We derive a natural, minimum-variance estimator that does not use any interpolation or binning, then derive from it a fast (linear time) estimator that is demonstrably nearly as good. This "fast tickwise estimator" is compared in simulation to the usual method of interpolating changes to a regular grid. Even when the grid spacing is optimized for the particular parameters (not often possible in practice), the fast tickwise estimator has generally smaller estimation errors, often by a large factor. These results are directly applicable to tick-by-tick price data of financial assets. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Comments: 21 pages, 6 figures, 3 tables

MSC Class: 91G15

arXiv:2103.09614 [pdf]

Should the Endless Frontier of Federal Science be Expanded?

Authors: David Baltimore, Robert Conn, William H Press, Thomas Rosenbaum, David N Spergel, Shirley M Tilghman, Harold Varmus

Abstract: Scientific research in the United States could receive a large increase in federal funding--up to 100 billion dollars over five years -- if proposed legislation entitled the Endless Frontiers Act becomes law. This bipartisan and bicameral bill, introduced in May 2020 by Senators Chuck Schumer (D-NY) and Todd Young (R-IN) and Congressmen Ro Khanna (D-CA) and Mike Gallagher (R-WI), is intended to ex… ▽ More Scientific research in the United States could receive a large increase in federal funding--up to 100 billion dollars over five years -- if proposed legislation entitled the Endless Frontiers Act becomes law. This bipartisan and bicameral bill, introduced in May 2020 by Senators Chuck Schumer (D-NY) and Todd Young (R-IN) and Congressmen Ro Khanna (D-CA) and Mike Gallagher (R-WI), is intended to expand the funding of the physical sciences, engineering, and technology at the National Science Foundation (NSF) and create a new Technology Directorate focused on use-inspired research. In addition to provisions to protect the NSF's current missions, a minimum of 15\% of the newly appropriated funds would be used to enhance NSF's basic science portfolio. The Endless Frontier Act offers a rare opportunity to enhance the breadth and financial support of the American research enterprise. In this essay, we consider the benefits and the liabilities of the proposed legislation and recommend changes that would further strengthen it. △ Less

Submitted 17 March, 2021; originally announced March 2021.

Comments: Appeared as an AAAS Policy Alert On-line

arXiv:2010.02985 [pdf, other]

Likelihood Models for Forensic Genealogy

Authors: William H. Press, John Hawkins

Abstract: In the idealized Morgan model of crossover, we study the probability distributions of shared DNA (identical by descent) between individuals having a wide range of relationships (not just lineal descendants), especially cases for which previous work produces inaccurate results. Using Monte Carlo simulation, we show that a particular, complicated functional form with just one continuous fitted param… ▽ More In the idealized Morgan model of crossover, we study the probability distributions of shared DNA (identical by descent) between individuals having a wide range of relationships (not just lineal descendants), especially cases for which previous work produces inaccurate results. Using Monte Carlo simulation, we show that a particular, complicated functional form with just one continuous fitted parameter accurately approximates the distributions in all cases tried. Analysis of that functional form shows that it is close to a normal distribution, not in shared fraction f, but in the square-root of f. We describe a multivariate normal model in this variable for use as a practical framework for several general tasks in forensic genealogy that are currently done by less-accurate and less well-founded methods. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: 26 pages, 5 figures, 2 tables

arXiv:1812.01112 [pdf, other]

An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage

Authors: William H. Press, John A. Hawkins

Abstract: Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and mu… ▽ More Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and multiple alignment to detect and correct indels -- in effect an inefficient multiple-repetition code. This paper describes an ECC, termed "HEDGES", that corrects simultaneously for substitutions, insertions, and deletions in a single read. Varying code rates allow for correction of up to ~10% nucleotide errors and achieve 50% or better of the estimated Shannon limit. △ Less

Submitted 3 December, 2018; originally announced December 2018.

Comments: 24 pages, 8 figures, 22 references

arXiv:astro-ph/9805197 [pdf, ps, other]

Density-Dependent Luminosity Functions for Galaxies in the Las Campanas Redshift Survey

Authors: Benjamin C. Bromley, William H. Press, Huan Lin, Robert P. Kirshner

Abstract: Galaxies in the Las Campanas Redshift Survey are classified according to their spectra, and the resulting spectral types are analyzed to determine if local environment affects their properties. We find that the luminosity function of early-type objects varies as a function of local density. Our results suggest that early-type galaxies (presumably ellipticals and S0's) are, on average, fainter wh… ▽ More Galaxies in the Las Campanas Redshift Survey are classified according to their spectra, and the resulting spectral types are analyzed to determine if local environment affects their properties. We find that the luminosity function of early-type objects varies as a function of local density. Our results suggest that early-type galaxies (presumably ellipticals and S0's) are, on average, fainter when they are located in high-density regions of the Universe. The same effect may operate for some, but not all, late-type galaxies. We discuss the implications of this result for theories of galaxy formation and evolution. △ Less

Submitted 14 May, 1998; originally announced May 1998.

Comments: 7 pages (LaTeX), 2 figures (Postscript). Submitted to the Astrophysical Journal

arXiv:astro-ph/9803193 [pdf, ps, other]

doi 10.1086/306322

Magnification Ratio of the Fluctuating Light in Gravitational Lens 0957+561

Authors: William H. Press, George B. Rybicki

Abstract: Radio observations establish the B/A magnification ratio of gravitational lens 0957+561 at about 0.75. Yet, for more than 15 years, the optical magnfication ratio has been between 0.9 and 1.12. The accepted explanation is microlensing of the optical source. However, this explanation is mildly discordant with (i) the relative constancy of the optical ratio, and (ii) recent data indicating possibl… ▽ More Radio observations establish the B/A magnification ratio of gravitational lens 0957+561 at about 0.75. Yet, for more than 15 years, the optical magnfication ratio has been between 0.9 and 1.12. The accepted explanation is microlensing of the optical source. However, this explanation is mildly discordant with (i) the relative constancy of the optical ratio, and (ii) recent data indicating possible non-achromaticity in the ratio. To study these issues, we develop a statistical formalism for separately measuring, in a unified manner, the magnification ratio of the fluctuating and constant parts of the light curve. Applying the formalism to the published data of Kundić et al. (1997), we find that the magnification ratios of fluctuating parts in both the g and r colors agrees with the magnification ratio of the constant part in g-band, and tends to disagree with the r-band value. One explanation could be about 0.1 mag of consistently unsubtracted r light from the lensing galaxy G1, which seems unlikely. Another could be that 0957+561 is approaching a caustic in the microlensing pattern. △ Less

Submitted 17 March, 1998; originally announced March 1998.

Comments: 12 pages including 1 PostScript figure

Report number: CfA-TA98-144

arXiv:astro-ph/9711227 [pdf, ps, other]

doi 10.1086/306144

Spectral Classification and Luminosity Function of Galaxies in the Las Campanas Redshift Survey

Authors: Benjamin C. Bromley, William H. Press, Huan Lin, Robert P. Kirshner

Abstract: We construct a spectral classification scheme for the galaxies of the Las Campanas Redshift Survey (LCRS) based on a principal component analysis of the measured galaxy spectra. We interpret the physical significance of our six spectral types and conclude that they are sensitive to morphological type and the amount of active star formation. In this first analysis of the LCRS to include spectral… ▽ More We construct a spectral classification scheme for the galaxies of the Las Campanas Redshift Survey (LCRS) based on a principal component analysis of the measured galaxy spectra. We interpret the physical significance of our six spectral types and conclude that they are sensitive to morphological type and the amount of active star formation. In this first analysis of the LCRS to include spectral classification, we estimate the general luminosity function, expressed as a weighted sum of the type-specific luminosity functions. In the R-band magnitude range of -23 < M <= -16.5, this function exhibits a broad shoulder centered near M = -20, and an increasing faint-end slope which formally converges on an alpha value of about -1.8 in the faint limit. The Schechter parameterization does not provide a good representation in this case, a fact which may partly explain the reported discrepancy between the luminosity functions of the LCRS and other redshift catalogs such as the Century Survey (Geller et al. 1997). The discrepancy may also arise from environmental effects such as the density-morphology relationship for which we see strong evidence in the LCRS galaxies. However, the Schechter parameterization is more effective for the luminosity functions of the individual spectral types. The data show a significant, progressive steepening of the faint-end slope, from alpha = +0.5 for early-type objects, to alpha = -1.8 for the extreme late-type galaxies. The extreme late-type population has a sufficiently high space density that its contribution to the general luminosity function is expected to dominate fainter than M = -16. We conclude that an evaluation of type-dependence is essential to any assessment of the general luminosity function. △ Less

Submitted 14 May, 1998; v1 submitted 19 November, 1997; originally announced November 1997.

Comments: 21 pages (LaTeX), 7 figures (Postscript). To appear in the Astrophysical Journal. The discussion of environmental dependence of luminosity functions has been shortened; the material from the earlier version now appears in a separate manuscript (astro-ph/9805197)

arXiv:astro-ph/9604126 [pdf, ps]

Understanding Data Better with Bayesian and Global Statistical Methods

Authors: William H. Press

Abstract: To understand their data better, astronomers need to use statistical tools that are more advanced than traditional ``freshman lab'' statistics. As an illustration, the problem of combining apparently incompatible measurements of a quantity is presented from both the traditional, and a more sophisticated Bayesian, perspective. Explicit formulas are given for both treatments. Results are shown for… ▽ More To understand their data better, astronomers need to use statistical tools that are more advanced than traditional ``freshman lab'' statistics. As an illustration, the problem of combining apparently incompatible measurements of a quantity is presented from both the traditional, and a more sophisticated Bayesian, perspective. Explicit formulas are given for both treatments. Results are shown for the value of the Hubble Constant, and a 95% confidence interval of 66 < H0 < 82 (km/s/Mpc) is obtained. △ Less

Submitted 22 April, 1996; originally announced April 1996.

Comments: 14 pages PostScript includes embedded figures. Paper given at Unsolved Problems in Astrophysics conference, Princeton, April 1995

Report number: CfA-TAD-96-114

Journal ref: in "Unsolved Problems in Astrophysics", Proceedings of Conference in Honor of John Bahcall, J.P. Ostriker, ed. (Princeton: Princeton University Press, 1996 [in press])

arXiv:astro-ph/9412017 [pdf, ps, other]

doi 10.1086/187897

Determining the Motion of the Local Group Using SN Ia Light Curve Shapes

Authors: Adam G. Riess, William H. Press, Robert P. Kirshner

Abstract: We have measured our Galaxy's motion relative to distant galaxies in which type Ia supernovae (SN Ia) have been observed. The effective recession velocity of this sample is 7000 km s$^{-1}$, which approaches the depth of the survey of brightest cluster galaxies by Lauer and Postman (1994). We use the Light Curve Shape (LCS) method for deriving distances to SN Ia, providing relative distance esti… ▽ More We have measured our Galaxy's motion relative to distant galaxies in which type Ia supernovae (SN Ia) have been observed. The effective recession velocity of this sample is 7000 km s$^{-1}$, which approaches the depth of the survey of brightest cluster galaxies by Lauer and Postman (1994). We use the Light Curve Shape (LCS) method for deriving distances to SN Ia, providing relative distance estimates to individual supernovae with a precision of $\sim$ 5\% (Riess, Press, \& Kirshner 1995). Analyzing the distribution on the sky of velocity residuals from a pure Hubble flow for 13 recent SN Ia drawn primarily from the Calán/Tololo survey (Hamuy 1993a, 1994, 1995a, 1995b, Maza et al. 1994), we find the best solution for the motion of the Local Group in this frame is 600 $\pm 350$ km s$^{-1}$ in the direction b=260$°$ {\it l}=+54$°$ with a 1 $σ$ error ellipse that measures 90$°$ by 25$°$. This solution is consistent with the rest frame of the cosmic microwave background (CMB) as determined by the Cosmic Background Explorer (COBE) measurement of the dipole temperature anisotropy (Smoot et al. 1992). It is inconsistent with the velocity observed by Lauer and Postman. △ Less

Submitted 6 December, 1994; originally announced December 1994.

Comments: 12 pp + 2 figs, posted as uuencoded tar.Z file which will uudecode-uncompress-untar to LaTeX file (uses aas macros) and two postscript figure files. Files (including postscript version of text) also available by anon ftp to cfata4.harvard.edu as pub/localgroup*

Journal ref: Astrophys.J. 445 (1995) L91

arXiv:astro-ph/9410054 [pdf, ps, other]

doi 10.1086/187704

Using SN Ia Light Curve Shapes to Measure The Hubble Constant

Authors: Adam G. Riess, William H. Press, Robert P. Kirshner

Abstract: We present an empirical method which uses visual band light curve shapes (LCS) to estimate the luminosity of type Ia supernovae (SN Ia). This method is first applied to a ``training set'' of 8 SN Ia light curves with independent distance estimates to derive the correlation between the LCS and the luminosity. We employ a linear estimation algorithm of the type developed by Rybicki and Press (1992… ▽ More We present an empirical method which uses visual band light curve shapes (LCS) to estimate the luminosity of type Ia supernovae (SN Ia). This method is first applied to a ``training set'' of 8 SN Ia light curves with independent distance estimates to derive the correlation between the LCS and the luminosity. We employ a linear estimation algorithm of the type developed by Rybicki and Press (1992). The result is similar to that obtained by Hamuy et al. (1995a) with the advantage that LCS produces quantitative error estimates for the distance. We then examine the light curves for 13 SN Ia to determine the LCS distances of these supernovae. The Hubble diagram constructed using these LCS distances has a remarkably small dispersion of $σ_V$ = 0.21 mag. We use the light curve of SN 1972E and the Cepheid distance to NGC 5253 to derive $67 \pm 7 $ km s$^{-1}$ Mpc$^{-1}$ for the Hubble constant. △ Less

Submitted 18 October, 1994; originally announced October 1994.

Comments: 10 pages + 2 figures, Postscript file includes text and figures, Submitted to Ap.J. (Letters), Harvard-Smithsonian Center for Astrophysics Preprint 4999

Journal ref: Astrophys.J. 438 (1995) L17-20

arXiv:comp-gas/9405004 [pdf, ps, other]

doi 10.1103/PhysRevLett.74.1060

A Class of Fast Methods for Processing Irregularly Sampled or Otherwise Inhomogeneous One-Dimensional Data

Authors: George B. Rybicki, William H. Press

Abstract: With the ansatz that a data set's correlation matrix has a certain parametrized form (one general enough, however, to allow the arbitrary specification of a slowly-varying decorrelation distance and population variance) the general machinery of Wiener or optimal filtering can be reduced from $O(n^3)$ to $O(n)$ operations, where $n$ is the size of the data set. The implied vast increases in compu… ▽ More With the ansatz that a data set's correlation matrix has a certain parametrized form (one general enough, however, to allow the arbitrary specification of a slowly-varying decorrelation distance and population variance) the general machinery of Wiener or optimal filtering can be reduced from $O(n^3)$ to $O(n)$ operations, where $n$ is the size of the data set. The implied vast increases in computational speed can allow many common sub-optimal or heuristic data analysis methods to be replaced by fast, relatively sophisticated, statistical algorithms. Three examples are given: data rectification, high- or low- pass filtering, and linear least squares fitting to a model with unaligned data points. △ Less

Submitted 20 May, 1994; originally announced May 1994.

Comments: 7 pages, LaTeX with REVTeX 3.0 macros, no figures. A toolkit with implementations (in Fortran 90) of the algorithms is available by anonymous ftp to cfata4.harvard.edu

arXiv:astro-ph/9303017 [pdf, ps, other]

doi 10.1086/173419

Properties of High-Redshift Lyman Alpha Clouds II. Statistical Properties of the Clouds

Authors: William H. Press, George B. Rybicki

Abstract: Curve of growth analysis, applied to the Lyman series absorption ratios deduced in our previous paper, yields a measurement of the logarithmic slope of distribution of \Lya\ clouds in column density $N$. The observed exponential distribution of the clouds' equivalent widths $W$ is then shown to require a broad distribution of velocity parameters $b$, extending up to 80 km s$^{-1}$. We show how t… ▽ More Curve of growth analysis, applied to the Lyman series absorption ratios deduced in our previous paper, yields a measurement of the logarithmic slope of distribution of \Lya\ clouds in column density $N$. The observed exponential distribution of the clouds' equivalent widths $W$ is then shown to require a broad distribution of velocity parameters $b$, extending up to 80 km s$^{-1}$. We show how the exponential itself emerges in a natural way. An absolute normalization for the differential distribution of cloud numbers in $z$, $N$, and $b$ is obtained. By detailed analysis of absorption fluctuations along the line of sight we are able to put upper limits on the cloud-cloud correlation function $ξ$ on several megaparsec length scales. We show that observed $b$ values, if thermal, are incompatible, in several different ways, with the hypothesis of equilibrium heating and ionization by a background UV flux. Either a significant component of $b$ is due to bulk motion (which we argue against on several grounds), or else the clouds are out of equilibrium, and hotter than is implied by their ionization state, a situation which could be indicative of recent adiabatic collapse. △ Less

Submitted 29 March, 1993; originally announced March 1993.

Comments: 32 pages, LaTeX using aastex30 macros, submitted to Ap.J

Journal ref: Astrophys.J. 418 (1993) 585

arXiv:astro-ph/9303016 [pdf, ps, other]

doi 10.1086/173057

Properties of High-Redshift Lyman Alpha Clouds I. Statistical Analysis of the SSG Quasars

Authors: William H. Press, George B. Rybicki, Donald P. Schneider

Abstract: Techniques for the statistical analysis of the \Lya\ forest in high redshift quasars are developed, and applied to the low resolution (25 Å) spectra of 29 of the 33 quasars in the Schneider-Schmidt-Gunn (SSG) sample.We find that the mean absorption increases with $z$ approximately as a power law $(1+z)^{γ+1}$ with $γ= 2.46\pm 0.37$. The mean ratio of \Lya\ to Lyman $β$ absorption in the clouds i… ▽ More Techniques for the statistical analysis of the \Lya\ forest in high redshift quasars are developed, and applied to the low resolution (25 Å) spectra of 29 of the 33 quasars in the Schneider-Schmidt-Gunn (SSG) sample.We find that the mean absorption increases with $z$ approximately as a power law $(1+z)^{γ+1}$ with $γ= 2.46\pm 0.37$. The mean ratio of \Lya\ to Lyman $β$ absorption in the clouds is $0.476\pm 0.054$. We also detect, and obtain ratios, for Lyman $β$, $γ$, and possibly $ε$. We are also able to quantify the fluctuations of the absorption around its mean, and find that these are comparable to, or perhaps slightly larger than, that expected from an uncorrelated distribution of clouds. The techniques in this paper, which include the use of bootstrap resampling of the quasar sample to obtain estimated errors and error covariances, and a mathematical treatment of absorption from a (possibly non-uniform) stochastic distribution of lines, should be applicable to future, more extensive, data sets. △ Less

Submitted 29 March, 1993; originally announced March 1993.

Comments: 29 pages, LaTeX using aastex30 macros, forthcoming as CfA preprint

Journal ref: Astrophys.J. 414 (1993) 64-81

Showing 1–15 of 15 results for author: Press, W H