-
Conditional Randomization Tests for Behavioral and Neural Time Series
Authors:
Kenneth D. Harris,
Kevin J. Miller
Abstract:
Randomization tests allow simple and unambiguous tests of null hypotheses, by comparing observed data to a null ensemble in which experimentally-controlled variables are randomly resampled. In behavioral and neuroscience experiments, however, the stimuli presented often depend on the subject's previous actions, so simple randomization tests are not possible. We describe how conditional randomizati…
▽ More
Randomization tests allow simple and unambiguous tests of null hypotheses, by comparing observed data to a null ensemble in which experimentally-controlled variables are randomly resampled. In behavioral and neuroscience experiments, however, the stimuli presented often depend on the subject's previous actions, so simple randomization tests are not possible. We describe how conditional randomization can be used to perform exact hypothesis tests in this situation, and illustrate it with two examples. We contrast conditional randomization with a related approach of tangent randomization, in which stimuli are resampled based only on events occurring in the past, which is not valid for all choices of test statistic. We discuss how to design experiments that allow conditional randomization tests to be used.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Spectral gap-based deterministic tensor completion
Authors:
Kameron Decker Harris,
Oscar López,
Angus Read,
Yizhe Zhu
Abstract:
Tensor completion is a core machine learning algorithm used in recommender systems and other domains with missing data. While the matrix case is well-understood, theoretical results for tensor problems are limited, particularly when the sampling patterns are deterministic. Here we bound the generalization error of the solutions of two tensor completion methods, Poisson loss and atomic norm minimiz…
▽ More
Tensor completion is a core machine learning algorithm used in recommender systems and other domains with missing data. While the matrix case is well-understood, theoretical results for tensor problems are limited, particularly when the sampling patterns are deterministic. Here we bound the generalization error of the solutions of two tensor completion methods, Poisson loss and atomic norm minimization, providing tighter bounds in terms of the target tensor rank. If the ground-truth tensor is order $t$ with CP-rank $r$, the dependence on $r$ is improved from $r^{2(t-1)(t^2-t-1)}$ in arXiv:1910.10692 to $r^{2(t-1)(3t-5)}$. The error in our bounds is deterministically controlled by the spectral gap of the sampling sparsity pattern. We also prove several new properties for the atomic tensor norm, reducing the rank dependence from $r^{3t-3}$ in arXiv:1711.04965 to $r^{3t-5}$ under random sampling schemes. A limitation is that atomic norm minimization, while theoretically interesting, leads to inefficient algorithms. However, numerical experiments illustrate the dependence of the reconstruction error on the spectral gap for the practical max-quasinorm, ridge penalty, and Poisson loss minimization algorithms. This view through the spectral gap is a promising window for further study of tensor algorithms.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
The martingale Z-test
Authors:
Kenneth D. Harris
Abstract:
We describe a statistical test for association of two autocorrelated time series, one of which generated randomly at each time point from a known but possibly history-dependent distribution. The null hypothesis is that at each time point, the two variables are independent, conditional on history until that time point. We define a test statistic that is a martingale under the null hypothesis and de…
▽ More
We describe a statistical test for association of two autocorrelated time series, one of which generated randomly at each time point from a known but possibly history-dependent distribution. The null hypothesis is that at each time point, the two variables are independent, conditional on history until that time point. We define a test statistic that is a martingale under the null hypothesis and describe an asymptotic test for it based on the martingale central limit theorem. If we reject this null hypothesis, we may infer an immediate causal effect of the randomized variable on the measured variable.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
BrainViewer: interacting with spatial connectome data at the mesoscale
Authors:
Seth Daetwiler,
Angus Read,
Jessica Stillwell,
Kameron Decker Harris
Abstract:
Scientists construct connectomes, comprehensive descriptions of neuronal connections across a brain, in order to better understand and model brain function. Interactive visualizations of these pathways would enable exploratory analysis of such information flows. Current tools can be used to see individual tracing experiments which are used to build mesoscale connectomes of the mouse brain, but not…
▽ More
Scientists construct connectomes, comprehensive descriptions of neuronal connections across a brain, in order to better understand and model brain function. Interactive visualizations of these pathways would enable exploratory analysis of such information flows. Current tools can be used to see individual tracing experiments which are used to build mesoscale connectomes of the mouse brain, but not the brain network itself. We present a connectivity visualization program called BrainViewer, which we use with a high-resolution mouse cortical connectome. This has the ability to display connectomes from other datasets when they become available and compare spatial connectivity across multiple brain structures. Our tool, optimized for speed and portability, presents a GUI visualization in 2-D top view and flatmap projections, allowing users to select and explore the connections of every source voxel to everywhere else in the cortex. Anatomists and other neuroscientists will find BrainViewer useful for building understanding beyond the known topography of cortical connectivity.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
Tests for partial correlation between repeatedly observed nonstationary nonlinear timeseries
Authors:
Kenneth D. Harris,
Alex E. Yuan
Abstract:
We describe two families of statistical tests to detect partial correlation in vectorial timeseries. The tests measure whether an observed timeseries Y can be predicted from a second series X, even after accounting for a third series Z which may correlate with X. They do not make any assumptions on the nature of these timeseries, such as stationarity or linearity, but they do require that multiple…
▽ More
We describe two families of statistical tests to detect partial correlation in vectorial timeseries. The tests measure whether an observed timeseries Y can be predicted from a second series X, even after accounting for a third series Z which may correlate with X. They do not make any assumptions on the nature of these timeseries, such as stationarity or linearity, but they do require that multiple statistically independent recordings of the 3 series are available. Intuitively, the tests work by asking if the series Y recorded on one experiment can be better predicted from X recorded on the same experiment than on a different experiment, after accounting for the prediction from Z recorded on both experiments.
△ Less
Submitted 24 April, 2024; v1 submitted 13 June, 2021;
originally announced June 2021.
-
A Shift Test for Independence in Generic Time Series
Authors:
Kenneth D. Harris
Abstract:
We describe a family of conservative statistical tests for independence of two autocorrelated time series. The series may take values in any sets, and one of them must be stationary. A user-specified function quantifying the association of a segment of the two series is compared to an ensemble obtained by time-shifting the stationary series -N to N steps. If the series are independent, the unshift…
▽ More
We describe a family of conservative statistical tests for independence of two autocorrelated time series. The series may take values in any sets, and one of them must be stationary. A user-specified function quantifying the association of a segment of the two series is compared to an ensemble obtained by time-shifting the stationary series -N to N steps. If the series are independent, the unshifted value is in the top m shifted values with probability at most m/(N+1). For large N, the probability approaches m/(2N+1). A conservative test rejects independence at significance α if the unshifted value is in the top α(N+1), and has half the power of an approximate test valid in the large N limit. We illustrate this framework with a test for correlation of autocorrelated categorical time series.
△ Less
Submitted 12 December, 2020;
originally announced December 2020.
-
On 1/n neural representation and robustness
Authors:
Josue Nassar,
Piotr Aleksander Sokol,
SueYeon Chung,
Kenneth D. Harris,
Il Memming Park
Abstract:
Understanding the nature of representation in neural networks is a goal shared by neuroscience and machine learning. It is therefore exciting that both fields converge not only on shared questions but also on similar approaches. A pressing question in these areas is understanding how the structure of the representation used by neural networks affects both their generalization, and robustness to pe…
▽ More
Understanding the nature of representation in neural networks is a goal shared by neuroscience and machine learning. It is therefore exciting that both fields converge not only on shared questions but also on similar approaches. A pressing question in these areas is understanding how the structure of the representation used by neural networks affects both their generalization, and robustness to perturbations. In this work, we investigate the latter by juxtaposing experimental results regarding the covariance spectrum of neural representations in the mouse V1 (Stringer et al) with artificial neural networks. We use adversarial robustness to probe Stringer et al's theory regarding the causal role of a 1/n covariance spectrum. We empirically investigate the benefits such a neural code confers in neural networks, and illuminate its role in multi-layer architectures. Our results show that imposing the experimentally observed structure on artificial neural networks makes them more robust to adversarial attacks. Moreover, our findings complement the existing theory relating wide neural networks to kernel methods, by showing the role of intermediate representations.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Deterministic tensor completion with hypergraph expanders
Authors:
Kameron Decker Harris,
Yizhe Zhu
Abstract:
We provide a novel analysis of low-rank tensor completion based on hypergraph expanders. As a proxy for rank, we minimize the max-quasinorm of the tensor, which generalizes the max-norm for matrices. Our analysis is deterministic and shows that the number of samples required to approximately recover an order-$t$ tensor with at most $n$ entries per dimension is linear in $n$, under the assumption t…
▽ More
We provide a novel analysis of low-rank tensor completion based on hypergraph expanders. As a proxy for rank, we minimize the max-quasinorm of the tensor, which generalizes the max-norm for matrices. Our analysis is deterministic and shows that the number of samples required to approximately recover an order-$t$ tensor with at most $n$ entries per dimension is linear in $n$, under the assumption that the rank and order of the tensor are $O(1)$. As steps in our proof, we find a new expander mixing lemma for a $t$-partite, $t$-uniform regular hypergraph model, and prove several new properties about tensor max-quasinorm. To the best of our knowledge, this is the first deterministic analysis of tensor completion. We develop a practical algorithm that solves a relaxed version of the max-quasinorm minimization problem, and we demonstrate its efficacy with numerical experiments.
△ Less
Submitted 29 July, 2021; v1 submitted 23 October, 2019;
originally announced October 2019.
-
Additive function approximation in the brain
Authors:
Kameron Decker Harris
Abstract:
Many biological learning systems such as the mushroom body, hippocampus, and cerebellum are built from sparsely connected networks of neurons. For a new understanding of such networks, we study the function spaces induced by sparse random features and characterize what functions may and may not be learned. A network with $d$ inputs per neuron is found to be equivalent to an additive model of order…
▽ More
Many biological learning systems such as the mushroom body, hippocampus, and cerebellum are built from sparsely connected networks of neurons. For a new understanding of such networks, we study the function spaces induced by sparse random features and characterize what functions may and may not be learned. A network with $d$ inputs per neuron is found to be equivalent to an additive model of order $d$, whereas with a degree distribution the network combines additive terms of different orders. We identify three specific advantages of sparsity: additive function approximation is a powerful inductive bias that limits the curse of dimensionality, sparse networks are stable to outlier noise in the inputs, and sparse random features are scalable. Thus, even simple brain architectures can be powerful function approximators. Finally, we hope that this work helps popularize kernel theories of networks among computational neuroscientists.
△ Less
Submitted 13 September, 2019; v1 submitted 5 September, 2019;
originally announced September 2019.
-
Centering Data Improves the Dynamic Mode Decomposition
Authors:
Seth M. Hirsh,
Kameron Decker Harris,
J. Nathan Kutz,
Bingni W. Brunton
Abstract:
Dynamic mode decomposition (DMD) is a data-driven method that models high-dimensional time series as a sum of spatiotemporal modes, where the temporal modes are constrained by linear dynamics. For nonlinear dynamical systems exhibiting strongly coherent structures, DMD can be a useful approximation to extract dominant, interpretable modes. In many domains with large spatiotemporal data---including…
▽ More
Dynamic mode decomposition (DMD) is a data-driven method that models high-dimensional time series as a sum of spatiotemporal modes, where the temporal modes are constrained by linear dynamics. For nonlinear dynamical systems exhibiting strongly coherent structures, DMD can be a useful approximation to extract dominant, interpretable modes. In many domains with large spatiotemporal data---including fluid dynamics, video processing, and finance---the dynamics of interest are often perturbations about fixed points or equilibria, which motivates the application of DMD to centered (i.e. mean-subtracted) data. In this work, we show that DMD with centered data is equivalent to incorporating an affine term in the dynamic model and is not equivalent to computing a discrete Fourier transform. Importantly, DMD with centering can always be used to compute eigenvalue spectra of the dynamics. However, in many cases DMD without centering cannot model the corresponding dynamics, most notably if the dynamics have full effective rank. Additionally, we generalize the notion of centering to extracting arbitrary, but known, fixed frequencies from the data. We corroborate these theoretical results numerically on three nonlinear examples: the Lorenz system, a surveillance video, and brain recordings. Since centering the data is simple and computationally efficient, we recommend it as a preprocessing step before DMD; furthermore, we suggest that it can be readily used in conjunction with many other popular implementations of the DMD algorithm.
△ Less
Submitted 13 June, 2019;
originally announced June 2019.
-
Time-varying Autoregression with Low Rank Tensors
Authors:
Kameron Decker Harris,
Aleksandr Aravkin,
Rajesh Rao,
Bingni Wen Brunton
Abstract:
We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matr…
▽ More
We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matrices as a low rank tensor. Because of its structure, the model is scalable to high-dimensional data and can easily incorporate priors such as smoothness over time. We find the components of the tensor using alternating minimization and prove that any stationary point of this algorithm is a local minimum. We demonstrate on a synthetic example that our method identifies the true rank of a switching linear system in the presence of noise. We illustrate our model's utility and superior scalability over extant methods when applied to several synthetic and real-world example: two types of time-varying linear systems, worm behavior, sea surface temperature, and monkey brain datasets.
△ Less
Submitted 19 May, 2020; v1 submitted 20 May, 2019;
originally announced May 2019.
-
Characterizing the invariances of learning algorithms using category theory
Authors:
Kenneth D. Harris
Abstract:
Many learning algorithms have invariances: when their training data is transformed in certain ways, the function they learn transforms in a predictable manner. Here we formalize this notion using concepts from the mathematical field of category theory. The invariances that a supervised learning algorithm possesses are formalized by categories of predictor and target spaces, whose morphisms represe…
▽ More
Many learning algorithms have invariances: when their training data is transformed in certain ways, the function they learn transforms in a predictable manner. Here we formalize this notion using concepts from the mathematical field of category theory. The invariances that a supervised learning algorithm possesses are formalized by categories of predictor and target spaces, whose morphisms represent the algorithm's invariances, and an index category whose morphisms represent permutations of the training examples. An invariant learning algorithm is a natural transformation between two functors from the product of these categories to the category of sets, representing training datasets and learned functions respectively. We illustrate the framework by characterizing and contrasting the invariances of linear regression and ridge regression.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Greedy low-rank algorithm for spatial connectome regression
Authors:
Patrick Kürschner,
Sergey Dolgov,
Kameron Decker Harris,
Peter Benner
Abstract:
Recovering brain connectivity from tract tracing data is an important computational problem in the neurosciences. Mesoscopic connectome reconstruction was previously formulated as a structured matrix regression problem (Harris et al., 2016), but existing techniques do not scale to the whole-brain setting. The corresponding matrix equation is challenging to solve due to large scale, ill-conditionin…
▽ More
Recovering brain connectivity from tract tracing data is an important computational problem in the neurosciences. Mesoscopic connectome reconstruction was previously formulated as a structured matrix regression problem (Harris et al., 2016), but existing techniques do not scale to the whole-brain setting. The corresponding matrix equation is challenging to solve due to large scale, ill-conditioning, and a general form that lacks a convergent splitting. We propose a greedy low-rank algorithm for connectome reconstruction problem in very high dimensions. The algorithm approximates the solution by a sequence of rank-one updates which exploit the sparse and positive definite problem structure. This algorithm was described previously (Kressner and Sirković, 2015) but never implemented for this connectome problem, leading to a number of challenges. We have had to design judicious stop** criteria and employ efficient solvers for the three main sub-problems of the algorithm, including an efficient GPU implementation that alleviates the main bottleneck for large datasets. The performance of the method is evaluated on three examples: an artificial "toy" dataset and two whole-cortex instances using data from the Allen Mouse Brain Connectivity Atlas. We find that the method is significantly faster than previous methods and that moderate ranks offer good approximation. This speedup allows for the estimation of increasingly large-scale connectomes across taxa as these data become available from tracing experiments. The data and code are available online.
△ Less
Submitted 1 November, 2019; v1 submitted 16 August, 2018;
originally announced August 2018.
-
Spectral gap in random bipartite biregular graphs and applications
Authors:
Gerandy Brito,
Ioana Dumitriu,
Kameron Decker Harris
Abstract:
We prove an analogue of Alon's spectral gap conjecture for random bipartite, biregular graphs. We use the Ihara-Bass formula to connect the non-backtracking spectrum to that of the adjacency matrix, employing the moment method to show there exists a spectral gap for the non-backtracking matrix. A byproduct of our main theorem is that random rectangular zero-one matrices with fixed row and column s…
▽ More
We prove an analogue of Alon's spectral gap conjecture for random bipartite, biregular graphs. We use the Ihara-Bass formula to connect the non-backtracking spectrum to that of the adjacency matrix, employing the moment method to show there exists a spectral gap for the non-backtracking matrix. A byproduct of our main theorem is that random rectangular zero-one matrices with fixed row and column sums are full-rank with high probability. Finally, we illustrate applications to community detection, coding theory, and deterministic matrix completion.
△ Less
Submitted 2 June, 2021; v1 submitted 20 April, 2018;
originally announced April 2018.
-
Different roles for inhibition in the rhythm-generating respiratory network
Authors:
Kameron Decker Harris,
Tatiana Dashevskiy,
Joshua Mendoza,
Alfredo J. Garcia III,
Jan-Marino Ramirez,
Eric Shea-Brown
Abstract:
Unraveling the interplay of excitation and inhibition within rhythm-generating networks remains a fundamental issue in neuroscience. We use a biophysical model to investigate the different roles of local and long-range inhibition in the respiratory network, a key component of which is the pre-Bötzinger complex inspiratory microcircuit. Increasing inhibition within the microcircuit results in a lim…
▽ More
Unraveling the interplay of excitation and inhibition within rhythm-generating networks remains a fundamental issue in neuroscience. We use a biophysical model to investigate the different roles of local and long-range inhibition in the respiratory network, a key component of which is the pre-Bötzinger complex inspiratory microcircuit. Increasing inhibition within the microcircuit results in a limited number of out-of-phase neurons before rhythmicity and synchrony degenerate. Thus, unstructured local inhibition is destabilizing and cannot support the generation of more than one rhythm. A two-phase rhythm requires restructuring the network into two microcircuits coupled by long-range inhibition in the manner of a half-center. In this context, inhibition leads to greater stability of the two out-of-phase rhythms. We support our computational results with in vitro recordings from mouse pre-Bötzinger complex. Partial excitation block leads to increased rhythmic variability, but this recovers following blockade of inhibition. Our results support the idea that local inhibition in the pre-Bötzinger complex is present to allow for descending control of synchrony or robustness to adverse conditions like hypoxia. We conclude that the balance of inhibition and excitation determines the stability of rhythmogenesis, but with opposite roles within and between areas. These different inhibitory roles may apply to a variety of rhythmic behaviors that emerge in widespread pattern generating circuits of the nervous system.
△ Less
Submitted 12 June, 2017; v1 submitted 13 October, 2016;
originally announced October 2016.
-
High resolution neural connectivity from incomplete tracing data using nonnegative spline regression
Authors:
Kameron Decker Harris,
Stefan Mihalas,
Eric Shea-Brown
Abstract:
Whole-brain neural connectivity data are now available from viral tracing experiments, which reveal the connections between a source injection site and elsewhere in the brain. These hold the promise of revealing spatial patterns of connectivity throughout the mammalian brain. To achieve this goal, we seek to fit a weighted, nonnegative adjacency matrix among 100 $μ$m brain "voxels" using viral tra…
▽ More
Whole-brain neural connectivity data are now available from viral tracing experiments, which reveal the connections between a source injection site and elsewhere in the brain. These hold the promise of revealing spatial patterns of connectivity throughout the mammalian brain. To achieve this goal, we seek to fit a weighted, nonnegative adjacency matrix among 100 $μ$m brain "voxels" using viral tracer data. Despite a multi-year experimental effort, injections provide incomplete coverage, and the number of voxels in our data is orders of magnitude larger than the number of injections, making the problem severely underdetermined. Furthermore, projection data are missing within the injection site because local connections there are not separable from the injection signal.
We use a novel machine-learning algorithm to meet these challenges and develop a spatially explicit, voxel-scale connectivity map of the mouse visual system. Our method combines three features: a matrix completion loss for missing data, a smoothing spline penalty to regularize the problem, and (optionally) a low rank factorization. We demonstrate the consistency of our estimator using synthetic data and then apply it to newly available Allen Mouse Brain Connectivity Atlas data for the visual system. Our algorithm is significantly more predictive than current state of the art approaches which assume regions to be homogeneous. We demonstrate the efficacy of a low rank version on visual cortex data and discuss the possibility of extending this to a whole-brain connectivity matrix at the voxel scale.
△ Less
Submitted 26 October, 2016; v1 submitted 24 May, 2016;
originally announced May 2016.
-
Reply to Garcia et al.: Common mistakes in measuring frequency dependent word characteristics
Authors:
P. S. Dodds,
E. M. Clark,
S. Desu,
M. R. Frank,
A. J. Reagan,
J. R. Williams,
L. Mitchell,
K. D. Harris,
I. M. Kloumann,
J. P. Bagrow,
K. Megerdoomian,
M. T. McMahon,
B. F. Tivnan,
C. M. Danforth
Abstract:
We demonstrate that the concerns expressed by Garcia et al. are misplaced, due to (1) a misreading of our findings in [1]; (2) a widespread failure to examine and present words in support of asserted summary quantities based on word usage frequencies; and (3) a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists. In particular, we show that the English…
▽ More
We demonstrate that the concerns expressed by Garcia et al. are misplaced, due to (1) a misreading of our findings in [1]; (2) a widespread failure to examine and present words in support of asserted summary quantities based on word usage frequencies; and (3) a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists. In particular, we show that the English component of our study compares well statistically with two related surveys, that no survey design influence is apparent, and that estimates of measurement error do not explain the positivity biases reported in our work and that of others. We further demonstrate that for the frequency dependence of positivity---of which we explored the nuances in great detail in [1]---Garcia et al. did not perform a reanalysis of our data---they instead carried out an analysis of a different, statistically improper data set and introduced a nonlinearity before performing linear regression.
△ Less
Submitted 28 May, 2015; v1 submitted 25 May, 2015;
originally announced May 2015.
-
Human language reveals a universal positivity bias
Authors:
Peter Sheridan Dodds,
Eric M. Clark,
Suma Desu,
Morgan R. Frank,
Andrew J. Reagan,
Jake Ryland Williams,
Lewis Mitchell,
Kameron Decker Harris,
Isabel M. Kloumann,
James P. Bagrow,
Karine Megerdoomian,
Matthew T. McMahon,
Brian F. Tivnan,
Christopher M. Danforth
Abstract:
Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias i…
▽ More
Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (1) the words of natural human language possess a universal positivity bias; (2) the estimated emotional content of words is consistent between languages under translation; and (3) this positivity bias is strongly independent of frequency of word usage. Alongside these general regularities, we describe inter-language variations in the emotional spectrum of languages which allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.
△ Less
Submitted 15 June, 2014;
originally announced June 2014.
-
High-dimensional cluster analysis with the Masked EM Algorithm
Authors:
Shabnam N. Kadir,
Dan F. M. Goodman,
Kenneth D. Harris
Abstract:
Cluster analysis faces two problems in high dimensions: first, the `curse of dimensionality' that can lead to overfitting and poor generalization performance; and second, the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. In many applications, only a small subset of features provide information about the cluster membership of any one data point, how…
▽ More
Cluster analysis faces two problems in high dimensions: first, the `curse of dimensionality' that can lead to overfitting and poor generalization performance; and second, the sheer time taken for conventional algorithms to process large amounts of high-dimensional data. In many applications, only a small subset of features provide information about the cluster membership of any one data point, however this informative feature subset may not be the same for all data points. Here we introduce a `Masked EM' algorithm for fitting mixture of Gaussians models in such cases. We show that the algorithm performs close to optimally on simulated Gaussian data, and in an application of `spike sorting' of high channel-count neuronal recordings.
△ Less
Submitted 11 September, 2013;
originally announced September 2013.
-
How (not) to assess the importance of correlations for the matching of spontaneous and evoked activity: a response
Authors:
Michael Okun,
Pierre Yger,
Kenneth D. Harris
Abstract:
A response to a comment of Fiser et al.
A response to a comment of Fiser et al.
△ Less
Submitted 13 March, 2013;
originally announced March 2013.
-
Dynamical influence processes on networks: General theory and applications to social contagion
Authors:
Kameron Decker Harris,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
We study binary state dynamics on a network where each node acts in response to the average state of its neighborhood. Allowing varying amounts of stochasticity in both the network and node responses, we find different outcomes in random and deterministic versions of the model. In the limit of a large, dense network, however, we show that these dynamics coincide. We construct a general mean field…
▽ More
We study binary state dynamics on a network where each node acts in response to the average state of its neighborhood. Allowing varying amounts of stochasticity in both the network and node responses, we find different outcomes in random and deterministic versions of the model. In the limit of a large, dense network, however, we show that these dynamics coincide. We construct a general mean field theory for random networks and show this predicts that the dynamics on the network are a smoothed version of the average response function dynamics. Thus, the behavior of the system can range from steady state to chaotic depending on the response functions, network connectivity, and update synchronicity. As a specific example, we model the competing tendencies of imitation and non-conformity by incorporating an off-threshold into standard threshold models of social contagion. In this way we attempt to capture important aspects of fashions and societal trends. We compare our theory to extensive simulations of this "limited imitation contagion" model on Poisson random graphs, finding agreement between the mean-field theory and stochastic simulations.
△ Less
Submitted 7 July, 2014; v1 submitted 6 March, 2013;
originally announced March 2013.
-
The Geography of Happiness: Connecting Twitter sentiment and expression, demographics, and objective characteristics of place
Authors:
Lewis Mitchell,
Kameron Decker Harris,
Morgan R. Frank,
Peter Sheridan Dodds,
Christopher M. Danforth
Abstract:
We conduct a detailed investigation of correlations between real-time expressions of individuals made across the United States and a wide range of emotional, geographic, demographic, and health characteristics. We do so by combining (1) a massive, geo-tagged data set comprising over 80 million words generated over the course of several recent years on the social network service Twitter and (2) ann…
▽ More
We conduct a detailed investigation of correlations between real-time expressions of individuals made across the United States and a wide range of emotional, geographic, demographic, and health characteristics. We do so by combining (1) a massive, geo-tagged data set comprising over 80 million words generated over the course of several recent years on the social network service Twitter and (2) annually-surveyed characteristics of all 50 states and close to 400 urban populations. Among many results, we generate taxonomies of states and cities based on their similarities in word use; estimate the happiness levels of states and cities; correlate highly-resolved demographic characteristics with happiness levels; and connect word choice and message length with urban characteristics such as education levels and obesity rates. Our results show how social media may potentially be used to estimate real-time levels and changes in population-level measures such as obesity rates.
△ Less
Submitted 18 May, 2013; v1 submitted 13 February, 2013;
originally announced February 2013.
-
On-off Threshold Models of Social Contagion
Authors:
Kameron Decker Harris
Abstract:
We study binary state contagion dynamics on a social network where nodes act in response to the average state of their neighborhood. We model the competing tendencies of imitation and non-conformity by incorporating an off-threshold into standard threshold models of behavior. In this way, we attempt to capture important aspects of fashions and general societal trends. Allowing varying amounts of s…
▽ More
We study binary state contagion dynamics on a social network where nodes act in response to the average state of their neighborhood. We model the competing tendencies of imitation and non-conformity by incorporating an off-threshold into standard threshold models of behavior. In this way, we attempt to capture important aspects of fashions and general societal trends. Allowing varying amounts of stochasticity in both the network and node responses, we find different outcomes in the random and deterministic versions of the model. In the limit of a large, dense network, however, we show that these dynamics coincide. The dynamical behavior of the system ranges from steady state to chaotic depending on network connectivity and update synchronicity. We construct a mean field theory for general random networks. In the undirected case, the mean field theory predicts that the dynamics on the network are a smoothed version of the average node response dynamics. We compare our theory to extensive simulations on Poisson random graphs with node responses that average to the chaotic tent map.
△ Less
Submitted 10 September, 2012;
originally announced September 2012.
-
Limited Imitation Contagion on Random Networks: Chaos, Universality, and Unpredictability
Authors:
Peter Sheridan Dodds,
Kameron Decker Harris,
Christopher M. Danforth
Abstract:
We study a family of binary state, socially-inspired contagion models which incorporate imitation limited by an aversion to complete conformity. We uncover rich behavior in our models whether operating with either probabilistic or deterministic individual response functions on both dynamic and fixed random networks. In particular, we find significant variation in the limiting behavior of a populat…
▽ More
We study a family of binary state, socially-inspired contagion models which incorporate imitation limited by an aversion to complete conformity. We uncover rich behavior in our models whether operating with either probabilistic or deterministic individual response functions on both dynamic and fixed random networks. In particular, we find significant variation in the limiting behavior of a population's infected fraction, ranging from steady-state to chaotic. We show that period doubling arises as we increase the average node degree, and that the universality class of this well known route to chaos depends on the interaction structure of random networks rather than the microscopic behavior of individual nodes. We find that increasing the fixedness of the system tends to stabilize the infected fraction, yet disjoint, multiple equilibria are possible depending solely on the choice of the initially infected node.
△ Less
Submitted 7 March, 2013; v1 submitted 1 August, 2012;
originally announced August 2012.
-
Twitter reciprocal reply networks exhibit assortativity with respect to happiness
Authors:
Catherine A. Bliss,
Isabel M. Kloumann,
Kameron Decker Harris,
Christopher M. Danforth,
Peter Sheridan Dodds
Abstract:
The advent of social media has provided an extraordinary, if imperfect, 'big data' window into the form and evolution of social networks. Based on nearly 40 million message pairs posted to Twitter between September 2008 and February 2009, we construct and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of user behavior, we em…
▽ More
The advent of social media has provided an extraordinary, if imperfect, 'big data' window into the form and evolution of social networks. Based on nearly 40 million message pairs posted to Twitter between September 2008 and February 2009, we construct and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of user behavior, we employ our recently developed hedonometric analysis methods to investigate patterns of sentiment expression. We find users' average happiness scores to be positively and significantly correlated with those of users one, two, and three links away. We strengthen our analysis by proposing and using a null model to test the effect of network topology on the assortativity of happiness. We also find evidence that more well connected users write happier status updates, with a transition occurring around Dunbar's number. More generally, our work provides evidence of a social sub-network structure within Twitter and raises several methodological points of interest with regard to social network reconstructions.
△ Less
Submitted 11 May, 2012; v1 submitted 5 December, 2011;
originally announced December 2011.
-
Predicting flow reversals in chaotic natural convection using data assimilation
Authors:
Kameron Decker Harris,
El Hassan Ridouane,
Darren L. Hitt,
Christopher M. Danforth
Abstract:
A simplified model of natural convection, similar to the Lorenz (1963) system, is compared to computational fluid dynamics simulations in order to test data assimilation methods and better understand the dynamics of convection. The thermosyphon is represented by a long time flow simulation, which serves as a reference "truth". Forecasts are then made using the Lorenz-like model and synchronized to…
▽ More
A simplified model of natural convection, similar to the Lorenz (1963) system, is compared to computational fluid dynamics simulations in order to test data assimilation methods and better understand the dynamics of convection. The thermosyphon is represented by a long time flow simulation, which serves as a reference "truth". Forecasts are then made using the Lorenz-like model and synchronized to noisy and limited observations of the truth using data assimilation. The resulting analysis is observed to infer dynamics absent from the model when using short assimilation windows.
Furthermore, chaotic flow reversal occurrence and residency times in each rotational state are forecast using analysis data. Flow reversals have been successfully forecast in the related Lorenz system, as part of a perfect model experiment, but never in the presence of significant model error or unobserved variables. Finally, we provide new details concerning the fluid dynamical processes present in the thermosyphon during these flow reversals.
△ Less
Submitted 20 April, 2012; v1 submitted 29 August, 2011;
originally announced August 2011.
-
Direct, physically motivated derivation of triggering probabilities for spreading processes on generalized random networks
Authors:
Kameron Decker Harris,
Joshua L. Payne,
Peter Sheridan Dodds
Abstract:
We derive a general expression for the probability of global spreading starting from a single infected seed for contagion processes acting on generalized, correlated random networks. We employ a simple probabilistic argument that encodes the spreading mechanism in an intuitive, physical fashion. We use our approach to directly and systematically obtain triggering probabilities for contagion proces…
▽ More
We derive a general expression for the probability of global spreading starting from a single infected seed for contagion processes acting on generalized, correlated random networks. We employ a simple probabilistic argument that encodes the spreading mechanism in an intuitive, physical fashion. We use our approach to directly and systematically obtain triggering probabilities for contagion processes acting on a collection of random network families including bipartite random networks. We find the contagion condition, the location of the phase transition into an endemic state, from an expansion about the disease-free state.
△ Less
Submitted 30 June, 2015; v1 submitted 26 August, 2011;
originally announced August 2011.
-
Positivity of the English language
Authors:
Isabel M. Kloumann,
Christopher M. Danforth,
Kameron Decker Harris,
Catherine A. Bliss,
Peter Sheridan Dodds
Abstract:
Over the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased? Here, we r…
▽ More
Over the last million years, human language has emerged and evolved as a fundamental instrument of social communication and semiotic representation. People use language in part to convey emotional information, leading to the central and contingent questions: (1) What is the emotional spectrum of natural language? and (2) Are natural languages neutrally, positively, or negatively biased? Here, we report that the human-perceived positivity of over 10,000 of the most frequently used English words exhibits a clear positive bias. More deeply, we characterize and quantify distributions of word positivity for four large and distinct corpora, demonstrating that their form is broadly invariant with respect to frequency of word use.
△ Less
Submitted 12 January, 2012; v1 submitted 25 August, 2011;
originally announced August 2011.
-
Empirical correction of a toy climate model
Authors:
Nicholas A. Allgaier,
Kameron D. Harris,
Christopher M. Danforth
Abstract:
Improving the accuracy of forecast models for physical systems such as the atmosphere is a crucial ongoing effort. Errors in state estimation for these often highly nonlinear systems has been the primary focus of recent research, but as that error has been successfully diminished, the role of model error in forecast uncertainty has duly increased. The present study is an investigation of a particu…
▽ More
Improving the accuracy of forecast models for physical systems such as the atmosphere is a crucial ongoing effort. Errors in state estimation for these often highly nonlinear systems has been the primary focus of recent research, but as that error has been successfully diminished, the role of model error in forecast uncertainty has duly increased. The present study is an investigation of a particular empirical correction procedure that is of special interest because it considers the model a "black box", and therefore can be applied widely with little modification. The procedure involves the comparison of short model forecasts with a reference "truth" system during a training period in order to calculate systematic (1) state-independent model bias and (2) state-dependent error patterns. An estimate of the likelihood of the latter error component is computed from the current state at every timestep of model integration. The effectiveness of this technique is explored in two experiments: (1) a perfect model scenario, in which models have the same structure and dynamics as the true system, differing only in parameter values; and (2) a more realistic scenario, in which models are structurally different (in dynamics, dimension, and parameterization) from the target system. In each case, the results suggest that the correction procedure is more effective for reducing error and prolonging forecast usefulness than parameter tuning. However, the cost of this increase in average forecast accuracy is the creation of substantial qualitative differences between the dynamics of the corrected model and the true system. A method to mitigate the structural damage caused by empirical correction and further increase forecast accuracy is presented.
△ Less
Submitted 13 July, 2011;
originally announced July 2011.
-
Exact solutions for social and biological contagion models on mixed directed and undirected, degree-correlated random networks
Authors:
Joshua L. Payne,
Kameron Decker Harris,
Peter Sheridan Dodds
Abstract:
We derive analytic expressions for the possibility, probability, and expected size of global spreading events starting from a single infected seed for a broad collection of contagion processes acting on random networks with both directed and undirected edges and arbitrary degree-degree correlations. Our work extends previous theoretical developments for the undirected case, and we provide numerica…
▽ More
We derive analytic expressions for the possibility, probability, and expected size of global spreading events starting from a single infected seed for a broad collection of contagion processes acting on random networks with both directed and undirected edges and arbitrary degree-degree correlations. Our work extends previous theoretical developments for the undirected case, and we provide numerical support for our findings by investigating an example class of networks for which we are able to obtain closed-form expressions.
△ Less
Submitted 14 June, 2011; v1 submitted 28 February, 2011;
originally announced March 2011.
-
Direct, physically-motivated derivation of the contagion condition for spreading processes on generalized random networks
Authors:
Peter Sheridan Dodds,
Kameron Decker Harris,
Joshua L. Payne
Abstract:
For a broad range single-seed contagion processes acting on generalized random networks, we derive a unifying analytic expression for the possibility of global spreading events in a straightforward, physically intuitive fashion. Our reasoning lays bare a direct mechanical understanding of an archetypal spreading phenomena that is not evident in circuitous extant mathematical approaches.
For a broad range single-seed contagion processes acting on generalized random networks, we derive a unifying analytic expression for the possibility of global spreading events in a straightforward, physically intuitive fashion. Our reasoning lays bare a direct mechanical understanding of an archetypal spreading phenomena that is not evident in circuitous extant mathematical approaches.
△ Less
Submitted 16 May, 2011; v1 submitted 28 January, 2011;
originally announced January 2011.
-
Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter
Authors:
Peter Sheridan Dodds,
Kameron Decker Harris,
Isabel M. Kloumann,
Catherine A. Bliss,
Christopher M. Danforth
Abstract:
Individual happiness is a fundamental societal metric. Normally measured through self-report, happiness has often been indirectly characterized and overshadowed by more readily quantifiable economic indicators such as gross domestic product. Here, we examine expressions made on the online, global microblog and social networking service Twitter, uncovering and explaining temporal variations in happ…
▽ More
Individual happiness is a fundamental societal metric. Normally measured through self-report, happiness has often been indirectly characterized and overshadowed by more readily quantifiable economic indicators such as gross domestic product. Here, we examine expressions made on the online, global microblog and social networking service Twitter, uncovering and explaining temporal variations in happiness and information levels over timescales ranging from hours to years. Our data set comprises over 46 billion words contained in nearly 4.6 billion expressions posted over a 33 month span by over 63 million unique users. In measuring happiness, we use a real-time, remote-sensing, non-invasive, text-based approach---a kind of hedonometer. In building our metric, made available with this paper, we conducted a survey to obtain happiness evaluations of over 10,000 individual words, representing a tenfold size improvement over similar existing word sets. Rather than being ad hoc, our word list is chosen solely by frequency of usage and we show how a highly robust metric can be constructed and defended.
△ Less
Submitted 8 December, 2011; v1 submitted 26 January, 2011;
originally announced January 2011.