-
Predicting and Explaining Behavioral Data with Structured Feature Space Decomposition
Authors:
Peter G Fennell,
Zhiya Zuo,
Kristina Lerman
Abstract:
Modeling human behavioral data is challenging due to its scale, sparseness (few observations per individual), heterogeneity (differently behaving individuals), and class imbalance (few observations of the outcome of interest). An additional challenge is learning an interpretable model that not only accurately predicts outcomes, but also identifies important factors associated with a given behavior…
▽ More
Modeling human behavioral data is challenging due to its scale, sparseness (few observations per individual), heterogeneity (differently behaving individuals), and class imbalance (few observations of the outcome of interest). An additional challenge is learning an interpretable model that not only accurately predicts outcomes, but also identifies important factors associated with a given behavior. To address these challenges, we describe a statistical approach to modeling behavioral data called the structured sum-of-squares decomposition (S3D). The algorithm, which is inspired by decision trees, selects important features that collectively explain the variation of the outcome, quantifies correlations between the features, and partitions the subspace of important features into smaller, more homogeneous blocks that correspond to similarly-behaving subgroups within the population. This partitioned subspace allows us to predict and analyze the behavior of the outcome variable both statistically and visually, giving a medium to examine the effect of various features and to create explainable predictions. We apply S3D to learn models of online activity from large-scale data collected from diverse sites, such as Stack Exchange, Khan Academy, Twitter, Duolingo, and Digg. We show that S3D creates parsimonious models that can predict outcomes in the held-out data at levels comparable to state-of-the-art approaches, but in addition, produces interpretable models that provide insights into behaviors. This is important for informing strategies aimed at changing behavior, designing social systems, but also for explaining predictions, a critical step towards minimizing algorithmic bias.
△ Less
Submitted 5 October, 2018;
originally announced October 2018.
-
Degree Correlations Amplify the Growth of Cascades in Networks
Authors:
Xin-Zeng Wu,
Peter G. Fennell,
Allon G. Percus,
Kristina Lerman
Abstract:
Networks facilitate the spread of cascades, allowing a local perturbation to percolate via interactions between nodes and their neighbors. We investigate how network structure affects the dynamics of a spreading cascade. By accounting for the joint degree distribution of a network within a generating function framework, we can quantify how degree correlations affect both the onset of global cascad…
▽ More
Networks facilitate the spread of cascades, allowing a local perturbation to percolate via interactions between nodes and their neighbors. We investigate how network structure affects the dynamics of a spreading cascade. By accounting for the joint degree distribution of a network within a generating function framework, we can quantify how degree correlations affect both the onset of global cascades and the propensity of nodes of specific degree class to trigger large cascades. However, not all degree correlations are equally important in a spreading process. We introduce a new measure of degree assortativity that accounts for correlations among nodes relevant to a spreading cascade. We show that the critical point defining the onset of global cascades has a monotone relationship to this new assortativity measure. In addition, we show that the choice of nodes to seed the largest cascades is strongly affected by degree correlations. Contrary to traditional wisdom, when degree assortativity is positive, low degree nodes are more likely to generate largest cascades. Our work suggests that it may be possible to tailor spreading processes by manipulating the higher-order structure of networks.
△ Less
Submitted 14 July, 2018;
originally announced July 2018.
-
Using Simpson's Paradox to Discover Interesting Patterns in Behavioral Data
Authors:
Nazanin Alipourfard,
Peter G. Fennell,
Kristina Lerman
Abstract:
We describe a data-driven discovery method that leverages Simpson's paradox to uncover interesting patterns in behavioral data. Our method systematically disaggregates data to identify subgroups within a population whose behavior deviates significantly from the rest of the population. Given an outcome of interest and a set of covariates, the method follows three steps. First, it disaggregates data…
▽ More
We describe a data-driven discovery method that leverages Simpson's paradox to uncover interesting patterns in behavioral data. Our method systematically disaggregates data to identify subgroups within a population whose behavior deviates significantly from the rest of the population. Given an outcome of interest and a set of covariates, the method follows three steps. First, it disaggregates data into subgroups, by conditioning on a particular covariate, so as minimize the variation of the outcome within the subgroups. Next, it models the outcome as a linear function of another covariate, both in the subgroups and in the aggregate data. Finally, it compares trends to identify disaggregations that produce subgroups with different behaviors from the aggregate. We illustrate the method by applying it to three real-world behavioral datasets, including Q\&A site Stack Exchange and online learning platforms Khan Academy and Duolingo.
△ Less
Submitted 8 May, 2018;
originally announced May 2018.
-
Can you Trust the Trend: Discovering Simpson's Paradoxes in Social Data
Authors:
Nazanin Alipourfard,
Peter G. Fennell,
Kristina Lerman
Abstract:
We investigate how Simpson's paradox affects analysis of trends in social data. According to the paradox, the trends observed in data that has been aggregated over an entire population may be different from, and even opposite to, those of the underlying subgroups. Failure to take this effect into account can lead analysis to wrong conclusions. We present a statistical method to automatically ident…
▽ More
We investigate how Simpson's paradox affects analysis of trends in social data. According to the paradox, the trends observed in data that has been aggregated over an entire population may be different from, and even opposite to, those of the underlying subgroups. Failure to take this effect into account can lead analysis to wrong conclusions. We present a statistical method to automatically identify Simpson's paradox in data by comparing statistical trends in the aggregate data to those in the disaggregated subgroups. We apply the approach to data from Stack Exchange, a popular question-answering platform, to analyze factors affecting answerer performance, specifically, the likelihood that an answer written by a user will be accepted by the asker as the best answer to his or her question. Our analysis confirms a known Simpson's paradox and identifies several new instances. These paradoxes provide novel insights into user behavior on Stack Exchange.
△ Less
Submitted 13 January, 2018;
originally announced January 2018.
-
Multistate dynamical processes on networks: Analysis through degree-based approximation frameworks
Authors:
Peter G. Fennell,
James P. Gleeson
Abstract:
Multistate dynamical processes on networks, where nodes can occupy one of a multitude of discrete states, are gaining widespread use because of their ability to recreate realistic, complex behaviour that cannot be adequately captured by simpler binary-state models. In epidemiology, multistate models are employed to predict the evolution of real epidemics, while multistate models are used in the so…
▽ More
Multistate dynamical processes on networks, where nodes can occupy one of a multitude of discrete states, are gaining widespread use because of their ability to recreate realistic, complex behaviour that cannot be adequately captured by simpler binary-state models. In epidemiology, multistate models are employed to predict the evolution of real epidemics, while multistate models are used in the social sciences to study diverse opinions and complex phenomena such as segregation. In this paper, we introduce generalized approximation frameworks for the study and analysis of multistate dynamical processes on networks. These frameworks are degree-based, allowing for the analysis of the effect of network connectivity structures on dynamical processes. We illustrate the utility of our approach with the analysis of two specific dynamical processes from the epidemiological and physical sciences. The approximation frameworks that we develop, along with open-source numerical solvers, provide a unifying framework and a valuable suite of tools for the interdisciplinary study of multistate dynamical processes on networks.
△ Less
Submitted 25 September, 2017;
originally announced September 2017.
-
The limitations of discrete-time approaches to continuous-time contagion dynamics
Authors:
Peter G. Fennell,
Sergey Melnik,
James P. Gleeson
Abstract:
Continuous-time Markov process models of contagions are widely studied, not least because of their utility in predicting the evolution of real-world contagions and in formulating control measures. It is often the case, however, that discrete-time approaches are employed to analyze such models or to simulate them numerically. In such cases, time is discretized into uniform steps and transition rate…
▽ More
Continuous-time Markov process models of contagions are widely studied, not least because of their utility in predicting the evolution of real-world contagions and in formulating control measures. It is often the case, however, that discrete-time approaches are employed to analyze such models or to simulate them numerically. In such cases, time is discretized into uniform steps and transition rates between states are replaced by transition probabilities. In this paper, we illustrate potential limitations to this approach. We show how discretizing time leads to a restriction on the values of the model parameters that can accurately be studied. We examine numerical simulation schemes employed in the literature, showing how synchronous-type updating schemes can bias discrete-time formalisms when compared against continuous-time formalisms. Event-based simulations, such as the Gillespie algorithm, are proposed as optimal simulation schemes both in terms of replicating the continuous-time process and computational speed. Finally, we show how discretizing time can affect the value of the epidemic threshold for large values of the infection rate and the recovery rate, even if the ratio between the former and the latter is small.
△ Less
Submitted 22 February, 2016;
originally announced March 2016.
-
Visualising stock flow consistent models as directed acyclic graphs
Authors:
Peter G. Fennell,
David O'Sullivan,
Antoine Godin,
Stephen Kinsella
Abstract:
We show how every stock-flow consistent model of the macroeconomy can be represented as a directed acyclic graph. The advantages of representing the model in this way include graphical clarity, causal inference, and model specification. We provide many examples implemented with a new software package.
We show how every stock-flow consistent model of the macroeconomy can be represented as a directed acyclic graph. The advantages of representing the model in this way include graphical clarity, causal inference, and model specification. We provide many examples implemented with a new software package.
△ Less
Submitted 16 September, 2014;
originally announced September 2014.
-
Analytical approach to the dynamics of facilitated spin models on random networks
Authors:
Peter G. Fennell,
James P. Gleeson,
Davide Cellai
Abstract:
Facilitated spin models were introduced some decades ago to mimic systems characterized by a glass transition. Recent developments have shown that a class of facilitated spin models is also able to reproduce characteristic signatures of the structural relaxation properties of glass-forming liquids. While the equilibrium phase diagram of these models can be calculated analytically, the dynamics are…
▽ More
Facilitated spin models were introduced some decades ago to mimic systems characterized by a glass transition. Recent developments have shown that a class of facilitated spin models is also able to reproduce characteristic signatures of the structural relaxation properties of glass-forming liquids. While the equilibrium phase diagram of these models can be calculated analytically, the dynamics are usually investigated numerically. Here we propose a new network-based approach, called approximate master equation (AME), to the dynamics of the Fredrickson-Andersen model. The approach correctly predicts the critical temperature at which the glass transition occurs. We also find excellent agreement between the theory and the numerical simulations for the transient regime, except in close proximity of the liquid-glass transition. Finally, we analytically characterize the critical clusters of the model and show that the departures between our AME approach and the Monte Carlo can be related to the large interface between frozen and unfrozen spins at temperatures close to the glass transition.
△ Less
Submitted 6 November, 2014; v1 submitted 1 May, 2014;
originally announced May 2014.