-
Average Estimates in Line Graphs Are Biased Toward Areas of Higher Variability
Authors:
Dominik Moritz,
Lace M. Padilla,
Francis Nguyen,
Steven L. Franconeri
Abstract:
We investigate variability overweighting, a previously undocumented bias in line graphs, where estimates of average value are biased toward areas of higher variability in that line. We found this effect across two preregistered experiments with 140 and 420 participants. These experiments also show that the bias is reduced when using a dot encoding of the same series. We can model the bias with the…
▽ More
We investigate variability overweighting, a previously undocumented bias in line graphs, where estimates of average value are biased toward areas of higher variability in that line. We found this effect across two preregistered experiments with 140 and 420 participants. These experiments also show that the bias is reduced when using a dot encoding of the same series. We can model the bias with the average of the data series and the average of the points drawn along the line. This bias might arise because higher variability leads to stronger weighting in the average calculation, either due to the longer line segments (even though those segments contain the same number of data values) or line segments with higher variability being otherwise more visually salient. Understanding and predicting this bias is important for visualization design guidelines, recommendation systems, and tool builders, as the bias can adversely affect estimates of averages and trends.
△ Less
Submitted 7 August, 2023;
originally announced August 2023.
-
An Unsupervised Machine Learning Approach to Assess the ZIP Code Level Impact of COVID-19 in NYC
Authors:
Fadoua Khmaissia,
Pegah Sagheb Haghighi,
Aarthe Jayaprakash,
Zhenwei Wu,
Sokratis Papadopoulos,
Yuan Lai,
Freddy T. Nguyen
Abstract:
New York City has been recognized as the world's epicenter of the novel Coronavirus pandemic. To identify the key inherent factors that are highly correlated to the Increase Rate of COVID-19 new cases in NYC, we propose an unsupervised machine learning framework. Based on the assumption that ZIP code areas with similar demographic, socioeconomic, and mobility patterns are likely to experience simi…
▽ More
New York City has been recognized as the world's epicenter of the novel Coronavirus pandemic. To identify the key inherent factors that are highly correlated to the Increase Rate of COVID-19 new cases in NYC, we propose an unsupervised machine learning framework. Based on the assumption that ZIP code areas with similar demographic, socioeconomic, and mobility patterns are likely to experience similar outbreaks, we select the most relevant features to perform a clustering that can best reflect the spread, and map them down to 9 interpretable categories. We believe that our findings can guide policy makers to promptly anticipate and prevent the spread of the virus by taking the right measures.
△ Less
Submitted 18 September, 2020; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities
Authors:
Marinka Zitnik,
Francis Nguyen,
Bo Wang,
Jure Leskovec,
Anna Goldenberg,
Michael M. Hoffman
Abstract:
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integ…
▽ More
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in develo** such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
△ Less
Submitted 10 October, 2018; v1 submitted 30 June, 2018;
originally announced July 2018.