-
Unraveling implicit human behavioral effects on dynamic characteristics of Covid-19 daily infection rates in Taiwan
Authors:
Ting-Li Chen,
Elizabeth P. Chou,
Min-Yi Chen,
Hsieh Fushing
Abstract:
We study Covid-19 spreading dynamics underlying 84 curves of daily Covid-19 infection rates pertaining to 84 districts belonging to the largest seven cities in Taiwan during her pristine surge period. Our computational developments begin with selecting and extracting 18 features from each smoothed district-specific curve. This step of computing effort allows unstructured data to be converted into…
▽ More
We study Covid-19 spreading dynamics underlying 84 curves of daily Covid-19 infection rates pertaining to 84 districts belonging to the largest seven cities in Taiwan during her pristine surge period. Our computational developments begin with selecting and extracting 18 features from each smoothed district-specific curve. This step of computing effort allows unstructured data to be converted into structured data, with which we then demonstrate asymmetric growth and decline dynamics among all involved curves. Specifically, based on Theoretical Information measurements of conditional entropy and mutual information, we compute major factors of order-1 and order-2 that reveal significant effects on affecting the curves' peak value and curvature at peak, which are two essential features characterizing all the curves. Further, we investigate and demonstrate major factors determining the geographic and social-economic induced behavioral effects by encoding each of these 84 districts with two binary characteristics: North-vs-South and Unban-vs-suburban. Furthermore, based on this data-driven knowledge on the district scale, we go on to study fine-scale behavioral effects on infectious disease spreading through similarity among 96 age-group-specific curves of daily infection rate within 12 urban districts of Taipei and 12 suburban districts of New Taipei City, which counts for almost one-quarter of the island nation's total population. We conclude that human living, traveling, and working behaviors do implicitly affect the spreading dynamics of Covid-19 across Taiwan profoundly.
△ Less
Submitted 20 November, 2022;
originally announced November 2022.
-
Learned practical guidelines for evaluating Conditional Entropy and Mutual Information in discovering major factors of response-vs-covariate dynamics
Authors:
Ting-Li Chen,
Hsieh Fushing,
Elizabeth P. Chou
Abstract:
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics' data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data's categorical nature. The major factor selection pro…
▽ More
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics' data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data's categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon's conditional entropy (CE) and mutual information ($I[Re; Co] $) as two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and $I[Re; Co] $ in accord with the criterion called [C1:confirmable]. Via [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening effects of curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Multiscale major factor selections for complex system data with structural dependency and heterogeneity
Authors:
Hsieh Fushing,
Elizabeth Chou,
Ting-Li Chen
Abstract:
Based on structured data derived from large complex systems, we computationally further develop and refine a major factor selection protocol by accommodating structural dependency and heterogeneity among many features to unravel data's information content. Two operational concepts: ``de-associating'' and its counterpart ``shadowing'' that play key roles in our protocol, are reasoned, explained, an…
▽ More
Based on structured data derived from large complex systems, we computationally further develop and refine a major factor selection protocol by accommodating structural dependency and heterogeneity among many features to unravel data's information content. Two operational concepts: ``de-associating'' and its counterpart ``shadowing'' that play key roles in our protocol, are reasoned, explained, and carried out via contingency table platforms. This protocol via ``de-associating'' capability would manifest data's information content by identifying which covariate feature-sets do or don't provide information beyond the first identified major factors to join the collection of major factors as secondary members. Our computational developments begin with globally characterizing a complex system by structural dependency between multiple response (Re) features and many covariate (Co) features. We first apply our major factor selection protocol on a Behavioral Risk Factor Surveillance System (BRFSS) data set to demonstrate discoveries of localities where heart-diseased patients become either majorities or further reduced minorities that sharply contrast data's imbalance nature. We then study a Major League Baseball (MLB) data set consisting of 12 pitchers across 3 seasons, reveal detailed multiscale information content regarding pitching dynamics, and provide nearly perfect resolutions to the Multiclass Classification (MCC) problem and the difficult task of detecting idiosyncratic changes of any individual pitcher across multiple seasons. We conclude by postulating an intuitive conjecture that large complex systems related to inferential topics can only be efficiently resolved through discoveries of data's multiscale information content reflecting the system's authentic structural dependency and heterogeneity.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Extreme-K categorical samples problem
Authors:
Elizabeth Chou,
Catie McVey,
Yin-Chen Hsieh,
Sabrina Enriquez,
Fushing Hsieh
Abstract:
With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-$K$ sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable $K$, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon $K$ popula…
▽ More
With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-$K$ sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable $K$, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon $K$ populations as a platform for discovering a wide spectrum of patterns among populations; 2) evaluate each geometric pattern's reliability. In CEDA developments, each population gives rise to a row vector of categories proportions. Upon the data matrix's row-axis, we discuss the pros and cons of Euclidean distance against its weighted version for building a binary clustering tree geometry. The criterion of choice rests on degrees of uniformness in column-blocks framed by this binary clustering tree. Each tree-leaf (population) is then encoded with a binary code sequence, so is tree-based pattern. For evaluating reliability, we adopt row-wise multinomial randomness to generate an ensemble of matrix mimicries, so an ensemble of mimicked binary trees. Reliability of any observed pattern is its recurrence rate within the tree ensemble. A high reliability value means a deterministic pattern. Our four applications of CEDA illuminate four significant aspects of extreme-$K$ sample problems.
△ Less
Submitted 29 July, 2020;
originally announced July 2020.
-
Similarity Search for Efficient Active Learning and Search of Rare Concepts
Authors:
Cody Coleman,
Edward Chou,
Julian Katz-Samuels,
Sean Culatana,
Peter Bailis,
Alexander C. Berg,
Robert Nowak,
Roshan Sumbaly,
Matei Zaharia,
I. Zeki Yalniz
Abstract:
Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for la…
▽ More
Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, thus enabling web-scale active learning.
△ Less
Submitted 22 July, 2021; v1 submitted 30 June, 2020;
originally announced July 2020.
-
Categorical Exploratory Data Analysis: From Multiclass Classification and Response Manifold Analytics perspectives of baseball pitching dynamics
Authors:
Fushing Hsieh,
Elizabeth P. Chou
Abstract:
From two coupled Multiclass Classification (MCC) and Response Manifold Analytics (RMA) perspectives, we develop Categorical Exploratory Data Analysis (CEDA) on PITCHf/x database for the information content of Major League Baseball's (MLB) pitching dynamics. MCC and RMA information contents are represented by one collection of multi-scales pattern categories from mixing geometries and one collectio…
▽ More
From two coupled Multiclass Classification (MCC) and Response Manifold Analytics (RMA) perspectives, we develop Categorical Exploratory Data Analysis (CEDA) on PITCHf/x database for the information content of Major League Baseball's (MLB) pitching dynamics. MCC and RMA information contents are represented by one collection of multi-scales pattern categories from mixing geometries and one collection of global-to-local geometric localities from response-covariate manifolds, respectively. These collectives shed light on the pitching dynamics and maps out uncertainty of popular machine learning approaches. On MCC setting, an indirect-distance-measure based label embedding tree leads to discover asymmetry of mixing geometries among labels' point-clouds. A selected chain of complementary covariate feature groups collectively brings out multi-order mixing geometric pattern categories. Such categories then reveal the true nature of MCC predictive inferences. On RMA setting, multiple response features couple with multiple major covariate features to demonstrate physical principles bearing manifolds with a lattice of natural localities. With minor features' heterogeneous effects being locally identified, such localities jointly weave their focal characteristics into system understanding and provide a platform for RMA predictive inferences. Our CEDA works for universal data types, adopts non-linear associations and facilitates efficient feature-selections and inferences.
△ Less
Submitted 25 June, 2020;
originally announced June 2020.
-
Dimension Reduction of High-Dimensional Datasets Based on Stepwise SVM
Authors:
Elizabeth P. Chou,
Tzu-Wei Ko
Abstract:
The current study proposes a dimension reduction method, stepwise support vector machine (SVM), to reduce the dimensions of large p small n datasets. The proposed method is compared with other dimension reduction methods, namely, the Pearson product difference correlation coefficient (PCCs), recursive feature elimination based on random forest (RF-RFE), and principal component analysis (PCA), by u…
▽ More
The current study proposes a dimension reduction method, stepwise support vector machine (SVM), to reduce the dimensions of large p small n datasets. The proposed method is compared with other dimension reduction methods, namely, the Pearson product difference correlation coefficient (PCCs), recursive feature elimination based on random forest (RF-RFE), and principal component analysis (PCA), by using five gene expression datasets. Additionally, the prediction performance of the variables selected by our method is evaluated. The study found that stepwise SVM can effectively select the important variables and achieve good prediction performance. Moreover, the predictions of stepwise SVM for reduced datasets was better than those for the unreduced datasets. The performance of stepwise SVM was more stable than that of PCA and RF-RFE, but the performance difference with respect to PCCs was minimal. It is necessary to reduce the dimensions of large p small n datasets. We believe that stepwise SVM can effectively eliminate noise in data and improve the prediction accuracy in any large p small n dataset.
△ Less
Submitted 9 November, 2017;
originally announced November 2017.