-
Informational Rescaling of PCA Maps with Application to Genetic Distance
Authors:
Nassim Nicholas Taleb,
Pierre Zalloua,
Khaled Elbassioni,
Andreas Henschel,
Daniel E. Platt
Abstract:
We discuss the inadequacy of covariances/correlations and other measures in L2 as relative distance metrics under some conditions. We propose a computationally simple heuristic to transform a map based on standard principal component analysis (PCA) (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Rescaling Principal Co…
▽ More
We discuss the inadequacy of covariances/correlations and other measures in L2 as relative distance metrics under some conditions. We propose a computationally simple heuristic to transform a map based on standard principal component analysis (PCA) (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Rescaling Principal Component based distances using MI allows a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information.
This entropy rescaled PCA, while preserving order relationships (along a dimension), changes the relative distances to make them linear to information. We show the effect on the entire world population and some subsamples, which leads to significant differences with the results of current research.
△ Less
Submitted 4 March, 2024; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Group invariant machine learning by fundamental domain projections
Authors:
Benjamin Aslan,
Daniel Platt,
David Sheard
Abstract:
We approach the well-studied problem of supervised group invariant and equivariant machine learning from the point of view of geometric topology. We propose a novel approach using a pre-processing step, which involves projecting the input data into a geometric space which parametrises the orbits of the symmetry group. This new data can then be the input for an arbitrary machine learning model (neu…
▽ More
We approach the well-studied problem of supervised group invariant and equivariant machine learning from the point of view of geometric topology. We propose a novel approach using a pre-processing step, which involves projecting the input data into a geometric space which parametrises the orbits of the symmetry group. This new data can then be the input for an arbitrary machine learning model (neural network, random forest, support-vector machine etc).
We give an algorithm to compute the geometric projection, which is efficient to implement, and we illustrate our approach on some example machine learning problems (including the well-studied problem of predicting Hodge numbers of CICY matrices), in each case finding an improvement in accuracy versus others in the literature. The geometric topology viewpoint also allows us to give a unified description of so-called intrinsic approaches to group equivariant machine learning, which encompasses many other approaches in the literature.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.
-
The Energy Footprint of Blockchain Consensus Mechanisms Beyond Proof-of-Work
Authors:
Moritz Platt,
Johannes Sedlmeir,
Daniel Platt,
Paolo Tasca,
Jiahua Xu,
Nikhil Vadgama,
Juan Ignacio IbaƱez
Abstract:
Popular distributed ledger technology (DLT) systems using proof-of-work (PoW) for Sybil attack resistance have extreme energy requirements, drawing stern criticism from academia, businesses, and the media. DLT systems building on alternative consensus mechanisms, foremost proof-of-stake (PoS), aim to address this downside. In this paper, we take a first step towards comparing the energy requiremen…
▽ More
Popular distributed ledger technology (DLT) systems using proof-of-work (PoW) for Sybil attack resistance have extreme energy requirements, drawing stern criticism from academia, businesses, and the media. DLT systems building on alternative consensus mechanisms, foremost proof-of-stake (PoS), aim to address this downside. In this paper, we take a first step towards comparing the energy requirements of such systems to understand whether they achieve this goal equally well. While multiple studies have been undertaken that analyze the energy demands of individual Blockchains, little comparative work has been done. We approach this research question by formalizing a basic consumption model for PoS blockchains. Applying this model to six archetypal blockchains generates three main findings: First, we confirm the concerns around the energy footprint of PoW by showing that Bitcoin's energy consumption exceeds the energy consumption of all PoS-based systems analyzed by at least three orders of magnitude. Second, we illustrate that there are significant differences in energy consumption among the PoSbased systems analyzed, with permissionless systems having an overall larger energy footprint. Third, we point out that the type of hardware that validators use has a considerable impact on whether PoS blockchains' energy consumption is comparable with or considerably larger than that of centralized, non-DLT systems.
△ Less
Submitted 4 April, 2022; v1 submitted 8 September, 2021;
originally announced September 2021.
-
Inferring COVID-19 Biological Pathways from Clinical Phenotypes via Topological Analysis
Authors:
Negin Karisani,
Daniel E. Platt,
Saugata Basu,
Laxmi Parida
Abstract:
COVID-19 has caused thousands of deaths around the world and also resulted in a large international economic disruption. Identifying the pathways associated with this illness can help medical researchers to better understand the properties of the condition. This process can be carried out by analyzing the medical records. It is crucial to develop tools and models that can aid researchers with this…
▽ More
COVID-19 has caused thousands of deaths around the world and also resulted in a large international economic disruption. Identifying the pathways associated with this illness can help medical researchers to better understand the properties of the condition. This process can be carried out by analyzing the medical records. It is crucial to develop tools and models that can aid researchers with this process in a timely manner. However, medical records are often unstructured clinical notes, and this poses significant challenges to develo** the automated systems. In this article, we propose a pipeline to aid practitioners in analyzing clinical notes and revealing the pathways associated with this disease. Our pipeline relies on topological properties and consists of three steps: 1) pre-processing the clinical notes to extract the salient concepts, 2) constructing a feature space of the patients to characterize the extracted concepts, and finally, 3) leveraging the topological properties to distill the available knowledge and visualize the result. Our experiments on a publicly available dataset of COVID-19 clinical notes testify that our pipeline can indeed extract meaningful pathways.
△ Less
Submitted 1 May, 2022; v1 submitted 18 January, 2021;
originally announced January 2021.
-
CNN Architectures for Large-Scale Audio Classification
Authors:
Shawn Hershey,
Sourish Chaudhuri,
Daniel P. W. Ellis,
Jort F. Gemmeke,
Aren Jansen,
R. Channing Moore,
Manoj Plakal,
Devin Platt,
Rif A. Saurous,
Bryan Seybold,
Malcolm Slaney,
Ron J. Weiss,
Kevin Wilson
Abstract:
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying th…
▽ More
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.
△ Less
Submitted 10 January, 2017; v1 submitted 29 September, 2016;
originally announced September 2016.