-
When are Deep Networks really better than Decision Forests at small sample sizes, and how?
Authors:
Haoyin Xu,
Kaleab A. Kinfu,
Will LeVine,
Sambit Panda,
Jayanta Dey,
Michael Ainsworth,
Yu-Chung Peng,
Madi Kusmanov,
Florian Engert,
Christopher M. White,
Joshua T. Vogelstein,
Carey E. Priebe
Abstract:
Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies…
▽ More
Deep networks and decision forests (such as random forests and gradient boosted trees) are the leading machine learning methods for structured and tabular data, respectively. Many papers have empirically compared large numbers of classifiers on one or two different domains (e.g., on 100 different tabular data settings). However, a careful conceptual and empirical comparison of these two strategies using the most contemporary best practices has yet to be performed. Conceptually, we illustrate that both can be profitably viewed as "partition and vote" schemes. Specifically, the representation space that they both learn is a partitioning of feature space into a union of convex polytopes. For inference, each decides on the basis of votes from the activated nodes. This formulation allows for a unified basic understanding of the relationship between these methods. Empirically, we compare these two strategies on hundreds of tabular data settings, as well as several vision and auditory settings. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found forests to excel at tabular and structured data (vision and audition) with small sample sizes, whereas deep nets performed better on structured data with larger sample sizes. This suggests that further gains in both scenarios may be realized via further combining aspects of forests and networks. We will continue revising this technical report in the coming months with updated results.
△ Less
Submitted 2 November, 2021; v1 submitted 31 August, 2021;
originally announced August 2021.
-
Discovery of Self-Assembling $π$-Conjugated Peptides by Active Learning-Directed Coarse-Grained Molecular Simulation
Authors:
Kirill Shmilovich,
Rachael A. Mansbach,
Hythem Sidky,
Olivia E. Dunne,
Sayak Subhra Panda,
John D. Tovar,
Andrew L. Ferguson
Abstract:
Electronically-active organic molecules have demonstrated great promise as novel soft materials for energy harvesting and transport. Self-assembled nanoaggregates formed from $π$-conjugated oligopeptides composed of an aromatic core flanked by oligopeptide wings offer emergent optoelectronic properties within a water soluble and biocompatible substrate. Nanoaggregate properties can be controlled b…
▽ More
Electronically-active organic molecules have demonstrated great promise as novel soft materials for energy harvesting and transport. Self-assembled nanoaggregates formed from $π$-conjugated oligopeptides composed of an aromatic core flanked by oligopeptide wings offer emergent optoelectronic properties within a water soluble and biocompatible substrate. Nanoaggregate properties can be controlled by tuning core chemistry and peptide composition, but the sequence-structure-function relations remain poorly characterized. In this work, we employ coarse-grained molecular dynamics simulations within an active learning protocol employing deep representational learning and Bayesian optimization to efficiently identify molecules capable of assembling pseudo-1D nanoaggregates with good stacking of the electronically-active $π$-cores. We consider the DXXX-OPV3-XXXD oligopeptide family, where D is an Asp residue and OPV3 is an oligophenylene vinylene oligomer (1,4-distyrylbenzene), to identify the top performing XXX tripeptides within all 20$^3$ = 8,000 possible sequences. By direct simulation of only 2.3% of this space, we identify molecules predicted to exhibit superior assembly relative to those reported in prior work. Spectral clustering of the top candidates reveals new design rules governing assembly. This work establishes new understanding of DXXX-OPV3-XXXD assembly, identifies promising new candidates for experimental testing, and presents a computational design platform that can be generically extended to other peptide-based and peptide-like systems.
△ Less
Submitted 26 January, 2020;
originally announced February 2020.
-
Fuzzy expert system for prediction of prostate cancer
Authors:
Juthika Mahanta,
Subhasis Panda
Abstract:
A fuzzy expert system (FES) for the prediction of prostate cancer (PC) is prescribed in this article. Age, prostate-specific antigen (PSA), prostate volume (PV) and $\%$ Free PSA ($\%$FPSA) are fed as inputs into the FES and prostate cancer risk (PCR) is obtained as the output. Using knowledge based rules in Mamdani type inference method the output is calculated. If PCR $\ge 50\%$, then the patien…
▽ More
A fuzzy expert system (FES) for the prediction of prostate cancer (PC) is prescribed in this article. Age, prostate-specific antigen (PSA), prostate volume (PV) and $\%$ Free PSA ($\%$FPSA) are fed as inputs into the FES and prostate cancer risk (PCR) is obtained as the output. Using knowledge based rules in Mamdani type inference method the output is calculated. If PCR $\ge 50\%$, then the patient shall be advised to go for a biopsy test for confirmation. The efficacy of the designed FES is tested against a clinical data set. The true prediction for all the patients turns out to be $68.91\%$ whereas only for positive biopsy cases it rises to $73.77\%$. This simple yet effective FES can be used as supportive tool for decision making in medical diagnosis.
△ Less
Submitted 1 December, 2018;
originally announced December 2018.
-
MUSIC: A Hybrid Computing Environment for Burrows-Wheeler Alignment for Massive Amount of Short Read Sequence Data
Authors:
Saurabh Gupta,
Sanjoy Chaudhury 'and' Binay Panda
Abstract:
High-throughput DNA sequencers are becoming indispensable in our understanding of diseases at molecular level, in marker-assisted selection in agriculture and in microbial genetics research. These sequencing instruments produce enormous amount of data (often terabytes of raw data in a month) that requires efficient analysis, management and interpretation. The commonly used sequencing instrument to…
▽ More
High-throughput DNA sequencers are becoming indispensable in our understanding of diseases at molecular level, in marker-assisted selection in agriculture and in microbial genetics research. These sequencing instruments produce enormous amount of data (often terabytes of raw data in a month) that requires efficient analysis, management and interpretation. The commonly used sequencing instrument today produces billions of short reads (upto 150 bases) from each run. The first step in the data analysis step is alignment of these short reads to the reference genome of choice. There are different open source algorithms available for sequence alignment to the reference genome. These tools normally have a high computational overhead, both in terms of number of processors and memory. Here, we propose a hybrid-computing environment called MUSIC (Map** USIng hybrid Computing) for one of the most popular open source sequence alignment algorithm, BWA, using accelerators that show significant improvement in speed over the serial code.
△ Less
Submitted 4 February, 2014;
originally announced February 2014.
-
A Quantitative Understanding of Human Sex Chromosomal Genes
Authors:
Sk. Sarif Hassan,
Pabitra Pal Choudhury,
Antara Sengupta,
Binayak Sahu,
Rojalin Mishra,
Devendra Kumar Yadav,
Saswatee Panda,
Dharamveer Pradhan,
Shrusti Dash,
Gourav Pradhan
Abstract:
In the last few decades, the human allosomes are engrossed in an intensive attention among researchers. The allosomes are now already been sequenced and found there are about 2000 and 78 genes in human X and Y chromosomes respectively. The hemizygosity of the human X chromosome in males exposes recessive disease alleles, and this phenomenon has prompted decades of intensive study of X-linked disor…
▽ More
In the last few decades, the human allosomes are engrossed in an intensive attention among researchers. The allosomes are now already been sequenced and found there are about 2000 and 78 genes in human X and Y chromosomes respectively. The hemizygosity of the human X chromosome in males exposes recessive disease alleles, and this phenomenon has prompted decades of intensive study of X-linked disorders. By contrast, the small size of the human Y chromosome, and its prominent long-arm heterochromatic region suggested absence of function beyond sex determination. But the present problem is to accomplish whether a given sequence of nucleotides i.e. a DNA is a Human X or Y chromosomal genes or not, without any biological experimental support. In our perspective, a proper quantitative understanding of these genes is required to justify or nullify whether a given sequence is a Human X or Y chromosomal gene. In this paper, some of the X and Y chromosomal genes have been quantified in genomic and proteomic level through Fractal Geometric and Mathematical Morphometric analysis. Using the proposed quantitative model, one can easily make probable justification or deterministic nullification whether a given sequence of nucleotides is a probable Human X or Y chromosomal gene or not, without seeking any biological experiment. Of course, a further biological experiment is essential to validate it as the probable Human X or Y chromosomal gene homologue. This study would enable Biologists to understand these genes in more quantitative manner instead of their qualitative features.
△ Less
Submitted 1 December, 2013; v1 submitted 23 July, 2012;
originally announced July 2012.