-
Global Map** of Gene/Protein Interactions in PubMed Abstracts: A Framework and an Experiment with P53 Interactions
Authors:
Xin Li,
Hsinchun Chen,
Zan Huang,
Hua Su,
Jesse D. Martinez
Abstract:
Gene/protein interactions provide critical information for a thorough understanding of cellular processes. Recently, considerable interest and effort has been focused on the construction and analysis of genome-wide gene networks. The large body of biomedical literature is an important source of gene/protein interaction information. Recent advances in text mining tools have made it possible to auto…
▽ More
Gene/protein interactions provide critical information for a thorough understanding of cellular processes. Recently, considerable interest and effort has been focused on the construction and analysis of genome-wide gene networks. The large body of biomedical literature is an important source of gene/protein interaction information. Recent advances in text mining tools have made it possible to automatically extract such documented interactions from free-text literature. In this paper, we propose a comprehensive framework for constructing and analyzing large-scale gene functional networks based on the gene/protein interactions extracted from biomedical literature repositories using text mining tools. Our proposed framework consists of analyses of the network topology, network topology-gene function relationship, and temporal network evolution to distill valuable information embedded in the gene functional interactions in literature. We demonstrate the application of the proposed framework using a testbed of P53-related PubMed abstracts, which shows that literature-based P53 networks exhibit small-world and scale-free properties. We also found that high degree genes in the literature-based networks have a high probability of appearing in the manually curated database and genes in the same pathway tend to form local clusters in our literature-based networks. Temporal analysis showed that genes interacting with many other genes tend to be involved in a large number of newly discovered interactions.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Meta-analysis of commercial-scale trials as a means to improve decision-making processes in the poultry industry: a phytogenic feed additive case study
Authors:
Diego A. Martinez,
Carol L. Ponce-de-Leon,
Carlos Vilchez
Abstract:
Background and Objective: In the current study, we sought to determine the value of a meta-analysis to improve decision-making processes related to nutrition in the poultry industry. To this end, nine commercial size experiments were conducted to test the effect of a phytogenic feed additive and three approaches were applied to the data. Materials and Methods: In all experiments, 1-day-old male Co…
▽ More
Background and Objective: In the current study, we sought to determine the value of a meta-analysis to improve decision-making processes related to nutrition in the poultry industry. To this end, nine commercial size experiments were conducted to test the effect of a phytogenic feed additive and three approaches were applied to the data. Materials and Methods: In all experiments, 1-day-old male Cobb 500 chicks were used and fed corn-soybean meal diets. Two dietary treatments were tested: T1, control diet and T2, control diet + feed additive at a 0.05% inclusion rate. The experimental units were broiler houses (7 experiments), floor pens (1 experiment) and cages (1 experiment). The response variables were final body weight, feed intake, feed conversion ratio, mortality and production efficiency. Analyses of variance of data from each and all the experiments were performed using SAS under completely randomized non-blocked or blocked designs, respectively. The meta-analyses were performed in R programming language. Results: No statistically significant effects were found in the evaluated variables in any of the independent experiments (p>0.12), nor following the application of a block design (p>0.08). The meta-analyses showed no statistically significant global effects in terms of final body weight (p>0.19), feed intake (p>0.23), mortality (p>0.09), or European Production Efficiency Factor (p>0.08); however, a positive global effect was found with respect to feed conversion ratio (p<0.046). Conclusion: This meta-analysis demonstrated that the phytogenic feed additive improved the efficiency of birds to convert feed to body weight (35 g less feed per 1 kg of body weight obtained). Thus, the use of meta-analyses in commercial-scale poultry trials can increase statistical power and as a result, help to detect statistical differences if they exist.
△ Less
Submitted 5 January, 2022;
originally announced January 2022.
-
Radiologist-level Performance by Using Deep Learning for Segmentation of Breast Cancers on MRI Scans
Authors:
Lukas Hirsch,
Yu Huang,
Shaojun Luo,
Carolina Rossi Saccarelli,
Roberto Lo Gullo,
Isaac Daimiel Naranjo,
Almir G. V. Bitencourt,
Natsuko Onishi,
Eun Sook Ko,
Doris Leithner,
Daly Avendano,
Sarah Eskreis-Winkler,
Mary Hughes,
Danny F. Martinez,
Katja Pinker,
Krishna Juluru,
Amin E. El-Rowmeim,
Pierre Elnajjar,
Elizabeth A. Morris,
Hernan A. Makse,
Lucas C Parra,
Elizabeth J. Sutton
Abstract:
Purpose: To develop a deep network architecture that would achieve fully automated radiologist-level segmentation of cancers at breast MRI.
Materials and Methods: In this retrospective study, 38229 examinations (composed of 64063 individual breast scans from 14475 patients) were performed in female patients (age range, 12-94 years; mean age, 52 years +/- 10 [standard deviation]) who presented betw…
▽ More
Purpose: To develop a deep network architecture that would achieve fully automated radiologist-level segmentation of cancers at breast MRI.
Materials and Methods: In this retrospective study, 38229 examinations (composed of 64063 individual breast scans from 14475 patients) were performed in female patients (age range, 12-94 years; mean age, 52 years +/- 10 [standard deviation]) who presented between 2002 and 2014 at a single clinical site. A total of 2555 breast cancers were selected that had been segmented on two-dimensional (2D) images by radiologists, as well as 60108 benign breasts that served as examples of noncancerous tissue; all these were used for model training. For testing, an additional 250 breast cancers were segmented independently on 2D images by four radiologists. Authors selected among several three-dimensional (3D) deep convolutional neural network architectures, input modalities, and harmonization methods. The outcome measure was the Dice score for 2D segmentation, which was compared between the network and radiologists by using the Wilcoxon signed rank test and the two one-sided test procedure.
Results: The highest-performing network on the training set was a 3D U-Net with dynamic contrast-enhanced MRI as input and with intensity normalized for each examination. In the test set, the median Dice score of this network was 0.77 (interquartile range, 0.26). The performance of the network was equivalent to that of the radiologists (two one-sided test procedures with radiologist performance of 0.69-0.84 as equivalence bounds, P <= .001 for both; n = 250).
Conclusion: When trained on a sufficiently large dataset, the developed 3D U-Net performed as well as fellowship-trained radiologists in detailed 2D segmentation of breast cancers at routine clinical MRI.
△ Less
Submitted 12 April, 2022; v1 submitted 21 September, 2020;
originally announced September 2020.
-
TODS: An Automated Time Series Outlier Detection System
Authors:
Kwei-Herng Lai,
Daochen Zha,
Guanchu Wang,
Junjie Xu,
Yue Zhao,
Devesh Kumar,
Yile Chen,
Purav Zumkhawaka,
Minyang Wan,
Diego Martinez,
Xia Hu
Abstract:
We present TODS, an automated Time Series Outlier Detection System for research and industrial applications. TODS is a highly modular system that supports easy pipeline construction. The basic building block of TODS is primitive, which is an implementation of a function with hyperparameters. TODS currently supports 70 primitives, including data processing, time series processing, feature analysis,…
▽ More
We present TODS, an automated Time Series Outlier Detection System for research and industrial applications. TODS is a highly modular system that supports easy pipeline construction. The basic building block of TODS is primitive, which is an implementation of a function with hyperparameters. TODS currently supports 70 primitives, including data processing, time series processing, feature analysis, detection algorithms, and a reinforcement module. Users can freely construct a pipeline using these primitives and perform end- to-end outlier detection with the constructed pipeline. TODS provides a Graphical User Interface (GUI), where users can flexibly design a pipeline with drag-and-drop. Moreover, a data-driven searcher is provided to automatically discover the most suitable pipelines given a dataset. TODS is released under Apache 2.0 license at https://github.com/datamllab/tods.
△ Less
Submitted 7 January, 2021; v1 submitted 18 September, 2020;
originally announced September 2020.
-
Simultaneously Evolving Deep Reinforcement Learning Models using Multifactorial Optimization
Authors:
Aritz D. Martinez,
Eneko Osaba,
Javier Del Ser,
Francisco Herrera
Abstract:
In recent years, Multifactorial Optimization (MFO) has gained a notable momentum in the research community. MFO is known for its inherent capability to efficiently address multiple optimization tasks at the same time, while transferring information among such tasks to improve their convergence speed. On the other hand, the quantum leap made by Deep Q Learning (DQL) in the Machine Learning field ha…
▽ More
In recent years, Multifactorial Optimization (MFO) has gained a notable momentum in the research community. MFO is known for its inherent capability to efficiently address multiple optimization tasks at the same time, while transferring information among such tasks to improve their convergence speed. On the other hand, the quantum leap made by Deep Q Learning (DQL) in the Machine Learning field has allowed facing Reinforcement Learning (RL) problems of unprecedented complexity. Unfortunately, complex DQL models usually find it difficult to converge to optimal policies due to the lack of exploration or sparse rewards. In order to overcome these drawbacks, pre-trained models are widely harnessed via Transfer Learning, extrapolating knowledge acquired in a source task to the target task. Besides, meta-heuristic optimization has been shown to reduce the lack of exploration of DQL models. This work proposes a MFO framework capable of simultaneously evolving several DQL models towards solving interrelated RL tasks. Specifically, our proposed framework blends together the benefits of meta-heuristic optimization, Transfer Learning and DQL to automate the process of knowledge transfer and policy learning of distributed RL agents. A thorough experimentation is presented and discussed so as to assess the performance of the framework, its comparison to the traditional methodology for Transfer Learning in terms of convergence, speed and policy quality , and the intertask relationships found and exploited over the search process.
△ Less
Submitted 23 March, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Grapevine: A Wine Prediction Algorithm Using Multi-dimensional Clustering Methods
Authors:
Richard Diehl Martinez,
Geoffrey Angus,
Rooz Mahdavian
Abstract:
We present a method for a wine recommendation system that employs multidimensional clustering and unsupervised learning methods. Our algorithm first performs clustering on a large corpus of wine reviews. It then uses the resulting wine clusters as an approximation of the most common flavor palates, recommending a user a wine by optimizing over a price-quality ratio within clusters that they demons…
▽ More
We present a method for a wine recommendation system that employs multidimensional clustering and unsupervised learning methods. Our algorithm first performs clustering on a large corpus of wine reviews. It then uses the resulting wine clusters as an approximation of the most common flavor palates, recommending a user a wine by optimizing over a price-quality ratio within clusters that they demonstrated a preference for.
△ Less
Submitted 29 June, 2018;
originally announced July 2018.
-
How to Assess the Impact of Quality and Patient Safety Interventions with Routinely Collected Longitudinal Data
Authors:
Diego A. Martinez,
Mehdi Jalalpour,
David T. Efron,
Scott R. Levin
Abstract:
Measuring the effect of patient safety improvement efforts is needed to determine their value but is difficult due to the inherent complexities of hospital operations. In this paper, we show by case study how interrupted time series design can be used to isolate and measure the impact of interventions while accounting for confounders often present in complex health delivery systems. We searched fo…
▽ More
Measuring the effect of patient safety improvement efforts is needed to determine their value but is difficult due to the inherent complexities of hospital operations. In this paper, we show by case study how interrupted time series design can be used to isolate and measure the impact of interventions while accounting for confounders often present in complex health delivery systems. We searched for time-stamped data from electronic medical records and operating room information systems associated with perioperative patient flow in a large, urban, academic hospital in Baltimore, Maryland. We limited the searched to those adult cases performed between January 2015 and March 2017. We used segmented regression and Box-Jenkins methods to measure the effect of perioperative throughput improvement efforts and account for the loss of high volume surgeons, surgical volume, and occupancy. We identified a significant decline of operating room exit delays of about 50%, achieved in 6 months and sustained over 14 months. By longitudinal assessment of intervention effects, rather than cross-sectional comparison, our measurement tool estimated and provided inferences of change-points over time while taking into account the magnitude of other latent systems factors.
△ Less
Submitted 26 June, 2018;
originally announced June 2018.
-
The Impact of an AirBnb Host's Listing Description 'Sentiment' and Length On Occupancy Rates
Authors:
Richard Diehl Martinez,
Anthony Carrington,
Tiffany Kuo,
Lena Tarhuni,
Nour Adel Zaki Abdel-Motaal
Abstract:
There has been significant literature regarding the way product review sentiment affects brand loyalty. Intrigued by how natural language influences consumer choice, we were motivated to examine whether an AirBnb host's occupancy rate (how often their listing is booked out of the days they indicated their listing was available) can be determined by the perceived sentiment and length of their descr…
▽ More
There has been significant literature regarding the way product review sentiment affects brand loyalty. Intrigued by how natural language influences consumer choice, we were motivated to examine whether an AirBnb host's occupancy rate (how often their listing is booked out of the days they indicated their listing was available) can be determined by the perceived sentiment and length of their description summary. Our main goal, more generally, was to determine which features, including (but not limited to) sentiment and description length, most influence a host's occupancy rate. We define sentiment score through a natural language algorithm process, based on the AFINN dictionary. Using AirBnB data on New York City, our hypothesis is that higher sentiment scores (more positive descriptions) and longer summary length lead to higher occupancy rates. Our results show that while longer summary length may positively influence occupancy rates, more positive summary descriptions have no effect. Instead, we find that other factors such as number of reviews and number of amenities, in addition to summary length, are better indicators of occupancy rate.
△ Less
Submitted 25 November, 2017;
originally announced November 2017.
-
The Nu Class of Low-Degree-Truncated, Rational, Generalized Functions. Ib. Integrals of Matern-correlation functions for all odd-half-integer class parameters
Authors:
Selden Crary,
Richard Diehl Martinez,
Michael Saunders
Abstract:
This paper is an extension of Parts I and Ia of a series about Nu-class generalized functions. We provide hand-generated algebraic expressions for integrals of single Matern-covariance functions, as well as for products of two Matern-covariance functions, for all odd-half-integer class parameters. These are useful both for IMSPE-optimal design software and for testing universality of Nu-class gene…
▽ More
This paper is an extension of Parts I and Ia of a series about Nu-class generalized functions. We provide hand-generated algebraic expressions for integrals of single Matern-covariance functions, as well as for products of two Matern-covariance functions, for all odd-half-integer class parameters. These are useful both for IMSPE-optimal design software and for testing universality of Nu-class generalized-function properties, across covariance classes.
△ Less
Submitted 22 May, 2019; v1 submitted 3 July, 2017;
originally announced July 2017.
-
Comparison of statistical sampling methods with ScannerBit, the GAMBIT scanning module
Authors:
The GAMBIT Scanner Workgroup,
:,
Gregory D. Martinez,
James McKay,
Ben Farmer,
Pat Scott,
Elinore Roebber,
Antje Putze,
Jan Conrad
Abstract:
We introduce ScannerBit, the statistics and sampling module of the public, open-source global fitting framework GAMBIT. ScannerBit provides a standardised interface to different sampling algorithms, enabling the use and comparison of multiple computational methods for inferring profile likelihoods, Bayesian posteriors, and other statistical quantities. The current version offers random, grid, rast…
▽ More
We introduce ScannerBit, the statistics and sampling module of the public, open-source global fitting framework GAMBIT. ScannerBit provides a standardised interface to different sampling algorithms, enabling the use and comparison of multiple computational methods for inferring profile likelihoods, Bayesian posteriors, and other statistical quantities. The current version offers random, grid, raster, nested sampling, differential evolution, Markov Chain Monte Carlo (MCMC) and ensemble Monte Carlo samplers. We also announce the release of a new standalone differential evolution sampler, Diver, and describe its design, usage and interface to ScannerBit. We subject Diver and three other samplers (the nested sampler MultiNest, the MCMC GreAT, and the native ScannerBit implementation of the ensemble Monte Carlo algorithm T-Walk) to a battery of statistical tests. For this we use a realistic physical likelihood function, based on the scalar singlet model of dark matter. We examine the performance of each sampler as a function of its adjustable settings, and the dimensionality of the sampling problem. We evaluate performance on four metrics: optimality of the best fit found, completeness in exploring the best-fit region, number of likelihood evaluations, and total runtime. For Bayesian posterior estimation at high resolution, T-Walk provides the most accurate and timely map** of the full parameter space. For profile likelihood analysis in less than about ten dimensions, we find that Diver and MultiNest score similarly in terms of best fit and speed, outperforming GreAT and T-Walk; in ten or more dimensions, Diver substantially outperforms the other three samplers on all metrics.
△ Less
Submitted 15 October, 2017; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Probing the Geometry of Data with Diffusion Fréchet Functions
Authors:
Diego Hernán Díaz Martínez,
Christine H. Lee,
Peter T. Kim,
Washington Mio
Abstract:
Many complex ecosystems, such as those formed by multiple microbial taxa, involve intricate interactions amongst various sub-communities. The most basic relationships are frequently modeled as co-occurrence networks in which the nodes represent the various players in the community and the weighted edges encode levels of interaction. In this setting, the composition of a community may be viewed as…
▽ More
Many complex ecosystems, such as those formed by multiple microbial taxa, involve intricate interactions amongst various sub-communities. The most basic relationships are frequently modeled as co-occurrence networks in which the nodes represent the various players in the community and the weighted edges encode levels of interaction. In this setting, the composition of a community may be viewed as a probability distribution on the nodes of the network. This paper develops methods for modeling the organization of such data, as well as their Euclidean counterparts, across spatial scales. Using the notion of diffusion distance, we introduce diffusion Frechet functions and diffusion Frechet vectors associated with probability distributions on Euclidean space and the vertex set of a weighted network, respectively. We prove that these functional statistics are stable with respect to the Wasserstein distance between probability measures, thus yielding robust descriptors of their shapes. We apply the methodology to investigate bacterial communities in the human gut, seeking to characterize divergence from intestinal homeostasis in patients with Clostridium difficile infection (CDI) and the effects of fecal microbiota transplantation, a treatment used in CDI patients that has proven to be significantly more effective than traditional treatment with antibiotics. The proposed method proves useful in deriving a biomarker that might help elucidate the mechanisms that drive these processes.
△ Less
Submitted 7 March, 2017; v1 submitted 16 May, 2016;
originally announced May 2016.
-
The Shape of Data and Probability Measures
Authors:
Diego Hernán Díaz Martínez,
Facundo Mémoli,
Washington Mio
Abstract:
We introduce the notion of multiscale covariance tensor fields (CTF) associated with Euclidean random variables as a gateway to the shape of their distributions. Multiscale CTFs quantify variation of the data about every point in the data landscape at all spatial scales, unlike the usual covariance tensor that only quantifies global variation about the mean. Empirical forms of localized covariance…
▽ More
We introduce the notion of multiscale covariance tensor fields (CTF) associated with Euclidean random variables as a gateway to the shape of their distributions. Multiscale CTFs quantify variation of the data about every point in the data landscape at all spatial scales, unlike the usual covariance tensor that only quantifies global variation about the mean. Empirical forms of localized covariance previously have been used in data analysis and visualization, but we develop a framework for the systematic treatment of theoretical questions and computational models based on localized covariance. We prove strong stability theorems with respect to the Wasserstein distance between probability measures, obtain consistency results, as well as estimates for the rate of convergence of empirical CTFs. These results ensure that CTFs are robust to sampling, noise and outliers. We provide numerous illustrations of how CTFs let us extract shape from data and also apply CTFs to manifold clustering, the problem of categorizing data points according to their noisy membership in a collection of possibly intersecting, smooth submanifolds of Euclidean space. We prove that the proposed manifold clustering method is stable and carry out several experiments to validate the method.
△ Less
Submitted 27 February, 2017; v1 submitted 15 September, 2015;
originally announced September 2015.