Machine Learning for automatic identification of new minor species
Authors:
Frederic Schmidt,
Guillaume Cruz Mermy,
Justin Erwin,
Severine Robert,
Lori Neary,
Ian R. Thomas,
Frank Daerden,
Bojan Ristic,
Manish R. Patel,
Giancarlo Bellucci,
Jose-Juan Lopez-Moreno,
Ann-Carine Vandaele
Abstract:
One of the main difficulties to analyze modern spectroscopic datasets is due to the large amount of data. For example, in atmospheric transmittance spectroscopy, the solar occultation channel (SO) of the NOMAD instrument onboard the ESA ExoMars2016 satellite called Trace Gas Orbiter (TGO) had produced $\sim$10 millions of spectra in 20000 acquisition sequences since the beginning of the mission in…
▽ More
One of the main difficulties to analyze modern spectroscopic datasets is due to the large amount of data. For example, in atmospheric transmittance spectroscopy, the solar occultation channel (SO) of the NOMAD instrument onboard the ESA ExoMars2016 satellite called Trace Gas Orbiter (TGO) had produced $\sim$10 millions of spectra in 20000 acquisition sequences since the beginning of the mission in April 2018 until 15 January 2020. Other datasets are even larger with $\sim$billions of spectra for OMEGA onboard Mars Express or CRISM onboard Mars Reconnaissance Orbiter. Usually, new lines are discovered after a long iterative process of model fitting and manual residual analysis. Here we propose a new method based on unsupervised machine learning, to automatically detect new minor species. Although precise quantification is out of scope, this tool can also be used to quickly summarize the dataset, by giving few endmembers ("source") and their abundances. We approximate the dataset non-linearity by a linear mixture of abundance and source spectra (endmembers). We used unsupervised source separation in form of non-negative matrix factorization to estimate those quantities. Several methods are tested on synthetic and simulation data. Our approach is dedicated to detect minor species spectra rather than precisely quantifying them. On synthetic example, this approach is able to detect chemical compounds present in form of 100 hidden spectra out of $10^4$, at 1.5 times the noise level. Results on simulated spectra of NOMAD-SO targeting CH$_{4}$ show that detection limits goes in the range of 100-500 ppt in favorable conditions. Results on real martian data from NOMAD-SO show that CO$_{2}$ and H$_{2}$O are present, as expected, but CH$_{4}$ is absent. Nevertheless, we confirm a set of new unexpected lines in the database, attributed by ACS instrument Team to the CO$_{2}$ magnetic dipole.
△ Less
Submitted 15 December, 2020;
originally announced December 2020.
A Big Data Analytics Framework to Predict the Risk of Opioid Use Disorder
Authors:
Md Mahmudul Hasan,
Md. Noor-E-Alam,
Mehul Rakeshkumar Patel,
Alicia Sasser Modestino,
Leon D. Sanchez,
Gary Young
Abstract:
Overdose related to prescription opioids have reached an epidemic level in the US, creating an unprecedented national crisis. This has been exacerbated partly due to the lack of tools for physicians to help predict the risk of whether a patient will develop opioid use disorder. Little is known about how machine learning can be applied to a big-data platform to ensure an informed, sustained and jud…
▽ More
Overdose related to prescription opioids have reached an epidemic level in the US, creating an unprecedented national crisis. This has been exacerbated partly due to the lack of tools for physicians to help predict the risk of whether a patient will develop opioid use disorder. Little is known about how machine learning can be applied to a big-data platform to ensure an informed, sustained and judicious prescribing of opioids, in particular for commercially insured population. This study explores Massachusetts All Payer Claims Data, a de-identified healthcare dataset, and proposes a machine learning framework to examine how naïve users develop opioid use disorder. We perform several feature selections techniques to identify influential demographic and clinical features associated with opioid use disorder from a class imbalanced analytic sample. We then compare the predictive power of four well-known machine learning algorithms: Logistic Regression, Random Forest, Decision Tree, and Gradient Boosting to predict the risk of opioid use disorder. The study results show that the Random Forest model outperforms the other three algorithms while determining the features, some of which are consistent with prior clinical findings. Moreover, alongside the higher predictive accuracy, the proposed framework is capable of extracting some risk factors that will add significant knowledge to what is already known in the extant literature. We anticipate that this study will help healthcare practitioners improve the current prescribing practice of opioids and contribute to curb the increasing rate of opioid addiction and overdose.
△ Less
Submitted 30 May, 2020; v1 submitted 6 April, 2019;
originally announced April 2019.