-
DataDock: An Open Source Data Hub for Research
Authors:
Lexington Whalen,
Homayoun Valafar
Abstract:
Every research project necessitates data, often requiring sharing and collaborative review within a team. However, there is a dearth of good open-source data sharing and reviewing services. Existing file-sharing services generally mandate paid subscriptions for increased storage or additional members, diverting research funds from addressing the core research problem that a lab is attempting to wo…
▽ More
Every research project necessitates data, often requiring sharing and collaborative review within a team. However, there is a dearth of good open-source data sharing and reviewing services. Existing file-sharing services generally mandate paid subscriptions for increased storage or additional members, diverting research funds from addressing the core research problem that a lab is attempting to work on. Moreover, these services often lack direct features for reviewing or commenting on data quality, a vital part of ensuring high quality data generation. In response to these challenges, we present DataDock, a specialized file transfer service crafted for specifically for researchers. DataDock operates as an application hosted on a research lab server. This design ensures that, with access to a machine and an internet connection, teams can facilitate file storage, transfer, and review without incurring extra costs. Being an open-source project, DataDock can be customized to suit the unique requirements of any research team, and is able to evolve to meet the needs of the research community. We also note that there are no limitations with respect to what data can be shared, downloaded, or commented on. As DataDock is agnostic to the file type, it can be used in any field from bioinformatics to particle physics; as long as it can be stored in a file, it can be shared. We open source the code here: https://github.com/lxaw/DataDock
△ Less
Submitted 26 June, 2024; v1 submitted 14 April, 2024;
originally announced June 2024.
-
Automated Measurement of Vascular Calcification in Femoral Endarterectomy Patients Using Deep Learning
Authors:
Alireza Bagheri Rajeoni,
Breanna Pederson,
Daniel G. Clair,
Susan M. Lessner,
Homayoun Valafar
Abstract:
Atherosclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedio…
▽ More
Atherosclerosis, a chronic inflammatory disease affecting the large arteries, presents a global health risk. Accurate analysis of diagnostic images, like computed tomographic angiograms (CTAs), is essential for staging and monitoring the progression of atherosclerosis-related conditions, including peripheral arterial disease (PAD). However, manual analysis of CTA images is time-consuming and tedious. To address this limitation, we employed a deep learning model to segment the vascular system in CTA images of PAD patients undergoing femoral endarterectomy surgery and to measure vascular calcification from the left renal artery to the patella. Utilizing proprietary CTA images of 27 patients undergoing femoral endarterectomy surgery provided by Prisma Health Midlands, we developed a Deep Neural Network (DNN) model to first segment the arterial system, starting from the descending aorta to the patella, and second, to provide a metric of arterial calcification. Our designed DNN achieved 83.4% average Dice accuracy in segmenting arteries from aorta to patella, advancing the state-of-the-art by 0.8%. Furthermore, our work is the first to present a robust statistical analysis of automated calcification measurement in the lower extremities using deep learning, attaining a Mean Absolute Percentage Error (MAPE) of 9.5% and a correlation coefficient of 0.978 between automated and manual calcification scores. These findings underscore the potential of deep learning techniques as a rapid and accurate tool for medical professionals to assess calcification in the abdominal aorta and its branches above the patella. The developed DNN model and related documentation in this project are available at GitHub page at https://github.com/pip-alireza/DeepCalcScoring.
△ Less
Submitted 27 November, 2023;
originally announced November 2023.
-
TransONet: Automatic Segmentation of Vasculature in Computed Tomographic Angiograms Using Deep Learning
Authors:
Alireza Bagheri Rajeoni,
Breanna Pederson,
Ali Firooz,
Hamed Abdollahi,
Andrew K. Smith,
Daniel G. Clair,
Susan M. Lessner,
Homayoun Valafar
Abstract:
Pathological alterations in the human vascular system underlie many chronic diseases, such as atherosclerosis and aneurysms. However, manually analyzing diagnostic images of the vascular system, such as computed tomographic angiograms (CTAs) is a time-consuming and tedious process. To address this issue, we propose a deep learning model to segment the vascular system in CTA images of patients unde…
▽ More
Pathological alterations in the human vascular system underlie many chronic diseases, such as atherosclerosis and aneurysms. However, manually analyzing diagnostic images of the vascular system, such as computed tomographic angiograms (CTAs) is a time-consuming and tedious process. To address this issue, we propose a deep learning model to segment the vascular system in CTA images of patients undergoing surgery for peripheral arterial disease (PAD). Our study focused on accurately segmenting the vascular system (1) from the descending thoracic aorta to the iliac bifurcation and (2) from the descending thoracic aorta to the knees in CTA images using deep learning techniques. Our approach achieved average Dice accuracies of 93.5% and 80.64% in test dataset for (1) and (2), respectively, highlighting its high accuracy and potential clinical utility. These findings demonstrate the use of deep learning techniques as a valuable tool for medical professionals to analyze the health of the vascular system efficiently and accurately. Please visit the GitHub page for this paper at https://github.com/pip-alireza/TransOnet.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Wordification: A New Way of Teaching English Spelling Patterns
Authors:
Lexington Whalen,
Nathan Bickel,
Shash Comandur,
Dalton Craven,
Stanley Dubinsky,
Homayoun Valafar
Abstract:
Literacy, or the ability to read and write, is a crucial indicator of success in life and greater society. It is estimated that 85% of people in juvenile delinquent systems cannot adequately read or write, that more than half of those with substance abuse issues have complications in reading or writing and that two-thirds of those who do not complete high school lack proper literacy skills. Furthe…
▽ More
Literacy, or the ability to read and write, is a crucial indicator of success in life and greater society. It is estimated that 85% of people in juvenile delinquent systems cannot adequately read or write, that more than half of those with substance abuse issues have complications in reading or writing and that two-thirds of those who do not complete high school lack proper literacy skills. Furthermore, young children who do not possess reading skills matching grade level by the fourth grade are approximately 80% likely to not catch up at all. Many may believe that in a developed country such as the United States, literacy fails to be an issue; however, this is a dangerous misunderstanding. Globally an estimated 1.19 trillion dollars are lost every year due to issues in literacy; in the USA, the loss is an estimated 300 billion. To put it in more shocking terms, one in five American adults still fail to comprehend basic sentences. Making matters worse, the only tools available now to correct a lack of reading and writing ability are found in expensive tutoring or other programs that oftentimes fail to be able to reach the required audience. In this paper, our team puts forward a new way of teaching English spelling and word recognitions to grade school students in the United States: Wordification. Wordification is a web application designed to teach English literacy using principles of linguistics applied to the orthographic and phonological properties of words in a manner not fully utilized previously in any computer-based teaching application.
△ Less
Submitted 10 November, 2023; v1 submitted 29 August, 2023;
originally announced September 2023.
-
nD-PDPA: nDimensional Probability Density Profile Analysis
Authors:
Arjang Fahim,
Stephanie Irausquin,
Homayoun Valafar
Abstract:
Despite the recent advances in various Structural Genomics Projects, a large gap remains between the number of sequenced and structurally characterized proteins. Some reasons for this discrepancy include technical difficulties, labor, and the cost related to determining a structure by experimental methods such as NMR spectroscopy. Several computational methods have been developed to expand the app…
▽ More
Despite the recent advances in various Structural Genomics Projects, a large gap remains between the number of sequenced and structurally characterized proteins. Some reasons for this discrepancy include technical difficulties, labor, and the cost related to determining a structure by experimental methods such as NMR spectroscopy. Several computational methods have been developed to expand the applicability of NMR spectroscopy by addressing temporal and economical problems more efficiently. While these methods demonstrate successful outcomes to solve more challenging and structurally novel proteins, the cost has not been reduced significantly. Probability Density Profile Analysis (PDPA) has been previously introduced by our lab to directly address the economics of structure determination of routine proteins and the identification of novel structures from a minimal set of unassigned NMR data. 2D-PDPA (in which 2D denotes incorporation of data from two alignment media) has been successful in identifying the structural homolog of an unknown protein within a library of ~1000 decoy structures. In order to further expand the selectivity and sensitivity of PDPA, the incorporation of additional data was necessary. However, the expansion of the original PDPA approach was limited by its computational requirements where the inclusion of additional data would render it computationally intractable. Here we present the most recent developments of PDPA method (nD-PDPA: n Dimensional Probability Density Profile Analysis) that eliminate 2D-PDPA's computational limitations, and allows inclusion of RDC data from multiple vector types in multiple alignment media.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Comprehensive and user-analytics-friendly cancer patient database for physicians and researchers
Authors:
Ali Firooz,
Avery T. Funkhouser,
Julie C. Martin,
W. Jeffery Edenfield,
Homayoun Valafar,
Anna V. Blenda
Abstract:
Nuanced cancer patient care is needed, as the development and clinical course of cancer is multifactorial with influences from the general health status of the patient, germline and neoplastic mutations, co-morbidities, and environment. To effectively tailor an individualized treatment to each patient, such multifactorial data must be presented to providers in an easy-to-access and easy-to-analyze…
▽ More
Nuanced cancer patient care is needed, as the development and clinical course of cancer is multifactorial with influences from the general health status of the patient, germline and neoplastic mutations, co-morbidities, and environment. To effectively tailor an individualized treatment to each patient, such multifactorial data must be presented to providers in an easy-to-access and easy-to-analyze fashion. To address the need, a relational database has been developed integrating status of cancer-critical gene mutations, serum galectin profiles, serum and tumor glycomic profiles, with clinical, demographic, and lifestyle data points of individual cancer patients. The database, as a backend, provides physicians and researchers with a single, easily accessible repository of cancer profiling data to aid-in and enhance individualized treatment. Our interactive database allows care providers to amalgamate cohorts from these groups to find correlations between different data types with the possibility of finding "molecular signatures" based upon a combination of genetic mutations, galectin serum levels, glycan compositions, and patient clinical data and lifestyle choices. Our project provides a framework for an integrated, interactive, and growing database to analyze molecular and clinical patterns across cancer stages and subtypes and provides opportunities for increased diagnostic and prognostic power.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
On Creating a Comprehensive Food Database
Authors:
Lexington Whalen,
Brie Turner-McGrievy,
Matthew McGrievy,
Andrew Hester,
Homayoun Valafar
Abstract:
Studies with the primary aim of addressing eating disorders focus on assessing the nutrient content of food items with an exclusive focus on caloric intake. There are two primary impediments that can be noted in these studies. The first of these relates to the fact that caloric intake of each food item is calculated from an existing database. The second concerns the scientific significance of calo…
▽ More
Studies with the primary aim of addressing eating disorders focus on assessing the nutrient content of food items with an exclusive focus on caloric intake. There are two primary impediments that can be noted in these studies. The first of these relates to the fact that caloric intake of each food item is calculated from an existing database. The second concerns the scientific significance of caloric intake used as the single measure of nutrient content. By requiring an existing database, researchers are forced to find some source of a comprehensive set of food items as well as their respective nutrients. This search alone is a difficult task, and if completed often leads to the requirement of a paid API service. These services are expensive and non-customizable, taking away funding that could be aimed at other parts of the study only to give an unwieldy database that can not be modified or contributed to. In this work, we introduce a new rendition of the USDA's food database that includes both foods found in grocery stores and those found in restaurants or fast food places. At the moment, we have accumulated roughly 1.5 million food entries consisting of approximately 18,000 brands and 100 restaurants in the United States. These foods also have an abundance of nutrient data associated with them, from the caloric amount to saturated fat levels. The data is stored in MySQL format and is spread among five major tables. We have also procured images for theses foods entries when available, and have included all of our data and program scripts in an open source repository.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
Human Activity Recognition on Time Series Accelerometer Sensor Data using LSTM Recurrent Neural Networks
Authors:
Chrisogonas O. Odhiambo,
Sanjoy Saha,
Corby K. Martin,
Homayoun Valafar
Abstract:
The use of sensors available through smart devices has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch accelerometer sensors to recognize eating activity. More specifically, we collected sensor data from 10 participants while consuming pizza. Using this information, and other compa…
▽ More
The use of sensors available through smart devices has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch accelerometer sensors to recognize eating activity. More specifically, we collected sensor data from 10 participants while consuming pizza. Using this information, and other comparable data available for similar events such as smoking and medication-taking, and dissimilar activities of jogging, we developed a LSTM-ANN architecture that has demonstrated 90% success in identifying individual bites compared to a puff, medication-taking or jogging activities.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Application of Dimensional Reduction in Artificial Neural Networks to Improve Emergency Department Triage During Chemical Mass Casualty Incidents
Authors:
Nicholas D. Boltin,
Joan M. Culley,
Homayoun Valafar
Abstract:
Chemical Mass Casualty Incidents (MCI) place a heavy burden on hospital staff and resources. Machine Learning (ML) tools can provide efficient decision support to caregivers. However, ML models require large volumes of data for the most accurate results, which is typically not feasible in the chaotic nature of a chemical MCI. This study examines the application of four statistical dimension reduct…
▽ More
Chemical Mass Casualty Incidents (MCI) place a heavy burden on hospital staff and resources. Machine Learning (ML) tools can provide efficient decision support to caregivers. However, ML models require large volumes of data for the most accurate results, which is typically not feasible in the chaotic nature of a chemical MCI. This study examines the application of four statistical dimension reduction techniques: Random Selection, Covariance/Variance, Pearson's Linear Correlation, and Principle Component Analysis to reduce a dataset of 311 hazardous chemicals and 79 related signs and symptoms (SSx). An Artificial Neural Network pipeline was developed to create comparative models. Results show that the number of signs and symptoms needed to determine a chemical culprit can be reduced to nearly 40 SSx without losing significant model accuracy. Evidence also suggests that the application of dimension reduction methods can improve ANN model performance accuracy.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
Application of Machine Learning to Sleep Stage Classification
Authors:
Andrew Smith,
Hardik Anand,
Snezana Milosavljevic,
Katherine M. Rentschler,
Ana Pocivavsek,
Homayoun Valafar
Abstract:
Sleep studies are imperative to recapitulate phenotypes associated with sleep loss and uncover mechanisms contributing to psychopathology. Most often, investigators manually classify the polysomnography into vigilance states, which is time-consuming, requires extensive training, and is prone to inter-scorer variability. While many works have successfully developed automated vigilance state classif…
▽ More
Sleep studies are imperative to recapitulate phenotypes associated with sleep loss and uncover mechanisms contributing to psychopathology. Most often, investigators manually classify the polysomnography into vigilance states, which is time-consuming, requires extensive training, and is prone to inter-scorer variability. While many works have successfully developed automated vigilance state classifiers based on multiple EEG channels, we aim to produce an automated and open-access classifier that can reliably predict vigilance state based on a single cortical electroencephalogram (EEG) from rodents to minimize the disadvantages that accompany tethering small animals via wires to computer programs. Approximately 427 hours of continuously monitored EEG, electromyogram (EMG), and activity were labeled by a domain expert out of 571 hours of total data. Here we evaluate the performance of various machine learning techniques on classifying 10-second epochs into one of three discrete classes: paradoxical, slow-wave, or wake. Our investigations include Decision Trees, Random Forests, Naive Bayes Classifiers, Logistic Regression Classifiers, and Artificial Neural Networks. These methodologies have achieved accuracies ranging from approximately 74% to approximately 96%. Most notably, the Random Forest and the ANN achieved remarkable accuracies of 95.78% and 93.31%, respectively. Here we have shown the potential of various machine learning classifiers to automatically, accurately, and reliably classify vigilance states based on a single EEG reading and a single EMG reading.
△ Less
Submitted 22 May, 2022; v1 submitted 4 November, 2021;
originally announced November 2021.
-
Application of Machine Learning in Early Recommendation of Cardiac Resynchronization Therapy
Authors:
Brendan E. Odigwe,
Francis G. Spinale,
Homayoun Valafar
Abstract:
Heart failure (HF) is a leading cause of morbidity, mortality, and health care costs. Prolonged conduction through the myocardium can occur with HF, and a device-driven approach, termed cardiac resynchronization therapy (CRT), can improve left ventricular (LV) myocardial conduction patterns. While a functional benefit of CRT has been demonstrated, a large proportion of HF patients (30-50%) receivi…
▽ More
Heart failure (HF) is a leading cause of morbidity, mortality, and health care costs. Prolonged conduction through the myocardium can occur with HF, and a device-driven approach, termed cardiac resynchronization therapy (CRT), can improve left ventricular (LV) myocardial conduction patterns. While a functional benefit of CRT has been demonstrated, a large proportion of HF patients (30-50%) receiving CRT do not show sufficient improvement. Moreover, identifying HF patients that would benefit from CRT prospectively remains a clinical challenge. Accordingly, strategies to effectively predict those HF patients that would derive a functional benefit from CRT holds great medical and socio-economic importance. Thus, we used machine learning methods of classifying HF patients, namely Cluster Analysis, Decision Trees, and Artificial neural networks, to develop predictive models of individual outcomes following CRT. Clinical, functional, and biomarker data were collected in HF patients before and following CRT. A prospective 6-month endpoint of a reduction in LV volume was defined as a CRT response. Using this approach (418 responders, 412 non-responders), each with 56 parameters, we could classify HF patients based on their response to CRT with more than 95% success. We have demonstrated that using machine learning approaches can identify HF patients with a high probability of a positive CRT response (95% accuracy), and of equal importance, identify those HF patients that would not derive a functional benefit from CRT. Develo** this approach into a clinical algorithm to assist in clinical decision-making regarding the use of CRT in HF patients would potentially improve outcomes and reduce health care costs.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
MedSensor: Medication Adherence Monitoring Using Neural Networks on Smartwatch Accelerometer Sensor Data
Authors:
Chrisogonas Odhiambo,
Pamela Wright,
Cindy Corbett,
Homayoun Valafar
Abstract:
Poor medication adherence presents serious economic and health problems including compromised treatment effectiveness, medical complications, and loss of billions of dollars in wasted medicine or procedures. Though various interventions have been proposed to address this problem, there is an urgent need to leverage light, smart, and minimally obtrusive technology such as smartwatches to develop us…
▽ More
Poor medication adherence presents serious economic and health problems including compromised treatment effectiveness, medical complications, and loss of billions of dollars in wasted medicine or procedures. Though various interventions have been proposed to address this problem, there is an urgent need to leverage light, smart, and minimally obtrusive technology such as smartwatches to develop user tools to improve medication use and adherence. In this study, we conducted several experiments on medication-taking activities, developed a smartwatch android application to collect the accelerometer hand gesture data from the smartwatch, and conveyed the data collected to a central cloud database. We developed neural networks, then trained the networks on the sensor data to recognize medication and non-medication gestures. With the proposed machine learning algorithm approach, this study was able to achieve average accuracy scores of 97% on the protocol-guided gesture data, and 95% on natural gesture data.
△ Less
Submitted 18 May, 2021;
originally announced May 2021.
-
Structure Calculation and Reconstruction of Discrete State Dynamics from Residual Dipolar Couplings using REDCRAFT
Authors:
Casey A. Cole,
Rishi Mukhapadhyay,
Hanin Omar,
Mirko Hennig,
Homayoun Valafar
Abstract:
Residual Dipolar Couplings (RDCs) acquired by Nuclear Magnetic Resonance (NMR) spectroscopy can be an indispensable source of information in investigation of molecular structures and dynamics. Here we present a complete strategy for structure calculation and reconstruction of discrete state dynamics from RDC data. Our method utilizes the previously presented REDCRAFT software package and its dynam…
▽ More
Residual Dipolar Couplings (RDCs) acquired by Nuclear Magnetic Resonance (NMR) spectroscopy can be an indispensable source of information in investigation of molecular structures and dynamics. Here we present a complete strategy for structure calculation and reconstruction of discrete state dynamics from RDC data. Our method utilizes the previously presented REDCRAFT software package and its dynamic-profile analysis to complete the task of fragmented structure determination and identification of the onset of dynamics from RDC data. Fragmented structure determination was used to demonstrate successful structure calculation of static and dynamic domains for several models of dynamics. We provide a mechanism of producing an ensemble of conformations for the dynamical regions that describe any observed order tensor discrepancies between the static and dynamic domains within a protein. In addition, the presented method is capable of approximating relative occupancy of each conformational state. The developed methodology has been evaluated on simulated RDC data with 1Hz of error from an 83 residue α protein (PDBID 1A1Z), and a 213 residue α/\b{eta} protein DGCR8 (PDBID 2YT4). Using 1A1Z, various models of arc and complex two and three discrete-state dynamics were simulated. MD simulation was used to generate a 2-state dynamics for DGCR8. In both instances our method reproduced structure of the protein including the conformational ensemble to within less than 2Å. Based on our investigations, arc motions with more than 30° of rotation are recognized as internal dynamics and are reconstructed with sufficient accuracy. Furthermore, states with relative occupancies above 20% are consistently recognized and reconstructed successfully. Arc motions with magnitude of 15° or relative occupancy of less than 10% are consistently unrecognizable as dynamical regions.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
Parallel Implementation of Distributed Global Optimization (DGO)
Authors:
Homayoun Valafar,
Okan K. Ersoy,
Farmaraz Valafar
Abstract:
Parallel implementations of distributed global optimization (DGO) [13] on MP-1 and NCUBE parallel computers revealed an approximate O(n) increase in the performance of this algorithm. Therefore, the implementation of the DGO on parallel processors can remedy the only draw back of this algorithm which is the O(n2) of execution time as the number of the dimensions increase. The speed up factor of th…
▽ More
Parallel implementations of distributed global optimization (DGO) [13] on MP-1 and NCUBE parallel computers revealed an approximate O(n) increase in the performance of this algorithm. Therefore, the implementation of the DGO on parallel processors can remedy the only draw back of this algorithm which is the O(n2) of execution time as the number of the dimensions increase. The speed up factor of the parallel implementations of DGO is measured with respect to the sequential execution time of the identical problem on SPARC IV computer. The best speed up was achieved by the SIMD implementation of the algorithm on the MP-1 with the total speedup of 126 for an optimization problem with n = 9. This optimization problem was distributed across 128 PEs of Mas-Par.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Reduction in the complexity of 1D 1H-NMR spectra by the use of Frequency to Information Transformation
Authors:
Homayoun Valafar,
Faramarz Valafar
Abstract:
Analysis of 1H-NMR spectra is often hindered by large variations that occur during the collection of these spectra. Large solvent and standard peaks, base line drift and negative peaks (due to improper phasing) are among some of these variations. Furthermore, some instrument dependent alterations, such as incorrect shimming, are also embedded in the recorded spectrum. The unpredictable nature of t…
▽ More
Analysis of 1H-NMR spectra is often hindered by large variations that occur during the collection of these spectra. Large solvent and standard peaks, base line drift and negative peaks (due to improper phasing) are among some of these variations. Furthermore, some instrument dependent alterations, such as incorrect shimming, are also embedded in the recorded spectrum. The unpredictable nature of these alterations of the signal has rendered the automated and instrument independent computer analysis of these spectra unreliable. In this paper, a novel method of extracting the information content of a signal (in this paper, frequency domain 1H-NMR spectrum), called the frequency-information transformation (FIT), is presented and compared to a previously used method (SPUTNIK). FIT can successfully extract the relevant information to a pattern matching task present in a signal, while discarding the remainder of a signal by transforming a Fourier transformed signal into an information spectrum (IS). This technique exhibits the ability of decreasing the inter-class correlation coefficients while increasing the intra-class correlation coefficients. Different spectra of the same molecule, in other words, will resemble more to each other while the spectra of different molecules will look more different from each other. This feature allows easier automated identification and analysis of molecules based on their spectral signatures using computer algorithms.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Distributed Global Optimization (DGO)
Authors:
Homayoun Valafar,
Okan K. Ersoy,
Faramarz Valafar
Abstract:
A new technique of global optimization and its applications in particular to neural networks are presented. The algorithm is also compared to other global optimization algorithms such as Gradient descent (GD), Monte Carlo (MC), Genetic Algorithm (GA) and other commercial packages. This new optimization technique proved itself worthy of further study after observing its accuracy of convergence, spe…
▽ More
A new technique of global optimization and its applications in particular to neural networks are presented. The algorithm is also compared to other global optimization algorithms such as Gradient descent (GD), Monte Carlo (MC), Genetic Algorithm (GA) and other commercial packages. This new optimization technique proved itself worthy of further study after observing its accuracy of convergence, speed of convergence and ease of use. Some of the advantages of this new optimization technique are listed below: 1. Optimizing function does not have to be continuous or differentiable. 2. No random mechanism is used, therefore this algorithm does not inherit the slow speed of random searches. 3. There are no fine-tuning parameters (such as the step rate of G.D. or temperature of S.A.) needed for this technique. 4. This algorithm can be implemented on parallel computers so that there is little increase in computation time (compared to linear increase) as the number of dimensions increases. The time complexity of O(n) is achieved.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
TALI: Protein Structure Alignment Using Backbone Torsion Angles
Authors:
Xijiang Miao,
Michael G. Bryson,
Homayoun Valafar
Abstract:
This article introduces a novel protein structure alignment method (named TALI) based on the protein backbone torsion angle instead of the more traditional distance matrix. Because the structural alignment of the two proteins is based on the comparison of two sequences of numbers (backbone torsion angles), we can take advantage of a large number of well-developed methods such as Smith-Waterman or…
▽ More
This article introduces a novel protein structure alignment method (named TALI) based on the protein backbone torsion angle instead of the more traditional distance matrix. Because the structural alignment of the two proteins is based on the comparison of two sequences of numbers (backbone torsion angles), we can take advantage of a large number of well-developed methods such as Smith-Waterman or Needleman-Wunsch. Here we report the result of TALI in comparison to other structure alignment methods such as DALI, CE, and SSM ass well as sequence alignment based on PSI-BLAST. TALI demonstrated great success over all other methods in application to challenging proteins. TALI was more successful in recognizing remote structural homology. TALI also demonstrated an ability to identify structural homology between two proteins where the structural difference was due to a rotation of internal domains by nearly 180$^\circ$.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
A Comparative study of Artificial Neural Networks Using Reinforcement learning and Multidimensional Bayesian Classification Using Parzen Density Estimation for Identification of GC-EIMS Spectra of Partially Methylated Alditol Acetates
Authors:
Faramarz Valafar,
Homayoun Valafar
Abstract:
This study reports the development of a pattern recognition search engine for a World Wide Web-based database of gas chromatography-electron impact mass spectra (GC-EIMS) of partially methylated Alditol Acetates (PMAAs). Here, we also report comparative results for two pattern recognition techniques that were employed for this study. The first technique is a statistical technique using Bayesian cl…
▽ More
This study reports the development of a pattern recognition search engine for a World Wide Web-based database of gas chromatography-electron impact mass spectra (GC-EIMS) of partially methylated Alditol Acetates (PMAAs). Here, we also report comparative results for two pattern recognition techniques that were employed for this study. The first technique is a statistical technique using Bayesian classifiers and Parzen density estimators. The second technique involves an artificial neural network module trained with reinforcement learning. We demonstrate here that both systems perform well in identifying spectra with small amounts of noise. Both system's performance degrades with degrading signal-to-noise ratio (SNR). When dealing with partial spectra (missing data), the artificial neural network system performs better. The developed system is implemented on the world wide web, and is intended to identify PMAAs using submitted spectra of these molecules recorded on any GC-EIMS instrument. The system, therefore, is insensitive to instrument and column dependent variations in GC-EIMS spectra.
△ Less
Submitted 31 July, 2020;
originally announced August 2020.
-
Parallel, Self Organizing, Consensus Neural Networks
Authors:
Homayoun Valafar,
Faramarz Valafar,
Okan Ersoy
Abstract:
A new neural network architecture (PSCNN) is developed to improve performance and speed of such networks. The architecture has all the advantages of the previous models such as self-organization and possesses some other superior characteristics such as input parallelism and decision making based on consensus. Due to the properties of this network, it was studied with respect to implementation on a…
▽ More
A new neural network architecture (PSCNN) is developed to improve performance and speed of such networks. The architecture has all the advantages of the previous models such as self-organization and possesses some other superior characteristics such as input parallelism and decision making based on consensus. Due to the properties of this network, it was studied with respect to implementation on a Parallel Processor (Ncube Machine) as well as a regular sequential machine. The architecture self organizes its own modules in a way to maximize performance. Since it is completely parallel, both recall and learning procedures are very fast. The performance of the network was compared to the Backpropagation networks in problems of language perception, remote sensing and binary logic (Exclusive-Or). PSCNN showed superior performance in all cases studied.
△ Less
Submitted 30 July, 2020;
originally announced August 2020.
-
Identification of 1H-NMR Spectra of Xyloglucan Oligosaccharides: A Comparative Study of Artificial Neural Networks and Bayesian Classification Using Nonparametric Density Estimation
Authors:
Faramarz Valafar,
Homayoun Valafar,
William S. York
Abstract:
Proton nuclear magnetic resonance (1H-NMR) is a widely used tool for chemical structural analysis. However, 1H-NMR spectra suffer from natural aberrations that render computer-assisted automated identification of these spectra difficult, and at times impossible. Previous efforts have successfully implemented instrument dependent or conditional identification of these spectra. In this paper, we rep…
▽ More
Proton nuclear magnetic resonance (1H-NMR) is a widely used tool for chemical structural analysis. However, 1H-NMR spectra suffer from natural aberrations that render computer-assisted automated identification of these spectra difficult, and at times impossible. Previous efforts have successfully implemented instrument dependent or conditional identification of these spectra. In this paper, we report the first instrument independent computer-assisted automated identification system for a group of complex carbohydrates known as the xyloglucan oligosaccharides. The developed system is also implemented on the world wide web (http://www.ccrc.uga.edu) as part of an identification package called the CCRC-Net and is intended to recognize any submitted 1H-NMR spectrum of these structures with reasonable signal-to-noise ratio, recorded on any 500 MHz NMR instrument. The system uses Artificial Neural Networks (ANNs) technology and is insensitive to the instrument and environment-dependent variations in 1H-NMR spectroscopy. In this paper, comparative results of the ANN engine versus a multidimensional Bayes' classifier is also presented.
△ Less
Submitted 30 July, 2020;
originally announced August 2020.
-
An Investigation in Optimal Encoding of Protein Primary Sequence for Structure Prediction by Artificial Neural Networks
Authors:
Aaron Hein,
Casey Cole,
Homayoun Valafar
Abstract:
Machine learning and the use of neural networks has increased precipitously over the past few years primarily due to the ever-increasing accessibility to data and the growth of computation power. It has become increasingly easy to harness the power of machine learning for predictive tasks. Protein structure prediction is one area where neural networks are becoming increasingly popular and successf…
▽ More
Machine learning and the use of neural networks has increased precipitously over the past few years primarily due to the ever-increasing accessibility to data and the growth of computation power. It has become increasingly easy to harness the power of machine learning for predictive tasks. Protein structure prediction is one area where neural networks are becoming increasingly popular and successful. Although very powerful, the use of ANN require selection of most appropriate input/output encoding, architecture, and class to produce the optimal results. In this investigation we have explored and evaluated the effect of several conventional and newly proposed input encodings and selected an optimal architecture. We considered 11 variations of input encoding, 11 alternative window sizes, and 7 different architectures. In total, we evaluated 2,541 permutations in application to the training and testing of more than 10,000 protein structures over the course of 3 months. Our investigations concluded that one-hot encoding, the use of LSTMs, and window sizes of 9, 11, and 15 produce the optimal outcome. Through this optimization, we were able to improve the quality of protein structure prediction by predicting the φ dihedrals to within 14° - 16° and ψ dihedrals to within 23°- 25°. This is a notable improvement compared to previously similar investigations.
△ Less
Submitted 2 August, 2020;
originally announced August 2020.
-
Process of Efficiently Parallelizing a Protein Structure Determination Algorithm
Authors:
Michael Bryson,
Xijiang Miao,
Homayoun Valafar
Abstract:
Computational protein structure determination involves optimization in a problem space much too large to exhaustively search. Existing approaches include optimization algorithms such as gradient descent and simulated annealing, but these typically only find local minima. One novel approach implemented in REDcRAFT is to instead of folding a protein all at the same time, fold it residue by residue.…
▽ More
Computational protein structure determination involves optimization in a problem space much too large to exhaustively search. Existing approaches include optimization algorithms such as gradient descent and simulated annealing, but these typically only find local minima. One novel approach implemented in REDcRAFT is to instead of folding a protein all at the same time, fold it residue by residue. This simulates a protein folding as each residue exits from the generating ribosome. While REDcRAFT exponentially reduces the problem space so it can be explored in polynomial time, it is still extremely computationally demanding. This algorithm does have the advantage that most of the execution time is spent in inherently parallelizable code. However, preliminary results from parallel execution indicate that approximately two-thirds of execution time is dedicated to system overhead. Additionally, by carefully analyzing and timing the structure of the program the major bottlenecks can be identified. After addressing these issues, REDcRAFT becomes a scalable parallel application with nearly two orders of magnitude improvement.
△ Less
Submitted 31 July, 2020;
originally announced August 2020.
-
A Preliminary Investigation in the Molecular Basis of Host Shutoff Mechanism in SARS-CoV
Authors:
Niharika Pandala,
Casey A. Cole,
Devaun McFarland,
Anita Nag,
Homayoun Valafar
Abstract:
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 pro…
▽ More
Recent events leading to the worldwide pandemic of COVID-19 have demonstrated the effective use of genomic sequencing technologies to establish the genetic sequence of this virus. In contrast, the COVID-19 pandemic has demonstrated the absence of computational approaches to understand the molecular basis of this infection rapidly. Here we present an integrated approach to the study of the nsp1 protein in SARS-CoV-1, which plays an essential role in maintaining the expression of viral proteins and further disabling the host protein expression, also known as the host shutoff mechanism. We present three independent methods of evaluating two potential binding sites speculated to participate in host shutoff by nsp1. We have combined results from computed models of nsp1, with deep mining of all existing protein structures (using PDBMine), and binding site recognition (using msTALI) to examine the two sites consisting of residues 55-59 and 73-80. Based on our preliminary results, we conclude that the residues 73-80 appear as the regions that facilitate the critical initial steps in the function of nsp1. Given the 90% sequence identity between nsp1 from SARS-CoV-1 and SARS-CoV-2, we conjecture the same critical initiation step in the function of COVID-19 nsp1.
△ Less
Submitted 23 July, 2020;
originally announced July 2020.
-
Assessing the Precision and Recall of msTALI as Applied to an Active-Site Study on Fold Families
Authors:
Devaun McFarland,
Homayoun Valafar
Abstract:
Proteins execute various activities required by biological cells. Further, they structurally support and pro-mote important biochemical reactions which functionally are sparked by active-sites. Active-sites are regions where reac-tions and binding events take place directly; they foster pro-tein purpose. Describing functional relationships depends on factors that incorporate sequence, structure, a…
▽ More
Proteins execute various activities required by biological cells. Further, they structurally support and pro-mote important biochemical reactions which functionally are sparked by active-sites. Active-sites are regions where reac-tions and binding events take place directly; they foster pro-tein purpose. Describing functional relationships depends on factors that incorporate sequence, structure, and the biochem-ical properties of amino acids that form proteins. Our ap-proach to active-site description is computational, and many other approaches utilizing available protein data fall short of ideal. Successful recognition of functional interactions is cru-cial to advancements in protein annotation and the bioinfor-matics field at large. This research outlines our Multiple Structure Torsion Angle Alignment (msTALI) as a suitable strategy for addressing active-site identification by comparing results to other existing methods. Specifically, we address the precision of msTALI across three protein families. Our target proteins are PDBIDs 1A2B, 1B4V, 1B8S, 1COY, 1CXZ, 3COX, 1D7E, 1DPF, 1F9I, 1FTN, 1IJH, 1KOU, 1NWZ, 2PHY, and 1SIC.
△ Less
Submitted 7 May, 2020;
originally announced May 2020.
-
An Artificial Neural Network Based Approach for Identification of Native Protein Structures using an Extended ForceField
Authors:
Timothy Matthew Fawcett,
Stephanie Irausquin,
Mikhail Simin,
Homayoun Valafar
Abstract:
Current protein forcefields like the ones seen in CHARMM or Xplor-NIH have many terms that include bonded and non-bonded terms. Yet the forcefields do not take into account the use of hydrogen bonds which are important for secondary structure creation and stabilization of proteins. SCOPE is an open-source program that generates proteins from rotamer space. It then creates a forcefield that uses on…
▽ More
Current protein forcefields like the ones seen in CHARMM or Xplor-NIH have many terms that include bonded and non-bonded terms. Yet the forcefields do not take into account the use of hydrogen bonds which are important for secondary structure creation and stabilization of proteins. SCOPE is an open-source program that generates proteins from rotamer space. It then creates a forcefield that uses only non-bonded and hydrogen bond energy terms to create a profile for a given protein. The profiles can then be used in an artificial neural network to create a linear model that is funneled to the true protein conformation.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
Recognition of Smoking Gesture Using Smart Watch Technology
Authors:
Casey A. Cole,
Bethany Janos,
Dien Anshari,
James F. Thrasher,
Scott Strayer,
Homayoun Valafar
Abstract:
Diseases resulting from prolonged smoking are the most common preventable causes of death in the world today. In this report we investigate the success of utilizing accelerometer sensors in smart watches to identify smoking gestures. Early identification of smoking gestures can help to initiate the appropriate intervention method and prevent relapses in smoking. Our experiments indicate 85%-95% su…
▽ More
Diseases resulting from prolonged smoking are the most common preventable causes of death in the world today. In this report we investigate the success of utilizing accelerometer sensors in smart watches to identify smoking gestures. Early identification of smoking gestures can help to initiate the appropriate intervention method and prevent relapses in smoking. Our experiments indicate 85%-95% success rates in identification of smoking gesture among other similar gestures using Artificial Neural Networks (ANNs). Our investigations concluded that information obtained from the x-dimension of accelerometers is the best means of identifying the smoking gesture, while y and z dimensions are helpful in eliminating other gestures such as: eating, drinking, and scratch of nose. We utilized sensor data from the Apple Watch during the training of the ANN. Using sensor data from another participant collected on Pebble Steel, we obtained a smoking identification accuracy of greater than 90% when using an ANN trained on data previously collected from the Apple Watch. Finally, we have demonstrated the possibility of using smart watches to perform continuous monitoring of daily activities.
△ Less
Submitted 5 March, 2020;
originally announced March 2020.
-
An AI model for Rapid and Accurate Identification of Chemical Agents in Mass Casualty Incidents
Authors:
Nicholas Boltin,
Daniel Vu,
Bethany Janos,
Alyssa Shofner,
Joan Culley,
Homayoun Valafar
Abstract:
In this report we examine the effectiveness of WISER in identification of a chemical culprit during a chemical based Mass Casualty Incident (MCI). We also evaluate and compare Binary Decision Tree (BDT) and Artificial Neural Networks (ANN) using the same experimental conditions as WISER. The reverse engineered set of Signs/Symptoms from the WISER application was used as the training set and 31,100…
▽ More
In this report we examine the effectiveness of WISER in identification of a chemical culprit during a chemical based Mass Casualty Incident (MCI). We also evaluate and compare Binary Decision Tree (BDT) and Artificial Neural Networks (ANN) using the same experimental conditions as WISER. The reverse engineered set of Signs/Symptoms from the WISER application was used as the training set and 31,100 simulated patient records were used as the testing set. Three sets of simulated patient records were generated by 5%, 10% and 15% perturbation of the Signs/Symptoms of each chemical record. While all three methods achieved a 100% training accuracy, WISER, BDT and ANN produced performances in the range of: 1.8%-0%, 65%-26%, 67%-21% respectively. A preliminary investigation of dimensional reduction using ANN illustrated a dimensional collapse from 79 variables to 40 with little loss of classification performance.
△ Less
Submitted 12 December, 2019;
originally announced January 2020.
-
De Novo Assembly of Uca minax Transcriptome from Next Generation Sequencing
Authors:
Hanin Omar,
Casey A. Cole,
Arjang Fahim,
Giuliana Gusmaroli,
Stephen Borgianini,
Homayoun Valafar
Abstract:
High-throughput cDNA sequencing (RNA-seq) is a very powerful technique to quantify gene expression in an unbiased way. The Crustacean family is among the groups of organisms sparsely represented in current genomic databases. Here we present transcriptome data from Uca minax (red-jointed fiddler crab) as an opportunity to extend our knowledge. Next generation sequencing was performed on six tissue…
▽ More
High-throughput cDNA sequencing (RNA-seq) is a very powerful technique to quantify gene expression in an unbiased way. The Crustacean family is among the groups of organisms sparsely represented in current genomic databases. Here we present transcriptome data from Uca minax (red-jointed fiddler crab) as an opportunity to extend our knowledge. Next generation sequencing was performed on six tissue samples from Uca minax using the Illumina HiSeq system. Six Transcriptome libraries were created using Trinity; a free, open-source software tool for de novo transcriptome assembly of high-throughput mRNA sequencing (RNA-seq) data with the absence of a reference genome. In addition, several tools that aid in management of data were used, such as RSEM, Bowtie, Blast, and IGV; a tool for visualizing RNA-seq analysis results. Fast quality control (FastQC) analysis of the raw sequenced files revealed that both adapter and PCR primer sequences were prevalently present, which may require a preprocessing step.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
An Investigation of Minimum Data Requirement for Successful Structure Determination of Pf2048.1 with REDCRAFT
Authors:
Casey A. Cole,
Daniela Ishimaru,
Mirko Hennig,
Homayoun Valafar
Abstract:
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also know as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In this wor…
▽ More
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also know as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In this work we report our results for structure calculation of the novel protein PF2048.1 from RDC data and establish the minimum data requirement for successful structure calculation using the software package REDCRAFT. Our investigations start with utilizing four sets of synthetic RDC data in two alignment media and proceed by reducing the RDC data to the final limit of {CN, NH} and {NH} from two alignment media respectively. Our results indicate that structure elucidation of this protein is possible with as little as {CN, NH} and {NH} to within 0.533Å of the target structure.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
State Transition Modeling of the Smoking Behavior using LSTM Recurrent Neural Networks
Authors:
Chrisogonas O. Odhiambo,
Casey A. Cole,
Alaleh Torkjazi,
Homayoun Valafar
Abstract:
The use of sensors has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch sensors to recognize smoking activity. More specifically, we have reformulated the previous work in detection of smoking to include in-context recognition of smoking. Our presented reformulation of the smoking g…
▽ More
The use of sensors has pervaded everyday life in several applications including human activity monitoring, healthcare, and social networks. In this study, we focus on the use of smartwatch sensors to recognize smoking activity. More specifically, we have reformulated the previous work in detection of smoking to include in-context recognition of smoking. Our presented reformulation of the smoking gesture as a state-transition model that consists of the mini-gestures hand-to-lip, hand-on-lip, and hand-off-lip, has demonstrated improvement in detection rates nearing 100% using conventional neural networks. In addition, we have begun the utilization of Long-Short-Term Memory (LSTM) neural networks to allow for in-context detection of gestures with accuracy nearing 97%.
△ Less
Submitted 7 January, 2020;
originally announced January 2020.
-
Automated Analysis of Femoral Artery Calcification Using Machine Learning Techniques
Authors:
Liang Zhao,
Brendan Odigwe,
Susan Lessner,
Daniel G. Clair,
Firas Mussa,
Homayoun Valafar
Abstract:
We report an object tracking algorithm that combines geometrical constraints, thresholding, and motion detection for tracking of the descending aorta and the network of major arteries that branch from the aorta including the iliac and femoral arteries. Using our automated identification and analysis, arterial system was identified with more than 85% success when compared to human annotation. Furth…
▽ More
We report an object tracking algorithm that combines geometrical constraints, thresholding, and motion detection for tracking of the descending aorta and the network of major arteries that branch from the aorta including the iliac and femoral arteries. Using our automated identification and analysis, arterial system was identified with more than 85% success when compared to human annotation. Furthermore, the reported automated system is capable of producing a stenosis profile, and a calcification score similar to the Agatston score. The use of stenosis and calcification profiles will lead to the development of better-informed diagnostic and prognostic tools.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
Modelling of Sickle Cell Anemia Patients Response to Hydroxyurea using Artificial Neural Networks
Authors:
Brendan E. Odigwe,
Jesuloluwa S. Eyitayo,
Celestine I. Odigwe,
Homayoun Valafar
Abstract:
Hydroxyurea (HU) has been shown to be effective in alleviating the symptoms of Sickle Cell Anemia disease. While Hydroxyurea reduces the complications associated with Sickle Cell Anemia in some patients, others do not benefit from this drug and experience deleterious effects since it is also a chemotherapeutic agent. Therefore, to whom, should the administration of HU be considered as a viable opt…
▽ More
Hydroxyurea (HU) has been shown to be effective in alleviating the symptoms of Sickle Cell Anemia disease. While Hydroxyurea reduces the complications associated with Sickle Cell Anemia in some patients, others do not benefit from this drug and experience deleterious effects since it is also a chemotherapeutic agent. Therefore, to whom, should the administration of HU be considered as a viable option, is the main question asked by the responsible physician. We address this question by develo** modeling techniques that can predict a patient's response to HU and therefore spare the non-responsive patients from the unnecessary effects of HU on the values of 22 parameters that can be obtained from blood samples in 122 patients. Using this data, we developed Deep Artificial Neural Network models that can predict with 92.6% accuracy, the final HbF value of a subject after undergoing HU therapy. Our current studies are focussing on forecasting a patient's HbF response, 30 days ahead of time.
△ Less
Submitted 25 November, 2019;
originally announced November 2019.
-
PDBMine: A Reformulation of the Protein Data Bank to Facilitate Structural Data Mining
Authors:
Casey A Cole,
Christopher Ott,
Diego Valdes,
Homayoun Valafar
Abstract:
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Pro…
▽ More
Large scale initiatives such as the Human Genome Project, Structural Genomics, and individual research teams have provided large deposits of genomic and proteomic data. The transfer of data to knowledge has become one of the existing challenges, which is a consequence of capturing data in databases that are optimally designed for archiving and not mining. In this research, we have targeted the Protein Databank (PDB) and demonstrated a transformation of its content, named PDBMine, that reduces storage space by an order of magnitude, and allows for powerful mining in relation to the topic of protein structure determination. We have demonstrated the utility of PDBMine in exploring the prevalence of dimeric and trimeric amino acid sequences and provided a mechanism of predicting protein structure.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Improvements of the REDCRAFT Software Package
Authors:
Casey A Cole,
Caleb Parks,
Julian Rachele,
Homayoun Valafar
Abstract:
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previou…
▽ More
Traditional approaches to elucidation of protein structures by NMR spectroscopy rely on distance restraints also known as nuclear Overhauser effects (NOEs). The use of NOEs as the primary source of structure determination by NMR spectroscopy is time consuming and expensive. Residual Dipolar Couplings (RDCs) have become an alternate approach for structure calculation by NMR spectroscopy. In previous works, the software package REDCRAFT has been presented as a means of harnessing the information containing in RDCs for structure calculation of proteins. In this work, we present significant improvements to the REDCRAFT package including: refinement of the decimation procedure, the inclusion of graphical user interface, adoption of NEF standards, and addition of scripts for enhanced protein modeling options. The improvements to REDCRAFT have resulted in the ability to fold proteins that the previous versions were unable to fold. For instance, we report the results of folding of the protein 1A1Z in the presence of highly erroneous data.
△ Less
Submitted 19 November, 2019;
originally announced November 2019.
-
Aligning Multiple Protein Structures using Biochemical and Biophysical Properties
Authors:
Paul Shealy,
Homayoun Valafar
Abstract:
Aligning multiple protein structures can yield valuable information about structural similarities among related proteins, as well as provide insight into evolutionary relationships between proteins in a family. We have developed an algorithm (msTALI) for aligning multiple protein structures using biochemical and biophysical properties, including torsion angles, secondary structure, hydrophobicity,…
▽ More
Aligning multiple protein structures can yield valuable information about structural similarities among related proteins, as well as provide insight into evolutionary relationships between proteins in a family. We have developed an algorithm (msTALI) for aligning multiple protein structures using biochemical and biophysical properties, including torsion angles, secondary structure, hydrophobicity, and surface accessibility. The algorithm is a progressive alignment algorithm motivated by popular techniques from multiple sequence alignment. It has demonstrated success in aligning the major structural regions of a set of proteins from the s/r kinase family. The algorithm was also successful at aligning functional residues of these proteins. In addition, the algorithm was also successful in aligning seven members of the acyl carrier protein family, including both experimentally derived as well as computationally modeled structures.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Using Residual Dipolar Couplings from Two Alignment Media to Detect Structural Homology
Authors:
Ryan Yandle,
Rishi Mukhopadhyay,
Homayoun Valafar
Abstract:
The method of Probability Density Profile Analysis has been introduced previously as a tool to find the best match between a set of experimentally generated Residual Dipolar Couplings and a set of known protein structures. While it proved effective on small databases in identifying protein fold families, and for picking the best result from computational protein folding tool ROBETTA, for larger da…
▽ More
The method of Probability Density Profile Analysis has been introduced previously as a tool to find the best match between a set of experimentally generated Residual Dipolar Couplings and a set of known protein structures. While it proved effective on small databases in identifying protein fold families, and for picking the best result from computational protein folding tool ROBETTA, for larger data sets, more data is required. Here, the method of 2-D Probability Density Profile Analysis is presented which incorporates paired RDC data from 2 alignment media for N-H vectors. The method was tested using synthetic RDC data generated with +/-1 Hz error. The results show that the addition of information from a second alignment medium makes 2-D PDPA a much more effective tool that is able to identify a structure from a database of 600 protein fold family representatives.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
Automated Assignment of Backbone Resonances Using Residual Dipolar Couplings Acquired from a Protein with Known Structure
Authors:
P. Shealy,
R. Mukhopadhyay,
S. Smith,
H. Valafar
Abstract:
Resonance assignment is a critical first step in the investigation of protein structures using NMR spectroscopy. The development of assignment methods that require less experimental data is possible with prior knowledge of the macromolecular structure. Automated methods of performing the task of resonance assignment can significantly reduce the financial cost and time requirement for protein struc…
▽ More
Resonance assignment is a critical first step in the investigation of protein structures using NMR spectroscopy. The development of assignment methods that require less experimental data is possible with prior knowledge of the macromolecular structure. Automated methods of performing the task of resonance assignment can significantly reduce the financial cost and time requirement for protein structure determination. Such methods can also be beneficial in validating a protein's solution state structure. Here we present a new approach to the assignment problem. Our approach uses only RDC data to assign backbone resonances. It provides simultaneous order tensor estimation and assignment. Our approach compares independent order tensor estimates to determine when the correct order tensor has been found. We demonstrate the algorithm's viability using simulated data from the protein domain 1A1Z.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Protein Fold Family Recognition From Unassigned Residual Dipolar Coupling Data
Authors:
Rishi Mukhopadhyay,
Paul Shealy,
Homayoun Valafar
Abstract:
Despite many advances in computational modeling of protein structures, these methods have not been widely utilized by experimental structural biologists. Two major obstacles are preventing the transition from a purely-experimental to a purely-computational mode of protein structure determination. The first problem is that most computational methods need a large library of computed structures that…
▽ More
Despite many advances in computational modeling of protein structures, these methods have not been widely utilized by experimental structural biologists. Two major obstacles are preventing the transition from a purely-experimental to a purely-computational mode of protein structure determination. The first problem is that most computational methods need a large library of computed structures that span a large variety of protein fold families, while structural genomics initiatives have slowed in their ability to provide novel protein folds in recent years. The second problem is an unwillingness to trust computational models that have no experimental backing. In this paper we test a potential solution to these problems that we have called Probability Density Profile Analysis (PDPA) that utilizes unassigned residual dipolar coupling data that are relatively cheap to acquire from NMR experiments.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Minimum Data Requirements and Supplemental Angle Constraints for Protein Structure Prediction with REDCRAFT
Authors:
E. Timko,
P. Shealy,
M. Bryson,
H. Valafar
Abstract:
One algorithm to predict protein structure is the residual dipolar coupling based residue assembly and filter tool (REDCRAFT). This algorithm exploits an exponential reduction of the search space of all possible structures to find a structure that best fits a set of experimental residual dipolar couplings. However, the minimum amount of data required to successfully determine a protein's structure…
▽ More
One algorithm to predict protein structure is the residual dipolar coupling based residue assembly and filter tool (REDCRAFT). This algorithm exploits an exponential reduction of the search space of all possible structures to find a structure that best fits a set of experimental residual dipolar couplings. However, the minimum amount of data required to successfully determine a protein's structure using REDCRAFT has not been previously investigated. Here we explore the effect of reducing the amount of data used to fold proteins. Our goal is to reduce experimental data collection times while retaining the accuracy levels previously achieved with larger amounts of data. We also investigate incorporating a priori secondary structure information into REDCRAFT to improve its structure prediction ability.
△ Less
Submitted 6 November, 2019; v1 submitted 31 October, 2019;
originally announced October 2019.