-
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Authors:
Kuzma Khrabrov,
Anton Ber,
Artem Tsypin,
Konstantin Ushenin,
Egor Rumiantsev,
Alexander Telepov,
Dmitry Protasov,
Ilya Shenbin,
Anton Alekseev,
Mikhail Shirokikh,
Sergey Nikolenko,
Elena Tutubalina,
Artur Kadurin
Abstract:
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets fo…
▽ More
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Benefits of mirror weight symmetry for 3D mesh segmentation in biomedical applications
Authors:
Vladislav Dordiuk,
Maksim Dzhigil,
Konstantin Ushenin
Abstract:
3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D m…
▽ More
3D mesh segmentation is an important task with many biomedical applications. The human body has bilateral symmetry and some variations in organ positions. It allows us to expect a positive effect of rotation and inversion invariant layers in convolutional neural networks that perform biomedical segmentations. In this study, we show the impact of weight symmetry in neural networks that perform 3D mesh segmentation. We analyze the problem of 3D mesh segmentation for pathological vessel structures (aneurysms) and conventional anatomical structures (endocardium and epicardium of ventricles). Local geometrical features are encoded as sampling from the signed distance function, and the neural network performs prediction for each mesh node. We show that weight symmetry gains from 1 to 3% of additional accuracy and allows decreasing the number of trainable parameters up to 8 times without suffering the performance loss if neural networks have at least three convolutional layers. This also works for very small training sets.
△ Less
Submitted 6 November, 2023; v1 submitted 29 September, 2023;
originally announced September 2023.
-
Compressor-Based Classification for Atrial Fibrillation Detection
Authors:
Nikita Markov,
Konstantin Ushenin,
Yakov Bozhko,
Olga Solovyova
Abstract:
Atrial fibrillation (AF) is one of the most common arrhythmias with challenging public health implications. Therefore, automatic detection of AF episodes on ECG is one of the essential tasks in biomedical engineering. In this paper, we applied the recently introduced method of compressor-based text classification with gzip algorithm for AF detection (binary classification between heart rhythms). W…
▽ More
Atrial fibrillation (AF) is one of the most common arrhythmias with challenging public health implications. Therefore, automatic detection of AF episodes on ECG is one of the essential tasks in biomedical engineering. In this paper, we applied the recently introduced method of compressor-based text classification with gzip algorithm for AF detection (binary classification between heart rhythms). We investigated the normalized compression distance applied to RR-interval and $Δ$RR-interval sequences ($Δ$RR-interval is the difference between subsequent RR-intervals). Here, the configuration of the k-nearest neighbour classifier, an optimal window length, and the choice of data types for compression were analyzed. We achieved good classification results while learning on the full MIT-BIH Atrial Fibrillation database, close to the best specialized AF detection algorithms (avg. sensitivity = 97.1\%, avg. specificity = 91.7\%, best sensitivity of 99.8\%, best specificity of 97.6\% with fivefold cross-validation). In addition, we evaluated the classification performance under the few-shot learning setting. Our results suggest that gzip compression-based classification, originally proposed for texts, is suitable for biomedical data and quantized continuous stochastic sequences in general.
△ Less
Submitted 2 October, 2023; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Computational anatomy atlas using multilayer perceptron with Lipschitz regularization
Authors:
Konstantin Ushenin,
Maksim Dzhigil,
Vladislav Dordiuk
Abstract:
A computational anatomy atlas is a set of internal organ geometries. It is based on data of real patients and complemented with virtual cases by using a some numerical approach. Atlases are in demand in computational physiology, especially in cardiological and neurophysiological applications. Usually, atlas generation uses explicit object representation, such as voxel models or surface meshes. In…
▽ More
A computational anatomy atlas is a set of internal organ geometries. It is based on data of real patients and complemented with virtual cases by using a some numerical approach. Atlases are in demand in computational physiology, especially in cardiological and neurophysiological applications. Usually, atlas generation uses explicit object representation, such as voxel models or surface meshes. In this paper, we propose a method of atlas generation using an implicit representation of 3D objects. Our approach has two key stages. The first stage converts voxel models of segmented organs to implicit form using the usual multilayer perceptron. This stage smooths the model and reduces memory consumption. The second stage uses a multilayer perceptron with Lipschitz regularization. This neural network provides a smooth transition between implicitly defined 3D geometries. Our work shows examples of models of the left and right human ventricles. All code and data for this work are open.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Statistical model for describing heart rate variability in normal rhythm and atrial fibrillation
Authors:
Nikita Markov,
Ilya Kotov,
Konstantin Ushenin,
Yakov Bozhko
Abstract:
Heart rate variability (HRV) indices describe properties of interbeat intervals in electrocardiogram (ECG). Usually HRV is measured exclusively in normal sinus rhythm (NSR) excluding any form of paroxysmal rhythm. Atrial fibrillation (AF) is the most widespread cardiac arrhythmia in human population. Usually such abnormal rhythm is not analyzed and assumed to be chaotic and unpredictable. Nonethel…
▽ More
Heart rate variability (HRV) indices describe properties of interbeat intervals in electrocardiogram (ECG). Usually HRV is measured exclusively in normal sinus rhythm (NSR) excluding any form of paroxysmal rhythm. Atrial fibrillation (AF) is the most widespread cardiac arrhythmia in human population. Usually such abnormal rhythm is not analyzed and assumed to be chaotic and unpredictable. Nonetheless, ranges of HRV indices differ between patients with AF, yet physiological characteristics which influence them are poorly understood. In this study, we propose a statistical model that describes relationship between HRV indices in NSR and AF. The model is based on Mahalanobis distance, the k-Nearest neighbour approach and multivariate normal distribution framework. Verification of the method was performed using 10 min intervals of NSR and AF that were extracted from long-term Holter ECGs. For validation we used Bhattacharyya distance and Kolmogorov-Smirnov 2-sample test in a k-fold procedure. The model is able to predict at least 7 HRV indices with high precision.
△ Less
Submitted 17 July, 2022;
originally announced July 2022.
-
Natural language processing for clusterization of genes according to their functions
Authors:
Vladislav Dordiuk,
Ekaterina Demicheva,
Fernando Polanco Espino,
Konstantin Ushenin
Abstract:
There are hundreds of methods for analysis of data obtained in mRNA-sequencing. The most of them are focused on small number of genes. In this study, we propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The list of genes is enriched with information from open databases. Then, the descriptions are encoded as vectors using the pretrained languag…
▽ More
There are hundreds of methods for analysis of data obtained in mRNA-sequencing. The most of them are focused on small number of genes. In this study, we propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The list of genes is enriched with information from open databases. Then, the descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches. The encoded gene function pass through the dimensionality reduction and clusterization. Aiming to find the most efficient pipeline, 180 cases of pipeline with different methods in the major pipeline steps were analyzed. The performance was evaluated with clusterization indexes and expert review of the results.
△ Less
Submitted 17 July, 2022;
originally announced July 2022.
-
On uniqueness theorems for the inverse problem of Electrocardiography in the Sobolev spaces
Authors:
Vitaly Kalinin,
Alexander Shlapunov,
Konstantin Ushenin
Abstract:
We consider a mathematical model related to reconstruction of cardiac electrical activity from ECG measurements on the body surface. An application of recent developments in solving boundary value problems for elliptic and parabolic equations in Sobolev type spaces allows us to obtain uniqueness theorems for the model. The obtained results can be used as a sound basis for creating numerical method…
▽ More
We consider a mathematical model related to reconstruction of cardiac electrical activity from ECG measurements on the body surface. An application of recent developments in solving boundary value problems for elliptic and parabolic equations in Sobolev type spaces allows us to obtain uniqueness theorems for the model. The obtained results can be used as a sound basis for creating numerical methods for non-invasive map** of the heart.
△ Less
Submitted 28 September, 2022; v1 submitted 6 November, 2021;
originally announced November 2021.
-
Anomaly Detection in Image Datasets Using Convolutional Neural Networks, Center Loss, and Mahalanobis Distance
Authors:
Garnik Vareldzhan,
Kirill Yurkov,
Konstantin Ushenin
Abstract:
User activities generate a significant number of poor-quality or irrelevant images and data vectors that cannot be processed in the main data processing pipeline or included in the training dataset. Such samples can be found with manual analysis by an expert or with anomalous detection algorithms. There are several formal definitions for the anomaly samples. For neural networks, the anomalous is u…
▽ More
User activities generate a significant number of poor-quality or irrelevant images and data vectors that cannot be processed in the main data processing pipeline or included in the training dataset. Such samples can be found with manual analysis by an expert or with anomalous detection algorithms. There are several formal definitions for the anomaly samples. For neural networks, the anomalous is usually defined as out-of-distribution samples. This work proposes methods for supervised and semi-supervised detection of out-of-distribution samples in image datasets. Our approach extends a typical neural network that solves the image classification problem. Thus, one neural network after extension can solve image classification and anomalous detection problems simultaneously. Proposed methods are based on the center loss and its effect on a deep feature distribution in a last hidden layer of the neural network. This paper provides an analysis of the proposed methods for the LeNet and EfficientNet-B0 on the MNIST and ImageNet-30 datasets.
△ Less
Submitted 13 April, 2021;
originally announced April 2021.
-
Effects of lead position, cardiac rhythm variation and drug-induced QT prolongation on performance of machine learning methods for ECG processing
Authors:
Marat Bogdanov,
Salim Baigildin,
Aygul Fabarisova,
Konstantin Ushenin,
Olga Solovyova
Abstract:
Machine learning shows great performance in various problems of electrocardiography (ECG) signal analysis. However, collecting a dataset for biomedical engineering is a very difficult task. Any dataset for ECG processing contains from 100 to 10,000 times fewer cases than datasets for image or text analysis. This issue is especially important because of physiological phenomena that can significantl…
▽ More
Machine learning shows great performance in various problems of electrocardiography (ECG) signal analysis. However, collecting a dataset for biomedical engineering is a very difficult task. Any dataset for ECG processing contains from 100 to 10,000 times fewer cases than datasets for image or text analysis. This issue is especially important because of physiological phenomena that can significantly change the morphology of heartbeats in ECG signals. In this preliminary study, we analyze the effects of lead choice from the standard ECG recordings, variation of ECG during 24-hours, and the effects of QT-prolongation agents on the performance of machine learning methods for ECG processing. We choose the problem of subject identification for analysis, because this problem may be solved for almost any available dataset of ECG data. In a discussion, we compare our findings with observations from other works that use machine learning for ECG processing with different problem statements. Our results show the importance of training dataset enrichment with ECG signals acquired in specific physiological conditions for obtaining good performance of ECG processing for real applications.
△ Less
Submitted 16 February, 2020; v1 submitted 10 December, 2019;
originally announced December 2019.
-
Phase map** for cardiac unipolar electrograms with neural network instead of phase transformation
Authors:
Konstantin Ushenin,
Tatyana Nesterova,
Dmitry Shmarko,
Vladimir Sholokhov
Abstract:
A phase map** is an approach to processing signals of electrograms recorded from the surface of cardiac tissue. The main concept of phase map** is the application of the phase transformation with the aim to obtain signals with useful properties. In our study, we propose to use a simple sawtooth signal instead of a phase signal for processing of electrogram data and building of the phase maps.…
▽ More
A phase map** is an approach to processing signals of electrograms recorded from the surface of cardiac tissue. The main concept of phase map** is the application of the phase transformation with the aim to obtain signals with useful properties. In our study, we propose to use a simple sawtooth signal instead of a phase signal for processing of electrogram data and building of the phase maps. We denote transformation that can provide this signal as a phase-like transformation (PLT). PLT defined via a convolutional neural network that is trained on a dataset from computer models of cardiac tissue electrophysiology. The proposed approaches were validated on data from the detailed personalized model of the human torso electrophysiology. This paper includes visualization of the phase map based on PLT and shows the robustness of the proposed approaches in the analysis of the complex non-stationary periodic activity of the excitable cardiac tissue.
△ Less
Submitted 13 April, 2021; v1 submitted 21 November, 2019;
originally announced November 2019.
-
Comparison of UNet, ENet, and BoxENet for Segmentation of Mast Cells in Scans of Histological Slices
Authors:
Alexander Karimov,
Artem Razumov,
Ruslana Manbatchurina,
Ksenia Simonova,
Irina Donets,
Anastasia Vlasova,
Yulia Khramtsova,
Konstantin Ushenin
Abstract:
Deep neural networks show high accuracy in theproblem of semantic and instance segmentation of biomedicaldata. However, this approach is computationally expensive. Thecomputational cost may be reduced with network simplificationafter training or choosing the proper architecture, which providessegmentation with less accuracy but does it much faster. In thepresent study, we analyzed the accuracy and…
▽ More
Deep neural networks show high accuracy in theproblem of semantic and instance segmentation of biomedicaldata. However, this approach is computationally expensive. Thecomputational cost may be reduced with network simplificationafter training or choosing the proper architecture, which providessegmentation with less accuracy but does it much faster. In thepresent study, we analyzed the accuracy and performance ofUNet and ENet architectures for the problem of semantic imagesegmentation. In addition, we investigated the ENet architecture by replacing of some convolution layers with box-convolutionlayers. The analysis performed on the original dataset consisted of histology slices with mast cells. These cells provide a region forsegmentation with different types of borders, which vary fromclearly visible to ragged. ENet was less accurate than UNet byonly about 1-2%, but ENet performance was 8-15 times faster than UNet one.
△ Less
Submitted 22 November, 2019; v1 submitted 15 September, 2019;
originally announced September 2019.