-
Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge
Authors:
Julien Delile,
Srayanta Mukherjee,
Anton Van Pamel,
Leonid Zhukov
Abstract:
Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial…
▽ More
Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In the field of biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. To overcome this challenge, Retrieval Augmented Generation (RAG) has been proposed to alleviate some of the shortcomings of LLMs by augmenting the prompts with context retrieved from external datasets. RAG methods typically select the context via maximum similarity search over text embeddings. In this study, we show that RAG methods leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel information-retrieval method that leverages a knowledge graph to downsample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be advantageously combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.
△ Less
Submitted 19 February, 2024;
originally announced February 2024.
-
SensorSCAN: Self-Supervised Learning and Deep Clustering for Fault Diagnosis in Chemical Processes
Authors:
Maksim Golyadkin,
Vitaliy Pozdnyakov,
Leonid Zhukov,
Ilya Makarov
Abstract:
Modern industrial facilities generate large volumes of raw sensor data during the production process. This data is used to monitor and control the processes and can be analyzed to detect and predict process abnormalities. Typically, the data has to be annotated by experts in order to be used in predictive modeling. However, manual annotation of large amounts of data can be difficult in industrial…
▽ More
Modern industrial facilities generate large volumes of raw sensor data during the production process. This data is used to monitor and control the processes and can be analyzed to detect and predict process abnormalities. Typically, the data has to be annotated by experts in order to be used in predictive modeling. However, manual annotation of large amounts of data can be difficult in industrial settings.
In this paper, we propose SensorSCAN, a novel method for unsupervised fault detection and diagnosis, designed for industrial chemical process monitoring. We demonstrate our model's performance on two publicly available datasets of the Tennessee Eastman Process with various faults. The results show that our method significantly outperforms existing approaches (+0.2-0.3 TPR for a fixed FPR) and effectively detects most of the process faults without expert annotation. Moreover, we show that the model fine-tuned on a small fraction of labeled data nearly reaches the performance of a SOTA model trained on the full dataset. We also demonstrate that our method is suitable for real-world applications where the number of faults is not known in advance. The code is available at https://github.com/AIRI-Institute/sensorscan.
△ Less
Submitted 2 November, 2023; v1 submitted 17 August, 2022;
originally announced August 2022.
-
New drugs and stock market: how to predict pharma market reaction to clinical trial announcements
Authors:
Semen Budennyy,
Alexey Kazakov,
Elizaveta Kovtun,
Leonid Zhukov
Abstract:
Pharmaceutical companies operate in a strictly regulated and highly risky environment in which a single slip can lead to serious financial implications. Accordingly, the announcements of clinical trial results tend to determine the future course of events, hence being closely monitored by the public. In this work, we provide statistical evidence for the result promulgation influence on the public…
▽ More
Pharmaceutical companies operate in a strictly regulated and highly risky environment in which a single slip can lead to serious financial implications. Accordingly, the announcements of clinical trial results tend to determine the future course of events, hence being closely monitored by the public. In this work, we provide statistical evidence for the result promulgation influence on the public pharma market value. Whereas most works focus on retrospective impact analysis, the present research aims to predict the numerical values of announcement-induced changes in stock prices. For this purpose, we develop a pipeline that includes a BERT-based model for extracting sentiment polarity of announcements, a Temporal Fusion Transformer for forecasting the expected return, a graph convolution network for capturing event relationships, and gradient boosting for predicting the price change. The challenge of the problem lies in inherently different patterns of responses to positive and negative announcements, reflected in a stronger and more pronounced reaction to the negative news. Moreover, such phenomenon as the drop in stocks after the positive announcements affirms the counterintuitiveness of the price behavior. Importantly, we discover two crucial factors that should be considered while working within a predictive framework. The first factor is the drug portfolio size of the company, indicating the greater susceptibility to an announcement in the case of small drug diversification. The second one is the network effect of the events related to the same company or nosology. All findings and insights are gained on the basis of one of the biggest FDA (the Food and Drug Administration) announcement datasets, consisting of 5436 clinical trial announcements from 681 companies over the last five years.
△ Less
Submitted 16 August, 2022; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Anomaly segmentation model for defects detection in electroluminescence images of heterojunction solar cells
Authors:
Alexey Korovin,
Artem Vasilyev,
Fedor Egorov,
Dmitry Saykin,
Evgeny Terukov,
Igor Shakhray,
Leonid Zhukov,
Semen Budennyy
Abstract:
Efficient defect detection in solar cell manufacturing is crucial for stable green energy technology manufacturing. This paper presents a deep-learning-based automatic detection model SeMaCNN for classification and semantic segmentation of electroluminescent images for solar cell quality evaluation and anomalies detection. The core of the model is an anomaly detection algorithm based on Mahalanobi…
▽ More
Efficient defect detection in solar cell manufacturing is crucial for stable green energy technology manufacturing. This paper presents a deep-learning-based automatic detection model SeMaCNN for classification and semantic segmentation of electroluminescent images for solar cell quality evaluation and anomalies detection. The core of the model is an anomaly detection algorithm based on Mahalanobis distance that can be trained in a semi-supervised manner on imbalanced data with small number of digital electroluminescence images with relevant defects. This is particularly valuable for prompt model integration into the industrial landscape. The model has been trained with the on-plant collected dataset consisting of 68 748 electroluminescent images of heterojunction solar cells with a busbar grid. Our model achieves the accuracy of 92.5%, F1 score 95.8%, recall 94.8%, and precision 96.9% within the validation subset consisting of 1049 manually annotated images. The model was also tested on the open ELPV dataset and demonstrates stable performance with accuracy 94.6% and F1 score 91.1%. The SeMaCNN model demonstrates a good balance between its performance and computational costs, which make it applicable for integrating into quality control systems of solar cell manufacturing.
△ Less
Submitted 1 October, 2022; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Eco2AI: carbon emissions tracking of machine learning models as the first step towards sustainable AI
Authors:
Semen Budennyy,
Vladimir Lazarev,
Nikita Zakharenko,
Alexey Korovin,
Olga Plosskaya,
Denis Dimitrov,
Vladimir Arkhipkin,
Ivan Oseledets,
Ivan Barsola,
Ilya Egorov,
Aleksandra Kosterina,
Leonid Zhukov
Abstract:
The size and complexity of deep neural networks continue to grow exponentially, significantly increasing energy consumption for training and inference by these models. We introduce an open-source package eco2AI to help data scientists and researchers to track energy consumption and equivalent CO2 emissions of their models in a straightforward way. In eco2AI we put emphasis on accuracy of energy co…
▽ More
The size and complexity of deep neural networks continue to grow exponentially, significantly increasing energy consumption for training and inference by these models. We introduce an open-source package eco2AI to help data scientists and researchers to track energy consumption and equivalent CO2 emissions of their models in a straightforward way. In eco2AI we put emphasis on accuracy of energy consumption tracking and correct regional CO2 emissions accounting. We encourage research community to search for new optimal Artificial Intelligence (AI) architectures with a lower computational cost. The motivation also comes from the concept of AI-based green house gases sequestrating cycle with both Sustainable AI and Green AI pathways.
△ Less
Submitted 3 August, 2022; v1 submitted 31 July, 2022;
originally announced August 2022.
-
Towards Computationally Feasible Deep Active Learning
Authors:
Akim Tsvigun,
Artem Shelmanov,
Gleb Kuzmin,
Leonid Sanochkin,
Daniil Larionov,
Gleb Gusev,
Manvel Avetisian,
Leonid Zhukov
Abstract:
Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the un…
▽ More
Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the unlabeled pool. We propose two techniques that tackle this issue for text classification and tagging tasks, offering a substantial reduction of AL iteration duration and the computational overhead introduced by deep acquisition models in AL. We also demonstrate that our algorithm that leverages pseudo-labeling and distilled models overcomes one of the essential obstacles revealed previously in the literature. Namely, it was shown that due to differences between an acquisition model used to select instances during AL and a successor model trained on the labeled data, the benefits of AL can diminish. We show that our algorithm, despite using a smaller and faster acquisition model, is capable of training a more expressive successor model with higher performance.
△ Less
Submitted 7 May, 2022;
originally announced May 2022.
-
Advanced service data provisioning in ROF-based mobile backhauls/fronthauls
Authors:
Mikhail E. Belkin,
Leonid Zhukov,
Alexander S. Sigov
Abstract:
A new cost-efficient concept to realize a real-time monitoring of quality-of-service metrics and other service data in 5G and beyond access network using a separate return channel based on a vertical cavity surface emitting laser in the optical injection locked mode that simultaneously operates as an optical transmitter and as a resonant cavity enhanced photodetector, is proposed and discussed. Th…
▽ More
A new cost-efficient concept to realize a real-time monitoring of quality-of-service metrics and other service data in 5G and beyond access network using a separate return channel based on a vertical cavity surface emitting laser in the optical injection locked mode that simultaneously operates as an optical transmitter and as a resonant cavity enhanced photodetector, is proposed and discussed. The feasibility and efficiency of the proposed approach are confirmed by a proof-of-concept experiment when optically transceiving high-speed digital signal with multi-position quadrature amplitude modulation of a radio-frequency carrier.
△ Less
Submitted 31 January, 2022;
originally announced February 2022.
-
Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough
Authors:
Alexander Ponomarchuk,
Ilya Burenko,
Elian Malkin,
Ivan Nazarov,
Vladimir Kokh,
Manvel Avetisian,
Leonid Zhukov
Abstract:
The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification.…
▽ More
The COVID-19 pandemic created a significant interest and demand for infection detection and monitoring solutions. In this paper we propose a machine learning method to quickly triage COVID-19 using recordings made on consumer devices. The approach combines signal processing methods with fine-tuned deep learning networks and provides methods for signal denoising, cough detection and classification. We have also developed and deployed a mobile application that uses symptoms checker together with voice, breath and cough signals to detect COVID-19 infection. The application showed robust performance on both open sourced datasets and on the noisy data collected during beta testing by the end users.
△ Less
Submitted 10 January, 2022; v1 submitted 12 July, 2021;
originally announced July 2021.
-
Kernel classification of connectomes based on earth mover's distance between graph spectra
Authors:
Yulia Dodonova,
Mikhail Belyaev,
Anna Tkachev,
Dmitry Petrov,
Leonid Zhukov
Abstract:
In this paper, we tackle a problem of predicting phenotypes from structural connectomes. We propose that normalized Laplacian spectra can capture structural properties of brain networks, and hence graph spectral distributions are useful for a task of connectome-based classification. We introduce a kernel that is based on earth mover's distance (EMD) between spectral distributions of brain networks…
▽ More
In this paper, we tackle a problem of predicting phenotypes from structural connectomes. We propose that normalized Laplacian spectra can capture structural properties of brain networks, and hence graph spectral distributions are useful for a task of connectome-based classification. We introduce a kernel that is based on earth mover's distance (EMD) between spectral distributions of brain networks. We access performance of an SVM classifier with the proposed kernel for a task of classification of autism spectrum disorder versus typical development based on a publicly available dataset. Classification quality (area under the ROC-curve) obtained with the EMD-based kernel on spectral distributions is 0.71, which is higher than that based on simpler graph embedding methods.
△ Less
Submitted 27 November, 2016;
originally announced November 2016.
-
Learning Alternative Name Spellings
Authors:
Jeffrey Sukharev,
Leonid Zhukov,
Alexandrin Popescul
Abstract:
Name matching is a key component of systems for entity resolution or record linkage. Alternative spellings of the same names are a com- mon occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name matching models. The procedure for building a crowd-sourced training set is outlined together with the pr…
▽ More
Name matching is a key component of systems for entity resolution or record linkage. Alternative spellings of the same names are a com- mon occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use in- formation retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and re- call. Additionally, we rigorously compare the performance of standard methods when compared with each other. Our result can lead to a significant practical impact in entity resolution applications.
△ Less
Submitted 7 May, 2014;
originally announced May 2014.
-
Chiral electromagnetic waves at the boundary of optical isomers: Quantum Cotton-Mouton effect
Authors:
L. E. Zhukov,
M. E. Raikh
Abstract:
We demonstrate that the boundary of two optical isomers with opposite directions of the gyration vectors (both parallel to boundary) can support propagation of electromagnetic wave in the direction perpendicular to the gyration axes (Cotton-Mouton geometry). The components of electromagnetic field in this wave decay exponentially into both media. The characteristic decay length is of the order o…
▽ More
We demonstrate that the boundary of two optical isomers with opposite directions of the gyration vectors (both parallel to boundary) can support propagation of electromagnetic wave in the direction perpendicular to the gyration axes (Cotton-Mouton geometry). The components of electromagnetic field in this wave decay exponentially into both media. The characteristic decay length is of the order of the Faraday rotation length for the propagation along the gyration axis. The remarkable property of the boundary wave is its chirality. Namely, the wave can propagate only in one direction determined by the relative sign of non-diagonal components of the dielectric tensor in contacting media. We find the dispersion law of the boundary wave for the cases of abrupt and smooth boundaries. We also study the effect of asymmetry between the contacting media on the boundary wave and generalize the result to the case of two parallel boundaries. Finally we consider the arrangement when the boundaries form a random network. We argue that at a point, when this network percolates, the corresponding boundary waves undergo quantum delocalization transition, similar to the quantum Hall transition.
△ Less
Submitted 26 December, 1998;
originally announced December 1998.
-
Orthogonal localized wave functions of an electron in a magnetic field
Authors:
E. I. Rashba,
L. E. Zhukov,
A. L. Efros
Abstract:
We prove the existence of a set of two-scale magnetic Wannier orbitals w_{m,n}(r) on the infinite plane. The quantum numbers of these states are the positions {m,n} of their centers which form a von Neumann lattice. Function w_{00}localized at the origin has a nearly Gaussian shape of exp(-r^2/4l^2)/sqrt(2Pi) for r < sqrt(2Pi)l,where l is the magnetic length. This region makes a dominating contr…
▽ More
We prove the existence of a set of two-scale magnetic Wannier orbitals w_{m,n}(r) on the infinite plane. The quantum numbers of these states are the positions {m,n} of their centers which form a von Neumann lattice. Function w_{00}localized at the origin has a nearly Gaussian shape of exp(-r^2/4l^2)/sqrt(2Pi) for r < sqrt(2Pi)l,where l is the magnetic length. This region makes a dominating contribution to the normalization integral. Outside this region function, w_{00}(r) is small, oscillates, and falls off with the Thouless critical exponent for magnetic orbitals, r^(-2). These functions form a convenient basis for many electron problems.
△ Less
Submitted 4 June, 1996; v1 submitted 5 March, 1996;
originally announced March 1996.
-
Two-electron state in a disordered 2D island: pairing caused by the Coulomb repulsion
Authors:
M. E. Raikh,
L. I. Glazman,
L. E. Zhukov
Abstract:
We show the existence of bound two-electron states in an almost depleted two-dimensional island. These two-electron states are carried by special compact configurations of four single-electron levels. The existence of these states does not require phonon mediation, and is facilitated by the disorder-induced potential relief and by the electron-electron repulsion only. The density of two-electron…
▽ More
We show the existence of bound two-electron states in an almost depleted two-dimensional island. These two-electron states are carried by special compact configurations of four single-electron levels. The existence of these states does not require phonon mediation, and is facilitated by the disorder-induced potential relief and by the electron-electron repulsion only. The density of two-electron states is estimated and their evolution with the magnetic field is discussed.
△ Less
Submitted 17 December, 1995;
originally announced December 1995.