Skip to main content

Showing 1–26 of 26 results for author: Tutubalina, E

.
  1. arXiv:2406.14347  [pdf, other

    physics.chem-ph cs.LG stat.ML

    $\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials

    Authors: Kuzma Khrabrov, Anton Ber, Artem Tsypin, Konstantin Ushenin, Egor Rumiantsev, Alexander Telepov, Dmitry Protasov, Ilya Shenbin, Anton Alekseev, Mikhail Shirokikh, Sergey Nikolenko, Elena Tutubalina, Artur Kadurin

    Abstract: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets fo… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2311.12410  [pdf, other

    cs.CL cs.AI cs.LG q-bio.QM

    nach0: Multimodal Natural and Chemical Languages Foundation Model

    Authors: Micha Livne, Zulfat Miftahutdinov, Elena Tutubalina, Maksim Kuznetsov, Daniil Polykovskiy, Annika Brundyn, Aastha Jhunjhunwala, Anthony Costa, Alex Aliper, Alán Aspuru-Guzik, Alex Zhavoronkov

    Abstract: Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthe… ▽ More

    Submitted 2 May, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

    Comments: Accepted to Chemical Science Journal. Models are publicly available via https://huggingface.co/insilicomedicine/nach0_base and https://huggingface.co/insilicomedicine/nach0_large

    Journal ref: Chemical Science, 15(22), 8380-8389, 2024

  3. Data and models for stance and premise detection in COVID-19 tweets: insights from the Social Media Mining for Health (SMM4H) 2022 shared task

    Authors: Vera Davydova, Huabin Yang, Elena Tutubalina

    Abstract: The COVID-19 pandemic has sparked numerous discussions on social media platforms, with users sharing their views on topics such as mask-wearing and vaccination. To facilitate the evaluation of neural models for stance detection and premise classification, we organized the Social Media Mining for Health (SMM4H) 2022 Shared Task 2. This competition utilized manually annotated posts on three COVID-19… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: This paper is under review in the Journal of Biomedical Informatics

    ACM Class: I.2.7; J.3

    Journal ref: Journal of Biomedical Informatics, 2023

  4. arXiv:2311.06295  [pdf, other

    physics.chem-ph cs.LG

    Gradual Optimization Learning for Conformational Energy Minimization

    Authors: Artem Tsypin, Leonid Ugadiarov, Kuzma Khrabrov, Alexander Telepov, Egor Rumiantsev, Alexey Skrynnik, Aleksandr I. Panov, Dmitry Vetrov, Elena Tutubalina, Artur Kadurin

    Abstract: Molecular conformation optimization is crucial to computer-aided drug discovery and materials design. Traditional energy minimization techniques rely on iterative optimization methods that use molecular forces calculated by a physical simulator (oracle) as anti-gradients. However, this is a computationally expensive approach that requires many interactions with a physical simulator. One way to acc… ▽ More

    Submitted 12 March, 2024; v1 submitted 5 November, 2023; originally announced November 2023.

    Comments: Published as a conference paper at ICLR2024 (Poster)

  5. arXiv:2210.13238  [pdf, other

    q-bio.QM cs.CL cs.LG

    Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification

    Authors: Andrey Sakhovskiy, Elena Tutubalina

    Abstract: In this paper, we focus on the classification of tweets as sources of potential signals for adverse drug effects (ADEs) or drug reactions (ADRs). Following the intuition that text and drug structure representations are complementary, we introduce a multimodal model with two components. These components are state-of-the-art BERT-based models for language understanding and molecular property predict… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: This paper is accepted to Journal of Biomedical Informatics

    Journal ref: Journal of Biomedical Informatics, Volume 135, 2022, 104182, ISSN 1532-0464

  6. NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities

    Authors: Natalia Loukachevitch, Suresh Manandhar, Elina Baral, Igor Rozhkov, Pavel Braslavski, Vladimir Ivanov, Tatiana Batura, Elena Tutubalina

    Abstract: This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

    Comments: Submitted to Bioinformatics (Publisher: Oxford University Press)

    Journal ref: Bioinformatics, Volume 39, Issue 4, April 2023, btad161

  7. Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

    Authors: Mark Rofin, Vladislav Mikhailov, Mikhail Florinskiy, Andrey Kravchenko, Elena Tutubalina, Tatiana Shavrina, Daniel Karabekyan, Ekaterina Artemova

    Abstract: The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular in… ▽ More

    Submitted 12 February, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: To appear in EACL 2023 (main)

  8. arXiv:2206.12514  [pdf, other

    cs.CL

    DetIE: Multilingual Open Information Extraction Inspired by Object Detection

    Authors: Michael Vasilkovsky, Anton Alekseev, Valentin Malykh, Ilya Shenbin, Elena Tutubalina, Dmitriy Salikhov, Mikhail Stepnov, Andrey Chertok, Sergey Nikolenko

    Abstract: State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorith… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

    Comments: Accepted to the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

  9. Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

    Authors: Tatiana Shamardina, Vladislav Mikhailov, Daniil Chernianskii, Alena Fenogenova, Marat Saidov, Anastasiya Valeeva, Tatiana Shavrina, Ivan Smurov, Elena Tutubalina, Ekaterina Artemova

    Abstract: We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text si… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted to Dialogue-22

  10. RuNNE-2022 Shared Task: Recognizing Nested Named Entities

    Authors: Ekaterina Artemova, Maxim Zmeev, Natalia Loukachevitch, Igor Rozhkov, Tatiana Batura, Vladimir Ivanov, Elena Tutubalina

    Abstract: The RuNNE Shared Task approaches the problem of nested named entity recognition. The annotation schema is designed in such a way, that an entity may partially overlap or even be nested into another entity. This way, the named entity "The Yermolova Theatre" of type "organization" houses another entity "Yermolova" of type "person". We adopt the Russian NEREL dataset for the RuNNE Shared Task. NEREL… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: To appear in Dialogue 2022

  11. Near-Zero-Shot Suggestion Mining with a Little Help from WordNet

    Authors: Anton Alekseev, Elena Tutubalina, Sejeong Kwon, Sergey Nikolenko

    Abstract: In this work, we explore the constructive side of online reviews: advice, tips, requests, and suggestions that users provide about goods, venues, services, and other items of interest. To reduce training costs and annotation efforts needed to build a classifier for a specific label set, we present and evaluate several entailment-based zero-shot approaches to suggestion classification in a label-fu… ▽ More

    Submitted 25 November, 2021; originally announced November 2021.

    Comments: Accepted to the 10th International Conference on Analysis of Images, Social Networks and Texts (AIST 2021)

    Journal ref: Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham

  12. Selection of pseudo-annotated data for adverse drug reaction classification across drug groups

    Authors: Ilseyar Alimova, Elena Tutubalina

    Abstract: Automatic monitoring of adverse drug events (ADEs) or reactions (ADRs) is currently receiving significant attention from the biomedical community. In recent years, user-generated data on social media has become a valuable resource for this task. Neural models have achieved impressive performance on automatic text classification for ADR detection. Yet, training and evaluation of these methods are c… ▽ More

    Submitted 24 November, 2021; originally announced November 2021.

    Comments: Accepted to AIST 2021

    Journal ref: Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham

  13. arXiv:2111.10974  [pdf, other

    cs.CV cs.AI cs.CL

    Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture

    Authors: Daria Bakshandaeva, Denis Dimitrov, Vladimir Arkhipkin, Alex Shonenkov, Mark Potanin, Denis Karachev, Andrey Kuznetsov, Anton Voronov, Vera Davydova, Elena Tutubalina, Aleksandr Petiushko

    Abstract: Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Transl… ▽ More

    Submitted 28 December, 2022; v1 submitted 21 November, 2021; originally announced November 2021.

  14. arXiv:2108.13112  [pdf, other

    cs.CL

    NEREL: A Russian Dataset with Nested Named Entities, Relations and Events

    Authors: Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Ilia Denisov, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Elena Tutubalina

    Abstract: In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse le… ▽ More

    Submitted 3 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: accepted to RANLP

  15. arXiv:2101.09311  [pdf, ps, other

    cs.CL cs.IR

    Drug and Disease Interpretation Learning with Biomedical Entity Representation Transformer

    Authors: Zulfat Miftahutdinov, Artur Kadurin, Roman Kudrin, Elena Tutubalina

    Abstract: Concept normalization in free-form texts is a crucial step in every text-mining pipeline. Neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) have achieved state-of-the-art results in the biomedical domain. In the context of drug discovery and development, clinical trials are necessary to establish the efficacy and safety of drugs. We investigate the effect… ▽ More

    Submitted 22 January, 2021; originally announced January 2021.

    Comments: Accepted to the 43rd European Conference on Information Retrieval (ECIR 2021)

  16. arXiv:2010.15939  [pdf, ps, other

    cs.CL cs.CY

    RuREBus: a Case Study of Joint Named Entity Recognition and Relation Extraction from e-Government Domain

    Authors: Vitaly Ivanin, Ekaterina Artemova, Tatiana Batura, Vladimir Ivanov, Veronika Sarkisyan, Elena Tutubalina, Ivan Smurov

    Abstract: We show-case an application of information extraction methods, such as named entity recognition (NER) and relation extraction (RE) to a novel corpus, consisting of documents, issued by a state agency. The main challenges of this corpus are: 1) the annotation scheme differs greatly from the one used for the general domain corpora, and 2) the documents are written in a language other than English. U… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: to appear in AIST 2020

  17. arXiv:2007.00257  [pdf, other

    cs.CL

    So What's the Plan? Mining Strategic Planning Documents

    Authors: Ekaterina Artemova, Tatiana Batura, Anna Golenkovskaya, Vitaly Ivanin, Vladimir Ivanov, Veronika Sarkisyan, Ivan Smurov, Elena Tutubalina

    Abstract: In this paper we present a corpus of Russian strategic planning documents, RuREBus. This project is grounded both from language technology and e-government perspectives. Not only new language sources and tools are being developed, but also their applications to e-goverment research. We demonstrate the pipeline for creating a text corpus from scratch. First, the annotation schema is designed. Next… ▽ More

    Submitted 7 July, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: 15 pages, 3 figures, 5 tables. The paper has been accepted for the Fifth International Conference on Digital Transformation and Global Society (DTGS 2020)

  18. Improving unsupervised neural aspect extraction for online discussions using out-of-domain classification

    Authors: Anton Alekseev, Elena Tutubalina, Valentin Malykh, Sergey Nikolenko

    Abstract: Deep learning architectures based on self-attention have recently achieved and surpassed state of the art results in the task of unsupervised aspect extraction and topic modeling. While models such as neural attention-based aspect extraction (ABAE) have been successfully applied to user-generated texts, they are less coherent when applied to traditional data sources such as news articles and newsg… ▽ More

    Submitted 17 June, 2020; originally announced June 2020.

    Comments: Journal of Intelligent & Fuzzy Systems, pre-press, https://content.iospress.com/articles/journal-of-intelligent-and-fuzzy-systems/ifs179908

  19. A large-scale COVID-19 Twitter chatter dataset for open scientific research -- an international collaboration

    Authors: Juan M. Banda, Ramya Tekumalla, Guanyu Wang, **gyuan Yu, Tuo Liu, Yuning Ding, Katya Artemova, Elena Tutubalina, Gerardo Chowell

    Abstract: As the COVID-19 pandemic continues its march around the world, an unprecedented amount of open data is being generated for genetics and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated in the front lines of the COV… ▽ More

    Submitted 13 November, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

    Comments: 8 pages, 1 figure 2 table. Update: new version of paper with up-to-date statistics and new co-authors

  20. The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews

    Authors: Elena Tutubalina, Ilseyar Alimova, Zulfat Miftahutdinov, Andrey Sakhovskiy, Valentin Malykh, Sergey Nikolenko

    Abstract: The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

    Comments: 9 pages, 9 tables, 4 figures

    Journal ref: Bioinformatics, 2020

  21. RecVAE: a New Variational Autoencoder for Top-N Recommendations with Implicit Feedback

    Authors: Ilya Shenbin, Anton Alekseev, Elena Tutubalina, Valentin Malykh, Sergey I. Nikolenko

    Abstract: Recent research has shown the advantages of using autoencoders based on deep neural networks for collaborative filtering. In particular, the recently proposed Mult-VAE model, which used the multinomial likelihood variational autoencoders, has shown excellent results for top-N recommendations. In this work, we propose the Recommender VAE (RecVAE) model that originates from our research on regulariz… ▽ More

    Submitted 23 December, 2019; originally announced December 2019.

    Comments: In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, New York, NY, USA, 9 pages

  22. arXiv:1908.07069  [pdf, other

    cs.IR cs.CL cs.LG

    CommentsRadar: Dive into Unique Data on All Comments on the Web

    Authors: Sergey Nikolenko, Elena Tutubalina, Zulfat Miftahutdinov, Eugene Beloded

    Abstract: We introduce an entity-centric search engineCommentsRadarthatpairs entity queries with articles and user opinions covering a widerange of topics from top commented sites. The engine aggregatesarticles and comments for these articles, extracts named entities,links them together and with knowledge base entries, performssentiment analysis, and aggregates the results, aiming to mine fortemporal trends… ▽ More

    Submitted 16 August, 2019; originally announced August 2019.

  23. arXiv:1907.07972  [pdf, ps, other

    cs.CL cs.IR cs.LG

    Deep Neural Models for Medical Concept Normalization in User-Generated Texts

    Authors: Zulfat Miftahutdinov, Elena Tutubalina

    Abstract: In this work, we consider the medical concept normalization problem, i.e., the problem of map** a health-related entity mention in a free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This is a challenging task since medical terminology is very different when coming from health care professionals or from the ge… ▽ More

    Submitted 18 July, 2019; originally announced July 2019.

    Comments: This is preprint of the paper "Deep Neural Models for Medical Concept Normalization in User-Generated Texts" to be published at ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop

    Journal ref: ACL SRW 2019

  24. arXiv:1901.07829  [pdf, other

    cs.CL cs.AI

    AspeRa: Aspect-based Rating Prediction Model

    Authors: Sergey I. Nikolenko, Elena Tutubalina, Valentin Malykh, Ilya Shenbin, Anton Alekseev

    Abstract: We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users. The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms… ▽ More

    Submitted 23 January, 2019; originally announced January 2019.

    Comments: accepted to ECIR 2019

  25. Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts

    Authors: Elena Tutubalina, Zulfat Miftahutdinov, Sergey Nikolenko, Valentin Malykh

    Abstract: In this work, we consider the medical concept normalization problem, i.e., the problem of map** a disease mention in free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This task is challenging since medical terminology is very different when coming from health care professionals or from the general public in th… ▽ More

    Submitted 29 November, 2018; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.07216

    Report number: ML4H/2018/117

    Journal ref: Journal of Biomedical Informatics. - 2018. - Vol.84, Is.. - P.93-102

  26. arXiv:1712.01213  [pdf, ps, other

    cs.CL cs.CY

    An Encoder-Decoder Model for ICD-10 Coding of Death Certificates

    Authors: Elena Tutubalina, Zulfat Miftahutdinov

    Abstract: Information extraction from textual documents such as hospital records and healthrelated user discussions has become a topic of intense interest. The task of medical concept coding is to map a variable length text to medical concepts and corresponding classification codes in some external system or ontology. In this work, we utilize recurrent neural networks to automatically assign ICD-10 codes to… ▽ More

    Submitted 4 December, 2017; originally announced December 2017.

    Journal ref: KFU at CLEF eHealth 2017 Task 1: ICD-10 Coding of English Death Certificates with Recurrent Neural Networks, CEUR Workshop Proceedings, Vol 1866, 2017