-
Anomalous Behavior of the Dielectric and Pyroelectric Responses of Ferroelectric Fine-Grained Ceramics
Authors:
Oleksandr S. Pylypchuk,
Serhii E. Ivanchenko,
Mykola Y. Yelisieiev,
Andrii S. Nikolenko,
Victor I. Styopkin,
Bohdan Pokhylko,
Vladyslav Kushnir,
Denis O. Stetsenko,
Oleksii Bereznykov,
Oksana V. Leschenko,
Eugene A. Eliseev,
Vladimir N. Poroshin,
Nicholas V. Morozovsky,
Victor V. Vainberg,
Anna N. Morozovska
Abstract:
We revealed the anomalous temperature behavior of the giant dielectric permittivity and unusual frequency dependences of the pyroelectric response of the fine-grained ceramics prepared by the spark plasma sintering of the ferroelectric BaTiO3 nanoparticles. The temperature dependences of the electro-resistivity indicate the frequency-dependent transition in the electro-transport mechanisms between…
▽ More
We revealed the anomalous temperature behavior of the giant dielectric permittivity and unusual frequency dependences of the pyroelectric response of the fine-grained ceramics prepared by the spark plasma sintering of the ferroelectric BaTiO3 nanoparticles. The temperature dependences of the electro-resistivity indicate the frequency-dependent transition in the electro-transport mechanisms between the lower and higher conductivity states accompanied by the maximum in the temperature dependence of the loss angle tangent. The pyroelectric thermal-wave probing revealed the existence of the spatially inhomogeneous counter-polarized ferroelectric state at the opposite surfaces of the ceramic sample. We described the anomalous temperature behavior of the giant dielectric response and losses using the core-shell model for ceramic grains and modified Maxwell-Wagner approach. We assume that core shells and grain boundaries, which contain high concentration of space charge carriers due to the presence of graphite inclusions in the inter-grain space, can effectively screen weakly conductive ferroelectric grain cores. The superparaelectric-like state with a giant dielectric response can appear in the paraelectric shells and inter-grain space due to the step-like thermal activation of localized polarons in the spatial regions, agreeing with experimentally observed frequency-dependent transition of the electro-transport mechanism. The obtained results can be the key for the description of complex electrophysical properties inherent to the strongly inhomogeneous media with electrically coupled insulating ferroelectric nanoregions and semiconducting superparaelectric-like regions.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Improving Interpretability and Robustness for the Detection of AI-Generated Images
Authors:
Tatiana Gaintseva,
Laida Kushnareva,
German Magai,
Irina Piontkovskaya,
Sergey Nikolenko,
Martin Benning,
Serguei Barannikov,
Gregory Slabaugh
Abstract:
With growing abilities of generative models, artificial content detection becomes an increasingly important and difficult task. However, all popular approaches to this problem suffer from poor generalization across domains and generative models. In this work, we focus on the robustness of AI-generated image (AIGI) detectors. We analyze existing state-of-the-art AIGI detection methods based on froz…
▽ More
With growing abilities of generative models, artificial content detection becomes an increasingly important and difficult task. However, all popular approaches to this problem suffer from poor generalization across domains and generative models. In this work, we focus on the robustness of AI-generated image (AIGI) detectors. We analyze existing state-of-the-art AIGI detection methods based on frozen CLIP embeddings and show how to interpret them, shedding light on how images produced by various AI generators differ from real ones. Next we propose two ways to improve robustness: based on removing harmful components of the embedding vector and based on selecting the best performing attention heads in the image encoder model. Our methods increase the mean out-of-distribution (OOD) classification score by up to 6% for cross-model transfer. We also propose a new dataset for AIGI detection and use it in our evaluation; we believe this dataset will help boost further research. The dataset and code are provided as a supplement.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Authors:
Kuzma Khrabrov,
Anton Ber,
Artem Tsypin,
Konstantin Ushenin,
Egor Rumiantsev,
Alexander Telepov,
Dmitry Protasov,
Ilya Shenbin,
Anton Alekseev,
Mikhail Shirokikh,
Sergey Nikolenko,
Elena Tutubalina,
Artur Kadurin
Abstract:
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets fo…
▽ More
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
ImplicitSLIM and How it Improves Embedding-based Collaborative Filtering
Authors:
Ilya Shenbin,
Sergey Nikolenko
Abstract:
We present ImplicitSLIM, a novel unsupervised learning approach for sparse high-dimensional data, with applications to collaborative filtering. Sparse linear methods (SLIM) and their variations show outstanding performance, but they are memory-intensive and hard to scale. ImplicitSLIM improves embedding-based models by extracting embeddings from SLIM-like models in a computationally cheap and memo…
▽ More
We present ImplicitSLIM, a novel unsupervised learning approach for sparse high-dimensional data, with applications to collaborative filtering. Sparse linear methods (SLIM) and their variations show outstanding performance, but they are memory-intensive and hard to scale. ImplicitSLIM improves embedding-based models by extracting embeddings from SLIM-like models in a computationally cheap and memory-efficient way, without explicit learning of heavy SLIM-like models. We show that ImplicitSLIM improves performance and speeds up convergence for both state of the art and classical collaborative filtering methods. The source code for ImplicitSLIM, related models, and applications is available at https://github.com/ilya-shenbin/ImplicitSLIM.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule
Authors:
Andrey Bout,
Alexander Podolskiy,
Sergey Nikolenko,
Irina Piontkovskaya
Abstract:
Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such…
▽ More
Progress in neural grammatical error correction (GEC) is hindered by the lack of annotated training data. Sufficient amounts of high-quality manually annotated data are not available, so recent research has relied on generating synthetic data, pretraining on it, and then fine-tuning on real datasets; performance gains have been achieved either by ensembling or by using huge pretrained models such as XXL-T5 as the backbone. In this work, we explore an orthogonal direction: how to use available data more efficiently. First, we propose auxiliary tasks that exploit the alignment between the original and corrected sentences, such as predicting a sequence of corrections. We formulate each task as a sequence-to-sequence problem and perform multi-task training. Second, we discover that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance, so we set out to find the best training schedule. Together, these two ideas lead to significant improvements, producing results that improve state of the art with much smaller models; in particular, we outperform the best models based on T5-XXL (11B parameters) with a BART-based model (400M parameters).
△ Less
Submitted 20 November, 2023;
originally announced November 2023.
-
AI-generated text boundary detection with RoFT
Authors:
Laida Kushnareva,
Tatiana Gaintseva,
German Magai,
Serguei Barannikov,
Dmitry Abulkhanov,
Kristian Kuznetsov,
Eduard Tulchinskii,
Irina Piontkovskaya,
Sergey Nikolenko
Abstract:
Due to the rapid development of large language models, people increasingly often encounter texts that may start as written by a human but continue as machine-generated. Detecting the boundary between human-written and machine-generated parts of such texts is a challenging problem that has not received much attention in literature. We attempt to bridge this gap and examine several ways to adapt sta…
▽ More
Due to the rapid development of large language models, people increasingly often encounter texts that may start as written by a human but continue as machine-generated. Detecting the boundary between human-written and machine-generated parts of such texts is a challenging problem that has not received much attention in literature. We attempt to bridge this gap and examine several ways to adapt state of the art artificial text detection classifiers to the boundary detection setting. We push all detectors to their limits, using the Real or Fake text benchmark that contains short texts on several topics and includes generations of various language models. We use this diversity to deeply examine the robustness of all detectors in cross-domain and cross-model settings to provide baselines and insights for future research. In particular, we find that perplexity-based approaches to boundary detection tend to be more robust to peculiarities of domain-specific data than supervised fine-tuning of the RoBERTa model; we also find which features of the text confuse boundary detection algorithms and negatively influence their performance in cross-domain settings.
△ Less
Submitted 2 April, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding
Authors:
Konstantin Yakovlev,
Alexander Podolskiy,
Andrey Bout,
Sergey Nikolenko,
Irina Piontkovskaya
Abstract:
Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-to-sequence models. However, approaches of this class are inherently slow due to one-by-one token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation ne…
▽ More
Grammatical error correction (GEC) is an important NLP task that is currently usually solved with autoregressive sequence-to-sequence models. However, approaches of this class are inherently slow due to one-by-one token generation, so non-autoregressive alternatives are needed. In this work, we propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network that outputs a self-attention weight matrix that can be used in beam search to find the best permutation of input tokens (with auxiliary {ins} tokens) and a decoder network based on a step-unrolled denoising autoencoder that fills in specific tokens. This allows us to find the token permutation after only one forward pass of the permutation network, avoiding autoregressive constructions. We show that the resulting network improves over previously known non-autoregressive methods for GEC and reaches the level of autoregressive methods that do not use language-specific synthetic data generation methods. Our results are supported by a comprehensive experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets and an extensive ablation study that supports our architectural and algorithmic choices.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
Authors:
Konstantin Yakovlev,
Gregory Polyakov,
Ilseyar Alimova,
Alexander Podolskiy,
Andrey Bout,
Sergey Nikolenko,
Irina Piontkovskaya
Abstract:
A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postproc…
▽ More
A recent trend in multimodal retrieval is related to postprocessing test set results via the dual-softmax loss (DSL). While this approach can bring significant improvements, it usually presumes that an entire matrix of test samples is available as DSL input. This work introduces a new postprocessing approach based on Sinkhorn transformations that outperforms DSL. Further, we propose a new postprocessing setting that does not require access to multiple test queries. We show that our approach can significantly improve the results of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus achieving a new state-of-the-art on several standard text-video retrieval datasets both with access to the entire test set and in the single-query setting.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Early Warning Prediction with Automatic Labeling in Epilepsy Patients
Authors:
Peng Zhang,
Ting Gao,
** Guo,
**qiao Duan,
Sergey Nikolenko
Abstract:
Early warning for epilepsy patients is crucial for their safety and well-being, in particular to prevent or minimize the severity of seizures. Through the patients' EEG data, we propose a meta learning framework to improve the prediction of early ictal signals. The proposed bi-level optimization framework can help automatically label noisy data at the early ictal stage, as well as optimize the tra…
▽ More
Early warning for epilepsy patients is crucial for their safety and well-being, in particular to prevent or minimize the severity of seizures. Through the patients' EEG data, we propose a meta learning framework to improve the prediction of early ictal signals. The proposed bi-level optimization framework can help automatically label noisy data at the early ictal stage, as well as optimize the training accuracy of the backbone model. To validate our approach, we conduct a series of experiments to predict seizure onset in various long-term windows, with LSTM and ResNet implemented as the baseline models. Our study demonstrates that not only the ictal prediction accuracy obtained by meta learning is significantly improved, but also the resulting model captures some intrinsic patterns of the noisy data that a single backbone model could not learn. As a result, the predicted probability generated by the meta network serves as a highly effective early warning indicator.
△ Less
Submitted 11 January, 2024; v1 submitted 9 October, 2023;
originally announced October 2023.
-
Benchmarking Multilabel Topic Classification in the Kyrgyz Language
Authors:
Anton Alekseev,
Sergey I. Nikolenko,
Gulnara Kabaeva
Abstract:
Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical sta…
▽ More
Kyrgyz is a very underrepresented language in terms of modern natural language processing resources. In this work, we present a new public benchmark for topic classification in Kyrgyz, introducing a dataset based on collected and annotated data from the news site 24.KG and presenting several baseline models for news classification in the multilabel setting. We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
△ Less
Submitted 30 August, 2023;
originally announced August 2023.
-
Machine Learning for SAT: Restricted Heuristics and New Graph Representations
Authors:
Mikhail Shirokikh,
Ilya Shenbin,
Anton Alekseev,
Sergey Nikolenko
Abstract:
Boolean satisfiability (SAT) is a fundamental NP-complete problem with many applications, including automated planning and scheduling. To solve large instances, SAT solvers have to rely on heuristics, e.g., choosing a branching variable in DPLL and CDCL solvers. Such heuristics can be improved with machine learning (ML) models; they can reduce the number of steps but usually hinder the running tim…
▽ More
Boolean satisfiability (SAT) is a fundamental NP-complete problem with many applications, including automated planning and scheduling. To solve large instances, SAT solvers have to rely on heuristics, e.g., choosing a branching variable in DPLL and CDCL solvers. Such heuristics can be improved with machine learning (ML) models; they can reduce the number of steps but usually hinder the running time because useful models are relatively large and slow. We suggest the strategy of making a few initial steps with a trained ML model and then releasing control to classical heuristics; this simplifies cold start for SAT solving and can decrease both the number of steps and overall runtime, but requires a separate decision of when to release control to the solver. Moreover, we introduce a modification of Graph-Q-SAT tailored to SAT problems converted from other domains, e.g., open shop scheduling problems. We validate the feasibility of our approach with random and industrial SAT problems.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Authors:
Eduard Tulchinskii,
Kristian Kuznetsov,
Laida Kushnareva,
Daniil Cherniavskii,
Serguei Barannikov,
Irina Piontkovskaya,
Sergey Nikolenko,
Evgeny Burnaev
Abstract:
Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over different text domains and varying proficiency of human writers, can be easily calculated for any language, and c…
▽ More
Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over different text domains and varying proficiency of human writers, can be easily calculated for any language, and can robustly separate natural and AI-generated texts regardless of the generation model and sampling method. In this work, we propose such an invariant for human-written texts, namely the intrinsic dimensionality of the manifold underlying the set of embeddings for a given text sample. We show that the average intrinsic dimensionality of fluent texts in a natural language is hovering around the value $9$ for several alphabet-based languages and around $7$ for Chinese, while the average intrinsic dimensionality of AI-generated texts for each language is $\approx 1.5$ lower, with a clear statistical separation between human-generated and AI-generated distributions. This property allows us to build a score-based artificial text detector. The proposed detector's accuracy is stable over text domains, generator models, and human writer proficiency levels, outperforming SOTA detectors in model-agnostic and cross-domain scenarios by a significant margin.
△ Less
Submitted 31 October, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Authors:
Nikita Sorokin,
Dmitry Abulkhanov,
Sergey Nikolenko,
Valentin Malykh
Abstract:
We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this w…
▽ More
We consider the clone detection and information retrieval problems for source code, well-known tasks important for any programming language. Although it is also an important and interesting problem to find code snippets that operate identically but are written in different programming languages, to the best of our knowledge multilingual clone detection has not been studied in literature. In this work, we formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset. Moreover, we present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages. The resulting CCT-LM model, initialized with GraphCodeBERT and fine-tuned with CCT, achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows the best results on the newly created multilingual clone detection benchmark XCD across all programming languages.
△ Less
Submitted 19 May, 2023;
originally announced May 2023.
-
Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Authors:
Ivan Sedykh,
Dmitry Abulkhanov,
Nikita Sorokin,
Sergey Nikolenko,
Valentin Malykh
Abstract:
Code search is an important and well-studied task, but it usually means searching for code by a text query. We argue that using a code snippet (and possibly an error traceback) as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art. Moreover, existing datasets use code comments rather than full-text descriptions as text, making them unsu…
▽ More
Code search is an important and well-studied task, but it usually means searching for code by a text query. We argue that using a code snippet (and possibly an error traceback) as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art. Moreover, existing datasets use code comments rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; we show that on SearchBySnippet, existing architectures fall short of a simple BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on SearchBySnippet with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.
△ Less
Submitted 27 May, 2024; v1 submitted 19 May, 2023;
originally announced May 2023.
-
STIR: Siamese Transformer for Image Retrieval Postprocessing
Authors:
Aleksei Shabanov,
Aleksei Tarasov,
Sergey Nikolenko
Abstract:
Current metric learning approaches for image retrieval are usually based on learning a space of informative latent representations where simple approaches such as the cosine distance will work well. Recent state of the art methods such as HypViT move to more complex embedding spaces that may yield better results but are harder to scale to production environments. In this work, we first construct a…
▽ More
Current metric learning approaches for image retrieval are usually based on learning a space of informative latent representations where simple approaches such as the cosine distance will work well. Recent state of the art methods such as HypViT move to more complex embedding spaces that may yield better results but are harder to scale to production environments. In this work, we first construct a simpler model based on triplet loss with hard negatives mining that performs at the state of the art level but does not have these drawbacks. Second, we introduce a novel approach for image retrieval postprocessing called Siamese Transformer for Image Retrieval (STIR) that reranks several top outputs in a single forward pass. Unlike previously proposed Reranking Transformers, STIR does not rely on global/local feature extraction and directly compares a query image and a retrieved candidate on pixel level with the usage of attention mechanism. The resulting approach defines a new state of the art on standard image retrieval datasets: Stanford Online Products and DeepFashion In-shop. We also release the source code at https://github.com/OML-Team/open-metric-learning/tree/main/pipelines/postprocessing/ and an interactive demo of our approach at https://dapladoc-oml-postprocessing-demo-srcappmain-pfh2g0.streamlit.app/
△ Less
Submitted 27 April, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Topological Data Analysis for Speech Processing
Authors:
Eduard Tulchinskii,
Kristian Kuznetsov,
Laida Kushnareva,
Daniil Cherniavskii,
Serguei Barannikov,
Irina Piontkovskaya,
Sergey Nikolenko,
Evgeny Burnaev
Abstract:
We apply topological data analysis (TDA) to speech classification problems and to the introspection of a pretrained speech model, HuBERT. To this end, we introduce a number of topological and algebraic features derived from Transformer attention maps and embeddings. We show that a simple linear classifier built on top of such features outperforms a fine-tuned classification head. In particular, we…
▽ More
We apply topological data analysis (TDA) to speech classification problems and to the introspection of a pretrained speech model, HuBERT. To this end, we introduce a number of topological and algebraic features derived from Transformer attention maps and embeddings. We show that a simple linear classifier built on top of such features outperforms a fine-tuned classification head. In particular, we achieve an improvement of about $9\%$ accuracy and $5\%$ ERR on four common datasets; on CREMA-D, the proposed feature set reaches a new state of the art performance with accuracy $80.155$. We also show that topological features are able to reveal functional roles of speech Transformer heads; e.g., we find the heads capable to distinguish between pairs of sample sources (natural/synthetic) or voices without any downstream fine-tuning. Our results demonstrate that TDA is a promising new approach for speech analysis, especially for tasks that require structural prediction. Appendices, an introduction to TDA, and other additional materials are available here - https://topohubert.github.io/speech-topology-webpages/
△ Less
Submitted 6 June, 2023; v1 submitted 30 November, 2022;
originally announced November 2022.
-
Personality-Driven Social Multimedia Content Recommendation
Authors:
Qi Yang,
Sergey Nikolenko,
Alfred Huang,
Aleksandr Farseev
Abstract:
Social media marketing plays a vital role in promoting brand and product values to wide audiences. In order to boost their advertising revenues, global media buying platforms such as Facebook Ads constantly reduce the reach of branded organic posts, pushing brands to spend more on paid media ads. In order to run organic and paid social media marketing efficiently, it is necessary to understand the…
▽ More
Social media marketing plays a vital role in promoting brand and product values to wide audiences. In order to boost their advertising revenues, global media buying platforms such as Facebook Ads constantly reduce the reach of branded organic posts, pushing brands to spend more on paid media ads. In order to run organic and paid social media marketing efficiently, it is necessary to understand the audience, tailoring the content to fit their interests and online behaviours, which is impossible to do manually at a large scale. At the same time, various personality type categorization schemes such as the Myers-Briggs Personality Type indicator make it possible to reveal the dependencies between personality traits and user content preferences on a wider scale by categorizing audience behaviours in a unified and structured manner. This problem is yet to be studied in depth by the research community, while the level of impact of different personality traits on content recommendation accuracy has not been widely utilised and comprehensively evaluated so far. Specifically, in this work we investigate the impact of human personality traits on the content recommendation model by applying a novel personality-driven multi-view content recommender system called Personality Content Marketing Recommender Engine, or PersiC. Our experimental results and real-world case study demonstrate not just PersiC's ability to perform efficient human personality-driven multi-view content recommendation, but also allow for actionable digital ad strategy recommendations, which when deployed are able to improve digital advertising efficiency by over 420% as compared to the original human-guided approach.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
DetIE: Multilingual Open Information Extraction Inspired by Object Detection
Authors:
Michael Vasilkovsky,
Anton Alekseev,
Valentin Malykh,
Ilya Shenbin,
Elena Tutubalina,
Dmitriy Salikhov,
Mikhail Stepnov,
Andrey Chertok,
Sergey Nikolenko
Abstract:
State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorith…
▽ More
State of the art neural methods for open information extraction (OpenIE) usually extract triplets (or tuples) iteratively in an autoregressive or predicate-based manner in order not to produce duplicates. In this work, we propose a different approach to the problem that can be equally or more successful. Namely, we present a novel single-pass method for OpenIE inspired by object detection algorithms from computer vision. We use an order-agnostic loss based on bipartite matching that forces unique predictions and a Transformer-based encoder-only architecture for sequence labeling. The proposed approach is faster and shows superior or similar performance in comparison with state of the art models on standard benchmarks in terms of both quality metrics and inference time. Our model sets the new state of the art performance of 67.7% F1 on CaRB evaluated as OIE2016 while being 3.35x faster at inference than previous state of the art. We also evaluate the multilingual version of our model in the zero-shot setting for two languages and introduce a strategy for generating synthetic multilingual data to fine-tune the model for each specific language. In this setting, we show performance improvement 15% on multilingual Re-OIE2016, reaching 75% F1 for both Portuguese and Spanish languages. Code and models are available at https://github.com/sberbank-ai/DetIE.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Near-Zero-Shot Suggestion Mining with a Little Help from WordNet
Authors:
Anton Alekseev,
Elena Tutubalina,
Sejeong Kwon,
Sergey Nikolenko
Abstract:
In this work, we explore the constructive side of online reviews: advice, tips, requests, and suggestions that users provide about goods, venues, services, and other items of interest. To reduce training costs and annotation efforts needed to build a classifier for a specific label set, we present and evaluate several entailment-based zero-shot approaches to suggestion classification in a label-fu…
▽ More
In this work, we explore the constructive side of online reviews: advice, tips, requests, and suggestions that users provide about goods, venues, services, and other items of interest. To reduce training costs and annotation efforts needed to build a classifier for a specific label set, we present and evaluate several entailment-based zero-shot approaches to suggestion classification in a label-fully-unseen fashion. In particular, we introduce the strategy of assigning target class labels to sentences in English language with user intentions, which significantly improves prediction quality. The proposed strategies are evaluated with a comprehensive experimental study that validated our results both quantitatively and qualitatively.
△ Less
Submitted 25 November, 2021;
originally announced November 2021.
-
Paraelectric KH$_2$PO$_4$ Nanocrystals in Monolithic Mesoporous Silica: Structure and Lattice Dynamics
Authors:
Yaroslav Shchur,
Andriy V. Kityk,
Viktor V. Strelchuk,
Andrii S. Nikolenko,
Nazariy A. Andrushchak,
Patrick Huber,
Anatolii S. Andrushchak
Abstract:
Combining dielectric crystals with mesoporous solids allows a versatile design of functional nanomaterials, where the porous host provides a mechanical rigid scaffold structure and the molecular filling adds the functionalization. Here, we report a study of the complex lattice dynamics of a SiO$_2$:KH$_2$PO$_4$ nanocomposite consisting of a monolithic, mesoporous silica glass host with KH$_2$PO…
▽ More
Combining dielectric crystals with mesoporous solids allows a versatile design of functional nanomaterials, where the porous host provides a mechanical rigid scaffold structure and the molecular filling adds the functionalization. Here, we report a study of the complex lattice dynamics of a SiO$_2$:KH$_2$PO$_4$ nanocomposite consisting of a monolithic, mesoporous silica glass host with KH$_2$PO$_4$ nanocrystals embedded in its tubular channels $\sim$12 nm across. A micro-Raman investigation performed in the spectral range of 70-1600 cm$^{-1}$ reveals the complex lattice dynamics of the confined crystals. Their Raman spectrum resembles the one taken from bulk KH$_2$PO$_4$ crystals and thus, along with X-ray diffraction experiments, corroborates the successful solution-based synthesis of KH$_2$PO$_4$ nanocrystals with a structure analogous to the bulk material. We succeeded in observing not only the high-frequency internal modes ($\sim$900-1200 cm$^{-1}$), typical of internal vibrations of the PO$_4$ tetrahedra, but, more importantly, also the lowest frequency modes typical of bulk KH$_2$PO$_4$ crystals. The experimental Raman spectrum was interpreted with a group theory analysis and first-principle lattice dynamics calculations. The analysis of calculated eigen-vectors indicates the involvement of hydrogen atoms in most phonon modes corroborating the substantial significance of the hydrogen subsystem in the lattice dynamics of paraelectric bulk and of KH$_2$PO$_4$ crystals in extreme spatial confinement. A marginal redistribution of relative Raman intensities of the confined compared to unconfined crystals presumably originates in slightly changed crystal fields and interatomic interactions, in particular for the parts of the nanocrystals in close proximity to the silica pore surfaces.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
Towards General Purpose Geometry-Preserving Single-View Depth Estimation
Authors:
Mikhail Romanov,
Nikolay Patatkin,
Anna Vorontsova,
Sergey Nikolenko,
Anton Konushin,
Dmitry Senyushkin
Abstract:
Single-view depth estimation (SVDE) plays a crucial role in scene understanding for AR applications, 3D modeling, and robotics, providing the geometry of a scene based on a single image. Recent works have shown that a successful solution strongly relies on the diversity and volume of training data. This data can be sourced from stereo movies and photos. However, they do not provide geometrically c…
▽ More
Single-view depth estimation (SVDE) plays a crucial role in scene understanding for AR applications, 3D modeling, and robotics, providing the geometry of a scene based on a single image. Recent works have shown that a successful solution strongly relies on the diversity and volume of training data. This data can be sourced from stereo movies and photos. However, they do not provide geometrically complete depth maps (as disparities contain unknown shift value). Therefore, existing models trained on this data are not able to recover correct 3D representations. Our work shows that a model trained on this data along with conventional datasets can gain accuracy while predicting correct scene geometry. Surprisingly, only a small portion of geometrically correct depth maps are required to train a model that performs equally to a model trained on the full geometrically correct dataset. After that, we train computationally efficient models on a mixture of datasets using the proposed method. Through quantitative comparison on completely unseen datasets and qualitative comparison of 3D point clouds, we show that our model defines the new state of the art in general-purpose SVDE.
△ Less
Submitted 9 February, 2021; v1 submitted 25 September, 2020;
originally announced September 2020.
-
Improving unsupervised neural aspect extraction for online discussions using out-of-domain classification
Authors:
Anton Alekseev,
Elena Tutubalina,
Valentin Malykh,
Sergey Nikolenko
Abstract:
Deep learning architectures based on self-attention have recently achieved and surpassed state of the art results in the task of unsupervised aspect extraction and topic modeling. While models such as neural attention-based aspect extraction (ABAE) have been successfully applied to user-generated texts, they are less coherent when applied to traditional data sources such as news articles and newsg…
▽ More
Deep learning architectures based on self-attention have recently achieved and surpassed state of the art results in the task of unsupervised aspect extraction and topic modeling. While models such as neural attention-based aspect extraction (ABAE) have been successfully applied to user-generated texts, they are less coherent when applied to traditional data sources such as news articles and newsgroup documents. In this work, we introduce a simple approach based on sentence filtering in order to improve topical aspects learned from newsgroups-based content without modifying the basic mechanism of ABAE. We train a probabilistic classifier to distinguish between out-of-domain texts (outer dataset) and in-domain texts (target dataset). Then, during data preparation we filter out sentences that have a low probability of being in-domain and train the neural model on the remaining sentences. The positive effect of sentence filtering on topic coherence is demonstrated in comparison to aspect extraction models trained on unfiltered texts.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews
Authors:
Elena Tutubalina,
Ilseyar Alimova,
Zulfat Miftahutdinov,
Andrey Sakhovskiy,
Valentin Malykh,
Sergey Nikolenko
Abstract:
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from…
▽ More
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
High-Resolution Daytime Translation Without Domain Labels
Authors:
Ivan Anokhin,
Pavel Solovev,
Denis Korzhenkov,
Alexey Kharlamov,
Taras Khakhulin,
Alexey Silvestrov,
Sergey Nikolenko,
Victor Lempitsky,
Gleb Sterkin
Abstract:
Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high re…
▽ More
Modeling daytime changes in high resolution photographs, e.g., re-rendering the same scene under different illuminations typical for day, night, or dawn, is a challenging image manipulation task. We present the high-resolution daytime translation (HiDT) model for this task. HiDT combines a generative image-to-image model and a new upsampling scheme that allows to apply image translation at high resolution. The model demonstrates competitive results in terms of both commonly used GAN metrics and human evaluation. Importantly, this good performance comes as a result of training on a dataset of still landscape images with no daytime labels available. Our results are available at https://saic-mdal.github.io/HiDT/.
△ Less
Submitted 23 March, 2020; v1 submitted 19 March, 2020;
originally announced March 2020.
-
KH2PO4 + Host Matrix (Alumina / SiO$_2$) Nanocomposite: Raman Scattering Insight
Authors:
Ya. Shchur,
A. S. Andrushchak,
V. V. Strelchuk,
A. S. Nikolenko,
V. T. Adamiv,
N. A. Andrushchak,
P. Göring,
P. Huber,
A. V. Kityk
Abstract:
We report on the synthesis and Raman scattering characterization of composite materials based on the hostnanoporous matrices filled with nanostructured KH2PO4 (KDP) crystal. Silica (SiO2) and anodized aluminium oxide (AAO) were used as host matrices with various pore diameters, inter-pore spacing and morphology. The structure of the nanocomposites was investigated by X-ray diffraction and scanning…
▽ More
We report on the synthesis and Raman scattering characterization of composite materials based on the hostnanoporous matrices filled with nanostructured KH2PO4 (KDP) crystal. Silica (SiO2) and anodized aluminium oxide (AAO) were used as host matrices with various pore diameters, inter-pore spacing and morphology. The structure of the nanocomposites was investigated by X-ray diffraction and scanning electron microscopy. Raman scattering reveals the creation of one-dimensional nanostructured KDP inside the SiO2 matrix. We clearly observed the stretching ν1, ν3 and bending ν2 vibrations of PO4 tetrahedral groups in the Raman spectrum of SiO2 + KDP. In Raman scattering spectra of AAO + KDP nanocomposite, the broad fluorescence background of AAO matrix dominates to a great extent, hindering thus the detecting of the KDP compound spectral response.
△ Less
Submitted 19 February, 2020;
originally announced February 2020.
-
RecVAE: a New Variational Autoencoder for Top-N Recommendations with Implicit Feedback
Authors:
Ilya Shenbin,
Anton Alekseev,
Elena Tutubalina,
Valentin Malykh,
Sergey I. Nikolenko
Abstract:
Recent research has shown the advantages of using autoencoders based on deep neural networks for collaborative filtering. In particular, the recently proposed Mult-VAE model, which used the multinomial likelihood variational autoencoders, has shown excellent results for top-N recommendations. In this work, we propose the Recommender VAE (RecVAE) model that originates from our research on regulariz…
▽ More
Recent research has shown the advantages of using autoencoders based on deep neural networks for collaborative filtering. In particular, the recently proposed Mult-VAE model, which used the multinomial likelihood variational autoencoders, has shown excellent results for top-N recommendations. In this work, we propose the Recommender VAE (RecVAE) model that originates from our research on regularization techniques for variational autoencoders. RecVAE introduces several novel ideas to improve Mult-VAE, including a novel composite prior distribution for the latent codes, a new approach to setting the $β$ hyperparameter for the $β$-VAE framework, and a new approach to training based on alternating updates. In experimental evaluation, we show that RecVAE significantly outperforms previously proposed autoencoder-based models, including Mult-VAE and RaCT, across classical collaborative filtering datasets, and present a detailed ablation study to assess our new developments. Code and models are available at https://github.com/ilya-shenbin/RecVAE.
△ Less
Submitted 23 December, 2019;
originally announced December 2019.
-
Synthetic Data for Deep Learning
Authors:
Sergey I. Nikolenko
Abstract:
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision but also in other areas. In this work, we attempt to provide a comprehensive survey of the various directions in the development and application of synthetic data. First, we discuss synthetic datasets for basic computer vision problems, both low-level (e.g., optical flow estimation) and…
▽ More
Synthetic data is an increasingly popular tool for training deep learning models, especially in computer vision but also in other areas. In this work, we attempt to provide a comprehensive survey of the various directions in the development and application of synthetic data. First, we discuss synthetic datasets for basic computer vision problems, both low-level (e.g., optical flow estimation) and high-level (e.g., semantic segmentation), synthetic environments and datasets for outdoor and urban scenes (autonomous driving), indoor scenes (indoor navigation), aerial navigation, simulation environments for robotics, applications of synthetic data outside computer vision (in neural programming, bioinformatics, NLP, and more); we also survey the work on improving synthetic data development and alternative ways to produce it such as GANs. Second, we discuss in detail the synthetic-to-real domain adaptation problem that inevitably arises in applications of synthetic data, including synthetic-to-real refinement with GAN-based models and domain adaptation at the feature/model level without explicit data transformations. Third, we turn to privacy-related applications of synthetic data and review the work on generating synthetic datasets with differential privacy guarantees. We conclude by highlighting the most promising directions for further work in synthetic data studies.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
CommentsRadar: Dive into Unique Data on All Comments on the Web
Authors:
Sergey Nikolenko,
Elena Tutubalina,
Zulfat Miftahutdinov,
Eugene Beloded
Abstract:
We introduce an entity-centric search engineCommentsRadarthatpairs entity queries with articles and user opinions covering a widerange of topics from top commented sites. The engine aggregatesarticles and comments for these articles, extracts named entities,links them together and with knowledge base entries, performssentiment analysis, and aggregates the results, aiming to mine fortemporal trends…
▽ More
We introduce an entity-centric search engineCommentsRadarthatpairs entity queries with articles and user opinions covering a widerange of topics from top commented sites. The engine aggregatesarticles and comments for these articles, extracts named entities,links them together and with knowledge base entries, performssentiment analysis, and aggregates the results, aiming to mine fortemporal trends and other insights. In this work, we present thegeneral engine, discuss the models used for all steps of this pipeline,and introduce several case studies that discover important insightsfrom online commenting data.
△ Less
Submitted 16 August, 2019;
originally announced August 2019.
-
Free-Lunch Saliency via Attention in Atari Agents
Authors:
Dmitry Nikulin,
Anastasia Ianina,
Vladimir Aliev,
Sergey Nikolenko
Abstract:
We propose a new approach to visualize saliency maps for deep neural network models and apply it to deep reinforcement learning agents trained on Atari environments. Our method adds an attention module that we call FLS (Free Lunch Saliency) to the feature extractor from an established baseline (Mnih et al., 2015). This addition results in a trainable model that can produce saliency maps, i.e., vis…
▽ More
We propose a new approach to visualize saliency maps for deep neural network models and apply it to deep reinforcement learning agents trained on Atari environments. Our method adds an attention module that we call FLS (Free Lunch Saliency) to the feature extractor from an established baseline (Mnih et al., 2015). This addition results in a trainable model that can produce saliency maps, i.e., visualizations of the importance of different parts of the input for the agent's current decision making. We show experimentally that a network with an FLS module exhibits performance similar to the baseline (i.e., it is "free", with no performance cost) and can be used as a drop-in replacement for reinforcement learning agents. We also design another feature extractor that scores slightly lower but provides higher-fidelity visualizations. In addition to attained scores, we report saliency metrics evaluated on the Atari-HEAD dataset of human gameplay.
△ Less
Submitted 30 October, 2019; v1 submitted 7 August, 2019;
originally announced August 2019.
-
New Competitiveness Bounds for the Shared Memory Switch
Authors:
Ivan Bochkov,
Alex Davydow,
Nikita Gaevoy,
Sergey I. Nikolenko
Abstract:
We consider one of the simplest and best known buffer management architectures: the shared memory switch with multiple output queues and uniform packets. It was one of the first models studied by competitive analysis, with the Longest Queue Drop (LQD) buffer management policy shown to be at least $\sqrt{2}$- and at most $2$-competitive; a general lower bound of $4/3$ has been proven for all determ…
▽ More
We consider one of the simplest and best known buffer management architectures: the shared memory switch with multiple output queues and uniform packets. It was one of the first models studied by competitive analysis, with the Longest Queue Drop (LQD) buffer management policy shown to be at least $\sqrt{2}$- and at most $2$-competitive; a general lower bound of $4/3$ has been proven for all deterministic online algorithms. Closing the gap between $\sqrt{2}$ and $2$ has remained an open problem in competitive analysis for more than a decade, with only marginal success in reducing the upper bound of $2$. In this work, we first present a simplified proof for the $\sqrt{2}$ lower bound for LQD and then, using a reduction to the continuous case, improve the general lower bound for all deterministic online algorithms from $\frac 43$ to $\sqrt{2}$. Then, we proceed to improve the lower bound of $\sqrt{2}$ specifically for LQD, showing that LQD is at least $1.44546086$-competitive. We are able to prove the bound by presenting an explicit construction of the optimal clairvoyant algorithm which then allows for two different ways to prove lower bounds: by direct computer simulations and by proving lower bounds via linear programming. The linear programming approach yields a lower bound for LQD of $1.4427902$ (still larger than $\sqrt{2}$).
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
Breast Tumor Cellularity Assessment using Deep Neural Networks
Authors:
Alexander Rakhlin,
Aleksei Tiulpin,
Alexey A. Shvets,
Alexandr A. Kalinin,
Vladimir I. Iglovikov,
Sergey Nikolenko
Abstract:
Breast cancer is one of the main causes of death worldwide. Histopathological cellularity assessment of residual tumors in post-surgical tissues is used to analyze a tumor's response to a therapy. Correct cellularity assessment increases the chances of getting an appropriate treatment and facilitates the patient's survival. In current clinical practice, tumor cellularity is manually estimated by p…
▽ More
Breast cancer is one of the main causes of death worldwide. Histopathological cellularity assessment of residual tumors in post-surgical tissues is used to analyze a tumor's response to a therapy. Correct cellularity assessment increases the chances of getting an appropriate treatment and facilitates the patient's survival. In current clinical practice, tumor cellularity is manually estimated by pathologists; this process is tedious and prone to errors or low agreement rates between assessors. In this work, we evaluated three strong novel Deep Learning-based approaches for automatic assessment of tumor cellularity from post-treated breast surgical specimens stained with hematoxylin and eosin. We validated the proposed methods on the BreastPathQ SPIE challenge dataset that consisted of 2395 image patches selected from whole slide images acquired from 64 patients. Compared to expert pathologist scoring, our best performing method yielded the Cohen's kappa coefficient of 0.70 (vs. 0.42 previously known in literature) and the intra-class correlation coefficient of 0.89 (vs. 0.83). Our results suggest that Deep Learning-based methods have a significant potential to alleviate the burden on pathologists, enhance the diagnostic workflow, and, thereby, facilitate better clinical outcomes in breast cancer treatment.
△ Less
Submitted 3 September, 2019; v1 submitted 5 May, 2019;
originally announced May 2019.
-
AspeRa: Aspect-based Rating Prediction Model
Authors:
Sergey I. Nikolenko,
Elena Tutubalina,
Valentin Malykh,
Ilya Shenbin,
Anton Alekseev
Abstract:
We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users. The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms…
▽ More
We propose a novel end-to-end Aspect-based Rating Prediction model (AspeRa) that estimates user rating based on review texts for the items and at the same time discovers coherent aspects of reviews that can be used to explain predictions or profile users. The AspeRa model uses max-margin losses for joint item and user embedding learning and a dual-headed architecture; it significantly outperforms recently proposed state-of-the-art models such as DeepCoNN, HFT, NARRE, and TransRev on two real world data sets of user reviews. With qualitative examination of the aspects and quantitative evaluation of rating prediction models based on these aspects, we show how aspect embeddings can be used in a recommender system.
△ Less
Submitted 23 January, 2019;
originally announced January 2019.
-
Adapting Convolutional Neural Networks for Geographical Domain Shift
Authors:
Pavel Ostyakov,
Sergey I. Nikolenko
Abstract:
We present the winning solution for the Inclusive Images Competition organized as part of the Conference on Neural Information Processing Systems (NeurIPS 2018) Competition Track. The competition was organized to study ways to cope with domain shift in image processing, specifically geographical shift: the training and two test sets in the competition had different geographical distributions. Our…
▽ More
We present the winning solution for the Inclusive Images Competition organized as part of the Conference on Neural Information Processing Systems (NeurIPS 2018) Competition Track. The competition was organized to study ways to cope with domain shift in image processing, specifically geographical shift: the training and two test sets in the competition had different geographical distributions. Our solution has proven to be relatively straightforward and simple: it is an ensemble of several CNNs where only the last layer is fine-tuned with the help of a small labeled set of tuning labels made available by the organizers. We believe that while domain shift remains a formidable problem, our approach opens up new possibilities for alleviating this problem in practice, where small labeled datasets from the target domain are usually either available or can be obtained and labeled cheaply.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models
Authors:
Daniil Polykovskiy,
Alexander Zhebrak,
Benjamin Sanchez-Lengeling,
Sergey Golovanov,
Oktai Tatanov,
Stanislav Belyaev,
Rauf Kurbanov,
Aleksey Artamonov,
Vladimir Aladinskiy,
Mark Veselov,
Artur Kadurin,
Simon Johansson,
Hongming Chen,
Sergey Nikolenko,
Alan Aspuru-Guzik,
Alex Zhavoronkov
Abstract:
Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervised predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare an…
▽ More
Generative models are becoming a tool of choice for exploring the molecular space. These models learn on a large training dataset and produce novel molecular structures with similar properties. Generated structures can be utilized for virtual screening or training semi-supervised predictive models in the downstream tasks. While there are plenty of generative models, it is unclear how to compare and rank them. In this work, we introduce a benchmarking platform called Molecular Sets (MOSES) to standardize training and comparison of molecular generative models. MOSES provides a training and testing datasets, and a set of metrics to evaluate the quality and diversity of generated structures. We have implemented and compared several molecular generation models and suggest to use our results as reference points for further advancements in generative chemistry research. The platform and source code are available at https://github.com/molecularsets/moses.
△ Less
Submitted 28 October, 2020; v1 submitted 29 November, 2018;
originally announced November 2018.
-
Sequence Learning with RNNs for Medical Concept Normalization in User-Generated Texts
Authors:
Elena Tutubalina,
Zulfat Miftahutdinov,
Sergey Nikolenko,
Valentin Malykh
Abstract:
In this work, we consider the medical concept normalization problem, i.e., the problem of map** a disease mention in free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This task is challenging since medical terminology is very different when coming from health care professionals or from the general public in th…
▽ More
In this work, we consider the medical concept normalization problem, i.e., the problem of map** a disease mention in free-form text to a concept in a controlled vocabulary, usually to the standard thesaurus in the Unified Medical Language System (UMLS). This task is challenging since medical terminology is very different when coming from health care professionals or from the general public in the form of social media texts. We approach it as a sequence learning problem, with recurrent neural networks trained to obtain semantic representations of one- and multi-word expressions. We develop end-to-end neural architectures tailored specifically to medical concept normalization, including bidirectional LSTM and GRU with an attention mechanism and additional semantic similarity features based on UMLS. Our evaluation over a standard benchmark shows that our model improves over a state of the art baseline for classification based on CNNs.
△ Less
Submitted 29 November, 2018; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Learning State Representations in Complex Systems with Multimodal Data
Authors:
Pavel Solovev,
Vladimir Aliev,
Pavel Ostyakov,
Gleb Sterkin,
Elizaveta Logacheva,
Stepan Troeshestov,
Roman Suvorov,
Anton Mashikhin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset a…
▽ More
Representation learning becomes especially important for complex systems with multimodal data sources such as cameras or sensors. Recent advances in reinforcement learning and optimal control make it possible to design control algorithms on these latent representations, but the field still lacks a large-scale standard dataset for unified comparison. In this work, we present a large-scale dataset and evaluation framework for representation learning for the complex task of landing an airplane. We implement and compare several approaches to representation learning on this dataset in terms of the quality of simple supervised learning tasks and disentanglement scores. The resulting representations can be used for further tasks such as anomaly detection, optimal control, model-based reinforcement learning, and other applications.
△ Less
Submitted 15 January, 2019; v1 submitted 27 November, 2018;
originally announced November 2018.
-
SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint
Authors:
Pavel Ostyakov,
Roman Suvorov,
Elizaveta Logacheva,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together…
▽ More
We present a novel approach to image manipulation and understanding by simultaneously learning to segment object masks, paste objects to another background image, and remove them from original images. For this purpose, we develop a novel generative model for compositional image generation, SEIGAN (Segment-Enhance-Inpaint Generative Adversarial Network), which learns these three operations together in an adversarial architecture with additional cycle consistency losses. To train, SEIGAN needs only bounding box supervision and does not require pairing or ground truth masks. SEIGAN produces better generated images (evaluated by human assessors) than other approaches and produces high-quality segmentation masks, improving over other adversarially trained approaches and getting closer to the results of fully supervised training.
△ Less
Submitted 15 January, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Label Denoising with Large Ensembles of Heterogeneous Neural Networks
Authors:
Pavel Ostyakov,
Elizaveta Logacheva,
Roman Suvorov,
Vladimir Aliev,
Gleb Sterkin,
Oleg Khomenko,
Sergey I. Nikolenko
Abstract:
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, m…
▽ More
Despite recent advances in computer vision based on various convolutional architectures, video understanding remains an important challenge. In this work, we present and discuss a top solution for the large-scale video classification (labeling) problem introduced as a Kaggle competition based on the YouTube-8M dataset. We show and compare different approaches to preprocessing, data augmentation, model architectures, and model combination. Our final model is based on a large ensemble of video- and frame-level models but fits into rather limiting hardware constraints. We apply an approach based on knowledge distillation to deal with noisy labels in the original dataset and the recently developed mixup technique to improve the basic models.
△ Less
Submitted 15 January, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
BASEL (Buffering Architecture SpEcification Language)
Authors:
Kirill Kogan,
Danushka Menikkumbura,
Gustavo Petri,
Youngtae Noh,
Sergey Nikolenko,
Patrick Eugster
Abstract:
Buffering architectures and policies for their efficient management constitute one of the core ingredients of a network architecture. In this work we introduce a new specification language, BASEL, that allows to express virtual buffering architectures and management policies representing a variety of economic models. BASEL does not require the user to implement policies in a high-level language; r…
▽ More
Buffering architectures and policies for their efficient management constitute one of the core ingredients of a network architecture. In this work we introduce a new specification language, BASEL, that allows to express virtual buffering architectures and management policies representing a variety of economic models. BASEL does not require the user to implement policies in a high-level language; rather, the entire buffering architecture and its policy are reduced to several comparators and simple functions. We show examples of buffering architectures in BASEL and demonstrate empirically the impact of various settings on performance.
△ Less
Submitted 14 October, 2015;
originally announced October 2015.
-
BayesHammer: Bayesian clustering for error correction in single-cell sequencing
Authors:
Sergey I. Nikolenko,
Anton I. Korobeynikov,
Max A. Alekseyev
Abstract:
Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.
We introduce s…
▽ More
Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.
We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error correction tool BayesHammer. While BayesHammer was designed for single-cell sequencing, we demonstrate that it also improves on existing error correction tools for multi-cell sequencing data while working much faster on real-life datasets. We benchmark BayesHammer on both $k$-mer counts and actual assembly results with the SPAdes genome assembler.
△ Less
Submitted 12 November, 2012;
originally announced November 2012.
-
FIFO Queueing Policies for Packets with Heterogeneous Processing
Authors:
Kirill Kogan,
Alejandro López-Ortiz,
Sergey I. Nikolenko,
Alexander V. Sirotkin,
Denis Tugaryov
Abstract:
We consider the problem of managing a bounded size First-In-First-Out (FIFO) queue buffer, where each incoming unit-sized packet requires several rounds of processing before it can be transmitted out. Our objective is to maximize the total number of successfully transmitted packets. We consider both push-out (when the policy is permitted to drop already admitted packets) and non-push-out cases. In…
▽ More
We consider the problem of managing a bounded size First-In-First-Out (FIFO) queue buffer, where each incoming unit-sized packet requires several rounds of processing before it can be transmitted out. Our objective is to maximize the total number of successfully transmitted packets. We consider both push-out (when the policy is permitted to drop already admitted packets) and non-push-out cases. In particular, we provide analytical guarantees for the throughput performance of our algorithms. We further conduct a comprehensive simulation study which experimentally validates the predicted theoretical behaviour.
△ Less
Submitted 24 April, 2012;
originally announced April 2012.
-
Balancing Work and Size with Bounded Buffers
Authors:
Kirill Kogan,
Alejandro Lopez-Ortiz,
Sergey I. Nikolenko,
Gabriel Scalosub,
Michael Segal
Abstract:
We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, where each packet requires several rounds of processing before it can be transmitted from the queue buffer. The goal in such an environment is to maximize the overall size of packets that are successfully transmitted. This model is motivated by the ever-growing ubiquity of…
▽ More
We consider the fundamental problem of managing a bounded size queue buffer where traffic consists of packets of varying size, where each packet requires several rounds of processing before it can be transmitted from the queue buffer. The goal in such an environment is to maximize the overall size of packets that are successfully transmitted. This model is motivated by the ever-growing ubiquity of network processors architectures, which must deal with heterogeneously-sized traffic, with heterogeneous processing requirements. Our work addresses the tension between two conflicting algorithmic approaches in such settings: the tendency to favor packets with fewer processing requirements, thus leading to fast contributions to the accumulated throughput, as opposed to preferring packets of larger size, which imply a large increase in throughput at each step. We present a model for studying such systems, and present competitive algorithms whose performance depend on the maximum size a packet may have, and maximum amount of processing a packet may require. We further provide lower bounds on algorithms performance in such settings.
△ Less
Submitted 5 September, 2013; v1 submitted 26 February, 2012;
originally announced February 2012.
-
New Combinatorial Complete One-Way Functions
Authors:
Arist Kojevnikov,
Sergey I. Nikolenko
Abstract:
In 2003, Leonid A. Levin presented the idea of a combinatorial complete one-way function and a sketch of the proof that Tiling represents such a function. In this paper, we present two new one-way functions based on semi-Thue string rewriting systems and a version of the Post Correspondence Problem and prove their completeness. Besides, we present an alternative proof of Levin's result. We also…
▽ More
In 2003, Leonid A. Levin presented the idea of a combinatorial complete one-way function and a sketch of the proof that Tiling represents such a function. In this paper, we present two new one-way functions based on semi-Thue string rewriting systems and a version of the Post Correspondence Problem and prove their completeness. Besides, we present an alternative proof of Levin's result. We also discuss the properties a combinatorial problem should have in order to hold a complete one-way function.
△ Less
Submitted 20 February, 2008;
originally announced February 2008.
-
Chow ring structure made simple
Authors:
S. Nikolenko,
N. Semenov
Abstract:
We show how to translate the task of computing the multiplicative structure of a Chow ring of a projective homogeneous variety into an easily understandable combinatorial task of calculating in the corresponding polynomial ring. The algorithms are also presented as a Maple package. Then we proceed to compute the multiplicative structure of the Chow rings for projective homogeneous varieties E6/P…
▽ More
We show how to translate the task of computing the multiplicative structure of a Chow ring of a projective homogeneous variety into an easily understandable combinatorial task of calculating in the corresponding polynomial ring. The algorithms are also presented as a Maple package. Then we proceed to compute the multiplicative structure of the Chow rings for projective homogeneous varieties E6/P1, E7/P7, and E8/P8.
△ Less
Submitted 14 June, 2006;
originally announced June 2006.
-
Motivic decomposition of anisotropic varieties of type F_4 into generalized Rost motives
Authors:
S. Nikolenko,
N. Semenov,
K. Zainoulline
Abstract:
This an extended version of the previous preprint dated by February 2005.
We prove that the Chow motive of an anisotropic projective homogeneous variety of type F4 is isomorphic to the direct sum of twisted copies of a generalized Rost motive. In particular, we provide an explicit construction of a generalized Rost motive for a generically splitting variety for a symbol in K_3^M(k)/3. We also…
▽ More
This an extended version of the previous preprint dated by February 2005.
We prove that the Chow motive of an anisotropic projective homogeneous variety of type F4 is isomorphic to the direct sum of twisted copies of a generalized Rost motive. In particular, we provide an explicit construction of a generalized Rost motive for a generically splitting variety for a symbol in K_3^M(k)/3. We also establish a motivic isomorphism between two anisotropic non-isomorphic projective homogeneous varieties of type F4. All our results hold for Chow motives with integral coefficients.
△ Less
Submitted 22 September, 2005; v1 submitted 17 February, 2005;
originally announced February 2005.
-
Hard satisfiable formulas for DPLL-type algorithms
Authors:
Sergey I. Nikolenko
Abstract:
We address lower bounds on the time complexity of algorithms solving the propositional satisfiability problem. Namely, we consider two DPLL-type algorithms, enhanced with the unit clause and pure literal heuristics. Exponential lower bounds for solving satisfiability on provably satisfiable formulas are proven.
We address lower bounds on the time complexity of algorithms solving the propositional satisfiability problem. Namely, we consider two DPLL-type algorithms, enhanced with the unit clause and pure literal heuristics. Exponential lower bounds for solving satisfiability on provably satisfiable formulas are proven.
△ Less
Submitted 15 January, 2003;
originally announced January 2003.