Search | arXiv e-print repository

Image-GS: Content-Adaptive Image Representation via 2D Gaussians

Authors: Yunxiang Zhang, Alexandr Kuznetsov, Akshay **dal, Kenneth Chen, Anton Sochenov, Anton Kaplanyan, Qi Sun

Abstract: Neural image representations have recently emerged as a promising technique for storing, streaming, and rendering visual data. Coupled with learning-based workflows, these novel representations have demonstrated remarkable visual fidelity and memory efficiency. However, existing neural image representations often rely on explicit uniform data structures without content adaptivity or computation-in… ▽ More Neural image representations have recently emerged as a promising technique for storing, streaming, and rendering visual data. Coupled with learning-based workflows, these novel representations have demonstrated remarkable visual fidelity and memory efficiency. However, existing neural image representations often rely on explicit uniform data structures without content adaptivity or computation-intensive implicit models, limiting their adoption in real-time graphics applications. Inspired by recent advances in radiance field rendering, we propose Image-GS, a content-adaptive image representation. Using anisotropic 2D Gaussians as the basis, Image-GS shows high memory efficiency, supports fast random access, and offers a natural level of detail stack. Leveraging a tailored differentiable renderer, Image-GS fits a target image by adaptively allocating and progressively optimizing a set of 2D Gaussians. The generalizable efficiency and fidelity of Image-GS are validated against several recent neural image representations and industry-standard texture compressors on a diverse set of images. Notably, its memory and computation requirements solely depend on and linearly scale with the number of 2D Gaussians, providing flexible controls over the trade-off between visual fidelity and run-time efficiency. We hope this research offers insights for develo** new applications that require adaptive quality and resource control, such as machine perception, asset streaming, and content generation. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.07906 [pdf, other]

Hybrid Rendering for Dynamic Scenes

Authors: Alexandr Kuznetsov, Stavros Diolatzis, Anton Sochenov, Anton Kaplanyan

Abstract: Despite significant advances in algorithms and hardware, global illumination continues to be a challenge in the real-time domain. Time constraints often force developers to either compromise on the quality of global illumination or disregard it altogether. We take advantage of a common setup in modern games: having a set of a level, which is a static scene with dynamic characters and lighting. We… ▽ More Despite significant advances in algorithms and hardware, global illumination continues to be a challenge in the real-time domain. Time constraints often force developers to either compromise on the quality of global illumination or disregard it altogether. We take advantage of a common setup in modern games: having a set of a level, which is a static scene with dynamic characters and lighting. We introduce a novel method for efficiently and accurately rendering global illumination in dynamic scenes. Our hybrid technique leverages precomputation and neural networks to capture the light transport of a static scene. Then, we introduce a method to compute the difference between the current scene and the static scene, which we already precomputed. By handling the bulk of the light transport through precomputation, our method only requires the rendering of a minimal difference, reducing the noise and increasing the quality. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.20067 [pdf, other]

doi 10.1145/3641519.3657502

N-Dimensional Gaussians for Fitting of High Dimensional Functions

Authors: Stavros Diolatzis, Tobias Zirr, Alexandr Kuznetsov, Georgios Kopanas, Anton Kaplanyan

Abstract: In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination… ▽ More In the wake of many new ML-inspired approaches for reconstructing and representing high-quality 3D content, recent hybrid and explicitly learned representations exhibit promising performance and quality characteristics. However, their scaling to higher dimensions is challenging, e.g. when accounting for dynamic content with respect to additional parameters such as material properties, illumination, or time. In this paper, we tackle these challenges for an explicit representations based on Gaussian mixture models. With our solutions, we arrive at efficient fitting of compact N-dimensional Gaussian mixtures and enable efficient evaluation at render time: For fast fitting and evaluation, we introduce a high-dimensional culling scheme that efficiently bounds N-D Gaussians, inspired by Locality Sensitive Hashing. For adaptive refinement yet compact representation, we introduce a loss-adaptive density control scheme that incrementally guides the use of additional capacity towards missing details. With these tools we can for the first time represent complex appearance that depends on many input dimensions beyond position or viewing angle within a compact, explicit representation optimized in minutes and rendered in milliseconds. △ Less

Submitted 31 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

Comments: https://www.sdiolatz.info/ndg-fitting/

arXiv:2405.12250 [pdf, other]

Your Transformer is Secretly Linear

Authors: Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

Abstract: This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm o… ▽ More This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: 9 pages, 9 figures

arXiv:2404.06212 [pdf, other]

OmniFusion Technical Report

Authors: Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

Abstract: Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various… ▽ More Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion provides highly-detailed answers in different domains: housekee**, sightseeing, culture, medicine, handwritten and scanned equations recognition, etc. Mistral-based OmniFusion model is an open-source solution with weights, training and inference scripts available at https://github.com/AIRI-Institute/OmniFusion. △ Less

Submitted 9 April, 2024; originally announced April 2024.

Comments: 17 pages, 4 figures, 9 tables, 2 appendices

MSC Class: 6804; 68T50 (Primary) ACM Class: I.2.7; I.2.10; I.4.9

arXiv:2312.03511 [pdf, other]

Kandinsky 3.0 Technical Report

Authors: Vladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev, Anastasia Maltseva, Said Azizov, Igor Pavlov, Julia Agafonova, Andrey Kuznetsov, Denis Dimitrov

Abstract: We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction… ▽ More We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3 △ Less

Submitted 28 June, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

Comments: Project page: https://ai-forever.github.io/Kandinsky-3

arXiv:2311.13073 [pdf, other]

FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline

Authors: Vladimir Arkhipkin, Zein Shaheen, Viacheslav Vasilev, Elizaveta Dakhova, Andrey Kuznetsov, Denis Dimitrov

Abstract: Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyfram… ▽ More Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/ △ Less

Submitted 20 December, 2023; v1 submitted 21 November, 2023; originally announced November 2023.

Comments: Project page: https://ai-forever.github.io/kandinsky-video/

arXiv:2311.05928 [pdf, other]

The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models

Authors: Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

Abstract: In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from… ▽ More In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. Which is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties. △ Less

Submitted 26 February, 2024; v1 submitted 10 November, 2023; originally announced November 2023.

Comments: Accepted to EACL-2024

arXiv:2310.03502 [pdf, other]

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

Authors: Anton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva, Vladimir Arkhipkin, Igor Pavlov, Ilya Ryabov, Angelina Kuts, Alexander Panchenko, Andrey Kuznetsov, Denis Dimitrov

Abstract: Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel explo… ▽ More Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality. △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2309.14471 [pdf, other]

Adapting Double Q-Learning for Continuous Reinforcement Learning

Authors: Arsenii Kuznetsov

Abstract: Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with tw… ▽ More Majority of off-policy reinforcement learning algorithms use overestimation bias control techniques. Most of these techniques rooted in heuristics, primarily addressing the consequences of overestimation rather than its fundamental origins. In this work we present a novel approach to the bias correction, similar in spirit to Double Q-Learning. We propose using a policy in form of a mixture with two components. Each policy component is maximized and assessed by separate networks, which removes any basis for the overestimation bias. Our approach shows promising near-SOTA results on a small set of MuJoCo environments. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2306.04288 [pdf, other]

Revising deep learning methods in parking lot occupancy detection

Authors: Anastasia Martynova, Mikhail Kuznetsov, Vadim Porvatov, Vladislav Tishin, Andrey Kuznetsov, Natalia Semenova, Ksenia Kuznetsova

Abstract: Parking guidance systems have recently become a popular trend as a part of the smart cities' paradigm of development. The crucial part of such systems is the algorithm allowing drivers to search for available parking lots across regions of interest. The classic approach to this task is based on the application of neural network classifiers to camera records. However, existing systems demonstrate a… ▽ More Parking guidance systems have recently become a popular trend as a part of the smart cities' paradigm of development. The crucial part of such systems is the algorithm allowing drivers to search for available parking lots across regions of interest. The classic approach to this task is based on the application of neural network classifiers to camera records. However, existing systems demonstrate a lack of generalization ability and appropriate testing regarding specific visual conditions. In this study, we extensively evaluate state-of-the-art parking lot occupancy detection algorithms, compare their prediction quality with the recently emerged vision transformers, and propose a new pipeline based on EfficientNet architecture. Performed computational experiments have demonstrated the performance increase in the case of our model, which was evaluated on 5 different datasets. △ Less

Submitted 12 February, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: 15 pages, 7 figures

arXiv:2305.19000 [pdf, other]

Independent Component Alignment for Multi-Task Learning

Authors: Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, Anton Konushin

Abstract: In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrat… ▽ More In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrate that a condition number reflects the aforementioned optimization issues. Accordingly, we present Aligned-MTL, a novel MTL optimization approach based on the proposed criterion, that eliminates instability in the training process by aligning the orthogonal components of the linear system of gradients. While many recent MTL approaches guarantee convergence to a minimum, task trade-offs cannot be specified in advance. In contrast, Aligned-MTL provably converges to an optimal point with pre-defined task-specific weights, which provides more control over the optimization result. Through experiments, we show that the proposed approach consistently improves performance on a diverse set of MTL benchmarks, including semantic and instance segmentation, depth estimation, surface normal estimation, and reinforcement learning. The source code is publicly available at https://github.com/SamsungLabs/MTL . △ Less

Submitted 30 May, 2023; originally announced May 2023.

Journal ref: CVPR2023

arXiv:2303.16531 [pdf, other]

RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Authors: Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov

Abstract: Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While… ▽ More Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: 5 pages, 6 figures, 2 tables

arXiv:2208.14861 [pdf, other]

doi 10.1145/3526113.3545693

Fuse: In-Situ Sensemaking Support in the Browser

Authors: Andrew Kuznetsov, Joseph Chee Chang, Nathan Hahn, Napol Rachatasumrit, Bradley Breneisen, Julina Coupland, Aniket Kittur

Abstract: People spend a significant amount of time trying to make sense of the internet, collecting content from a variety of sources and organizing it to make decisions and achieve their goals. While humans are able to fluidly iterate on collecting and organizing information in their minds, existing tools and approaches introduce significant friction into the process. We introduce Fuse, a browser extensio… ▽ More People spend a significant amount of time trying to make sense of the internet, collecting content from a variety of sources and organizing it to make decisions and achieve their goals. While humans are able to fluidly iterate on collecting and organizing information in their minds, existing tools and approaches introduce significant friction into the process. We introduce Fuse, a browser extension that externalizes users' working memory by combining low-cost collection with lightweight organization of content in a compact card-based sidebar that is always available. Fuse helps users simultaneously extract key web content and structure it in a lightweight and visual way. We discuss how these affordances help users externalize more of their mental model into the system (e.g., saving, annotating, and structuring items) and support fast reviewing and resumption of task contexts. Our 22-month public deployment and follow-up interviews provide longitudinal insights into the structuring behaviors of real-world users conducting information foraging tasks. △ Less

Submitted 31 August, 2022; originally announced August 2022.

arXiv:2208.00496 [pdf, other]

doi 10.1145/3526113.3545661

Wigglite: Low-cost Information Collection and Triage

Authors: Michael Xieyang Liu, Andrew Kuznetsov, Yongsung Kim, Joseph Chee Chang, Aniket Kittur, Brad A. Myers

Abstract: Consumers conducting comparison shop**, researchers making sense of competitive space, and developers looking for code snippets online all face the challenge of capturing the information they find for later use without interrupting their current flow. In addition, during many learning and exploration tasks, people need to externalize their mental context, such as estimating how urgent a topic is… ▽ More Consumers conducting comparison shop**, researchers making sense of competitive space, and developers looking for code snippets online all face the challenge of capturing the information they find for later use without interrupting their current flow. In addition, during many learning and exploration tasks, people need to externalize their mental context, such as estimating how urgent a topic is to follow up on, or rating a piece of evidence as a "pro" or "con," which helps scaffold subsequent deeper exploration. However, current approaches incur a high cost, often requiring users to select, copy, context switch, paste, and annotate information in a separate document without offering specific affordances that capture their mental context. In this work, we explore a new interaction technique called "wiggling," which can be used to fluidly collect, organize, and rate information during early sensemaking stages with a single gesture. Wiggling involves rapid back-and-forth movements of a pointer or up-and-down scrolling on a smartphone, which can indicate the information to be collected and its valence, using a single, light-weight gesture that does not interfere with other interactions that are already available. Through implementation and user evaluation, we found that wiggling helped participants accurately collect information and encode their mental context with a 58% reduction in operational cost while being 24% faster compared to a common baseline. △ Less

Submitted 31 July, 2022; originally announced August 2022.

arXiv:2202.10784 [pdf, other]

RuCLIP -- new models and experiments: a technical report

Authors: Alex Shonenkov, Andrey Kuznetsov, Denis Dimitrov, Tatyana Shavrina, Daniil Chesakov, Anastasia Maltseva, Alena Fenogenova, Igor Pavlov, Anton Emelyanov, Sergey Markov, Daria Bakshandaeva, Vera Shybaeva, Andrey Chertok

Abstract: In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and co… ▽ More In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2202.03046 [pdf, other]

A new face swap method for image and video domains: a technical report

Authors: Daniil Chesakov, Anastasia Maltseva, Alexander Groshev, Andrey Kuznetsov, Denis Dimitrov

Abstract: Deep fake technology became a hot field of research in the last few years. Researchers investigate sophisticated Generative Adversarial Networks (GAN), autoencoders, and other approaches to establish precise and robust algorithms for face swap**. Achieved results show that the deep fake unsupervised synthesis task has problems in terms of the visual quality of generated data. These problems usua… ▽ More Deep fake technology became a hot field of research in the last few years. Researchers investigate sophisticated Generative Adversarial Networks (GAN), autoencoders, and other approaches to establish precise and robust algorithms for face swap**. Achieved results show that the deep fake unsupervised synthesis task has problems in terms of the visual quality of generated data. These problems usually lead to high fake detection accuracy when an expert analyzes them. The first problem is that existing image-to-image approaches do not consider video domain specificity and frame-by-frame processing leads to face jittering and other clearly visible distortions. Another problem is the generated data resolution, which is low for many existing methods due to high computational complexity. The third problem appears when the source face has larger proportions (like bigger cheeks), and after replacement it becomes visible on the face border. Our main goal was to develop such an approach that could solve these problems and outperform existing solutions on a number of clue metrics. We introduce a new face swap pipeline that is based on FaceShifter architecture and fixes the problems stated above. With a new eye loss function, super-resolution block, and Gaussian-based face mask generation leads to improvements in quality which is confirmed during evaluation. △ Less

Submitted 7 February, 2022; originally announced February 2022.

arXiv:2111.10974 [pdf, other]

Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture

Authors: Daria Bakshandaeva, Denis Dimitrov, Vladimir Arkhipkin, Alex Shonenkov, Mark Potanin, Denis Karachev, Andrey Kuznetsov, Anton Voronov, Vera Davydova, Elena Tutubalina, Aleksandr Petiushko

Abstract: Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Transl… ▽ More Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture - a baseline solution, in the center of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one. △ Less

Submitted 28 December, 2022; v1 submitted 21 November, 2021; originally announced November 2021.

arXiv:2110.13523 [pdf, other]

Automating Control of Overestimation Bias for Reinforcement Learning

Authors: Arsenii Kuznetsov, Alexander Grishin, Artem Tsypin, Arsenii Ashukha, Artur Kadurin, Dmitry Vetrov

Abstract: Overestimation bias control techniques are used by the majority of high-performing off-policy reinforcement learning algorithms. However, most of these techniques rely on pre-defined bias correction policies that are either not flexible enough or require environment-specific tuning of hyperparameters. In this work, we present a general data-driven approach for the automatic selection of bias contr… ▽ More Overestimation bias control techniques are used by the majority of high-performing off-policy reinforcement learning algorithms. However, most of these techniques rely on pre-defined bias correction policies that are either not flexible enough or require environment-specific tuning of hyperparameters. In this work, we present a general data-driven approach for the automatic selection of bias control hyperparameters. We demonstrate its effectiveness on three algorithms: Truncated Quantile Critics, Weighted Delayed DDPG, and Maxmin Q-learning. The proposed technique eliminates the need for an extensive hyperparameter search. We show that it leads to a significant reduction of the actual number of interactions while preserving the performance. △ Less

Submitted 28 January, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

arXiv:2105.05977 [pdf, other]

Spelling Correction with Denoising Transformer

Authors: Alex Kuznetsov, Hector Urdiales

Abstract: We present a novel method of performing spelling correction on short input strings, such as search queries or individual words. At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans. This procedure is used to train the production spelling correction model based on a transformer architecture. This model is currently served in the H… ▽ More We present a novel method of performing spelling correction on short input strings, such as search queries or individual words. At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans. This procedure is used to train the production spelling correction model based on a transformer architecture. This model is currently served in the HubSpot product search. We show that our approach to typo generation is superior to the widespread practice of adding noise, which ignores human patterns. We also demonstrate how our approach may be extended to resource-scarce settings and train spelling correction models for Arabic, Greek, Russian, and Setswana languages, without using any labeled data. △ Less

Submitted 12 May, 2021; originally announced May 2021.

Comments: 9 pages, 3 figures

arXiv:2104.02789 [pdf, other]

NeuMIP: Multi-Resolution Neural Materials

Authors: Alexandr Kuznetsov, Krishna Mullia, Zexiang Xu, Miloš Hašan, Ravi Ramamoorthi

Abstract: We propose NeuMIP, a neural method for representing and rendering a variety of material appearances at different scales. Classical prefiltering (mipmap**) methods work well on simple material properties such as diffuse color, but fail to generalize to normals, self-shadowing, fibers or more complex microstructures and reflectances. In this work, we generalize traditional mipmap pyramids to pyram… ▽ More We propose NeuMIP, a neural method for representing and rendering a variety of material appearances at different scales. Classical prefiltering (mipmap**) methods work well on simple material properties such as diffuse color, but fail to generalize to normals, self-shadowing, fibers or more complex microstructures and reflectances. In this work, we generalize traditional mipmap pyramids to pyramids of neural textures, combined with a fully connected network. We also introduce neural offsets, a novel method which allows rendering materials with intricate parallax effects without any tessellation. This generalizes classical parallax map**, but is trained without supervision by any explicit heightfield. Neural materials within our system support a 7-dimensional query, including position, incoming and outgoing direction, and the desired filter kernel size. The materials have small storage (on the order of standard mipmap** except with more texture channels), and can be integrated within common Monte-Carlo path tracing systems. We demonstrate our method on a variety of materials, resulting in complex appearance across levels of detail, with accurate parallax, self-shadowing, and other effects. △ Less

Submitted 6 April, 2021; originally announced April 2021.

arXiv:2010.01775 [pdf, other]

Photon-Driven Neural Path Guiding

Authors: Shilin Zhu, Zexiang Xu, Tiancheng Sun, Alexandr Kuznetsov, Mark Meyer, Henrik Wann Jensen, Hao Su, Ravi Ramamoorthi

Abstract: Although Monte Carlo path tracing is a simple and effective algorithm to synthesize photo-realistic images, it is often very slow to converge to noise-free results when involving complex global illumination. One of the most successful variance-reduction techniques is path guiding, which can learn better distributions for importance sampling to reduce pixel noise. However, previous methods require… ▽ More Although Monte Carlo path tracing is a simple and effective algorithm to synthesize photo-realistic images, it is often very slow to converge to noise-free results when involving complex global illumination. One of the most successful variance-reduction techniques is path guiding, which can learn better distributions for importance sampling to reduce pixel noise. However, previous methods require a large number of path samples to achieve reliable path guiding. We present a novel neural path guiding approach that can reconstruct high-quality sampling distributions for path guiding from a sparse set of samples, using an offline trained neural network. We leverage photons traced from light sources as the input for sampling density reconstruction, which is highly effective for challenging scenes with strong global illumination. To fully make use of our deep neural network, we partition the scene space into an adaptive hierarchical grid, in which we apply our network to reconstruct high-quality sampling distributions for any local region in the scene. This allows for highly efficient path guiding for any path bounce at any location in path tracing. We demonstrate that our photon-driven neural path guiding method can generalize well on diverse challenging testing scenes that are not seen in training. Our approach achieves significantly better rendering results of testing scenes than previous state-of-the-art path guiding methods. △ Less

Submitted 5 October, 2020; originally announced October 2020.

Comments: Keywords: computer graphics, rendering, path tracing, path guiding, machine learning, neural networks, denoising, reconstruction

arXiv:2005.04269 [pdf, other]

Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

Authors: Arsenii Kuznetsov, Pavel Shvechikov, Alexander Grishin, Dmitry Vetrov

Abstract: The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional represent… ▽ More The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment. △ Less

Submitted 8 May, 2020; originally announced May 2020.

Comments: Under review by the International Conference on Machine Learning

arXiv:2004.08606 [pdf, other]

Feathers dataset for Fine-Grained Visual Categorization

Authors: Alina Belko, Konstantin Dobratulin, Andrey Kuznetsov

Abstract: This paper introduces a novel dataset FeatherV1, containing 28,272 images of feathers categorized by 595 bird species. It was created to perform taxonomic identification of bird species by a single feather, which can be applied in amateur and professional ornithology. FeatherV1 is the first publicly available bird's plumage dataset for machine learning, and it can raise interest for a new task in… ▽ More This paper introduces a novel dataset FeatherV1, containing 28,272 images of feathers categorized by 595 bird species. It was created to perform taxonomic identification of bird species by a single feather, which can be applied in amateur and professional ornithology. FeatherV1 is the first publicly available bird's plumage dataset for machine learning, and it can raise interest for a new task in fine-grained visual recognition domain. The latest version of the dataset can be downloaded at https://github.com/feathers-dataset/feathersv1-dataset. We also present feathers classification task results. We selected several deep learning architectures (DenseNet based) for categorical crossentropy values comparison on the provided dataset. △ Less

Submitted 18 April, 2020; originally announced April 2020.

Comments: 6 pages, 6 figures, 3 tables

arXiv:1811.03346 [pdf]

System of Bibliometric Monitoring Sciences in Ukraine

Authors: Leonid Kostenko, Alexander Zhabin, Alexander Kuznetsov, Elizabeth Kukharchuk, Tatiana Simonenko

Abstract: The origins of scientometrics (research metrics) were analysed and the lack of attention to elaboration of its methodology was emphasized. The approaches to evaluation of scientific activity outcome were considered and the tendency of transition from formal quantitative indicators to receiving expert conclusion on the basis of bibliometric indicators was noted. The principles of the Leiden manifes… ▽ More The origins of scientometrics (research metrics) were analysed and the lack of attention to elaboration of its methodology was emphasized. The approaches to evaluation of scientific activity outcome were considered and the tendency of transition from formal quantitative indicators to receiving expert conclusion on the basis of bibliometric indicators was noted. The principles of the Leiden manifesto of scientometrics were set out, kee** to which provides clear monitoring and support of science development, and favours establishing of the constructive dialog between scientific environment and society as well. The conceptual statements and peculiarities of practical realization of the informative and analytical system "Bibliometryka Ukraynskoyi Nauky" ("Bibliometrics of the Ukrainian Science") elaborated by the Vernadsky National Library of Ukraine, are being represented. The proposals on the formation of advisory councils, which are aimed to adopt conclusions on the effectiveness of research activity of institutions, were considered. The feasibility of building a common platform for expert evaluation of scientific studies for countries of the Eastern Partnership by initiating similar bibliometric projects in these countries and their further convergence is proved. △ Less

Submitted 8 November, 2018; originally announced November 2018.

Comments: 13 pages

arXiv:1808.06199 [pdf, other]

Lower bound for the cost of connecting tree with given vertex degree sequence

Authors: Mikhail Goubko, Alexander Kuznetsov

Abstract: The optimal connecting network problem generalizes many models of structure optimization known from the literature, including communication and transport network topology design, graph cut and graph clustering, structure identification from data, etc. For the case of connecting trees with the given sequence of vertex degrees, the cost of the optimal tree is shown to be bounded from below by the so… ▽ More The optimal connecting network problem generalizes many models of structure optimization known from the literature, including communication and transport network topology design, graph cut and graph clustering, structure identification from data, etc. For the case of connecting trees with the given sequence of vertex degrees, the cost of the optimal tree is shown to be bounded from below by the solution of a semidefinite optimization program with bilinear matrix constraints, which is reduced to the solution of a series of convex programs with linear matrix inequality constraints. The proposed lower bound estimate is used to construct several heuristic algorithms and to evaluate their quality on a variety of generated and real-life data sets. Keywords: Optimal communication network, generalized Wiener index, origin-destination matrix, semidefinite programming, quadratic matrix inequality. △ Less

Submitted 20 August, 2018; v1 submitted 19 August, 2018; originally announced August 2018.

Comments: 29 pages, 6 figures, 2 tables

MSC Class: 05C05; 05C07; 05C12; 05C35; 05C50; 68R10; 90C06; 90C22; 90C35; 90C59; 94C15

arXiv:1804.03635 [pdf, other]

Semantic embeddings for program behavior patterns

Authors: Alexander Chistyakov, Ekaterina Lobacheva, Arseny Kuznetsov, Alexey Romanenko

Abstract: In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable s… ▽ More In this paper, we propose a new feature extraction technique for program execution logs. First, we automatically extract complex patterns from a program's behavior graph. Then, we embed these patterns into a continuous space by training an autoencoder. We evaluate the proposed features on a real-world malicious software detection task. We also find that the embedding space captures interpretable structures in the space of pattern parts. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: Published at Workshop track of ICLR 2017

arXiv:1601.02034 [pdf, other]

It's just a matter of perspective(s): Crowd-Powered Consensus Organization of Corpora

Authors: Ayush Jain, Joon Young Seo, Karan Goel, Andrew Kuznetsov, Aditya Parameswaran, Hari Sundaram

Abstract: We study the problem of organizing a collection of objects - images, videos - into clusters, using crowdsourcing. This problem is notoriously hard for computers to do automatically, and even with crowd workers, is challenging to orchestrate: (a) workers may cluster based on different latent hierarchies or perspectives; (b) workers may cluster at different granularities even when clustering using t… ▽ More We study the problem of organizing a collection of objects - images, videos - into clusters, using crowdsourcing. This problem is notoriously hard for computers to do automatically, and even with crowd workers, is challenging to orchestrate: (a) workers may cluster based on different latent hierarchies or perspectives; (b) workers may cluster at different granularities even when clustering using the same perspective; and (c) workers may only see a small portion of the objects when deciding how to cluster them (and therefore have limited understanding of the "big picture"). We develop cost-efficient, accurate algorithms for identifying the consensus organization (i.e., the organizing perspective most workers prefer to employ), and incorporate these algorithms into a cost-effective workflow for organizing a collection of objects, termed ORCHESTRA. We compare our algorithms with other algorithms for clustering, on a variety of real-world datasets, and demonstrate that ORCHESTRA organizes items better and at significantly lower costs. △ Less

Submitted 8 January, 2016; originally announced January 2016.

Showing 1–28 of 28 results for author: Kuznetsov, A