-
Kandinsky 3.0 Technical Report
Authors:
Vladimir Arkhipkin,
Andrei Filatov,
Viacheslav Vasilev,
Anastasia Maltseva,
Said Azizov,
Igor Pavlov,
Julia Agafonova,
Andrey Kuznetsov,
Denis Dimitrov
Abstract:
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction…
▽ More
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3
△ Less
Submitted 28 June, 2024; v1 submitted 6 December, 2023;
originally announced December 2023.
-
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Authors:
Vladimir Arkhipkin,
Zein Shaheen,
Viacheslav Vasilev,
Elizaveta Dakhova,
Andrey Kuznetsov,
Denis Dimitrov
Abstract:
Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyfram…
▽ More
Multimedia generation approaches occupy a prominent place in artificial intelligence research. Text-to-image models achieved high-quality results over the last few years. However, video synthesis methods recently started to develop. This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The first stage concerns keyframes synthesis to figure the storyline of a video, while the second one is devoted to interpolation frames generation to make movements of the scene and objects smooth. We compare several temporal conditioning approaches for keyframes generation. The results show the advantage of using separate temporal blocks over temporal layers in terms of metrics reflecting video generation quality aspects and human preference. The design of our interpolation model significantly reduces computational costs compared to other masked frame interpolation approaches. Furthermore, we evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores. Finally, we compare our pipeline with existing solutions and achieve top-2 scores overall and top-1 among open-source solutions: CLIPSIM = 0.2976 and FVD = 433.054. Project page: https://ai-forever.github.io/kandinsky-video/
△ Less
Submitted 20 December, 2023; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion
Authors:
Anton Razzhigaev,
Arseniy Shakhmatov,
Anastasia Maltseva,
Vladimir Arkhipkin,
Igor Pavlov,
Ilya Ryabov,
Angelina Kuts,
Alexander Panchenko,
Andrey Kuznetsov,
Denis Dimitrov
Abstract:
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel explo…
▽ More
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Star-Shaped Denoising Diffusion Probabilistic Models
Authors:
Andrey Okhotin,
Dmitry Molchanov,
Vladimir Arkhipkin,
Grigory Bartosh,
Viktor Ohanesian,
Aibek Alanov,
Dmitry Vetrov
Abstract:
Denoising Diffusion Probabilistic Models (DDPMs) provide the foundation for the recent breakthroughs in generative modeling. Their Markovian structure makes it difficult to define DDPMs with distributions other than Gaussian or discrete. In this paper, we introduce Star-Shaped DDPM (SS-DDPM). Its star-shaped diffusion process allows us to bypass the need to define the transition probabilities or c…
▽ More
Denoising Diffusion Probabilistic Models (DDPMs) provide the foundation for the recent breakthroughs in generative modeling. Their Markovian structure makes it difficult to define DDPMs with distributions other than Gaussian or discrete. In this paper, we introduce Star-Shaped DDPM (SS-DDPM). Its star-shaped diffusion process allows us to bypass the need to define the transition probabilities or compute posteriors. We establish duality between star-shaped and specific Markovian diffusions for the exponential family of distributions and derive efficient algorithms for training and sampling from SS-DDPMs. In the case of Gaussian distributions, SS-DDPM is equivalent to DDPM. However, SS-DDPMs provide a simple recipe for designing diffusion models with distributions such as Beta, von Mises$\unicode{x2013}$Fisher, Dirichlet, Wishart and others, which can be especially useful when data lies on a constrained manifold. We evaluate the model in different settings and find it competitive even on image data, where Beta SS-DDPM achieves results comparable to a Gaussian DDPM. Our implementation is available at https://github.com/andrey-okhotin/star-shaped .
△ Less
Submitted 28 October, 2023; v1 submitted 10 February, 2023;
originally announced February 2023.
-
Eco2AI: carbon emissions tracking of machine learning models as the first step towards sustainable AI
Authors:
Semen Budennyy,
Vladimir Lazarev,
Nikita Zakharenko,
Alexey Korovin,
Olga Plosskaya,
Denis Dimitrov,
Vladimir Arkhipkin,
Ivan Oseledets,
Ivan Barsola,
Ilya Egorov,
Aleksandra Kosterina,
Leonid Zhukov
Abstract:
The size and complexity of deep neural networks continue to grow exponentially, significantly increasing energy consumption for training and inference by these models. We introduce an open-source package eco2AI to help data scientists and researchers to track energy consumption and equivalent CO2 emissions of their models in a straightforward way. In eco2AI we put emphasis on accuracy of energy co…
▽ More
The size and complexity of deep neural networks continue to grow exponentially, significantly increasing energy consumption for training and inference by these models. We introduce an open-source package eco2AI to help data scientists and researchers to track energy consumption and equivalent CO2 emissions of their models in a straightforward way. In eco2AI we put emphasis on accuracy of energy consumption tracking and correct regional CO2 emissions accounting. We encourage research community to search for new optimal Artificial Intelligence (AI) architectures with a lower computational cost. The motivation also comes from the concept of AI-based green house gases sequestrating cycle with both Sustainable AI and Green AI pathways.
△ Less
Submitted 3 August, 2022; v1 submitted 31 July, 2022;
originally announced August 2022.
-
Many Heads but One Brain: Fusion Brain -- a Competition and a Single Multimodal Multitask Architecture
Authors:
Daria Bakshandaeva,
Denis Dimitrov,
Vladimir Arkhipkin,
Alex Shonenkov,
Mark Potanin,
Denis Karachev,
Andrey Kuznetsov,
Anton Voronov,
Vera Davydova,
Elena Tutubalina,
Aleksandr Petiushko
Abstract:
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Transl…
▽ More
Supporting the current trend in the AI community, we present the AI Journey 2021 Challenge called Fusion Brain, the first competition which is targeted to make the universal architecture which could process different modalities (in this case, images, texts, and code) and solve multiple tasks for vision and language. The Fusion Brain Challenge combines the following specific tasks: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants' submissions on it. Moreover, we have collected and made publicly available a new handwritten dataset in both English and Russian, which consists of 94,128 pairs of images and texts. We also propose a multimodal and multitask architecture - a baseline solution, in the center of which is a frozen foundation model and which has been trained in Fusion mode along with Single-task mode. The proposed Fusion approach proves to be competitive and more energy-efficient compared to the task-specific one.
△ Less
Submitted 28 December, 2022; v1 submitted 21 November, 2021;
originally announced November 2021.
-
Chiral Optical Tamm States: Temporal Coupled-Mode Theory
Authors:
Ivan V. Timofeev,
Pavel S. Pankin,
Stepan Ya. Vetrov,
Vasily G. Arkhipkin,
Wei Lee,
Victor Ya. Zyryanov
Abstract:
The chiral optical Tamm state (COTS) is a special localized state at the interface of a handedness-preserving mirror and a structurally chiral medium such as a cholesteric liquid crystal or a chiral sculptured thin film. The spectral behavior of COTS, observed as reflection resonances, is described by the temporal coupled-mode theory. Mode coupling is different for two circular light polarizations…
▽ More
The chiral optical Tamm state (COTS) is a special localized state at the interface of a handedness-preserving mirror and a structurally chiral medium such as a cholesteric liquid crystal or a chiral sculptured thin film. The spectral behavior of COTS, observed as reflection resonances, is described by the temporal coupled-mode theory. Mode coupling is different for two circular light polarizations because COTS has a helix structure replicating that of the cholesteric. The mode coupling for co-handed circularly polarized light exponentially attenuates with the cholesteric layer thickness since the COTS frequency falls into the stop band. Cross-handed circularly polarized light freely goes through the cholesteric layer and can excite COTS when reflected from the handedness-preserving mirror. The coupling in this case is proportional to anisotropy of the cholesteric and theoretically it is only anisotropy of magnetic permittivity that can ultimately cancel this coupling. These two couplings being equal results in a polarization crossover (the Kopp--Genack effect) for which a linear polarization is optimal to excite COTS. The corresponding cholesteric thickness and scattering matrix for COTS are generally described by simple expressions.
△ Less
Submitted 30 March, 2017; v1 submitted 1 March, 2017;
originally announced March 2017.
-
Quantum properties of a parametric four-wave mixing in a Raman-type atomic system
Authors:
A. V. Sharypov,
Bing He,
V. G. Arkhipkin,
S. A. Myslivets
Abstract:
We present a study of the quantum properties of two light fields used to parametric four-wave mixing in a Raman-type atomic system. The system realizes an effective Hamiltonian of beamsplitter type coupling between the light fields, which allows to control squeezing and amplitude distribution of the light fields, as well as realizing their entanglement. The scheme can be feasibly applied to engine…
▽ More
We present a study of the quantum properties of two light fields used to parametric four-wave mixing in a Raman-type atomic system. The system realizes an effective Hamiltonian of beamsplitter type coupling between the light fields, which allows to control squeezing and amplitude distribution of the light fields, as well as realizing their entanglement. The scheme can be feasibly applied to engineer the quantum properties of two single-mode light fields in properly chosen input states.
△ Less
Submitted 9 December, 2016;
originally announced December 2016.
-
Coherently controlling Raman-induced grating in atomic media
Authors:
V. G. Arkhipkin,
S. A. Myslivets,
I. V. Timofeev
Abstract:
We consider dynamically controllable periodic structures, called Raman induced gratings, in three- and four-level atomic media, resulting from Raman interaction in a standing-wave pump. These gratings are due to periodic spatial modulation of the Raman nonlinearity and fundamentally differ from the ones based on electromagnetically induced transparency. The transmission and reflection spectra of s…
▽ More
We consider dynamically controllable periodic structures, called Raman induced gratings, in three- and four-level atomic media, resulting from Raman interaction in a standing-wave pump. These gratings are due to periodic spatial modulation of the Raman nonlinearity and fundamentally differ from the ones based on electromagnetically induced transparency. The transmission and reflection spectra of such gratings can be simultaneously amplified and controlled by varying the pump field intensity. It is shown that a transparent medium with periodic spatial modulation of the Raman gain can be opaque near the Raman resonance and yet at the same time it can be a non-linear amplifying mirror. We also show that spectral properties of the Raman induced grating can be controlled with the help of an additional weak control field.
△ Less
Submitted 26 November, 2015;
originally announced November 2015.
-
Geometric phase and o-mode blue shift in a chiral anisotropic medium inside a Fabry-Pérot cavity
Authors:
I. V. Timofeev,
V. A. Gunyakov,
V. S. Sutormin,
S. A. Myslivets,
V. G. Arkhipkin,
S. Ya. Vetrov,
W. Lee,
V. Ya. Zyryanov
Abstract:
Anomalous spectral shift of transmission peaks is observed in a Fabry--Pérot cavity filled with a chiral anisotropic medium. The effective refractive index value resides out of the interval between the ordinary and the extraordinary refractive indices. The spectral shift is explained by contribution of a geometric phase. The problem is solved analytically using the approximate Jones matrix method,…
▽ More
Anomalous spectral shift of transmission peaks is observed in a Fabry--Pérot cavity filled with a chiral anisotropic medium. The effective refractive index value resides out of the interval between the ordinary and the extraordinary refractive indices. The spectral shift is explained by contribution of a geometric phase. The problem is solved analytically using the approximate Jones matrix method, numerically using the accurate Berreman method and geometrically using the generalized Mauguin--Poincaré rolling cone method. The $o$-mode blue shift is measured for a 4-methoxybenzylidene-4'-$n$-butylaniline twisted--nematic layer inside the Fabry--Pérot cavity. The twist is electrically induced due to the homeoplanar--twisted configuration transition in an ionic-surfactant-doped liquid crystal layer. Experimental evidence confirms the validity of the theoretical model.
△ Less
Submitted 19 September, 2015;
originally announced September 2015.
-
Voltage-induced defect mode interaction in a one-dimensional photonic crystal with a twisted-nematic defect layer
Authors:
Ivan V. Timofeev,
Yu-Ting Lin,
Vladimir A. Gunyakov,
Sergey A. Myslivets,
Vasily G. Arkhipkin,
Stepan Ya. Vetrov,
Wei Lee,
Victor Ya. Zyryanov
Abstract:
Defect modes are investigated in a band gap of an electrically tunable one-dimensional photonic crystal infiltrated with a twisted-nematic liquid crystal (1D PC/TN). Their frequency shift and interference under applied voltage are studied both experimentally and theoretically. We deal with the case where the defect layer thickness is much larger than the wavelength (Mauguin condition). It is shown…
▽ More
Defect modes are investigated in a band gap of an electrically tunable one-dimensional photonic crystal infiltrated with a twisted-nematic liquid crystal (1D PC/TN). Their frequency shift and interference under applied voltage are studied both experimentally and theoretically. We deal with the case where the defect layer thickness is much larger than the wavelength (Mauguin condition). It is shown theoretically that the defect modes could have a complex structure with the elliptic polarization. Two series of polarized modes interact with each other and exhibit an avoided crossing phenomenon in the case of opposite parity.
△ Less
Submitted 21 October, 2011;
originally announced October 2011.
-
Ultranarrow resonance peaks in the transmission and reflection spectra of a photonic crystal cavity with Raman gain
Authors:
V. G. Arkhipkin,
S. A. Myslivets
Abstract:
The Raman gain of a probe light in a three-state $Λ$-scheme placed into a defect of a one-dimensional photonic crystal is studied theoretically. We show that there exists a pump intensity range, where the transmission and reflection spectra of the probe field exhibit \textit{simultaneously} occurring narrow peaks (resonances) whose position is determined by the Raman resonance. Transmission and…
▽ More
The Raman gain of a probe light in a three-state $Λ$-scheme placed into a defect of a one-dimensional photonic crystal is studied theoretically. We show that there exists a pump intensity range, where the transmission and reflection spectra of the probe field exhibit \textit{simultaneously} occurring narrow peaks (resonances) whose position is determined by the Raman resonance. Transmission and reflection coefficients can be larger than unity at pump intensities of order tens of $μ$W/cm$^{2}$. When the pump intensity is outside this region, the peak in the transmission spectrum turns into a narrow dip. The nature of narrow resonances is attributed to a drastic dispersion of the nonlinear refractive index in the vicinity of the Raman transition, which leads to a significant reduction of the group velocity of the probe wave.
△ Less
Submitted 1 September, 2009;
originally announced September 2009.
-
Temporal shape manipulation of adiabatons
Authors:
V. G. Arkhipkin,
I. V. Timofeev
Abstract:
We describe how to control the temporal shape of adiabaton using peculiarities of propagation dynamics under coherent population trap**. Temporal compression is demonstrated as a special case of pulse sha**. The general case of unequal oscillator strengths of two optical transitions in atom is considered.
We describe how to control the temporal shape of adiabaton using peculiarities of propagation dynamics under coherent population trap**. Temporal compression is demonstrated as a special case of pulse sha**. The general case of unequal oscillator strengths of two optical transitions in atom is considered.
△ Less
Submitted 6 August, 2005; v1 submitted 23 June, 2005;
originally announced June 2005.
-
Spatial evolution of short pulses under coherent population trap**
Authors:
V. G. Arkhipkin,
I. V. Timofeev
Abstract:
Spatial and temporal evolution is studied of two powerful short laser pulses having different wavelengths and interacting with a dense three-level Lambda-type optical medium under coherent population trap**. A general case of unequal oscillator strengths of the transitions is considered. Durations of the probe pulse and the coupling pulse $T_{1,2}$ ($T_2>T_1$) are assumed to be shorter than an…
▽ More
Spatial and temporal evolution is studied of two powerful short laser pulses having different wavelengths and interacting with a dense three-level Lambda-type optical medium under coherent population trap**. A general case of unequal oscillator strengths of the transitions is considered. Durations of the probe pulse and the coupling pulse $T_{1,2}$ ($T_2>T_1$) are assumed to be shorter than any of the relevant atomic relaxation times. We propose analytical and numerical solutions of a self-consistent set of coupled Schrödinger equations and reduced wave equations in the adiabatic limit with the account of the first non-adiabatic correction. The adiabaticity criterion is also discussed with the account of the pulse propagation. The dynamics of propagation is found to be strongly dependent on the ratio of the transition oscillator strengths. It is shown that envelopes of the pulses slightly change throughout the medium length at the initial stage of propagation. This distance can be large compared to the one-photon resonant absorption length. Eventually, the probe pulse is completely reemitted into the coupling pulse during propagation. The effect of localization of the atomic coherence has been observed similar to the one predicted by Fleischhauer and Lukin (PRL, {\bf 84}, 5094 (2000).
△ Less
Submitted 30 July, 2001; v1 submitted 23 March, 2001;
originally announced March 2001.