Search | arXiv e-print repository

Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss

Authors: Artem Khrapov, Vadim Popov, Tasnima Sadekova, Assel Yermekova, Mikhail Kudinov

Abstract: Diffusion models are known to be vulnerable to outliers in training data. In this paper we study an alternative diffusion loss function, which can preserve the high quality of generated data like the original squared $L_{2}$ loss while at the same time being robust to outliers. We propose to use pseudo-Huber loss function with a time-dependent parameter to allow for the trade-off between robustnes… ▽ More Diffusion models are known to be vulnerable to outliers in training data. In this paper we study an alternative diffusion loss function, which can preserve the high quality of generated data like the original squared $L_{2}$ loss while at the same time being robust to outliers. We propose to use pseudo-Huber loss function with a time-dependent parameter to allow for the trade-off between robustness on the most vulnerable early reverse-diffusion steps and fine details restoration on the final steps. We show that pseudo-Huber loss with the time-dependent parameter exhibits better performance on corrupted datasets in both image and audio domains. In addition, the loss function we propose can potentially help diffusion models to resist dataset corruption while not requiring data filtering or purification compared to conventional training algorithms. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 13 pages, 16 figures

arXiv:2403.14561 [pdf, other]

doi 10.1145/3613904.3641929

Looking Together $\neq$ Seeing the Same Thing: Understanding Surgeons' Visual Needs During Intra-operative Coordination and Instruction

Authors: Vitaliy Popov, Xinyue Chen, **gying Wang, Michael Kemp, Gurjit Sandhu, Taylor Kantor, Natalie Mateju, Xu Wang

Abstract: Shared gaze visualizations have been found to enhance collaboration and communication outcomes in diverse HCI scenarios including computer supported collaborative work and learning contexts. Given the importance of gaze in surgery operations, especially when a surgeon trainer and trainee need to coordinate their actions, research on the use of gaze to facilitate intra-operative coordination and in… ▽ More Shared gaze visualizations have been found to enhance collaboration and communication outcomes in diverse HCI scenarios including computer supported collaborative work and learning contexts. Given the importance of gaze in surgery operations, especially when a surgeon trainer and trainee need to coordinate their actions, research on the use of gaze to facilitate intra-operative coordination and instruction has been limited and shows mixed implications. We performed a field observation of 8 surgeries and an interview study with 14 surgeons to understand their visual needs during operations, informing ways to leverage and augment gaze to enhance intra-operative coordination and instruction. We found that trainees have varying needs in receiving visual guidance which are often unfulfilled by the trainers' instructions. It is critical for surgeons to control the timing of the gaze-based visualizations and effectively interpret gaze data. We suggest overlay technologies, e.g., gaze-based summaries and depth sensing, to augment raw gaze in support of surgical coordination and instruction. △ Less

Submitted 21 March, 2024; originally announced March 2024.

Journal ref: CHI'2024

arXiv:2402.17903 [pdf, other]

doi 10.1145/3613904.3642587

Surgment: Segmentation-enabled Semantic Search and Creation of Visual Question and Feedback to Support Video-Based Surgery Learning

Authors: **gying Wang, Haoran Tang, Taylor Kantor, Tandis Soltani, Vitaliy Popov, Xu Wang

Abstract: Videos are prominent learning materials to prepare surgical trainees before they enter the operating room (OR). In this work, we explore techniques to enrich the video-based surgery learning experience. We propose Surgment, a system that helps expert surgeons create exercises with feedback based on surgery recordings. Surgment is powered by a few-shot-learning-based pipeline (SegGPT+SAM) to segmen… ▽ More Videos are prominent learning materials to prepare surgical trainees before they enter the operating room (OR). In this work, we explore techniques to enrich the video-based surgery learning experience. We propose Surgment, a system that helps expert surgeons create exercises with feedback based on surgery recordings. Surgment is powered by a few-shot-learning-based pipeline (SegGPT+SAM) to segment surgery scenes, achieving an accuracy of 92\%. The segmentation pipeline enables functionalities to create visual questions and feedback desired by surgeons from a formative study. Surgment enables surgeons to 1) retrieve frames of interest through sketches, and 2) design exercises that target specific anatomical components and offer visual feedback. In an evaluation study with 11 surgeons, participants applauded the search-by-sketch approach for identifying frames of interest and found the resulting image-based questions and feedback to be of high educational value. △ Less

Submitted 27 February, 2024; originally announced February 2024.

Journal ref: CHI'2024

arXiv:2312.03759 [pdf, ps, other]

How should the advent of large language models affect the practice of science?

Authors: Marcel Binz, Stephan Alaniz, Adina Roskies, Balazs Aczel, Carl T. Bergstrom, Colin Allen, Daniel Schad, Dirk Wulff, Jevin D. West, Qiong Zhang, Richard M. Shiffrin, Samuel J. Gershman, Ven Popov, Emily M. Bender, Marco Marelli, Matthew M. Botvinick, Zeynep Akata, Eric Schulz

Abstract: Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schu… ▽ More Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2109.13821 [pdf, other]

Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme

Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, Jiansheng Wei

Abstract: Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We pres… ▽ More Voice conversion is a common speech synthesis task which can be solved in different ways depending on a particular real-world scenario. The most challenging one often referred to as one-shot many-to-many voice conversion consists in copying the target voice from only one reference utterance in the most general case when both source and target speakers do not belong to the training dataset. We present a scalable high-quality solution based on diffusion probabilistic modeling and demonstrate its superior quality compared to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on real-time applications, we investigate general principles which can make diffusion models faster while kee** synthesis quality at a high level. As a result, we develop a novel Stochastic Differential Equations solver suitable for various diffusion model types and generative tasks as shown through empirical studies and justify it by theoretical analysis. △ Less

Submitted 4 August, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

arXiv:2105.06337 [pdf, other]

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Authors: Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov

Abstract: Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by graduall… ▽ More Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We will make the code publicly available shortly. △ Less

Submitted 5 August, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

arXiv:1811.04623 [pdf, ps, other]

Fine-tuning of Language Models with Discriminator

Authors: Vadim Popov, Mikhail Kudinov

Abstract: Cross-entropy loss is a common choice when it comes to multiclass classification tasks and language modeling in particular. Minimizing this loss results in language models of very good quality. We show that it is possible to fine-tune these models and make them perform even better if they are fine-tuned with sum of cross-entropy loss and reverse Kullback-Leibler divergence. The latter is estimated… ▽ More Cross-entropy loss is a common choice when it comes to multiclass classification tasks and language modeling in particular. Minimizing this loss results in language models of very good quality. We show that it is possible to fine-tune these models and make them perform even better if they are fine-tuned with sum of cross-entropy loss and reverse Kullback-Leibler divergence. The latter is estimated using discriminator network that we train in advance. During fine-tuning probabilities of rare words that are usually underestimated by language models become bigger. The novel approach that we propose allows us to reach state-of-the-art quality on Penn Treebank: perplexity decreases from 52.4 to 52.1. Our fine-tuning algorithm is rather fast, scales well to different architectures and datasets and requires almost no hyperparameter tuning: the only hyperparameter that needs to be tuned is learning rate. △ Less

Submitted 15 January, 2019; v1 submitted 12 November, 2018; originally announced November 2018.

arXiv:1712.07473 [pdf, ps, other]

Differentially Private Distributed Learning for Language Modeling Tasks

Authors: Vadim Popov, Mikhail Kudinov, Irina Piontkovskaya, Petr Vytovtov, Alex Nevidomsky

Abstract: One of the big challenges in machine learning applications is that training data can be different from the real-world data faced by the algorithm. In language modeling, users' language (e.g. in private messaging) could change in a year and be completely different from what we observe in publicly available data. At the same time, public data can be used for obtaining general knowledge (i.e. general… ▽ More One of the big challenges in machine learning applications is that training data can be different from the real-world data faced by the algorithm. In language modeling, users' language (e.g. in private messaging) could change in a year and be completely different from what we observe in publicly available data. At the same time, public data can be used for obtaining general knowledge (i.e. general model of English). We study approaches to distributed fine-tuning of a general model on user private data with the additional requirements of maintaining the quality on the general data and minimization of communication costs. We propose a novel technique that significantly improves prediction quality on users' language compared to a general model and outperforms gradient compression methods in terms of communication efficiency. The proposed procedure is fast and leads to an almost 70% perplexity reduction and 8.7 percentage point improvement in keystroke saving rate on informal English texts. We also show that the range of tasks our approach is applicable to is not limited by language modeling only. Finally, we propose an experimental framework for evaluating differential privacy of distributed training of language models and show that our approach has good privacy guarantees. △ Less

Submitted 6 March, 2018; v1 submitted 20 December, 2017; originally announced December 2017.

arXiv:1412.4316 [pdf]

Hasq Hash Chains

Authors: Oleg Mazonka, Vlad Popov

Abstract: This paper describes a particular hash-based records linking chain scheme. This scheme is simple conceptually and easy to implement in software. It allows for a simple and secure way to transfer ownership of digital objects between peers. This paper describes a particular hash-based records linking chain scheme. This scheme is simple conceptually and easy to implement in software. It allows for a simple and secure way to transfer ownership of digital objects between peers. △ Less

Submitted 14 December, 2014; originally announced December 2014.

arXiv:1309.4507 [pdf]

Faster Fair Solution for the Reader-Writer Problem

Authors: Vlad Popov, Oleg Mazonka

Abstract: A fast fair solution for Reader-Writer Problem is presented. A fast fair solution for Reader-Writer Problem is presented. △ Less

Submitted 17 September, 2013; originally announced September 2013.

arXiv:1104.4433 [pdf, ps, other]

Arc-preserving subsequences of arc-annotated sequences

Authors: Vladimir Yu. Popov

Abstract: Arc-annotated sequences are useful in representing the structural information of RNA and protein sequences. The longest arc-preserving common subsequence problem has been introduced as a framework for studying the similarity of arc-annotated sequences. In this paper, we consider arc-annotated sequences with various arc structures. We consider the longest arc preserving common subsequence problem.… ▽ More Arc-annotated sequences are useful in representing the structural information of RNA and protein sequences. The longest arc-preserving common subsequence problem has been introduced as a framework for studying the similarity of arc-annotated sequences. In this paper, we consider arc-annotated sequences with various arc structures. We consider the longest arc preserving common subsequence problem. In particular, we show that the decision version of the 1-{\sc fragment LAPCS(crossing,chain)} and the decision version of the 0-{\sc diagonal LAPCS(crossing,chain)} are {\bf NP}-complete for some fixed alphabet $Σ$ such that $|Σ| = 2$. Also we show that if $|Σ| = 1$, then the decision version of the 1-{\sc fragment LAPCS(unlimited, plain)} and the decision version of the 0-{\sc diagonal LAPCS(unlimited, plain)} are {\bf NP}-complete. △ Less

Submitted 22 April, 2011; originally announced April 2011.

MSC Class: 68Q15 ACM Class: F.1.3

Journal ref: Acta Univ. Sapientiae, Informatica 3, 1 (2011) 35--47

arXiv:1011.3257 [pdf]

Integration of Flexible Web Based GUI in I-SOAS

Authors: Zeeshan Ahmed, Vasil Popov

Abstract: It is necessary to improve the concepts of the present web based graphical user interface for the development of more flexible and intelligent interface to provide ease and increase the level of comfort at user end like most of the desktop based applications. This research is conducted targeting the goal of implementing flexible GUI consisting of a visual component manager with different component… ▽ More It is necessary to improve the concepts of the present web based graphical user interface for the development of more flexible and intelligent interface to provide ease and increase the level of comfort at user end like most of the desktop based applications. This research is conducted targeting the goal of implementing flexible GUI consisting of a visual component manager with different components by functionality, design and purpose. In this research paper we present a Rich Internet Application (RIA) based graphical user interface for web based product development, and going into the details we present a comparison between existing RIA Technologies, adopted methodology in the GUI development and developed prototype. △ Less

Submitted 14 November, 2010; originally announced November 2010.

Comments: In the proceedings of 6th I*PROMS Virtual International Conference on Innovative Production Machines and Systems (IPROMS 2010), Session Production Organisation and Management, Cardiff University, Whittles Publishing, Scotland UK, 15-26 November, 2010

Showing 1–12 of 12 results for author: Popov, V