Search | arXiv e-print repository

Cross-view Masked Diffusion Transformers for Person Image Synthesis

Authors: Trung X. Pham, Zhang Kang, Chang D. Yoo

Abstract: We present X-MDPT ($\underline{Cross}$-view $\underline{M}$asked $\underline{D}$iffusion $\underline{P}$rediction $\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model c… ▽ More We present X-MDPT ($\underline{Cross}$-view $\underline{M}$asked $\underline{D}$iffusion $\underline{P}$rediction $\underline{T}$ransformers), a novel diffusion model designed for pose-guided human image generation. X-MDPT distinguishes itself by employing masked diffusion transformers that operate on latent patches, a departure from the commonly-used Unet structures in existing works. The model comprises three key modules: 1) a denoising diffusion Transformer, 2) an aggregation network that consolidates conditions into a single vector for the diffusion process, and 3) a mask cross-prediction module that enhances representation learning with semantic information from the reference image. X-MDPT demonstrates scalability, improving FID, SSIM, and LPIPS with larger models. Despite its simple design, our model outperforms state-of-the-art approaches on the DeepFashion dataset while exhibiting efficiency in terms of training parameters, training time, and inference speed. Our compact 33MB model achieves an FID of 7.42, surpassing a prior Unet latent diffusion approach (FID 8.07) using only $11\times$ fewer parameters. Our best model surpasses the pixel-based diffusion with $\frac{2}{3}$ of the parameters and achieves $5.43 \times$ faster inference. The code is available at https://github.com/trungpx/xmdpt. △ Less

Submitted 3 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2311.18508 [pdf, other]

DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution

Authors: Axi Niu, Kang Zhang, Joshua Tian ** Tee, Trung X. Pham, **qiu Sun, Chang D. Yoo, In So Kweon, Yanning Zhang

Abstract: It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To… ▽ More It is well known the adversarial optimization of GAN-based image super-resolution (SR) methods makes the preceding SR model generate unpleasant and undesirable artifacts, leading to large distortion. We attribute the cause of such distortions to the poor calibration of the discriminator, which hampers its ability to provide meaningful feedback to the generator for learning high-quality images. To address this problem, we propose a simple but non-travel diffusion-style data augmentation scheme for current GAN-based SR methods, known as DifAugGAN. It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training motivated by the successes of data augmentation schemes in the field to achieve good calibration. Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance. Extensive experimental evaluations demonstrate the superiority of DifAugGAN over state-of-the-art GAN-based SISR methods across both synthetic and real-world datasets, showcasing notable advancements in both qualitative and quantitative results. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2305.18547 [pdf, other]

Learning from Multi-Perception Features for Real-Word Image Super-resolution

Authors: Axi Niu, Kang Zhang, Trung X. Pham, Pei Wang, **qiu Sun, In So Kweon, Yanning Zhang

Abstract: Currently, there are two popular approaches for addressing real-world image super-resolution problems: degradation-estimation-based and blind-based methods. However, degradation-estimation-based methods may be inaccurate in estimating the degradation, making them less applicable to real-world LR images. On the other hand, blind-based methods are often limited by their fixed single perception infor… ▽ More Currently, there are two popular approaches for addressing real-world image super-resolution problems: degradation-estimation-based and blind-based methods. However, degradation-estimation-based methods may be inaccurate in estimating the degradation, making them less applicable to real-world LR images. On the other hand, blind-based methods are often limited by their fixed single perception information, which hinders their ability to handle diverse perceptual characteristics. To overcome this limitation, we propose a novel SR method called MPF-Net that leverages multiple perceptual features of input images. Our method incorporates a Multi-Perception Feature Extraction (MPFE) module to extract diverse perceptual information and a series of newly-designed Cross-Perception Blocks (CPB) to combine this information for effective super-resolution reconstruction. Additionally, we introduce a contrastive regularization term (CR) that improves the model's learning capability by using newly generated HR and LR images as positive and negative samples for ground truth HR. Experimental results on challenging real-world SR datasets demonstrate that our approach significantly outperforms existing state-of-the-art methods in both qualitative and quantitative measures. △ Less

Submitted 26 May, 2023; originally announced May 2023.

arXiv:2302.12831 [pdf, other]

CDPMSR: Conditional Diffusion Probabilistic Models for Single Image Super-Resolution

Authors: Axi Niu, Kang Zhang, Trung X. Pham, **qiu Sun, Yu Zhu, In So Kweon, Yanning Zhang

Abstract: Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation to generate high-quality images. Prior attempts at applying the DPM to image super-resolution (SR) have shown that iteratively refining a pure Gaussian noise with a conditional image using a U-Net trained on denoising at various-level noises can help obtain a satisfied high-resolution image for the low-reso… ▽ More Diffusion probabilistic models (DPM) have been widely adopted in image-to-image translation to generate high-quality images. Prior attempts at applying the DPM to image super-resolution (SR) have shown that iteratively refining a pure Gaussian noise with a conditional image using a U-Net trained on denoising at various-level noises can help obtain a satisfied high-resolution image for the low-resolution one. To further improve the performance and simplify current DPM-based super-resolution methods, we propose a simple but non-trivial DPM-based super-resolution post-process framework,i.e., cDPMSR. After applying a pre-trained SR model on the to-be-test LR image to provide the conditional input, we adapt the standard DPM to conduct conditional image generation and perform super-resolution through a deterministic iterative denoising process. Our method surpasses prior attempts on both qualitative and quantitative results and can generate more photo-realistic counterparts for the low-resolution images with various benchmark datasets including Set5, Set14, Urban100, BSD100, and Manga109. Code will be published after accepted. △ Less

Submitted 14 February, 2023; originally announced February 2023.

Comments: 4 pages, 4 figures

arXiv:2211.09861 [pdf, other]

Self-Supervised Visual Representation Learning via Residual Momentum

Authors: Trung X. Pham, Axi Niu, Zhang Kang, Sultan Rizky Madjid, Ji Woo Hong, Daehyeok Kim, Joshua Tian ** Tee, Chang D. Yoo

Abstract: Self-supervised learning (SSL) approaches have shown promising capabilities in learning the representation from unlabeled data. Amongst them, momentum-based frameworks have attracted significant attention. Despite being a great success, these momentum-based SSL frameworks suffer from a large gap in representation between the online encoder (student) and the momentum encoder (teacher), which hinder… ▽ More Self-supervised learning (SSL) approaches have shown promising capabilities in learning the representation from unlabeled data. Amongst them, momentum-based frameworks have attracted significant attention. Despite being a great success, these momentum-based SSL frameworks suffer from a large gap in representation between the online encoder (student) and the momentum encoder (teacher), which hinders performance on downstream tasks. This paper is the first to investigate and identify this invisible gap as a bottleneck that has been overlooked in the existing SSL frameworks, potentially preventing the models from learning good representation. To solve this problem, we propose "residual momentum" to directly reduce this gap to encourage the student to learn the representation as close to that of the teacher as possible, narrow the performance gap with the teacher, and significantly improve the existing SSL. Our method is straightforward, easy to implement, and can be easily plugged into other SSL frameworks. Extensive experimental results on numerous benchmark datasets and diverse network architectures have demonstrated the effectiveness of our method over the state-of-the-art contrastive learning baselines. △ Less

Submitted 21 November, 2022; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 18 pages, 16 figures

arXiv:2210.08282 [pdf, other]

LAD: A Hybrid Deep Learning System for Benign Paroxysmal Positional Vertigo Disorders Diagnostic

Authors: Trung Xuan Pham, ** Woong Choi, Rusty John Lloyd Mina, Thanh Nguyen, Sultan Rizky Madjid, Chang Dong Yoo

Abstract: Herein, we introduce "Look and Diagnose" (LAD), a hybrid deep learning-based system that aims to support doctors in the medical field in diagnosing effectively the Benign Paroxysmal Positional Vertigo (BPPV) disorder. Given the body postures of the patient in the Dix-Hallpike and lateral head turns test, the visual information of both eyes is captured and fed into LAD for analyzing and classifying… ▽ More Herein, we introduce "Look and Diagnose" (LAD), a hybrid deep learning-based system that aims to support doctors in the medical field in diagnosing effectively the Benign Paroxysmal Positional Vertigo (BPPV) disorder. Given the body postures of the patient in the Dix-Hallpike and lateral head turns test, the visual information of both eyes is captured and fed into LAD for analyzing and classifying into one of six possible disorders the patient might be suffering from. The proposed system consists of two streams: (1) an RNN-based stream that takes raw RGB images of both eyes to extract visual features and optical flow of each eye followed by ternary classification to determine left/right posterior canal (PC) or other; and (2) pupil detector stream that detects the pupil when it is classified as Non-PC and classifies the direction and strength of the beating to categorize the Non-PC types into the remaining four classes: Geotropic BPPV (left and right) and Apogeotropic BPPV (left and right). Experimental results show that with the patient's body postures, the system can accurately classify given BPPV disorder into the six types of disorders with an accuracy of 91% on the validation set. The proposed method can successfully classify disorders with an accuracy of 93% for the Posterior Canal disorder and 95% for the Geotropic and Apogeotropic disorder, paving a potential direction for research with the medical data. △ Less

Submitted 15 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE Access 2022, 13 pages, 14 figures

arXiv:2206.02193 [pdf, other]

doi 10.1142/S0129055X2350023X

Peeling for tensorial wave equations on Schwarzschild spacetime

Authors: Truong Xuan Pham

Abstract: In this paper, we establish the asymptotic behaviour along outgoing and incoming radial geodesics, i.e., the peeling property for the tensorial Fackrell-Ipser and spin $\pm 1$ Teukolsky equations on Schwarzschild spacetime. Our method combines a conformal compactification with vector field techniques to prove the two-side estimates of the energies of tensorial fields through the future and past nu… ▽ More In this paper, we establish the asymptotic behaviour along outgoing and incoming radial geodesics, i.e., the peeling property for the tensorial Fackrell-Ipser and spin $\pm 1$ Teukolsky equations on Schwarzschild spacetime. Our method combines a conformal compactification with vector field techniques to prove the two-side estimates of the energies of tensorial fields through the future and past null infinity $\mathscr{I}^\pm$ and the initial Cauchy hypersurface $Σ_0 = \left\{ t=0 \right\}$ in a neighbourhood of spacelike infinity $i_0$ far away from the horizon and future timelike infinity. Our results obtain the optimal initial data which guarantees the peeling at all orders. △ Less

Submitted 25 September, 2023; v1 submitted 5 June, 2022; originally announced June 2022.

Comments: 22 pages, Reviews in Mathematical Physics, 2023. arXiv admin note: text overlap with arXiv:2006.02888

Journal ref: Reviews in Mathematical Physics, 2023

arXiv:2203.17248 [pdf, other]

Dual Temperature Helps Contrastive Learning Without Many Negative Samples: Towards Understanding and Simplifying MoCo

Authors: Chaoning Zhang, Kang Zhang, Trung X. Pham, Axi Niu, Zhinan Qiao, Chang D. Yoo, In So Kweon

Abstract: Contrastive learning (CL) is widely known to require many negative samples, 65536 in MoCo for instance, for which the performance of a dictionary-free framework is often inferior because the negative sample size (NSS) is limited by its mini-batch size (MBS). To decouple the NSS from the MBS, a dynamic dictionary has been adopted in a large volume of CL frameworks, among which arguably the most pop… ▽ More Contrastive learning (CL) is widely known to require many negative samples, 65536 in MoCo for instance, for which the performance of a dictionary-free framework is often inferior because the negative sample size (NSS) is limited by its mini-batch size (MBS). To decouple the NSS from the MBS, a dynamic dictionary has been adopted in a large volume of CL frameworks, among which arguably the most popular one is MoCo family. In essence, MoCo adopts a momentum-based queue dictionary, for which we perform a fine-grained analysis of its size and consistency. We point out that InfoNCE loss used in MoCo implicitly attract anchors to their corresponding positive sample with various strength of penalties and identify such inter-anchor hardness-awareness property as a major reason for the necessity of a large dictionary. Our findings motivate us to simplify MoCo v2 via the removal of its dictionary as well as momentum. Based on an InfoNCE with the proposed dual temperature, our simplified frameworks, SimMoCo and SimCo, outperform MoCo v2 by a visible margin. Moreover, our work bridges the gap between CL and non-CL frameworks, contributing to a more unified understanding of these two mainstream frameworks in SSL. Code is available at: https://bit.ly/3LkQbaT. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: Accepted by CVPR2022

arXiv:2203.16262 [pdf, other]

How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning

Authors: Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, In So Kweon

Abstract: To avoid collapse in self-supervised learning (SSL), a contrastive loss is widely used but often requires a large number of negative samples. Without negative samples yet achieving competitive performance, a recent work has attracted significant attention for providing a minimalist simple Siamese (SimSiam) method to avoid collapse. However, the reason for how it avoids collapse without negative sa… ▽ More To avoid collapse in self-supervised learning (SSL), a contrastive loss is widely used but often requires a large number of negative samples. Without negative samples yet achieving competitive performance, a recent work has attracted significant attention for providing a minimalist simple Siamese (SimSiam) method to avoid collapse. However, the reason for how it avoids collapse without negative samples remains not fully clear and our investigation starts by revisiting the explanatory claims in the original SimSiam. After refuting their claims, we introduce vector decomposition for analyzing the collapse based on the gradient analysis of the $l_2$-normalized representation vector. This yields a unified perspective on how negative samples and SimSiam alleviate collapse. Such a unified perspective comes timely for understanding the recent progress in SSL. △ Less

Submitted 30 March, 2022; originally announced March 2022.

Comments: accepted on ICLR 2022

arXiv:2108.00475 [pdf, other]

Self-supervised Learning with Local Attention-Aware Feature

Authors: Trung X. Pham, Rusty John Lloyd Mina, Dias Issa, Chang D. Yoo

Abstract: In this work, we propose a novel methodology for self-supervised learning for generating global and local attention-aware visual features. Our approach is based on training a model to differentiate between specific image transformations of an input sample and the patched images. Utilizing this approach, the proposed method is able to outperform the previous best competitor by 1.03% on the Tiny-Ima… ▽ More In this work, we propose a novel methodology for self-supervised learning for generating global and local attention-aware visual features. Our approach is based on training a model to differentiate between specific image transformations of an input sample and the patched images. Utilizing this approach, the proposed method is able to outperform the previous best competitor by 1.03% on the Tiny-ImageNet dataset and by 2.32% on the STL-10 dataset. Furthermore, our approach outperforms the fully-supervised learning method on the STL-10 dataset. Experimental results and visualizations show the capability of successfully learning global and local attention-aware visual representations. △ Less

Submitted 1 August, 2021; originally announced August 2021.

Comments: 5 pages, 4 figures

arXiv:2107.02969 [pdf, ps, other]

doi 10.1063/1.5121433

Peeling of Dirac fields on Kerr spacetimes

Authors: Truong Xuan Pham

Abstract: In a recent paper with J.-P. Nicolas [J.-P. Nicolas and P.T. Xuan, Annales Henri Poincare 2019], we studied the peeling for scalar fields on Kerr metrics. The present work extends these results to Dirac fields on the same geometrical background. We follow the approach initiated by L.J. Mason and J.-P. Nicolas [L. Mason and J.-P. Nicolas, J.Inst.Math.Jussieu 2009; L. Mason and J.-P. Nicolas, J.Geom… ▽ More In a recent paper with J.-P. Nicolas [J.-P. Nicolas and P.T. Xuan, Annales Henri Poincare 2019], we studied the peeling for scalar fields on Kerr metrics. The present work extends these results to Dirac fields on the same geometrical background. We follow the approach initiated by L.J. Mason and J.-P. Nicolas [L. Mason and J.-P. Nicolas, J.Inst.Math.Jussieu 2009; L. Mason and J.-P. Nicolas, J.Geom.Phys 2012] on the Schwarzschild spacetime and extended to Kerr metrics for scalar fields. The method combines the Penrose conformal compactification and geometric energy estimates in order to work out a definition of the peeling at all orders in terms of Sobolev regularity near $\mathscr{I}$, instead of ${\mathcal C}^k$ regularity at $\mathscr{I}$, then provides the optimal spaces of initial data such that the associated solution satisfies the peeling at a given order. The results confirm that the analogous decay and regularity assumptions on initial data in Minkowski and in Kerr produce the same regularity across null infinity. Our results are local near spacelike infinity and are valid for all values of the angular momentum of the spacetime, including for fast Kerr metrics. △ Less

Submitted 25 September, 2023; v1 submitted 6 July, 2021; originally announced July 2021.

Comments: 29 pages, Journal of mathematical physics, 2020

arXiv:1909.06720 [pdf, other]

Cascade RPN: Delving into High-Quality Region Proposal Network with Adaptive Convolution

Authors: Thang Vu, Hyunjun Jang, Trung X. Pham, Chang D. Yoo

Abstract: This paper considers an architecture referred to as Cascade Region Proposal Network (Cascade RPN) for improving the region-proposal quality and detection performance by \textit{systematically} addressing the limitation of the conventional RPN that \textit{heuristically defines} the anchors and \textit{aligns} the features to the anchors. First, instead of using multiple anchors with predefined sca… ▽ More This paper considers an architecture referred to as Cascade Region Proposal Network (Cascade RPN) for improving the region-proposal quality and detection performance by \textit{systematically} addressing the limitation of the conventional RPN that \textit{heuristically defines} the anchors and \textit{aligns} the features to the anchors. First, instead of using multiple anchors with predefined scales and aspect ratios, Cascade RPN relies on a \textit{single anchor} per location and performs multi-stage refinement. Each stage is progressively more stringent in defining positive samples by starting out with an anchor-free metric followed by anchor-based metrics in the ensuing stages. Second, to attain alignment between the features and the anchors throughout the stages, \textit{adaptive convolution} is proposed that takes the anchors in addition to the image features as its input and learns the sampled features guided by the anchors. A simple implementation of a two-stage Cascade RPN achieves AR 13.4 points higher than that of the conventional RPN, surpassing any existing region proposal methods. When adopting to Fast R-CNN and Faster R-CNN, Cascade RPN can improve the detection mAP by 3.1 and 3.5 points, respectively. The code is made publicly available at \url{https://github.com/thangvubk/Cascade-RPN.git}. △ Less

Submitted 4 December, 2019; v1 submitted 14 September, 2019; originally announced September 2019.

Comments: To appear in NeurIPS 2019 (spotlight)

arXiv:1810.01641 [pdf, other]

PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report

Authors: Andrey Ignatov, Radu Timofte, Thang Van Vu, Tung Minh Luu, Trung X Pham, Cao Van Nguyen, Yongwoo Kim, Jae-Seok Choi, Munchurl Kim, Jie Huang, Jiewen Ran, Chen Xing, Xingguang Zhou, Pengfei Zhu, Mingrui Geng, Yawei Li, Eirikur Agustsson, Shuhang Gu, Luc Van Gool, Etienne de Stoutz, Nikolay Kobyshev, Kehui Nie, Yan Zhao, Gen Li, Tong Tong , et al. (23 additional authors not shown)

Abstract: This paper reviews the first challenge on efficient perceptual image enhancement with the focus on deploying deep learning models on smartphones. The challenge consisted of two tracks. In the first one, participants were solving the classical image super-resolution problem with a bicubic downscaling factor of 4. The second track was aimed at real-world photo enhancement, and the goal was to map lo… ▽ More This paper reviews the first challenge on efficient perceptual image enhancement with the focus on deploying deep learning models on smartphones. The challenge consisted of two tracks. In the first one, participants were solving the classical image super-resolution problem with a bicubic downscaling factor of 4. The second track was aimed at real-world photo enhancement, and the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with a DSLR camera. The target metric used in this challenge combined the runtime, PSNR scores and solutions' perceptual results measured in the user study. To ensure the efficiency of the submitted models, we additionally measured their runtime and memory requirements on Android smartphones. The proposed solutions significantly improved baseline results defining the state-of-the-art for image enhancement on smartphones. △ Less

Submitted 3 October, 2018; originally announced October 2018.

arXiv:1801.08996 [pdf, ps, other]

doi 10.1007/s00023-019-00832-0

Peeling on Kerr spacetime~:linear and non linear scalar fields

Authors: Jean-Philippe Nicolas, Truong Xuan Pham

Abstract: We study the peeling on Kerr spacetime for fields satisfying conformally invariant linear and nonlinear scalar wave equations. We follow an approach initiated by L.J. Mason and the first author for the Schwarzschild metric, based on a Penrose compactification and energy estimates. This approach provides a definition of the peeling at all orders in terms of Sobolev regularity near ${\mathscr I}$ in… ▽ More We study the peeling on Kerr spacetime for fields satisfying conformally invariant linear and nonlinear scalar wave equations. We follow an approach initiated by L.J. Mason and the first author for the Schwarzschild metric, based on a Penrose compactification and energy estimates. This approach provides a definition of the peeling at all orders in terms of Sobolev regularity near ${\mathscr I}$ instead of ${\cal C}^k$ regularity at ${\mathscr I}$, allowing to characterise completely and without loss the classes of initial data ensuring a certain order of peeling at ${\mathscr I}$. This paper extends the construction to the Kerr metric, confirms the validity and optimality of the flat spacetime model (in the sense that the same regularity and fall-off assumptions on the data guarantee the peeling behaviour in flat spacetime and on the Kerr metric) and does so for the first time for a nonlinear equation. Our results are local near spacelike infinity and are valid for all values of the angular momentum of the spacetime, including for fast Kerr metrics. △ Less

Submitted 26 January, 2018; originally announced January 2018.

Comments: 51 pages

MSC Class: 35L05; 35Q75; 83C57

Showing 1–14 of 14 results for author: Pham, T X