Search | arXiv e-print repository

SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

Authors: Ruihuang Li, Liyi Chen, Zhengqiang Zhang, Varun Jampani, Vishal M. Patel, Lei Zhang

Abstract: Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow conve… ▽ More Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow convergence and over-smoothed textures. We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 16 pages, 13 figures

arXiv:2406.13237 [pdf, other]

ModelMix: A New Model-Mixup Strategy to Minimize Vicinal Risk across Tasks for Few-scribble based Cardiac Segmentation

Authors: Ke Zhang, Vishal M. Patel

Abstract: Pixel-level dense labeling is both resource-intensive and time-consuming, whereas weak labels such as scribble present a more feasible alternative to full annotations. However, training segmentation networks with weak supervision from scribbles remains challenging. Inspired by the fact that different segmentation tasks can be correlated with each other, we introduce a new approach to few-scribble… ▽ More Pixel-level dense labeling is both resource-intensive and time-consuming, whereas weak labels such as scribble present a more feasible alternative to full annotations. However, training segmentation networks with weak supervision from scribbles remains challenging. Inspired by the fact that different segmentation tasks can be correlated with each other, we introduce a new approach to few-scribble supervised segmentation based on model parameter interpolation, termed as ModelMix. Leveraging the prior knowledge that linearly interpolating convolution kernels and bias terms should result in linear interpolations of the corresponding feature vectors, ModelMix constructs virtual models using convex combinations of convolutional parameters from separate encoders. We then regularize the model set to minimize vicinal risk across tasks in both unsupervised and scribble-supervised way. Validated on three open datasets, i.e., ACDC, MSCMRseg, and MyoPS, our few-scribble guided ModelMix significantly surpasses the performance of the state-of-the-art scribble supervised methods. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 10 pages, 3 figures

arXiv:2406.10373 [pdf, other]

Wild-GS: Real-Time Novel View Synthesis from Unconstrained Photo Collections

Authors: Jiacong Xu, Yiqun Mei, Vishal M. Patel

Abstract: Photographs captured in unstructured tourist environments frequently exhibit variable appearances and transient occlusions, challenging accurate scene reconstruction and inducing artifacts in novel view synthesis. Although prior approaches have integrated the Neural Radiance Field (NeRF) with additional learnable modules to handle the dynamic appearances and eliminate transient objects, their exte… ▽ More Photographs captured in unstructured tourist environments frequently exhibit variable appearances and transient occlusions, challenging accurate scene reconstruction and inducing artifacts in novel view synthesis. Although prior approaches have integrated the Neural Radiance Field (NeRF) with additional learnable modules to handle the dynamic appearances and eliminate transient objects, their extensive training demands and slow rendering speeds limit practical deployments. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising alternative to NeRF, offering superior training and inference efficiency along with better rendering quality. This paper presents Wild-GS, an innovative adaptation of 3DGS optimized for unconstrained photo collections while preserving its efficiency benefits. Wild-GS determines the appearance of each 3D Gaussian by their inherent material attributes, global illumination and camera properties per image, and point-level local variance of reflectance. Unlike previous methods that model reference features in image space, Wild-GS explicitly aligns the pixel appearance features to the corresponding local Gaussians by sampling the triplane extracted from the reference image. This novel design effectively transfers the high-frequency detailed appearance of the reference view to 3D space and significantly expedites the training process. Furthermore, 2D visibility maps and depth regularization are leveraged to mitigate the transient effects and constrain the geometry, respectively. Extensive experiments demonstrate that Wild-GS achieves state-of-the-art rendering performance and the highest efficiency in both training and inference among all the existing techniques. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 15 pages, 7 figures

arXiv:2406.02549 [pdf, other]

Dreamguider: Improved Training free Diffusion-based Conditional Generation

Authors: Nithin Gopalakrishnan Nair, Vishal M Patel

Abstract: Diffusion models have emerged as a formidable tool for training-free conditional generation.However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introd… ▽ More Diffusion models have emerged as a formidable tool for training-free conditional generation.However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introduced minimal compute methods for linear inverse problems, a generic lightweight guidance solution to both linear and non-linear guidance problems is still missing. To this end, we propose Dreamguider, a method that enables inference-time guidance without compute-heavy backpropagation through the diffusion network. The key idea is to regulate the gradient flow through a time-varying factor. Moreover, we propose an empirical guidance scale that works for a wide variety of tasks, hence removing the need for handcrafted parameter tuning. We further introduce an effective lightweight augmentation strategy that significantly boosts the performance during inference-time guidance. We present experiments using Dreamguider on multiple tasks across multiple datasets and models to show the effectiveness of the proposed modules. To facilitate further research, we will make the code public after the review process. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.11708 [pdf, other]

Adaptive Batch Normalization Networks for Adversarial Robustness

Authors: Shao-Yuan Lo, Vishal M. Patel

Abstract: Deep networks are vulnerable to adversarial examples. Adversarial Training (AT) has been a standard foundation of modern adversarial defense approaches due to its remarkable effectiveness. However, AT is extremely time-consuming, refraining it from wide deployment in practical applications. In this paper, we aim at a non-AT defense: How to design a defense method that gets rid of AT but is still r… ▽ More Deep networks are vulnerable to adversarial examples. Adversarial Training (AT) has been a standard foundation of modern adversarial defense approaches due to its remarkable effectiveness. However, AT is extremely time-consuming, refraining it from wide deployment in practical applications. In this paper, we aim at a non-AT defense: How to design a defense method that gets rid of AT but is still robust against strong adversarial attacks? To answer this question, we resort to adaptive Batch Normalization (BN), inspired by the recent advances in test-time domain adaptation. We propose a novel defense accordingly, referred to as the Adaptive Batch Normalization Network (ABNN). ABNN employs a pre-trained substitute model to generate clean BN statistics and sends them to the target model. The target model is exclusively trained on clean data and learns to align the substitute model's BN statistics. Experimental results show that ABNN consistently improves adversarial robustness against both digital and physically realizable attacks on both image and video datasets. Furthermore, ABNN can achieve higher clean data performance and significantly lower training time complexity compared to AT-based approaches. △ Less

Submitted 26 May, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

Comments: Accepted at IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS) 2024

arXiv:2405.10913 [pdf, other]

Blackbox Adaptation for Medical Image Segmentation

Authors: Jay N. Paranjape, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

Abstract: In recent years, various large foundation models have been proposed for image segmentation. There models are often trained on large amounts of data corresponding to general computer vision tasks. Hence, these models do not perform well on medical data. There have been some attempts in the literature to perform parameter-efficient finetuning of such foundation models for medical image segmentation.… ▽ More In recent years, various large foundation models have been proposed for image segmentation. There models are often trained on large amounts of data corresponding to general computer vision tasks. Hence, these models do not perform well on medical data. There have been some attempts in the literature to perform parameter-efficient finetuning of such foundation models for medical image segmentation. However, these approaches assume that all the parameters of the model are available for adaptation. But, in many cases, these models are released as APIs or blackboxes, with no or limited access to the model parameters and data. In addition, finetuning methods also require a significant amount of compute, which may not be available for the downstream task. At the same time, medical data can't be shared with third-party agents for finetuning due to privacy reasons. To tackle these challenges, we pioneer a blackbox adaptation technique for prompted medical image segmentation, called BAPS. BAPS has two components - (i) An Image-Prompt decoder (IP decoder) module that generates visual prompts given an image and a prompt, and (ii) A Zero Order Optimization (ZOO) Method, called SPSA-GC that is used to update the IP decoder without the need for backpropagating through the foundation model. Thus, our method does not require any knowledge about the foundation model's weights or gradients. We test BAPS on four different modalities and show that our method can improve the original model's performance by around 4%. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: Accepted early at MICCAI 2024

arXiv:2405.05033 [pdf, other]

Multi-fidelity Hamiltonian Monte Carlo

Authors: Dhruv V. Patel, Jonghyun Lee, Matthew W. Farthing, Peter K. Kitanidis, Eric F. Darve

Abstract: Numerous applications in biology, statistics, science, and engineering require generating samples from high-dimensional probability distributions. In recent years, the Hamiltonian Monte Carlo (HMC) method has emerged as a state-of-the-art Markov chain Monte Carlo technique, exploiting the shape of such high-dimensional target distributions to efficiently generate samples. Despite its impressive em… ▽ More Numerous applications in biology, statistics, science, and engineering require generating samples from high-dimensional probability distributions. In recent years, the Hamiltonian Monte Carlo (HMC) method has emerged as a state-of-the-art Markov chain Monte Carlo technique, exploiting the shape of such high-dimensional target distributions to efficiently generate samples. Despite its impressive empirical success and increasing popularity, its wide-scale adoption remains limited due to the high computational cost of gradient calculation. Moreover, applying this method is impossible when the gradient of the posterior cannot be computed (for example, with black-box simulators). To overcome these challenges, we propose a novel two-stage Hamiltonian Monte Carlo algorithm with a surrogate model. In this multi-fidelity algorithm, the acceptance probability is computed in the first stage via a standard HMC proposal using an inexpensive differentiable surrogate model, and if the proposal is accepted, the posterior is evaluated in the second stage using the high-fidelity (HF) numerical solver. Splitting the standard HMC algorithm into these two stages allows for approximating the gradient of the posterior efficiently, while producing accurate posterior samples by using HF numerical solvers in the second stage. We demonstrate the effectiveness of this algorithm for a range of problems, including linear and nonlinear Bayesian inverse problems with in-silico data and experimental data. The proposed algorithm is shown to seamlessly integrate with various low-fidelity and HF models, priors, and datasets. Remarkably, our proposed method outperforms the traditional HMC algorithm in both computational and statistical efficiency by several orders of magnitude, all while retaining or improving the accuracy in computed posterior statistics. △ Less

Submitted 8 May, 2024; originally announced May 2024.

arXiv:2404.14406 [pdf, other]

Hyp-OC: Hyperbolic One Class Classification for Face Anti-Spoofing

Authors: Kartik Narayan, Vishal M. Patel

Abstract: Face recognition technology has become an integral part of modern security systems and user authentication processes. However, these systems are vulnerable to spoofing attacks and can easily be circumvented. Most prior research in face anti-spoofing (FAS) approaches it as a two-class classification task where models are trained on real samples and known spoof attacks and tested for detection perfo… ▽ More Face recognition technology has become an integral part of modern security systems and user authentication processes. However, these systems are vulnerable to spoofing attacks and can easily be circumvented. Most prior research in face anti-spoofing (FAS) approaches it as a two-class classification task where models are trained on real samples and known spoof attacks and tested for detection performance on unknown spoof attacks. However, in practice, FAS should be treated as a one-class classification task where, while training, one cannot assume any knowledge regarding the spoof samples a priori. In this paper, we reformulate the face anti-spoofing task from a one-class perspective and propose a novel hyperbolic one-class classification framework. To train our network, we use a pseudo-negative class sampled from the Gaussian distribution with a weighted running mean and propose two novel loss functions: (1) Hyp-PC: Hyperbolic Pairwise Confusion loss, and (2) Hyp-CE: Hyperbolic Cross Entropy loss, which operate in the hyperbolic space. Additionally, we employ Euclidean feature clip** and gradient clip** to stabilize the training in the hyperbolic space. To the best of our knowledge, this is the first work extending hyperbolic embeddings for face anti-spoofing in a one-class manner. With extensive experiments on five benchmark datasets: Rose-Youtu, MSU-MFSD, CASIA-MFSD, Idiap Replay-Attack, and OULU-NPU, we demonstrate that our method significantly outperforms the state-of-the-art, achieving better spoof detection performance. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: Accepted in FG2024, Project Page - https://kartik-3004.github.io/hyp-oc/

arXiv:2404.12368 [pdf, other]

Gradient-Regularized Out-of-Distribution Detection

Authors: Sina Sharifi, Taha Entesari, Bardia Safaei, Vishal M. Patel, Mahyar Fazlyab

Abstract: One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However,… ▽ More One of the challenges for neural networks in real-life applications is the overconfident errors these models make when the data is not from the original training distribution. Addressing this issue is known as Out-of-Distribution (OOD) detection. Many state-of-the-art OOD methods employ an auxiliary dataset as a surrogate for OOD data during training to achieve improved performance. However, these methods fail to fully exploit the local information embedded in the auxiliary dataset. In this work, we propose the idea of leveraging the information embedded in the gradient of the loss function during training to enable the network to not only learn a desired OOD score for each sample but also to exhibit similar behavior in a local neighborhood around each sample. We also develop a novel energy-based sampling method to allow the network to be exposed to more informative OOD samples during the training phase. This is especially important when the auxiliary dataset is large. We demonstrate the effectiveness of our method through extensive experiments on several OOD benchmarks, improving the existing state-of-the-art FPR95 by 4% on our ImageNet experiment. We further provide a theoretical analysis through the lens of certified robustness and Lipschitz analysis to showcase the theoretical foundation of our work. We will publicly release our code after the review process. △ Less

Submitted 22 April, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

Comments: Under review

arXiv:2404.11764 [pdf, other]

Multimodal 3D Object Detection on Unseen Domains

Authors: Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel

Abstract: LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world,… ▽ More LiDAR datasets for autonomous driving exhibit biases in properties such as point cloud density, range, and object dimensions. As a result, object detection networks trained and evaluated in different environments often experience performance degradation. Domain adaptation approaches assume access to unannotated samples from the test distribution to address this problem. However, in the real world, the exact conditions of deployment and access to samples representative of the test dataset may be unavailable while training. We argue that the more realistic and challenging formulation is to require robustness in performance to unseen target domains. We propose to address this problem in a two-pronged manner. First, we leverage paired LiDAR-image data present in most autonomous driving datasets to perform multimodal object detection. We suggest that working with multimodal features by leveraging both images and LiDAR point clouds for scene understanding tasks results in object detectors more robust to unseen domain shifts. Second, we train a 3D object detector to learn multimodal object features across different distributions and promote feature invariance across these source domains to improve generalizability to unseen target domains. To this end, we propose CLIX$^\text{3D}$, a multimodal fusion and supervised contrastive learning framework for 3D object detection that performs alignment of object features from same-class samples of different domains while pushing the features from different classes apart. We show that CLIX$^\text{3D}$ yields state-of-the-art domain generalization performance under multiple dataset shifts. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: technical report

arXiv:2404.11737 [pdf, other]

Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection

Authors: Deepti Hegde, Suhas Lohit, Kuan-Chuan Peng, Michael J. Jones, Vishal M. Patel

Abstract: Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervis… ▽ More Popular representation learning methods encourage feature invariance under transformations applied at the input. However, in 3D perception tasks like object localization and segmentation, outputs are naturally equivariant to some transformations, such as rotation. Using pre-training loss functions that encourage equivariance of features under certain transformations provides a strong self-supervision signal while also retaining information of geometric relationships between transformed feature representations. This can enable improved performance in downstream tasks that are equivariant to such transformations. In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. Our experiments show that the best performance arises with a pre-training approach that encourages equivariance to translation, scaling, and flip, rotation and scene flow. For spatial augmentations, we find that depending on the transformation, either a contrastive objective or an equivariance-by-classification objective yields best results. To leverage real-world object deformations and motion, we consider sequential LiDAR scene pairs and develop a novel 3D scene flow-based equivariance objective that leads to improved performance overall. We show our pre-training method for 3D object detection which outperforms existing equivariant and invariant approaches in many settings. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: technical report

arXiv:2404.09977 [pdf, other]

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

Authors: Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M Patel

Abstract: Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation as well as spatially conditioned image generation. For most applications, we can train the model end-toend with paired data to obtain photorealistic generation quality. However, to add an additional task, one often needs to retrain the model from scratch using paired data across al… ▽ More Large diffusion-based Text-to-Image (T2I) models have shown impressive generative powers for text-to-image generation as well as spatially conditioned image generation. For most applications, we can train the model end-toend with paired data to obtain photorealistic generation quality. However, to add an additional task, one often needs to retrain the model from scratch using paired data across all modalities to retain good generation performance. In this paper, we tackle this issue and propose a novel strategy to scale a generative model across new tasks with minimal compute. During our experiments, we discovered that the variance maps of intermediate feature maps of diffusion models capture the intensity of conditioning. Utilizing this prior information, we propose MaxFusion, an efficient strategy to scale up text-to-image generation models to accommodate new modality conditions. Specifically, we combine aligned features of multiple models, hence bringing a compositional effect. Our fusion strategy can be integrated into off-the-shelf models to enhance their generative prowess. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09976 [pdf, other]

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Authors: Nithin Gopalakrishnan Nair, Jeya Maria Jose Valanarasu, Vishal M. Patel

Abstract: Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regard… ▽ More Recently, diffusion transformers have gained wide attention with its excellent performance in text-to-image and text-to-vidoe models, emphasizing the need for transformers as backbone for diffusion models. Transformer-based models have shown better generalization capability compared to CNN-based models for general vision tasks. However, much less has been explored in the existing literature regarding the capabilities of transformer-based diffusion backbones and expanding their generative prowess to other datasets. This paper focuses on enabling a single pre-trained diffusion transformer model to scale across multiple datasets swiftly, allowing for the completion of diverse generative tasks using just one model. To this end, we propose DiffScaler, an efficient scaling strategy for diffusion models where we train a minimal amount of parameters to adapt to different tasks. In particular, we learn task-specific transformations at each layer by incorporating the ability to utilize the learned subspaces of the pre-trained model, as well as the ability to learn additional task-specific subspaces, which may be absent in the pre-training dataset. As these parameters are independent, a single diffusion model with these task-specific parameters can be used to perform multiple tasks simultaneously. Moreover, we find that transformer-based diffusion models significantly outperform CNN-based diffusion models methods while performing fine-tuning over smaller datasets. We perform experiments on four unconditional image generation datasets. We show that using our proposed method, a single pre-trained model can scale up to perform these conditional and unconditional tasks, respectively, with minimal parameter tuning while performing as close as fine-tuning an entire diffusion model for that particular task. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2404.09810 [pdf, other]

The Challenges of Optimization For Data Science

Authors: Christian Varner, Vivak Patel

Abstract: Optimization problems arising in data science have given rise to a number of new derivative-based optimization methods. Such methods often use standard smoothness assumptions -- namely, global Lipschitz continuity of the gradient function -- to establish a convergence theory. Unfortunately, in this work, we show that common optimization problems from data science applications are not globally Lips… ▽ More Optimization problems arising in data science have given rise to a number of new derivative-based optimization methods. Such methods often use standard smoothness assumptions -- namely, global Lipschitz continuity of the gradient function -- to establish a convergence theory. Unfortunately, in this work, we show that common optimization problems from data science applications are not globally Lipschitz smooth, nor do they satisfy some more recently developed smoothness conditions in literature. Instead, we show that such optimization problems are better modeled as having locally Lipschitz continuous gradients. We then construct explicit examples satisfying this assumption on which existing classes of optimization methods are either unreliable or experience an explosion in evaluation complexity. In summary, we show that optimization problems arising in data science are particularly difficult to solve, and that there is a need for methods that can reliably and practically solve these problems. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 24 pages, 3 tables, 2 figures, 10 algorithms

MSC Class: 90C30; 65K05; 68T09

arXiv:2404.01562 [pdf]

Efficient, indistinguishable telecom C-band photons using a tapered nanobeam

Authors: Mohammad Habibur Rahaman, Samuel Harper, Chang-Min Lee, Kyu-Young Kim, Mustafa Atabey Buyukkaya, Victor J. Patel, Samuel D. Hawkins, Je-Hyung Kim, Sadhvikas Addamane, Edo Waks

Abstract: Telecom C-band single photons exhibit the lowest attenuation in optical fibers, enabling long-haul quantum-secured communication. However, efficient coupling with optical fibers is crucial for these single photons to be effective carriers in long-distance transmission. In this work, we demonstrate an efficient fiber-coupled single photon source at the telecom C-band using InAs/InP quantum dots cou… ▽ More Telecom C-band single photons exhibit the lowest attenuation in optical fibers, enabling long-haul quantum-secured communication. However, efficient coupling with optical fibers is crucial for these single photons to be effective carriers in long-distance transmission. In this work, we demonstrate an efficient fiber-coupled single photon source at the telecom C-band using InAs/InP quantum dots coupled to a tapered nanobeam. The tapered nanobeam structure facilitates directional emission that is mode-matched to a lensed fiber, resulting in a collection efficiency of up to 65% from the nanobeam to a single-mode fiber. Using this approach, we demonstrate single photon count rates of 575 $\pm$ 5 Kcps and a single photon purity of $g^2$ (0) = 0.015 $\pm$ 0.003. Additionally, we demonstrate Hong-Ou Mandel interference from the emitted photons with a visibility of 0.84 $\pm$ 0.06. From these measurements, we determine a photon coherence time of 450 $\pm$ 20 ps, a factor of just 8.3 away from the lifetime limit. This work represents an important step towards the development of telecom C-band single-photon sources emitting bright, pure, and indistinguishable photons, which are necessary to realize fiber-based long-distance quantum networks △ Less

Submitted 5 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.01367 [pdf, other]

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

Abstract: We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established te… ▽ More We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established text-to-image diffusion models, we conduct an in-depth investigation into how model size influences sampling efficiency across varying sampling steps. Our findings unveil a surprising trend: when operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results. Moreover, we extend our study to demonstrate the generalizability of the these findings by applying various diffusion samplers, exploring diverse downstream tasks, evaluating post-distilled models, as well as comparing performance relative to training compute. These findings open up new pathways for the development of LDM scaling strategies which can be employed to enhance generative capabilities within limited inference budgets. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2404.00744 [pdf, other]

The distribution of Bayes' ratio

Authors: Luca Amendola, Vrund Patel, Ziad Sakr, Elena Sellentin, Kevin Wolz

Abstract: The ratio of Bayesian evidences is a popular tool in cosmology to compare different models. There are however several issues with this method: Bayes' ratio depends on the prior even in the limit of non-informative priors, and Jeffrey's scale, used to assess the test, is arbitrary. Moreover, the standard use of Bayes' ratio is often criticized for being unable to reject models. In this paper, we ad… ▽ More The ratio of Bayesian evidences is a popular tool in cosmology to compare different models. There are however several issues with this method: Bayes' ratio depends on the prior even in the limit of non-informative priors, and Jeffrey's scale, used to assess the test, is arbitrary. Moreover, the standard use of Bayes' ratio is often criticized for being unable to reject models. In this paper, we address these shortcoming by promoting evidences and evidence ratios to frequentist statistics and deriving their sampling distributions. By comparing the evidence ratios to their sampling distributions, poor fitting models can now be rejected. Our method additionally does not depend on the prior in the limit of very weak priors, thereby safeguarding the experimenter against premature rejection of a theory with a uninformative prior, and replaces the arbitrary Jeffrey's scale by probability thresholds for rejection. We provide analytical solutions for some simplified cases (Gaussian data, linear parameters, and nested models), and we apply the method to cosmological supernovae Ia data. We dub our method the FB method, for Frequentist-Bayesian. △ Less

Submitted 31 March, 2024; originally announced April 2024.

Comments: 20 pages

arXiv:2403.19593 [pdf, other]

Frame by Familiar Frame: Understanding Replication in Video Diffusion Models

Authors: Aimon Rahman, Malsha V. Perera, Vishal M. Patel

Abstract: Building on the momentum of image generation diffusion models, there is an increasing interest in video-based diffusion models. However, video generation poses greater challenges due to its higher-dimensional nature, the scarcity of training data, and the complex spatiotemporal relationships involved. Image generation models, due to their extensive data requirements, have already strained computat… ▽ More Building on the momentum of image generation diffusion models, there is an increasing interest in video-based diffusion models. However, video generation poses greater challenges due to its higher-dimensional nature, the scarcity of training data, and the complex spatiotemporal relationships involved. Image generation models, due to their extensive data requirements, have already strained computational resources to their limits. There have been instances of these models reproducing elements from the training samples, leading to concerns and even legal disputes over sample replication. Video diffusion models, which operate with even more constrained datasets and are tasked with generating both spatial and temporal content, may be more prone to replicating samples from their training sets. Compounding the issue, these models are often evaluated using metrics that inadvertently reward replication. In our paper, we present a systematic investigation into the phenomenon of sample replication in video diffusion models. We scrutinize various recent diffusion models for video synthesis, assessing their tendency to replicate spatial and temporal content in both unconditional and conditional generation scenarios. Our study identifies strategies that are less likely to lead to replication. Furthermore, we propose new evaluation strategies that take replication into account, offering a more accurate measure of a model's ability to generate the original content. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.14513 [pdf, other]

View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network

Authors: Quan Zhang, Lei Wang, Vishal M. Patel, Xiaohua Xie, Jianhuang Lai

Abstract: Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dr… ▽ More Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras, such as ground-ground matching. However, as a more practical scenario, aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features, namely hierarchical subtractive separation and orthogonal loss, where the former separates these two features inside the VDT, and the latter constrains these two to be independent. In addition, we contribute a large-scale AGPReID dataset called CARGO, consisting of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID, kee** the same magnitude of computational complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: CVPR 2024

arXiv:2403.14053 [pdf, other]

Leveraging Thermal Modality to Enhance Reconstruction in Low-Light Conditions

Authors: Jiacong Xu, Mingqian Liao, K Ram Prabhakar, Vishal M. Patel

Abstract: Neural Radiance Fields (NeRF) accomplishes photo-realistic novel view synthesis by learning the implicit volumetric representation of a scene from multi-view images, which faithfully convey the colorimetric information. However, sensor noises will contaminate low-value pixel signals, and the lossy camera image signal processor will further remove near-zero intensities in extremely dark situations,… ▽ More Neural Radiance Fields (NeRF) accomplishes photo-realistic novel view synthesis by learning the implicit volumetric representation of a scene from multi-view images, which faithfully convey the colorimetric information. However, sensor noises will contaminate low-value pixel signals, and the lossy camera image signal processor will further remove near-zero intensities in extremely dark situations, deteriorating the synthesis performance. Existing approaches reconstruct low-light scenes from raw images but struggle to recover texture and boundary details in dark regions. Additionally, they are unsuitable for high-speed models relying on explicit representations. To address these issues, we present Thermal-NeRF, which takes thermal and visible raw images as inputs, considering the thermal camera is robust to the illumination variation and raw images preserve any possible clues in the dark, to accomplish visible and thermal view synthesis simultaneously. Also, the first multi-view thermal and visible dataset (MVTV) is established to support the research on multimodal NeRF. Thermal-NeRF achieves the best trade-off between detail preservation and noise smoothing and provides better synthesis performance than previous work. Finally, we demonstrate that both modalities are beneficial to each other in 3D reconstruction. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: 25 pages, 13 figures

arXiv:2403.12960 [pdf, other]

FaceXFormer: A Unified Transformer for Facial Analysis

Authors: Kartik Narayan, Vibashan VS, Rama Chellappa, Vishal M. Patel

Abstract: In this work, we introduce FaceXformer, an end-to-end unified transformer model for a comprehensive range of facial analysis tasks such as face parsing, landmark detection, head pose estimation, attributes recognition, and estimation of age, gender, race, and landmarks visibility. Conventional methods in face analysis have often relied on task-specific designs and preprocessing techniques, which l… ▽ More In this work, we introduce FaceXformer, an end-to-end unified transformer model for a comprehensive range of facial analysis tasks such as face parsing, landmark detection, head pose estimation, attributes recognition, and estimation of age, gender, race, and landmarks visibility. Conventional methods in face analysis have often relied on task-specific designs and preprocessing techniques, which limit their approach to a unified architecture. Unlike these conventional methods, our FaceXformer leverages a transformer-based encoder-decoder architecture where each task is treated as a learnable token, enabling the integration of multiple tasks within a single framework. Moreover, we propose a parameter-efficient decoder, FaceX, which jointly processes face and task tokens, thereby learning generalized and robust face representations across different tasks. To the best of our knowledge, this is the first work to propose a single model capable of handling all these facial analysis tasks using transformers. We conducted a comprehensive analysis of effective backbones for unified face task processing and evaluated different task queries and the synergy between them. We conduct experiments against state-of-the-art specialized models and previous multi-task models in both intra-dataset and cross-dataset evaluations across multiple benchmarks. Additionally, our model effectively handles images "in-the-wild," demonstrating its robustness and generalizability across eight different tasks, all while maintaining the real-time performance of 37 FPS. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Project page: https://kartik-3004.github.io/facexformer_web/

arXiv:2403.09632 [pdf, other]

Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Authors: Yiqun Mei, Yu Zeng, He Zhang, Zhixin Shu, Xuaner Zhang, Sai Bi, Jianming Zhang, HyunJoon Jung, Vishal M. Patel

Abstract: At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to rec… ▽ More At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features, and predict a relit 3D representation in the form of a tri-plane, which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control, Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs, Holo-Relighting can generate complex non-Lambertian lighting effects (e.g., specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage, and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments, we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism, 3D consistency and controllability. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: CVPR2024

arXiv:2403.09620 [pdf, other]

PosSAM: Panoptic Open-vocabulary Segment Anything

Authors: Vibashan VS, Shubhankar Borse, Hyo** Park, Debasmit Das, Vishal Patel, Munawar Hayat, Fatih Porikli

Abstract: In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing appr… ▽ More In this paper, we introduce an open-vocabulary panoptic segmentation model that effectively unifies the strengths of the Segment Anything Model (SAM) with the vision-language CLIP model in an end-to-end framework. While SAM excels in generating spatially-aware masks, it's decoder falls short in recognizing object class information and tends to oversegment without additional guidance. Existing approaches address this limitation by using multi-stage techniques and employing separate models to generate class-aware prompts, such as bounding boxes or segmentation masks. Our proposed method, PosSAM is an end-to-end model which leverages SAM's spatially rich features to produce instance-aware masks and harnesses CLIP's semantically discriminative features for effective instance classification. Specifically, we address the limitations of SAM and propose a novel Local Discriminative Pooling (LDP) module leveraging class-agnostic SAM and class-aware CLIP features for unbiased open-vocabulary classification. Furthermore, we introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image. We conducted extensive experiments to demonstrate our methods strong generalization properties across multiple datasets, achieving state-of-the-art performance with substantial improvements over SOTA open-vocabulary panoptic segmentation methods. In both COCO to ADE20K and ADE20K to COCO settings, PosSAM outperforms the previous state-of-the-art methods by a large margin, 2.4 PQ and 4.6 PQ, respectively. Project Website: https://vibashan.github.io/possam-web/. △ Less

Submitted 14 March, 2024; originally announced March 2024.

arXiv:2403.06978 [pdf, other]

Attention Prompt Tuning: Parameter-efficient Adaptation of Pre-trained Models for Spatiotemporal Modeling

Authors: Wele Gedara Chaminda Bandara, Vishal M. Patel

Abstract: In this paper, we introduce Attention Prompt Tuning (APT) - a computationally efficient variant of prompt tuning for video-based applications such as action recognition. Prompt tuning approaches involve injecting a set of learnable prompts along with data tokens during fine-tuning while kee** the backbone frozen. This approach greatly reduces the number of learnable parameters compared to full t… ▽ More In this paper, we introduce Attention Prompt Tuning (APT) - a computationally efficient variant of prompt tuning for video-based applications such as action recognition. Prompt tuning approaches involve injecting a set of learnable prompts along with data tokens during fine-tuning while kee** the backbone frozen. This approach greatly reduces the number of learnable parameters compared to full tuning. For image-based downstream tasks, normally a couple of learnable prompts achieve results close to those of full tuning. However, videos, which contain more complex spatiotemporal information, require hundreds of tunable prompts to achieve reasonably good results. This reduces the parameter efficiency observed in images and significantly increases latency and the number of floating-point operations (FLOPs) during inference. To tackle these issues, we directly inject the prompts into the keys and values of the non-local attention mechanism within the transformer block. Additionally, we introduce a novel prompt reparameterization technique to make APT more robust against hyperparameter selection. The proposed APT approach greatly reduces the number of FLOPs and latency while achieving a significant performance boost over the existing parameter-efficient tuning methods on UCF101, HMDB51, and SSv2 datasets for action recognition. The code and pre-trained models are available at https://github.com/wgcban/apt △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: Accepted at 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG'24) Code available at: https://github.com/wgcban/apt 12 pages, 8 figures, 6 tables

arXiv:2402.17207 [pdf, other]

Deployment Prior Injection for Run-time Calibratable Object Detection

Authors: Mo Zhou, Yiding Yang, Haoxiang Li, Vishal M. Patel, Gang Hua

Abstract: With a strong alignment between the training and test distributions, object relation as a context prior facilitates object detection. Yet, it turns into a harmful but inevitable training set bias upon test distributions that shift differently across space and time. Nevertheless, the existing detectors cannot incorporate deployment context prior during the test phase without parameter update. Such… ▽ More With a strong alignment between the training and test distributions, object relation as a context prior facilitates object detection. Yet, it turns into a harmful but inevitable training set bias upon test distributions that shift differently across space and time. Nevertheless, the existing detectors cannot incorporate deployment context prior during the test phase without parameter update. Such kind of capability requires the model to explicitly learn disentangled representations with respect to context prior. To achieve this, we introduce an additional graph input to the detector, where the graph represents the deployment context prior, and its edge values represent object relations. Then, the detector behavior is trained to bound to the graph with a modified training objective. As a result, during the test phase, any suitable deployment context prior can be injected into the detector via graph edits, hence calibrating, or "re-biasing" the detector towards the given prior at run-time without parameter update. Even if the deployment prior is unknown, the detector can self-calibrate using deployment prior approximated using its own predictions. Comprehensive experimental results on the COCO dataset, as well as cross-dataset testing on the Objects365 dataset, demonstrate the effectiveness of the run-time calibratable detector. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.02263 [pdf, other]

MixedNUTS: Training-Free Accuracy-Robustness Balance via Nonlinearly Mixed Classifiers

Authors: Yatong Bai, Mo Zhou, Vishal M. Patel, Somayeh Sojoudi

Abstract: Adversarial robustness often comes at the cost of degraded accuracy, impeding the real-life application of robust classification models. Training-based solutions for better trade-offs are limited by incompatibilities with already-trained high-performance large models, necessitating the exploration of training-free ensemble approaches. Observing that robust models are more confident in correct pred… ▽ More Adversarial robustness often comes at the cost of degraded accuracy, impeding the real-life application of robust classification models. Training-based solutions for better trade-offs are limited by incompatibilities with already-trained high-performance large models, necessitating the exploration of training-free ensemble approaches. Observing that robust models are more confident in correct predictions than in incorrect ones on clean and adversarial data alike, we speculate amplifying this "benign confidence property" can reconcile accuracy and robustness in an ensemble setting. To achieve so, we propose "MixedNUTS", a training-free method where the output logits of a robust classifier and a standard non-robust classifier are processed by nonlinear transformations with only three parameters, which are optimized through an efficient algorithm. MixedNUTS then converts the transformed logits into probabilities and mixes them as the overall output. On CIFAR-10, CIFAR-100, and ImageNet datasets, experimental results with custom strong adaptive attacks demonstrate MixedNUTS's vastly improved accuracy and near-SOTA robustness -- it boosts CIFAR-100 clean accuracy by 7.86 points, sacrificing merely 0.87 points in robust accuracy. △ Less

Submitted 12 April, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

MSC Class: 68T07

arXiv:2401.02158 [pdf, other]

Shayona@SMM4H23: COVID-19 Self diagnosis classification using BERT and LightGBM models

Authors: Rushi Chavda, Darshan Makwana, Vraj Patel, Anupam Shukla

Abstract: This paper describes approaches and results for shared Task 1 and 4 of SMMH4-23 by Team Shayona. Shared Task-1 was binary classification of english tweets self-reporting a COVID-19 diagnosis, and Shared Task-4 was Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. Our team has achieved the highest f1-score 0.94 in Task-1 among all participants. We hav… ▽ More This paper describes approaches and results for shared Task 1 and 4 of SMMH4-23 by Team Shayona. Shared Task-1 was binary classification of english tweets self-reporting a COVID-19 diagnosis, and Shared Task-4 was Binary classification of English Reddit posts self-reporting a social anxiety disorder diagnosis. Our team has achieved the highest f1-score 0.94 in Task-1 among all participants. We have leveraged the Transformer model (BERT) in combination with the LightGBM model for both tasks. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2312.14126 [pdf, other]

Entropic Open-set Active Learning

Authors: Bardia Safaei, Vibashan VS, Celso M. de Melo, Vishal M. Patel

Abstract: Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting.… ▽ More Active Learning (AL) aims to enhance the performance of deep models by selecting the most informative samples for annotation from a pool of unlabeled data. Despite impressive performance in closed-set settings, most AL methods fail in real-world scenarios where the unlabeled data contains unknown categories. Recently, a few studies have attempted to tackle the AL problem for the open-set setting. However, these methods focus more on selecting known samples and do not efficiently utilize unknown samples obtained during AL rounds. In this work, we propose an Entropic Open-set AL (EOAL) framework which leverages both known and unknown distributions effectively to select informative samples during AL rounds. Specifically, our approach employs two different entropy scores. One measures the uncertainty of a sample with respect to the known-class distributions. The other measures the uncertainty of the sample with respect to the unknown-class distributions. By utilizing these two entropy scores we effectively separate the known and unknown samples from the unlabeled data resulting in better sampling. Through extensive experiments, we show that the proposed method outperforms existing state-of-the-art methods on CIFAR-10, CIFAR-100, and TinyImageNet datasets. Code is available at \url{https://github.com/bardisafa/EOAL}. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: Accepted in AAAI 2024

arXiv:2312.02156 [pdf, other]

Latent Feature-Guided Diffusion Models for Shadow Removal

Authors: Kangfu Mei, Luis Figueroa, Zhe Lin, Zhihong Ding, Scott Cohen, Vishal M. Patel

Abstract: Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that… ▽ More Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: project page see https://kfmei.page/shadow-diffusion/index.html

arXiv:2312.02151 [pdf, other]

Guarding Barlow Twins Against Overfitting with Mixed Samples

Authors: Wele Gedara Chaminda Bandara, Celso M. De Melo, Vishal M. Patel

Abstract: Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing fo… ▽ More Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins' proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn't necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets. The code and checkpoints are available at: https://github.com/wgcban/mix-bt.git △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: Code and checkpoints are available at: https://github.com/wgcban/mix-bt.git

arXiv:2311.05574 [pdf, ps, other]

A near-optimal zero-free disk for the Ising model

Authors: Viresh Patel, Guus Regts, Ayla Stam

Abstract: The partition function of the Ising model of a graph $G=(V,E)$ is defined as $Z_{\text{Ising}}(G;b)=\sum_{σ:V\to \{0,1\}} b^{m(σ)}$, where $m(σ)$ denotes the number of edges $e=\{u,v\}$ such that $σ(u)=σ(v)$. We show that for any positive integer $Δ$ and any graph $G$ of maximum degree at most $Δ$, $Z_{\text{Ising}}(G;b)\neq 0$ for all $b\in \mathbb{C}$ satisfying… ▽ More The partition function of the Ising model of a graph $G=(V,E)$ is defined as $Z_{\text{Ising}}(G;b)=\sum_{σ:V\to \{0,1\}} b^{m(σ)}$, where $m(σ)$ denotes the number of edges $e=\{u,v\}$ such that $σ(u)=σ(v)$. We show that for any positive integer $Δ$ and any graph $G$ of maximum degree at most $Δ$, $Z_{\text{Ising}}(G;b)\neq 0$ for all $b\in \mathbb{C}$ satisfying $|\frac{b-1}{b+1}| \leq \frac{1-o_Δ(1)}{Δ-1}$ (where $o_Δ(1) \to 0$ as $Δ\to \infty$). This is optimal in the sense that $\tfrac{1-o_Δ(1)}{Δ-1}$ cannot be replaced by $\tfrac{c}{Δ-1}$ for any constant $c > 1$ subject to a complexity theoretic assumption. To prove our result we use a standard reformulation of the partition function of the Ising model as the generating function of even sets. We establish a zero-free disk for this generating function inspired by techniques from statistical physics on partition functions of a polymer models. Our approach is quite general and we discuss extensions of it to a certain types of polymer models. △ Less

Submitted 23 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Comments: 12 pages; we have added a few propositions in Section 2 and reorganized the section to clarify the proof of Lemma 3.1. Some other small modifications have also been made as per suggestion of two referees

arXiv:2310.06212 [pdf, other]

Comparison of deep-learning data fusion strategies in mandibular osteoradionecrosis prediction modelling using clinical variables and radiation dose distribution volumes

Authors: Laia Humbert-Vidan, Vinod Patel, Andrew P King, Teresa Guerrero Urbano

Abstract: Purpose. NTCP modelling is rapidly embracing DL methods as the need to include spatial dose information is acknowledged. Finding the most appropriate way of combining radiation dose distribution images and clinical data involves technical challenges and requires domain knowledge. We propose different data fusion strategies that we hope will serve as a starting point for future DL NTCP studies. Met… ▽ More Purpose. NTCP modelling is rapidly embracing DL methods as the need to include spatial dose information is acknowledged. Finding the most appropriate way of combining radiation dose distribution images and clinical data involves technical challenges and requires domain knowledge. We propose different data fusion strategies that we hope will serve as a starting point for future DL NTCP studies. Methods. Early, joint and late DL multi-modality fusion strategies were compared using clinical variables and mandibular radiation dose distribution volumes. The discriminative performance of the multi-modality models was compared to that of single-modality models. All the experiments were conducted on a control-case matched cohort of 92 ORN cases and 92 controls from a single institution. Results. The highest ROC AUC score was obtained with the late fusion model (0.70), but no statistically significant differences in discrimination performance were observed between strategies. While late fusion was the least technically complex strategy, its design did not model the inter-modality interactions that are required for NTCP modelling. Joint fusion involved the most complex design but resulted in a single network training process which included intra- and inter-modality interactions in its model parameter optimisation. Conclusions. This is the first study that compares different strategies for including image data into DL NTCP models in combination with lower dimensional data such as clinical variables. The discrimination performance of such multi-modality NTCP models and the choice of fusion strategy will depend on the distribution and quality of both types of data. We encourage future DL NTCP studies to report on different fusion strategies to better justify their choice of DL pipeline. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: 10 pages, 4 figures, 3 tables

arXiv:2310.04690 [pdf, other]

A dimension-reduced variational approach for solving physics-based inverse problems using generative adversarial network priors and normalizing flows

Authors: Agnimitra Dasgupta, Dhruv V Patel, Deep Ray, Erik A Johnson, Assad A Oberai

Abstract: We propose a novel modular inference approach combining two different generative models -- generative adversarial networks (GAN) and normalizing flows -- to approximate the posterior distribution of physics-based Bayesian inverse problems framed in high-dimensional ambient spaces. We dub the proposed framework GAN-Flow. The proposed method leverages the intrinsic dimension reduction and superior s… ▽ More We propose a novel modular inference approach combining two different generative models -- generative adversarial networks (GAN) and normalizing flows -- to approximate the posterior distribution of physics-based Bayesian inverse problems framed in high-dimensional ambient spaces. We dub the proposed framework GAN-Flow. The proposed method leverages the intrinsic dimension reduction and superior sample generation capabilities of GANs to define a low-dimensional data-driven prior distribution. Once a trained GAN-prior is available, the inverse problem is solved entirely in the latent space of the GAN using variational Bayesian inference with normalizing flow-based variational distribution, which approximates low-dimensional posterior distribution by transforming realizations from the low-dimensional latent prior (Gaussian) to corresponding realizations of a low-dimensional variational posterior distribution. The trained GAN generator then maps realizations from this approximate posterior distribution in the latent space back to the high-dimensional ambient space. We also propose a two-stage training strategy for GAN-Flow wherein we train the two generative models sequentially. Thereafter, GAN-Flow can estimate the statistics of posterior-predictive quantities of interest at virtually no additional computational cost. The synergy between the two types of generative models allows us to overcome many challenges associated with the application of Bayesian inference to large-scale inverse problems, chief among which are describing an informative prior and sampling from the high-dimensional posterior. We demonstrate the efficacy and flexibility of GAN-Flow on various physics-based inverse problems of varying ambient dimensionality and prior knowledge using different types of GANs and normalizing flows. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.01407 [pdf, other]

CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation

Authors: Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar

Abstract: Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that… ▽ More Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement, restoration, editing, and compositing. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. To address this challenge, we introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally, a conditional consistency loss enforces consistent predictions across diffusion steps, effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods, achieving a new state-of-the-art in producing high-quality images with very few steps (e.g., 1-4) across multiple tasks, including super-resolution, text-guided image editing, and depth-to-image generation. △ Less

Submitted 17 February, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

arXiv:2310.00224 [pdf, other]

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

Authors: Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks

Abstract: Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidanc… ▽ More Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidance is typically useful only towards synthesizing high-level semantics rather than editing fine-grained details as in image-to-image translation tasks. To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation. The key idea is to steer the image generation of the diffusion model at inference time via designing a loss using a pre-trained inverse model that characterizes the conditional task. This loss modulates the sampling trajectory of the diffusion process. Our framework allows for easy incorporation of multiple conditions during inference. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution. Our results demonstrate clear qualitative and quantitative improvements over state-of-the-art diffusion-based plug-and-play models while adding negligible additional computational cost. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: Accepted at ICCV 2023

arXiv:2309.11677 [pdf, ps, other]

Cycle Partitions in Dense Regular Digraphs and Oriented Graphs

Authors: Allan Lo, Viresh Patel, Mehmet Akif Yıldız

Abstract: A conjecture of Jackson from 1981 states that every $d$-regular oriented graph on $n$ vertices with $n\leq 4d+1$ is Hamiltonian. We prove this conjecture for sufficiently large $n$. In fact we prove a more general result that for all $α>0$, there exists $n_0=n_0(α)$ such that every $d$-regular digraph on $n\geq n_0$ vertices with $d \geq αn $ can be covered by at most $n/(d+1)$ vertex-disjoint cyc… ▽ More A conjecture of Jackson from 1981 states that every $d$-regular oriented graph on $n$ vertices with $n\leq 4d+1$ is Hamiltonian. We prove this conjecture for sufficiently large $n$. In fact we prove a more general result that for all $α>0$, there exists $n_0=n_0(α)$ such that every $d$-regular digraph on $n\geq n_0$ vertices with $d \geq αn $ can be covered by at most $n/(d+1)$ vertex-disjoint cycles, and moreover that if $G$ is an oriented graph, then at most $n/(2d+1)$ cycles suffice. △ Less

Submitted 7 June, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: 33 pages, 1 figure

MSC Class: 05C35; 05C38; 05C20; 05C70

arXiv:2309.10928 [pdf, ps, other]

Improved bounds for the zeros of the chromatic polynomial via Whitney's Broken Circuit Theorem

Authors: Matthew Jenssen, Viresh Patel, Guus Regts

Abstract: We prove that for any graph $G$ of maximum degree at most $Δ$, the zeros of its chromatic polynomial $χ_G(x)$ (in $\mathbb{C}$) lie inside the disc of radius $5.94 Δ$ centered at $0$. This improves on the previously best known bound of approximately $6.91Δ$. We also obtain improved bounds for graphs of high girth. We prove that for every $g$ there is a constant $K_g$ such that for any graph $G$… ▽ More We prove that for any graph $G$ of maximum degree at most $Δ$, the zeros of its chromatic polynomial $χ_G(x)$ (in $\mathbb{C}$) lie inside the disc of radius $5.94 Δ$ centered at $0$. This improves on the previously best known bound of approximately $6.91Δ$. We also obtain improved bounds for graphs of high girth. We prove that for every $g$ there is a constant $K_g$ such that for any graph $G$ of maximum degree at most $Δ$ and girth at least $g$, the zeros of its chromatic polynomial $χ_G(x)$ lie inside the disc of radius $K_g Δ$ centered at $0$, where $K_g$ is the solution to a certain optimization problem. In particular, $K_g < 5$ when $g \geq 5$ and $K_g < 4$ when $g \geq 25$ and $K_g$ tends to approximately $3.86$ as $g \to \infty$. Key to the proof is a classical theorem of Whitney which allows us to relate the chromatic polynomial of a graph $G$ to the generating function of so-called broken-circuit-free forests in $G$. We also establish a zero-free disc for the generating function of all forests in $G$ (aka the partition function of the arboreal gas) which may be of independent interest. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 16 pages

arXiv:2309.10894 [pdf, other]

A Novel Gradient Methodology with Economical Objective Function Evaluations for Data Science Applications

Authors: Christian Varner, Vivak Patel

Abstract: Gradient methods are experiencing a growth in methodological and theoretical developments owing to the challenges posed by optimization problems arising in data science. However, such gradient methods face diverging optimality gaps or exploding objective evaluations when applied to optimization problems with realistic properties for data science applications. In this work, we address this gap by d… ▽ More Gradient methods are experiencing a growth in methodological and theoretical developments owing to the challenges posed by optimization problems arising in data science. However, such gradient methods face diverging optimality gaps or exploding objective evaluations when applied to optimization problems with realistic properties for data science applications. In this work, we address this gap by develo** a generic methodology that economically uses objective function evaluations in a problem-driven manner to prevent optimality gap divergence and avoid explosions in objective evaluations. Our methodology allows for a variety of step size routines and search direction strategies. Furthermore, we develop a particular, novel step size selection methodology that is well-suited to our framework. We show that our specific procedure is highly competitive with standard optimization methods on CUTEst test problems. We then show our specific procedure is highly favorable relative to standard optimization methods on a particularly tough data science problem: learning the parameters in a generalized estimating equation model. Thus, we provide a novel gradient methodology that is better suited to optimization problems from this important class of data science applications. △ Less

Submitted 16 April, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: 24 pages, 7 figures, 3 tables, 3 algorithms

MSC Class: 90C30; 65K05; 68T09

arXiv:2309.05213 [pdf, other]

Towards Federated Learning Under Resource Constraints via Layer-wise Training and Depth Dropout

Authors: Pengfei Guo, Warren Richard Morningstar, Raviteja Vemulapalli, Karan Singhal, Vishal M. Patel, Philip Andrew Mansfield

Abstract: Large machine learning models trained on diverse data have recently seen unprecedented success. Federated learning enables training on private data that may otherwise be inaccessible, such as domain-specific datasets decentralized across many clients. However, federated learning can be difficult to scale to large models when clients have limited resources. This challenge often results in a trade-o… ▽ More Large machine learning models trained on diverse data have recently seen unprecedented success. Federated learning enables training on private data that may otherwise be inaccessible, such as domain-specific datasets decentralized across many clients. However, federated learning can be difficult to scale to large models when clients have limited resources. This challenge often results in a trade-off between model size and access to diverse data. To mitigate this issue and facilitate training of large models on edge devices, we introduce a simple yet effective strategy, Federated Layer-wise Learning, to simultaneously reduce per-client memory, computation, and communication costs. Clients train just a single layer each round, reducing resource costs considerably with minimal performance degradation. We also introduce Federated Depth Dropout, a complementary technique that randomly drops frozen layers during training, to further reduce resource usage. Coupling these two techniques enables us to effectively train significantly larger models on edge devices. Specifically, we reduce training memory usage by 5x or more in federated self-supervised representation learning and demonstrate that performance in downstream tasks is comparable to conventional federated self-supervised learning. △ Less

Submitted 10 September, 2023; originally announced September 2023.

arXiv:2308.15615 [pdf]

Nitrogen Precooling Heat Exchanger replacement and control system upgrade in Superfluid Cryoplant at CMTF

Authors: J. Subedi, B. Hansen, M. White, V. Patel, J. Makara, O. Atassi, G. Johnson

Abstract: Liquid Nitrogen precooling is used in most Cryoplants to achieve cooldown to 80 K temperature range. In one such system at Fermilab's CMTF Superfluid Cryoplant, where the Helium supply directly exchanges heat with liquid Nitrogen, freezing of Nitrogen occurred inside the heat exchanger due to heat exchanger imbalance during a Cryoplant trip. Trapped vapor pockets of N2 within the frozen heat excha… ▽ More Liquid Nitrogen precooling is used in most Cryoplants to achieve cooldown to 80 K temperature range. In one such system at Fermilab's CMTF Superfluid Cryoplant, where the Helium supply directly exchanges heat with liquid Nitrogen, freezing of Nitrogen occurred inside the heat exchanger due to heat exchanger imbalance during a Cryoplant trip. Trapped vapor pockets of N2 within the frozen heat exchanger channels were formed while warming up the heat exchanger, creating high localized pressure and subsequent damage/rupture of the heat exchanger. Replacement of the heat exchanger was done, and modifications were made in the system to rectify future occurrences. The control system was updated to bypass the heat exchanger entirely if the incoming Helium stream temperature drops below 76 K. This was done by repurposing two control valves as heat exchanger bypass valves that were previously used for a redundant 80 K adsorber in the coldbox. Additional modifications were made to further prevent return of large amount of cold Helium gas from cold end during abrupt Cryoplant shutdown. This modification has ensured high reliability of heat exchanger with prevention of freezing of Nitrogen which can damage the heat exchanger. △ Less

Submitted 29 August, 2023; originally announced August 2023.

Comments: Cryogenic Eng Conf and Intnl Cryo Materials Conf (CEC/ICMC 2023)

Report number: FERMILAB-CONF-23-379-TD

arXiv:2308.04035 [pdf, other]

Cross-Dataset Adaptation for Instrument Classification in Cataract Surgery Videos

Authors: Jay N. Paranjape, Shameema Sikder, Vishal M. Patel, S. Swaroop Vedula

Abstract: Surgical tool presence detection is an important part of the intra-operative and post-operative analysis of a surgery. State-of-the-art models, which perform this task well on a particular dataset, however, perform poorly when tested on another dataset. This occurs due to a significant domain shift between the datasets resulting from the use of different tools, sensors, data resolution etc. In thi… ▽ More Surgical tool presence detection is an important part of the intra-operative and post-operative analysis of a surgery. State-of-the-art models, which perform this task well on a particular dataset, however, perform poorly when tested on another dataset. This occurs due to a significant domain shift between the datasets resulting from the use of different tools, sensors, data resolution etc. In this paper, we highlight this domain shift in the commonly performed cataract surgery and propose a novel end-to-end Unsupervised Domain Adaptation (UDA) method called the Barlow Adaptor that addresses the problem of distribution shift without requiring any labels from another domain. In addition, we introduce a novel loss called the Barlow Feature Alignment Loss (BFAL) which aligns features across different domains while reducing redundancy and the need for higher batch sizes, thus improving cross-dataset performance. The use of BFAL is a novel approach to address the challenge of domain shift in cataract surgery data. Extensive experiments are conducted on two cataract surgery datasets and it is shown that the proposed method outperforms the state-of-the-art UDA methods by 6%. The code can be found at https://github.com/JayParanjape/Barlow-Adaptor △ Less

Submitted 31 July, 2023; originally announced August 2023.

Comments: MICCAI 2023

arXiv:2308.03726 [pdf, other]

AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation

Authors: Jay N. Paranjape, Nithin Gopalakrishnan Nair, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

Abstract: Segmentation is a fundamental problem in surgical scene analysis using artificial intelligence. However, the inherent data scarcity in this domain makes it challenging to adapt traditional segmentation techniques for this task. To tackle this issue, current research employs pretrained models and finetunes them on the given data. Even so, these require training deep networks with millions of parame… ▽ More Segmentation is a fundamental problem in surgical scene analysis using artificial intelligence. However, the inherent data scarcity in this domain makes it challenging to adapt traditional segmentation techniques for this task. To tackle this issue, current research employs pretrained models and finetunes them on the given data. Even so, these require training deep networks with millions of parameters every time new data becomes available. A recently published foundation model, Segment-Anything (SAM), generalizes well to a large variety of natural images, hence tackling this challenge to a reasonable extent. However, SAM does not generalize well to the medical domain as is without utilizing a large amount of compute resources for fine-tuning and using task-specific prompts. Moreover, these prompts are in the form of bounding-boxes or foreground/background points that need to be annotated explicitly for every image, making this solution increasingly tedious with higher data size. In this work, we propose AdaptiveSAM - an adaptive modification of SAM that can adjust to new datasets quickly and efficiently, while enabling text-prompted segmentation. For finetuning AdaptiveSAM, we propose an approach called bias-tuning that requires a significantly smaller number of trainable parameters than SAM (less than 2\%). At the same time, AdaptiveSAM requires negligible expert intervention since it uses free-form text as prompt and can segment the object of interest with just the label name as prompt. Our experiments show that AdaptiveSAM outperforms current state-of-the-art methods on various medical imaging datasets including surgery, ultrasound and X-ray. Code is available at https://github.com/JayParanjape/biastuning △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: 10 pages, 6 figures, 5 tables

arXiv:2307.16896 [pdf, other]

Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training

Authors: Jeya Maria Jose Valanarasu, Yucheng Tang, Dong Yang, Ziyue Xu, Can Zhao, Wenqi Li, Vishal M. Patel, Bennett Landman, Daguang Xu, Yufan He, Vishwesh Nath

Abstract: Harnessing the power of pre-training on large-scale datasets like ImageNet forms a fundamental building block for the progress of representation learning-driven solutions in computer vision. Medical images are inherently different from natural images as they are acquired in the form of many modalities (CT, MR, PET, Ultrasound etc.) and contain granulated information like tissue, lesion, organs etc… ▽ More Harnessing the power of pre-training on large-scale datasets like ImageNet forms a fundamental building block for the progress of representation learning-driven solutions in computer vision. Medical images are inherently different from natural images as they are acquired in the form of many modalities (CT, MR, PET, Ultrasound etc.) and contain granulated information like tissue, lesion, organs etc. These characteristics of medical images require special attention towards learning features representative of local context. In this work, we focus on designing an effective pre-training framework for 3D radiology images. First, we propose a new masking strategy called local masking where the masking is performed across channel embeddings instead of tokens to improve the learning of local feature representations. We combine this with classical low-level perturbations like adding noise and downsampling to further enable low-level representation learning. To this end, we introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. Additionally, we also devise a cross-modal contrastive loss (CMCL) to accommodate the pre-training of multiple modalities in a single framework. We curate a large-scale dataset to enable pre-training of 3D medical radiology images (MRI and CT). The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance. Notably, our proposed method tops the public test leaderboard of BTCV multi-organ segmentation challenge. △ Less

Submitted 31 July, 2023; originally announced July 2023.

Comments: Preprint

arXiv:2307.11081 [pdf, other]

GLSFormer: Gated - Long, Short Sequence Transformer for Step Recognition in Surgical Videos

Authors: Nisarg A. Shah, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

Abstract: Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of sp… ▽ More Automated surgical step recognition is an important task that can significantly improve patient safety and decision-making during surgeries. Existing state-of-the-art methods for surgical step recognition either rely on separate, multi-stage modeling of spatial and temporal information or operate on short-range temporal resolution when learned jointly. However, the benefits of joint modeling of spatio-temporal features and long-range information are not taken in account. In this paper, we propose a vision transformer-based approach to jointly learn spatio-temporal features directly from sequence of frame-level patches. Our method incorporates a gated-temporal attention mechanism that intelligently combines short-term and long-term spatio-temporal feature representations. We extensively evaluate our approach on two cataract surgery video datasets, namely Cataract-101 and D99, and demonstrate superior performance compared to various state-of-the-art methods. These results validate the suitability of our proposed approach for automated surgical step recognition. Our code is released at: https://github.com/nisargshah1999/GLSFormer △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted to MICCAI 2023 (Early Accept)

arXiv:2307.01815 [pdf, ps, other]

On perfect powers that are sums of cubes of a nine term arithmetic progression

Authors: Nirvana Coppola, Mar Curcó-Iranzo, Maleeha Khawaja, Vandita Patel, Özge Ülkem

Abstract: We study the equation $(x-4r)^3 + (x-3r)^3 + (x-2r)^3+(x-r)^3 + x^3 + (x+r)^3+(x+2r)^3 + (x+3r)^3 + (x+4r)^3 = y^p$, which is a natural continuation of previous works carried out by A. Argáez-García and the fourth author (perfect powers that are sums of cubes of a three, five and seven term arithmetic progression). Under the assumptions $0 < r \leq 10^6$, $p \geq 5 $ a prime and $\gcd(x, r) = 1$,… ▽ More We study the equation $(x-4r)^3 + (x-3r)^3 + (x-2r)^3+(x-r)^3 + x^3 + (x+r)^3+(x+2r)^3 + (x+3r)^3 + (x+4r)^3 = y^p$, which is a natural continuation of previous works carried out by A. Argáez-García and the fourth author (perfect powers that are sums of cubes of a three, five and seven term arithmetic progression). Under the assumptions $0 < r \leq 10^6$, $p \geq 5 $ a prime and $\gcd(x, r) = 1$, we show that solutions must satisfy $xy=0$. Moreover, we study the equation for prime exponents $2$ and $3$ in greater detail. Under the assumptions $r>0$ a positive integer and $\gcd(x, r) = 1$ we show that there are infinitely many solutions for $p=2$ and $p=3$ via explicit constructions using integral points on elliptic curves. We use an amalgamation of methods in computational and algebraic number theory to overcome the increased computational challenge. Most notable is a significant computational efficiency obtained through appealing to Bilu, Hanrot and Voutier's Primitive Divisor Theorem and the method of Chabauty, as well as employing a Thue equation solver earlier on. △ Less

Submitted 19 September, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

Comments: 12 pages

MSC Class: Primary 11D61; Secondary 11D41; 11D59; 11J86; 14H52

arXiv:2306.16654 [pdf, other]

Self-Supervised MRI Reconstruction with Unrolled Diffusion Models

Authors: Yilmaz Korkmaz, Tolga Cukur, Vishal M. Patel

Abstract: Magnetic Resonance Imaging (MRI) produces excellent soft tissue contrast, albeit it is an inherently slow imaging modality. Promising deep learning methods have recently been proposed to reconstruct accelerated MRI scans. However, existing methods still suffer from various limitations regarding image fidelity, contextual sensitivity, and reliance on fully-sampled acquisitions for model training. T… ▽ More Magnetic Resonance Imaging (MRI) produces excellent soft tissue contrast, albeit it is an inherently slow imaging modality. Promising deep learning methods have recently been proposed to reconstruct accelerated MRI scans. However, existing methods still suffer from various limitations regarding image fidelity, contextual sensitivity, and reliance on fully-sampled acquisitions for model training. To comprehensively address these limitations, we propose a novel self-supervised deep reconstruction model, named Self-Supervised Diffusion Reconstruction (SSDiffRecon). SSDiffRecon expresses a conditional diffusion process as an unrolled architecture that interleaves cross-attention transformers for reverse diffusion steps with data-consistency blocks for physics-driven processing. Unlike recent diffusion methods for MRI reconstruction, a self-supervision strategy is adopted to train SSDiffRecon using only undersampled k-space data. Comprehensive experiments on public brain MR datasets demonstrates the superiority of SSDiffRecon against state-of-the-art supervised, and self-supervised baselines in terms of reconstruction speed and quality. Implementation will be available at https://github.com/yilmazkorkmaz1/SSDiffRecon. △ Less

Submitted 15 April, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.05168 [pdf, ps, other]

Power values of power sums: a survey

Authors: Nirvana Coppola, Mar Curcó-Iranzo, Maleeha Khawaja, Vandita Patel, Özge Ülkem

Abstract: Research on power values of power sums has gained much attention of late, partially due to the explosion of refinements in multiple advanced tools in (computational) Number Theory in recent years. In this survey, we present the key tools and techniques employed thus far in the (explicit) resolution of Diophantine problems, as well as an overview of existing results. We also state some open problem… ▽ More Research on power values of power sums has gained much attention of late, partially due to the explosion of refinements in multiple advanced tools in (computational) Number Theory in recent years. In this survey, we present the key tools and techniques employed thus far in the (explicit) resolution of Diophantine problems, as well as an overview of existing results. We also state some open problems that naturally arise in the process. △ Less

Submitted 27 July, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Added additional references and open problems. This collaboration was formed from the Women in Numbers Europe 4 workshop

arXiv:2305.16310 [pdf, other]

Securing Deep Generative Models with Universal Adversarial Signature

Authors: Yu Zeng, Mo Zhou, Yuan Xue, Vishal M. Patel

Abstract: Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector cap… ▽ More Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector capable of generalizing to new, unseen generative models. In this paper, we propose to inject a universal adversarial signature into an arbitrary pre-trained generative model, in order to make its generated contents more detectable and traceable. First, the imperceptible optimal signature for each image can be found by a signature injector through adversarial training. Subsequently, the signature can be incorporated into an arbitrary generator by fine-tuning it with the images processed by the signature injector. In this way, the detector corresponding to the signature can be reused for any fine-tuned generator for tracking the generator identity. The proposed method is validated on the FFHQ and ImageNet datasets with various state-of-the-art generative models, consistently showing a promising detection rate. Code will be made publicly available at \url{https://github.com/zengxianyu/genwm}. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.14674 [pdf, other]

T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities

Authors: Kangfu Mei, Mo Zhou, Vishal M. Patel

Abstract: Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures… ▽ More Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: for project page, see https://t1-diffusion-model.github.io

arXiv:2305.06402 [pdf, ps, other]

Analyzing Bias in Diffusion-based Face Generation Models

Authors: Malsha V. Perera, Vishal M. Patel

Abstract: Diffusion models are becoming increasingly popular in synthetic data generation and image editing applications. However, these models can amplify existing biases and propagate them to downstream applications. Therefore, it is crucial to understand the sources of bias in their outputs. In this paper, we investigate the presence of bias in diffusion-based face generation models with respect to attri… ▽ More Diffusion models are becoming increasingly popular in synthetic data generation and image editing applications. However, these models can amplify existing biases and propagate them to downstream applications. Therefore, it is crucial to understand the sources of bias in their outputs. In this paper, we investigate the presence of bias in diffusion-based face generation models with respect to attributes such as gender, race, and age. Moreover, we examine how dataset size affects the attribute composition and perceptual quality of both diffusion and Generative Adversarial Network (GAN) based face generation models across various attribute classes. Our findings suggest that diffusion models tend to worsen distribution bias in the training data for various attributes, which is heavily influenced by the size of the dataset. Conversely, GAN models trained on balanced datasets with a larger number of samples show less bias across different attributes. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Showing 1–50 of 342 results for author: Patel, V