Search | arXiv e-print repository

arXiv:2407.01926 [pdf]

Chemical Shift Encoding based Double Bonds Quantification in Triglycerides using Deep Image Prior

Authors: Chaoxing Huang, Ziqiang Yu, Zijian Gao, Qiuyi Shen, Queenie Chan, Vincent Wai-Sun Wong, Winnie Chiu-Wing Chu, Weitian Chen

Abstract: This study evaluated a deep learning-based method using Deep Image Prior (DIP) to quantify triglyceride double bonds from chemical-shift encoded multi-echo gradient echo images without network training. We employed a cost function based on signal constraints to iteratively update the neural network on a single dataset. The method was validated using phantom experiments and in vivo scans. Results s… ▽ More This study evaluated a deep learning-based method using Deep Image Prior (DIP) to quantify triglyceride double bonds from chemical-shift encoded multi-echo gradient echo images without network training. We employed a cost function based on signal constraints to iteratively update the neural network on a single dataset. The method was validated using phantom experiments and in vivo scans. Results showed close alignment between measured and reference double bond values, with phantom experiments yielding a Pearson correlation coefficient of 0.96 (p = .0005). In vivo results demonstrated good agreement in subcutaneous fat. We conclude that Deep Image Prior shows feasibility for quantifying double bonds and fatty acid content from chemical-shift encoded multi-echo MRI. △ Less

Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.01521 [pdf, other]

Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

Authors: Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar, Yang Song

Abstract: Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inver… ▽ More Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems. For example, we achieve a PSNR of 30.72dB on the FFHQ 256 dataset for phase retrieval, which is an improvement of 9.12dB compared to existing methods. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.12002 [pdf, other]

Modeling, Inference, and Prediction in Mobility-Based Compartmental Models for Epidemiology

Authors: Ning Jiang, Weiqi Chu, Yao Li

Abstract: Classical compartmental models in epidemiology often struggle to accurately capture real-world dynamics due to their inability to address the inherent heterogeneity of populations. In this paper, we introduce a novel approach that incorporates heterogeneity through a mobility variable, transforming the traditional ODE system into a system of integro-differential equations that describe the dynamic… ▽ More Classical compartmental models in epidemiology often struggle to accurately capture real-world dynamics due to their inability to address the inherent heterogeneity of populations. In this paper, we introduce a novel approach that incorporates heterogeneity through a mobility variable, transforming the traditional ODE system into a system of integro-differential equations that describe the dynamics of population densities across different compartments. Our results show that, for the same basic reproduction number, our mobility-based model predicts a smaller final pandemic size compared to classic compartmental models, whose population densities are represented as Dirac delta functions in our density-based framework. This addresses the overestimation issue common in many classical models. Additionally, we demonstrate that the time series of the infected population is sufficient to uniquely identify the mobility distribution. We reconstruct this distribution using a machine-learning-based framework, providing both theoretical and algorithmic support to effectively constrain the mobility-based model with real-world data. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 16 pages, 7 figures

arXiv:2405.17401 [pdf, other]

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

Authors: Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu

Abstract: We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of styl… ▽ More We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control of content and style in a training-free manner. Further, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Preprint. Under review

arXiv:2405.03141 [pdf, other]

Automatic Ultrasound Curve Angle Measurement via Affinity Clustering for Adolescent Idiopathic Scoliosis Evaluation

Authors: Yihao Zhou, Timothy Tin-Yan Lee, Kelly Ka-Lee Lai, Chonglin Wu, Hin Ting Lau, De Yang, Chui-Yi Chan, Winnie Chiu-Wing Chu, Jack Chun-Yiu Cheng, Tsz-** Lam, Yong-** Zheng

Abstract: The current clinical gold standard for evaluating adolescent idiopathic scoliosis (AIS) is X-ray radiography, using Cobb angle measurement. However, the frequent monitoring of the AIS progression using X-rays poses a challenge due to the cumulative radiation exposure. Although 3D ultrasound has been validated as a reliable and radiation-free alternative for scoliosis assessment, the process of mea… ▽ More The current clinical gold standard for evaluating adolescent idiopathic scoliosis (AIS) is X-ray radiography, using Cobb angle measurement. However, the frequent monitoring of the AIS progression using X-rays poses a challenge due to the cumulative radiation exposure. Although 3D ultrasound has been validated as a reliable and radiation-free alternative for scoliosis assessment, the process of measuring spinal curvature is still carried out manually. Consequently, there is a considerable demand for a fully automatic system that can locate bony landmarks and perform angle measurements. To this end, we introduce an estimation model for automatic ultrasound curve angle (UCA) measurement. The model employs a dual-branch network to detect candidate landmarks and perform vertebra segmentation on ultrasound coronal images. An affinity clustering strategy is utilized within the vertebral segmentation area to illustrate the affinity relationship between candidate landmarks. Subsequently, we can efficiently perform line delineation from a clustered affinity map for UCA measurement. As our method is specifically designed for UCA calculation, this method outperforms other state-of-the-art methods for landmark and line detection tasks. The high correlation between the automatic UCA and Cobb angle (R$^2$=0.858) suggests that our proposed method can potentially replace manual UCA measurement in ultrasound scoliosis assessment. △ Less

Submitted 6 May, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02280 [pdf, other]

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Authors: Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki

Abstract: View-predictive generative models provide strong priors for lifting object-centric images and videos into 3D and 4D through rendering and score distillation objectives. A question then remains: what about lifting complete multi-object dynamic scenes? There are two challenges in this direction: First, rendering error gradients are often insufficient to recover fast object motion, and second, view p… ▽ More View-predictive generative models provide strong priors for lifting object-centric images and videos into 3D and 4D through rendering and score distillation objectives. A question then remains: what about lifting complete multi-object dynamic scenes? There are two challenges in this direction: First, rendering error gradients are often insufficient to recover fast object motion, and second, view predictive generative models work much better for objects than whole scenes, so, score distillation objectives cannot currently be applied at the scene level directly. We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via 360-degree novel view synthesis. Our key insight is a "decompose-recompose" approach that factorizes the video scene into the background and object tracks, while also factorizing object motion into 3 components: object-centric deformation, object-to-world-frame transformation, and camera motion. Such decomposition permits rendering error gradients and object view-predictive models to recover object 3D completions and deformations while bounding box tracks guide the large object movements in the scene. We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study. Besides 4D scene generation, DreamScene4D obtains accurate 2D persistent point track by projecting the inferred 3D trajectories to 2D. We will release our code and hope our work will stimulate more research on fine-grained 4D understanding from videos. △ Less

Submitted 23 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: Project page: https://dreamscene4d.github.io/

arXiv:2404.09891 [pdf, ps, other]

Convolution Identities of Stirling Numbers

Authors: Nadia Na Li, Wenchang Chu

Abstract: By means of the generating function method, a linear recurrence relation is explicitly resolved. The solution is expressed in terms of the Stirling numbers of both the first and the second kind. Two remarkable pairs of combinatorial identities are established as applications, that contain some well-known convolution formulae on Stirling numbers as special cases. By means of the generating function method, a linear recurrence relation is explicitly resolved. The solution is expressed in terms of the Stirling numbers of both the first and the second kind. Two remarkable pairs of combinatorial identities are established as applications, that contain some well-known convolution formulae on Stirling numbers as special cases. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 7 pages

MSC Class: 05A10; 11B65

arXiv:2403.18270 [pdf, other]

Image Deraining via Self-supervised Reinforcement Learning

Authors: He-Hao Liao, Yan-Tsung Peng, Wen-Tao Chu, **-Chun Hsieh, Chung-Chi Tsai

Abstract: The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from… ▽ More The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2403.02329 [pdf, other]

COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems against Semantic Attacks

Authors: Zijian Huang, Wenda Chu, Linyi Li, Chejian Xu, Bo Li

Abstract: Multi-sensor fusion systems (MSFs) play a vital role as the perception module in modern autonomous vehicles (AVs). Therefore, ensuring their robustness against common and realistic adversarial semantic transformations, such as rotation and shifting in the physical world, is crucial for the safety of AVs. While empirical evidence suggests that MSFs exhibit improved robustness compared to single-mod… ▽ More Multi-sensor fusion systems (MSFs) play a vital role as the perception module in modern autonomous vehicles (AVs). Therefore, ensuring their robustness against common and realistic adversarial semantic transformations, such as rotation and shifting in the physical world, is crucial for the safety of AVs. While empirical evidence suggests that MSFs exhibit improved robustness compared to single-modal models, they are still vulnerable to adversarial semantic transformations. Despite the proposal of empirical defenses, several works show that these defenses can be attacked again by new adaptive attacks. So far, there is no certified defense proposed for MSFs. In this work, we propose the first robustness certification framework COMMIT certify robustness of multi-sensor fusion systems against semantic attacks. In particular, we propose a practical anisotropic noise mechanism that leverages randomized smoothing with multi-modal data and performs a grid-based splitting method to characterize complex semantic transformations. We also propose efficient algorithms to compute the certification in terms of object detection accuracy and IoU for large-scale MSF models. Empirically, we evaluate the efficacy of COMMIT in different settings and provide a comprehensive benchmark of certified robustness for different MSF models using the CARLA simulation platform. We show that the certification for MSF models is at most 48.39% higher than that of single-modal models, which validates the advantages of MSF models. We believe our certification framework and benchmark will contribute an important step towards certifiably robust AVs in practice. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.16124 [pdf, other]

AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D Talking Face Generation

Authors: Yasheng Sun, Wenqing Chu, Hang Zhou, Kaisiyuan Wang, Hideki Koike

Abstract: While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with the speaker's speaking status remains challenging. Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns wi… ▽ More While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with the speaker's speaking status remains challenging. Our goal is to directly leverage the inherent style information conveyed by human speech for generating an expressive talking face that aligns with the speaking status. In this paper, we propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking face generation. This system harnesses the robust contextual reasoning and hallucination capability offered by Large Language Models (LLMs) to instruct the realistic synthesis of 3D talking faces. Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information and generating instructions implying expressive facial details seamlessly corresponding to the speech. Subsequently, a diffusion-based generative network executes these instructions. This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions and specify desired operations or modifications. Extensive experiments showcase the effectiveness of our approach in producing vivid talking faces with expressive facial movements and consistent emotional status. △ Less

Submitted 25 February, 2024; originally announced February 2024.

arXiv:2402.13297 [pdf, other]

Integrating Deep Learning and Synthetic Biology: A Co-Design Approach for Enhancing Gene Expression via N-terminal Coding Sequences

Authors: Zhanglu Yan, Weiran Chu, Yuhua Sheng, Kaiwen Tang, Shida Wang, Yanfeng Liu, Weng-Fai Wong

Abstract: N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. T… ▽ More N-terminal coding sequence (NCS) influences gene expression by impacting the translation initiation rate. The NCS optimization problem is to find an NCS that maximizes gene expression. The problem is important in genetic engineering. However, current methods for NCS optimization such as rational design and statistics-guided approaches are labor-intensive yield only relatively small improvements. This paper introduces a deep learning/synthetic biology co-designed few-shot training workflow for NCS optimization. Our method utilizes k-nearest encoding followed by word2vec to encode the NCS, then performs feature extraction using attention mechanisms, before constructing a time-series network for predicting gene expression intensity, and finally a direct search algorithm identifies the optimal NCS with limited training data. We took green fluorescent protein (GFP) expressed by Bacillus subtilis as a reporting protein of NCSs, and employed the fluorescence enhancement factor as the metric of NCS optimization. Within just six iterative experiments, our model generated an NCS (MLD62) that increased average GFP expression by 5.41-fold, outperforming the state-of-the-art NCS designs. Extending our findings beyond GFP, we showed that our engineered NCS (MLD62) can effectively boost the production of N-acetylneuraminic acid by enhancing the expression of the crucial rate-limiting GNA1 gene, demonstrating its practical utility. We have open-sourced our NCS expression database and experimental procedures for public use. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.06599 [pdf, other]

On the Out-Of-Distribution Generalization of Multimodal Large Language Models

Authors: Xingxuan Zhang, Jiansheng Li, Wen**g Chu, Junjia Hai, Renzhe Xu, Yuqing Yang, Shikai Guan, Jiazheng Xu, Peng Cui

Abstract: We investigate the generalization boundaries of current Multimodal Large Language Models (MLLMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that MLLMs struggle w… ▽ More We investigate the generalization boundaries of current Multimodal Large Language Models (MLLMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. We evaluate their zero-shot generalization across synthetic images, real-world distributional shifts, and specialized datasets like medical and molecular imagery. Empirical results indicate that MLLMs struggle with generalization beyond common training domains, limiting their direct application without adaptation. To understand the cause of unreliable performance, we analyze three hypotheses: semantic misinterpretation, visual feature extraction insufficiency, and map** deficiency. Results identify map** deficiency as the primary hurdle. To address this problem, we show that in-context learning (ICL) can significantly enhance MLLMs' generalization, opening new avenues for overcoming generalization barriers. We further explore the robustness of ICL under distribution shifts and show its vulnerability to domain shifts, label shifts, and spurious correlation shifts between in-context examples and test data. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2401.17773 [pdf, other]

doi 10.1109/TCSVT.2023.3303945

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Authors: Xingning Dong, Qingpei Guo, Tian Gan, Qing Wang, Jianlong Wu, Xiangyuan Ren, Yuan Cheng, Wei Chu

Abstract: We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-traini… ▽ More We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp. △ Less

Submitted 31 January, 2024; originally announced January 2024.

Comments: Accepted by TCSVT (IEEE Transactions on Circuits and Systems for Video Technology)

arXiv:2401.15362 [pdf, other]

Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval

Authors: Ayush Dubey, Shiv Ram Dubey, Satish Kumar Singh, Wei-Ta Chu

Abstract: Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global feat… ▽ More Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image. The Convolutional Neural Network (CNN)-based approaches have been extensively exploited with self-supervised contrastive learning for image hashing. However, the existing approaches suffer due to lack of effective utilization of global features by CNNs and biased-ness created by false negative pairs in the contrastive learning. In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing, by generating the hash codes through product quantization and by avoiding the potential false negative pairs through clipped contrastive learning. The proposed model is tested with superior performance for unsupervised image retrieval on benchmark datasets, including CIFAR10, NUS-Wide and Flickr25K, as compared to the recent state-of-the-art deep models. The results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning. △ Less

Submitted 27 January, 2024; originally announced January 2024.

arXiv:2401.09625 [pdf]

doi 10.1016/j.mtphys.2024.101338

Pressure-induced superconductivity in a novel germanium allotrope

Authors: Liangzi Deng, Jianbo Zhang, Yuki Sakai, Zhongjia Tang, Moein Adnani, Rabin Dahal, Alexander P. Litvinchuk, James R. Chelikowsky, Marvin L. Cohen, Russell J. Hemley, Arnold Guloy, Yang Ding, Ching-Wu Chu

Abstract: High-pressure studies on elements play an essential role in superconductivity research, with implications for both fundamental science and applications. Here we report the experimental discovery of surprisingly low pressure driving a novel germanium allotrope into a superconducting state in comparison to that for alpha-Ge. Raman measurements revealed structural phase transitions and possible elect… ▽ More High-pressure studies on elements play an essential role in superconductivity research, with implications for both fundamental science and applications. Here we report the experimental discovery of surprisingly low pressure driving a novel germanium allotrope into a superconducting state in comparison to that for alpha-Ge. Raman measurements revealed structural phase transitions and possible electronic topological transitions under pressure up to 58 GPa. Based on pressure-dependent resistivity measurements, superconductivity was induced above 2 GPa and the maximum Tc of 6.8 K was observed under 4.6 GPa. Interestingly, a superconductivity enhancement was discovered during decompression, indicating the possibility of maintaining pressure-induced superconductivity at ambient pressure with better superconducting performance. Density functional theory analysis further suggested that the electronic structure of Ge (oP32) is sensitive to its detailed geometry and revealed that disorder in the beta-tin structure leads to a higher Tc in comparison to the perfect beta-tin Ge. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: 30 pages, 13 figures

arXiv:2401.04354 [pdf, other]

Knowledge-enhanced Multi-perspective Video Representation Learning for Scene Recognition

Authors: Xuzheng Yu, Chen Jiang, Wei Zhang, Tian Gan, Linlin Chao, Jianan Zhao, Yuan Cheng, Qingpei Guo, Wei Chu

Abstract: With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challeng… ▽ More With the explosive growth of video data in real-world applications, a comprehensive representation of videos becomes increasingly important. In this paper, we address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos. Due to the diversity and complexity of video contents in realistic scenarios, this task remains a challenge. Most existing works identify scenes for videos only from visual or textual information in a temporal perspective, ignoring the valuable information hidden in single frames, while several earlier studies only recognize scenes for separate images in a non-temporal perspective. We argue that these two perspectives are both meaningful for this task and complementary to each other, meanwhile, externally introduced knowledge can also promote the comprehension of videos. We propose a novel two-stream framework to model video representations from multiple perspectives, i.e. temporal and non-temporal perspectives, and integrate the two perspectives in an end-to-end manner by self-distillation. Besides, we design a knowledge-enhanced feature fusion and label prediction method that contributes to naturally introducing knowledge into the task of video scene recognition. Experiments conducted on a real-world dataset demonstrate the effectiveness of our proposed method. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2312.00852 [pdf, other]

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

Authors: Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu

Abstract: Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for post… ▽ More Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires $O(1)$ compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: Preprint

arXiv:2311.08430 [pdf, other]

Rankitect: Ranking Architecture Search Battling World-class Engineers at Meta Scale

Authors: Wei Wen, Kuang-Hung Liu, Igor Fedorov, Xin Zhang, Hang Yin, Weiwei Chu, Kaveh Hassani, Mengying Sun, Jiang Liu, Xu Wang, Lin Jiang, Yuxin Chen, Buyun Zhang, Xi Liu, Dehua Cheng, Zhengxing Chen, Guang Zhao, Fangqiu Han, Jiyan Yang, Yuchen Hao, Liang Xiong, Wen-Yen Chen

Abstract: Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1… ▽ More Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive tradeoff between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale. △ Less

Submitted 13 November, 2023; originally announced November 2023.

Comments: Wei Wen and Kuang-Hung Liu contribute equally

arXiv:2311.06791 [pdf, other]

InfMLLM: A Unified Framework for Visual-Language Tasks

Authors: Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi

Abstract: Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual… ▽ More Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks, particularly image captioning, visual question answering (VQA,) and visual grounding. To this end, we implemented a three-stage training scheme: starting with lightweight alignment pretraining, then moderate-weight multitask hybrid training, and finally, LLM fine-tuning to improve instruction following capability. Throughout the training process, the requirements on GPU memory gradually increase. To effectively manage the number of visual embeddings passed to the LLM while preserving their positional information, we introduce a straightforward visual adapter module dubbed pool-adapter. Our experiments demonstrate that preserving the positional information of visual embeddings through the pool-adapter is particularly beneficial for tasks like visual grounding. We name our proposed approach InfMLLM and have evaluated it extensively on various benchmark datasets. Our results demonstrate that InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs. The code and model will be made open-source at: \url{https://github.com/mightyzau/InfMLLM}. △ Less

Submitted 6 December, 2023; v1 submitted 12 November, 2023; originally announced November 2023.

Comments: 8

arXiv:2311.03558 [pdf]

Replication and study of anomalies in LK-99--the alleged ambient-pressure, room-temperature superconductor

Authors: T. Habamahoro, T. Bontke, M. Chirom, Z. Wu, J. M. Bao, L. Z. Deng, C. W. Chu

Abstract: We have studied LK-99 [Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O], alleged by Lee et al. to exhibit superconductivity above room temperature and at ambient pressure, and have reproduced all anomalies in electric and magnetic measurements that they reported as evidence for the claim of LK-99 being an ambient-pressure, room-temperature superconductor. We found that these anomalies are associated with the struc… ▽ More We have studied LK-99 [Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O], alleged by Lee et al. to exhibit superconductivity above room temperature and at ambient pressure, and have reproduced all anomalies in electric and magnetic measurements that they reported as evidence for the claim of LK-99 being an ambient-pressure, room-temperature superconductor. We found that these anomalies are associated with the structural transition of the Cu$_2$S impurity in their sample and not with superconductivity. △ Less

Submitted 6 November, 2023; originally announced November 2023.

Comments: 15 pages, 7 figures

arXiv:2310.19847 [pdf, ps, other]

Integrals of Hyperbolic Tangent Function

Authors: **g Li, Wenchang Chu

Abstract: By means of the contour integration method, we evaluate, in closed form, a class of definite integrals involving hyperbolic tangent function. By means of the contour integration method, we evaluate, in closed form, a class of definite integrals involving hyperbolic tangent function. △ Less

Submitted 30 October, 2023; originally announced October 2023.

MSC Class: 33E20; 11M32

arXiv:2310.06992 [pdf, other]

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models

Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas Guibas, Katerina Fragkiadaki

Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained… ▽ More Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research. △ Less

Submitted 25 January, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: Project page available at https://wenhsuanchu.github.io/ovtracktor/

arXiv:2309.15458 [pdf, other]

LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

Authors: Weidi Xu, **gwei Wang, Lele Xie, Jianshan He, Hongting Zhou, Taifeng Wang, Xiaopei Wan, **gdong Chen, Chao Qu, Wei Chu

Abstract: Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modulari… ▽ More Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations effectively mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over graphs, images, and text show that LogicMP outperforms advanced competitors in both performance and efficiency. △ Less

Submitted 16 April, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: 28 pages, 14 figures, 12 tables

arXiv:2309.11091 [pdf, other]

doi 10.1145/3474085.3475301

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

Authors: Chen Jiang, Kaiming Huang, Sifeng He, Xudong Yang, Wei Zhang, Xiaobo Zhang, Yuan Cheng, Lei Yang, Qing Wang, Furong Xu, Tan Pan, Wei Chu

Abstract: With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video… ▽ More With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted by ACM MM 2021

arXiv:2309.11082 [pdf, other]

doi 10.1145/3581783.3612006

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Authors: Chen Jiang, Hong Liu, Xuzheng Yu, Qing Wang, Yuan Cheng, Jia Xu, Zhongyi Liu, Qingpei Guo, Wei Chu, Ming Yang, Yuan Qi

Abstract: In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising re… ▽ More In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet. △ Less

Submitted 26 January, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted by ACM MM 2023

arXiv:2309.08825 [pdf, other]

Distributionally Robust Post-hoc Classifiers under Prior Shifts

Authors: Jiaheng Wei, Harikrishna Narasimhan, Ehsan Amid, Wen-Sheng Chu, Yang Liu, Abhishek Kumar

Abstract: The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike… ▽ More The generalization ability of machine learning models degrades significantly when the test distribution shifts away from the training distribution. We investigate the problem of training models that are robust to shifts caused by changes in the distribution of class-priors or group-priors. The presence of skewed training priors can often lead to the models overfitting to spurious features. Unlike existing methods, which optimize for either the worst or the average performance over classes or groups, our work is motivated by the need for finer control over the robustness properties of the model. We present an extremely lightweight post-hoc approach that performs scaling adjustments to predictions from a pre-trained model, with the goal of minimizing a distributionally robust loss around a chosen target distribution. These adjustments are computed by solving a constrained optimization problem on a validation set and applied to the model during test time. Our constrained optimization objective is inspired by a natural notion of robustness to controlled distribution shifts. Our method comes with provable guarantees and empirically makes a strong case for distributional robust post-hoc classifiers. An empirical implementation is available at https://github.com/weijiaheng/Drops. △ Less

Submitted 15 September, 2023; originally announced September 2023.

Comments: Camera ready version, accepted at ICLR 2023

arXiv:2309.03508 [pdf, other]

doi 10.1109/TIP.2023.3315151

Dynamic Frame Interpolation in Wavelet Domain

Authors: Lingtong Kong, Boyuan Jiang, Donghao Luo, Wenqing Chu, Ying Tai, Chengjie Wang, Jie Yang

Abstract: Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand… ▽ More Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at https://github.com/ltkong218/WaveletVFI. △ Less

Submitted 20 September, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: Accepted by IEEE TIP

arXiv:2309.00398 [pdf, other]

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

Authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, **gdong Wang

Abstract: In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide vi… ▽ More In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples. △ Less

Submitted 7 September, 2023; v1 submitted 1 September, 2023; originally announced September 2023.

Comments: 8pages, 8figures, project page: https://videogen.github.io/VideoGen/

arXiv:2307.02736 [pdf]

An Uncertainty Aided Framework for Learning based Liver $T_1ρ$ Map** and Analysis

Authors: Chaoxing Huang, Vincent Wai Sun Wong, Queenie Chan, Winnie Chiu Wing Chu, Weitian Chen

Abstract: Objective: Quantitative $T_1ρ$ imaging has potential for assessment of biochemical alterations of liver pathologies. Deep learning methods have been employed to accelerate quantitative $T_1ρ$ imaging. To employ artificial intelligence-based quantitative imaging methods in complicated clinical environment, it is valuable to estimate the uncertainty of the predicated $T_1ρ$ values to provide the con… ▽ More Objective: Quantitative $T_1ρ$ imaging has potential for assessment of biochemical alterations of liver pathologies. Deep learning methods have been employed to accelerate quantitative $T_1ρ$ imaging. To employ artificial intelligence-based quantitative imaging methods in complicated clinical environment, it is valuable to estimate the uncertainty of the predicated $T_1ρ$ values to provide the confidence level of the quantification results. The uncertainty should also be utilized to aid the post-hoc quantitative analysis and model learning tasks. Approach: To address this need, we propose a parametric map refinement approach for learning-based $T_1ρ$ map** and train the model in a probabilistic way to model the uncertainty. We also propose to utilize the uncertainty map to spatially weight the training of an improved $T_1ρ$ map** network to further improve the map** performance and to remove pixels with unreliable $T_1ρ$ values in the region of interest. The framework was tested on a dataset of 51 patients with different liver fibrosis stages. Main results: Our results indicate that the learning-based map refinement method leads to a relative map** error of less than 3% and provides uncertainty estimation simultaneously. The estimated uncertainty reflects the actual error level, and it can be used to further reduce relative $T_1ρ$ map** error to 2.60% as well as removing unreliable pixels in the region of interest effectively. Significance: Our studies demonstrate the proposed approach has potential to provide a learning-based quantitative MRI system for trustworthy $T_1ρ$ map** of the liver. △ Less

Submitted 9 October, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

arXiv:2307.01778 [pdf, other]

Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling

Authors: Zhanhao Hu, Wenda Chu, Xiaopei Zhu, Hui Zhang, Bo Zhang, Xiaolin Hu

Abstract: Recent works have proposed to craft adversarial clothes for evading person detectors, while they are either only effective at limited viewing angles or very conspicuous to humans. We aim to craft adversarial texture for clothes based on 3D modeling, an idea that has been used to craft rigid adversarial objects such as a 3D-printed turtle. Unlike rigid objects, humans and clothes are non-rigid, lea… ▽ More Recent works have proposed to craft adversarial clothes for evading person detectors, while they are either only effective at limited viewing angles or very conspicuous to humans. We aim to craft adversarial texture for clothes based on 3D modeling, an idea that has been used to craft rigid adversarial objects such as a 3D-printed turtle. Unlike rigid objects, humans and clothes are non-rigid, leading to difficulties in physical realization. In order to craft natural-looking adversarial clothes that can evade person detectors at multiple viewing angles, we propose adversarial camouflage textures (AdvCaT) that resemble one kind of the typical textures of daily clothes, camouflage textures. We leverage the Voronoi diagram and Gumbel-softmax trick to parameterize the camouflage textures and optimize the parameters via 3D modeling. Moreover, we propose an efficient augmentation pipeline on 3D meshes combining topologically plausible projection (TopoProj) and Thin Plate Spline (TPS) to narrow the gap between digital and real-world objects. We printed the developed 3D texture pieces on fabric materials and tailored them into T-shirts and trousers. Experiments show high attack success rates of these clothes against multiple detectors. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: Accepted by CVPR 2023

arXiv:2307.01490 [pdf]

Bistable scattering of nano-silicon for super-linear super-resolution imaging

Authors: Po-Hsueh Tseng, Kentaro Nishida, Pang-Han Wu, Yu-Lung Tang, Yu-Chieh Chen, Chi-Yin Yang, Jhen-Hong Yang, Wei-Ruei Chen, Olesiya Pashina, Mihail Petrov, Kuo-** Chen, Shi- Wei Chu

Abstract: Optical bistability is fundamental for all-optical switches, but typically requires high-Q cavities with micrometer sizes. Through boosting nonlinearity with photo-thermo-optical effects, we achieve bistability in a silicon Mie resonator with a volume size of 10-3 um3 and Q-factor < 10, both are record-low. Furthermore, bistable scattering naturally leads to large super-linear emission-excitation… ▽ More Optical bistability is fundamental for all-optical switches, but typically requires high-Q cavities with micrometer sizes. Through boosting nonlinearity with photo-thermo-optical effects, we achieve bistability in a silicon Mie resonator with a volume size of 10-3 um3 and Q-factor < 10, both are record-low. Furthermore, bistable scattering naturally leads to large super-linear emission-excitation power dependence, which we applied to enhance optical resolution by more than 3 times. Our work paves the way toward nanoscale photonics computation and label-free semiconductor nano-inspection. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2306.16602 [pdf, other]

An electro-hydrodynamics modeling of droplet actuation on solid surface by surfactant-mediated electro-dewetting

Authors: Weiqi Chu, Hangjie Ji, Qining Wang, Chang-** "CJ'' Kim, Andrea L. Bertozzi

Abstract: We propose an electro-hydrodynamics model to describe the dynamic evolution of a slender drop containing a dilute ionic surfactant on a naturally wettable surface, with a varying external electric field. This unified model reproduces fundamental microfluidic operations controlled by electrical signals, including dewetting, rewetting, and droplet shifting. In this paper, lubrication theory analysis… ▽ More We propose an electro-hydrodynamics model to describe the dynamic evolution of a slender drop containing a dilute ionic surfactant on a naturally wettable surface, with a varying external electric field. This unified model reproduces fundamental microfluidic operations controlled by electrical signals, including dewetting, rewetting, and droplet shifting. In this paper, lubrication theory analysis and numerical simulations illustrate how to electrically control the wettability of surface via the charged surfactant. Our numerical results show that electric field promotes dewetting by attracting ionic surfactants onto the transition thin-film region and promotes rewetting by attracting them away from the region. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 16 pages, 13 figures

arXiv:2306.14182 [pdf, other]

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

Authors: Qingpei Guo, Kaisheng Yao, Wei Chu

Abstract: The ability to model intra-modal and inter-modal interactions is fundamental in multimodal machine learning. The current state-of-the-art models usually adopt deep learning models with fixed structures. They can achieve exceptional performances on specific tasks, but face a particularly challenging problem of modality mismatch because of diversity of input modalities and their fixed structures. In… ▽ More The ability to model intra-modal and inter-modal interactions is fundamental in multimodal machine learning. The current state-of-the-art models usually adopt deep learning models with fixed structures. They can achieve exceptional performances on specific tasks, but face a particularly challenging problem of modality mismatch because of diversity of input modalities and their fixed structures. In this paper, we present \textbf{Switch-BERT} for joint vision and language representation learning to address this problem. Switch-BERT extends BERT architecture by introducing learnable layer-wise and cross-layer interactions. It learns to optimize attention from a set of attention modes representing these interactions. One specific property of the model is that it learns to attend outputs from various depths, therefore mitigates the modality mismatch problem. We present extensive experiments on visual question answering, image-text retrieval and referring expression comprehension experiments. Results confirm that, whereas alternative architectures including ViLBERT and UNITER may excel in particular tasks, Switch-BERT can consistently achieve better or comparable performances than the current state-of-the-art models in these tasks. Ablation studies indicate that the proposed model achieves superior performances due to its ability in learning task-specific multimodal interactions. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: Accepted by ECCV2022

arXiv:2305.04732 [pdf]

doi 10.1103/PhysRevB.108.014312

Photo-accelerated hot carrier transfer at MoS2/WS2:a first-principles study

Authors: Zhi-Guo Tao, Guo-Jun Zhu, Weibin Chu, Xin-Gao Gong, Ji-Hui Yang

Abstract: Charge transfer in type-II heterostructures plays important roles in determining device performance for photovoltaic and photocatalytic applications. However, current theoretical studies of charge transfer process don't consider the effects of operating conditions such as illuminations and yield systemically larger interlayer transfer time of hot electrons in MoS2/WS2 compared to experimental resu… ▽ More Charge transfer in type-II heterostructures plays important roles in determining device performance for photovoltaic and photocatalytic applications. However, current theoretical studies of charge transfer process don't consider the effects of operating conditions such as illuminations and yield systemically larger interlayer transfer time of hot electrons in MoS2/WS2 compared to experimental results. Here in this work, we propose a general picture that, illumination can induce interfacial dipoles in type-II heterostructures, which can accelerate hot carrier transfer by reducing the energy difference between the electronic states in separate materials and enhancing the nonadiabatic couplings. Using the first-principles calculations and the ab-initio nonadiabatic molecular dynamics, we demonstrate this picture using MoS2/WS2 as a prototype. The calculated characteristic time for the interlayer transfer (60 fs) and the overall relaxation (700 fs) processes of hot electrons is in good agreement with the experiments. We further find that illumination mainly affects the ultrafast interlayer transfer process but has little effects on the relatively slow intralayer relaxation process. Therefore, the overall relaxation process of hot electrons has a saturated time with increased illumination strengths. The illumination-accelerated charge transfer is expected to universally exist in type-II heterostructures. △ Less

Submitted 8 May, 2023; originally announced May 2023.

arXiv:2305.02610 [pdf, other]

Boundary-aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval

Authors: Tan Pan, Furong Xu, Xudong Yang, Sifeng He, Chen Jiang, Qingpei Guo, Feng Qian Xiaobo Zhang, Yuan Cheng, Lei Yang, Wei Chu

Abstract: Image retrieval plays an important role in the Internet world. Usually, the core parts of mainstream visual retrieval systems include an online service of the embedding model and a large-scale vector database. For traditional model upgrades, the old model will not be replaced by the new one until the embeddings of all the images in the database are re-computed by the new model, which takes days or… ▽ More Image retrieval plays an important role in the Internet world. Usually, the core parts of mainstream visual retrieval systems include an online service of the embedding model and a large-scale vector database. For traditional model upgrades, the old model will not be replaced by the new one until the embeddings of all the images in the database are re-computed by the new model, which takes days or weeks for a large amount of data. Recently, backward-compatible training (BCT) enables the new model to be immediately deployed online by making the new embeddings directly comparable to the old ones. For BCT, improving the compatibility of two models with less negative impact on retrieval performance is the key challenge. In this paper, we introduce AdvBCT, an Adversarial Backward-Compatible Training method with an elastic boundary constraint that takes both compatibility and discrimination into consideration. We first employ adversarial learning to minimize the distribution disparity between embeddings of the new model and the old model. Meanwhile, we add an elastic boundary constraint during training to improve compatibility and discrimination efficiently. Extensive experiments on GLDv2, Revisited Oxford (ROxford), and Revisited Paris (RParis) demonstrate that our method outperforms other BCT methods on both compatibility and discrimination. The implementation of AdvBCT will be publicly available at https://github.com/Ashespt/AdvBCT. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: accepted by CVPR 2023

arXiv:2305.02572 [pdf, other]

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Authors: Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, Yong Liu

Abstract: Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose… ▽ More Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis. △ Less

Submitted 30 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

arXiv:2304.07611 [pdf, other]

doi 10.1109/TASLP.2023.3263789

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Authors: Ruchao Fan, Wei Chu, Peng Chang, Abeer Alwan

Abstract: Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper,… ▽ More Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: Published in IEEE Transactions on Audio, Speech, and Language Processing

arXiv:2304.06662 [pdf, other]

Deep Learning in Breast Cancer Imaging: A Decade of Progress and Future Directions

Authors: Luyang Luo, Xi Wang, Yi Lin, Xiaoqi Ma, Andong Tan, Ronald Chan, Varut Vardhanabhuti, Winnie CW Chu, Kwang-Ting Cheng, Hao Chen

Abstract: Breast cancer has reached the highest incidence rate worldwide among all malignancies since 2020. Breast imaging plays a significant role in early diagnosis and intervention to improve the outcome of breast cancer patients. In the past decade, deep learning has shown remarkable progress in breast cancer imaging analysis, holding great promise in interpreting the rich information and complex contex… ▽ More Breast cancer has reached the highest incidence rate worldwide among all malignancies since 2020. Breast imaging plays a significant role in early diagnosis and intervention to improve the outcome of breast cancer patients. In the past decade, deep learning has shown remarkable progress in breast cancer imaging analysis, holding great promise in interpreting the rich information and complex context of breast imaging modalities. Considering the rapid improvement in deep learning technology and the increasing severity of breast cancer, it is critical to summarize past progress and identify future challenges to be addressed. This paper provides an extensive review of deep learning-based breast cancer imaging research, covering studies on mammogram, ultrasound, magnetic resonance imaging, and digital pathology images over the past decade. The major deep learning methods and applications on imaging-based screening, diagnosis, treatment response prediction, and prognosis are elaborated and discussed. Drawn from the findings of this survey, we present a comprehensive discussion of the challenges and potential avenues for future research in deep learning-based breast cancer imaging. △ Less

Submitted 20 January, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

Comments: IEEE RBME 2024

arXiv:2304.03592 [pdf]

Role of electrodes in study of hydrovoltaic effects

Authors: Chunxiao Zheng, Sunmiao Fang, Weicun Chu, ** Tan, Bingkun Tian, Xiaofeng Jiang, Wanlin Guo

Abstract: The last decade has witnessed the emergence of hydrovoltaic technology, which can harvest electricity from different forms of water movement, such as raindrops, waves, flows, moisture, and natural evaporation. In particular, the evaporation-induced hydrovoltaic effect received great attention since its discovery in 2017 due to its negative heat emission property. Nevertheless, the influence of ele… ▽ More The last decade has witnessed the emergence of hydrovoltaic technology, which can harvest electricity from different forms of water movement, such as raindrops, waves, flows, moisture, and natural evaporation. In particular, the evaporation-induced hydrovoltaic effect received great attention since its discovery in 2017 due to its negative heat emission property. Nevertheless, the influence of electrode reactions in evaporation-induced power generation is not negligible due to the chemical reaction between active metal electrodes and water, which leads to " exceptional " power generation. Herein, we designed a series of experiments based on air-laid paper devices with electrodes of different activities as the top and bottom electrodes. To verify the contribution of electrodes, we compared the output performance of different electrode combinations when the device is partially-wetted and fully-wetted. The device hydrophilicity, salt concentration, and acidity or basicity of solutions are also comprehensively investigated. It is demonstrated that the chemical reaction of active metals (Zn, Cu, Ag, etc.) with different aqueous solutions can generate considerable electrical energy and significantly distort the device performance, especially for Zn electrodes with an output voltage from ~1.26 to ~1.52 V and current from ~1.24 to ~75.69 μA. To promote the long-term development of hydrovoltaic technology, we recommend use of inert electrodes in hydrovoltaic studies, such as Au and Pt, especially in water and moisture environment. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2303.18167 [pdf, other]

Accounting for Vibration Noise in Stochastic Measurement Errors

Authors: Lionel Voirol, Davide A. Cucci, Mucyo Karemera, Wenfei Chu, Roberto Molinari, Stéphane Guerrier

Abstract: The measurement of data over time and/or space is of utmost importance in a wide range of domains from engineering to physics. Devices that perform these measurements therefore need to be extremely precise to obtain correct system diagnostics and accurate predictions, consequently requiring a rigorous calibration procedure which models their errors before being employed. While the deterministic co… ▽ More The measurement of data over time and/or space is of utmost importance in a wide range of domains from engineering to physics. Devices that perform these measurements therefore need to be extremely precise to obtain correct system diagnostics and accurate predictions, consequently requiring a rigorous calibration procedure which models their errors before being employed. While the deterministic components of these errors do not represent a major modelling challenge, most of the research over the past years has focused on delivering methods that can explain and estimate the complex stochastic components of these errors. This effort has allowed to greatly improve the precision and uncertainty quantification of measurement devices but has this far not accounted for a significant stochastic noise that arises for many of these devices: vibration noise. Indeed, having filtered out physical explanations for this noise, a residual stochastic component often carries over which can drastically affect measurement precision. This component can originate from different sources, including the internal mechanics of the measurement devices as well as the movement of these devices when placed on moving objects or vehicles. To remove this disturbance from signals, this work puts forward a modelling framework for this specific type of noise and adapts the Generalized Method of Wavelet Moments to estimate these models. We deliver the asymptotic properties of this method when applied to processes that include vibration noise and show the considerable practical advantages of this approach in simulation and applied case studies. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: 30 pages, 9 figures

arXiv:2303.13662 [pdf, other]

Rethinking Domain Generalization for Face Anti-spoofing: Separability and Alignment

Authors: Yiyou Sun, Yaojie Liu, Xiaoming Liu, Yixuan Li, Wen-Sheng Chu

Abstract: This work studies the generalization issue of face anti-spoofing (FAS) models on domain gaps, such as image resolution, blurriness and sensor variations. Most prior works regard domain-specific signals as a negative impact, and apply metric learning or adversarial losses to remove them from feature representation. Though learning a domain-invariant feature space is viable for the training data, we… ▽ More This work studies the generalization issue of face anti-spoofing (FAS) models on domain gaps, such as image resolution, blurriness and sensor variations. Most prior works regard domain-specific signals as a negative impact, and apply metric learning or adversarial losses to remove them from feature representation. Though learning a domain-invariant feature space is viable for the training data, we show that the feature shift still exists in an unseen test domain, which backfires on the generalizability of the classifier. In this work, instead of constructing a domain-invariant feature space, we encourage domain separability while aligning the live-to-spoof transition (i.e., the trajectory from live to spoof) to be the same for all domains. We formulate this FAS strategy of separability and alignment (SA-FAS) as a problem of invariant risk minimization (IRM), and learn domain-variant feature representation but domain-invariant classifier. We demonstrate the effectiveness of SA-FAS on challenging cross-domain FAS datasets and establish state-of-the-art performance. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: Accepted in CVPR2023

arXiv:2303.07623 [pdf, other]

Uncertainty-weighted Multi-tasking for $T_{1ρ}$ and T$_2$ Map** in the Liver with Self-supervised Learning

Authors: Chaoxing Huang, Yurui Qian, Jian Hou, Baiyan Jiang, Queenie Chan, Vincent WS Wong, Winnie CW Chu, Weitian Chen

Abstract: Multi-parametric map** of MRI relaxations in liver has the potential of revealing pathological information of the liver. A self-supervised learning based multi-parametric map** method is proposed to map T$T_{1ρ}$ and T$_2$ simultaneously, by utilising the relaxation constraint in the learning process. Data noise of different map** tasks is utilised to make the model uncertainty-aware, which… ▽ More Multi-parametric map** of MRI relaxations in liver has the potential of revealing pathological information of the liver. A self-supervised learning based multi-parametric map** method is proposed to map T$T_{1ρ}$ and T$_2$ simultaneously, by utilising the relaxation constraint in the learning process. Data noise of different map** tasks is utilised to make the model uncertainty-aware, which adaptively weight different map** tasks during learning. The method was examined on a dataset of 51 patients with non-alcoholic fatter liver disease. Results showed that the proposed method can produce comparable parametric maps to the traditional multi-contrast pixel wise fitting method, with a reduced number of images and less computation time. The uncertainty weighting also improves the model performance. It has the potential of accelerating MRI quantitative imaging. △ Less

Submitted 14 March, 2023; originally announced March 2023.

arXiv:2303.01023 [pdf, other]

Adiabatic quantum learning

Authors: Nannan Ma, Wenhao Chu, Jiangbin Gong

Abstract: Adiabatic quantum control protocols have been of wide interest to quantum computation due to their robustness and insensitivity to their actual duration of execution. As an extension of previous quantum learning algorithms, this work proposes to execute some quantum learning protocols based entirely on adiabatic quantum evolution, hence dubbed as ``adiabatic quantum learning". In a conventional qu… ▽ More Adiabatic quantum control protocols have been of wide interest to quantum computation due to their robustness and insensitivity to their actual duration of execution. As an extension of previous quantum learning algorithms, this work proposes to execute some quantum learning protocols based entirely on adiabatic quantum evolution, hence dubbed as ``adiabatic quantum learning". In a conventional quantum machine learning protocol, the output is usually the expectation value of a pre-selected observable and the projective measurement of which forces a quantum circuit to run many times to obtain the output with a reasonable precision. By contrast, the proposed adiabatic quantum learning here may be integrated with future adiabatic weak measurement protocols, where a single measurement of the system allows to extract the expectation value of observables of interest without disrupting the concerned quantum states. Our main idea is illustrated with simple examples. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 9 pages, 3 figures

arXiv:2302.14335 [pdf, other]

DC-Former: Diverse and Compact Transformer for Person Re-Identification

Authors: Wen Li, Cheng Zou, Meng Wang, Furong Xu, Jianan Zhao, Ruobing Zheng, Yuan Cheng, Wei Chu

Abstract: In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation.… ▽ More In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation. In this paper, we propose a Diverse and Compact Transformer (DC-Former) that can achieve a similar effect by splitting embedding space into multiple diverse and compact subspaces. Compact embedding subspace helps model learn more robust and discriminative embedding to identify similar classes. And the fusion of these diverse embeddings containing more fine-grained information can further improve the effect of re-ID. Specifically, multiple class tokens are used in vision transformer to represent multiple embedding spaces. Then, a self-diverse constraint (SDC) is applied to these spaces to push them away from each other, which makes each embedding space diverse and compact. Further, a dynamic weight controller(DWC) is further designed for balancing the relative importance among them during training. The experimental results of our method are promising, which surpass previous state-of-the-art methods on several commonly used person re-ID benchmarks. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: Accepted by AAAI23

arXiv:2302.13300 [pdf]

Low-temperature thermal Hall conductivity of Pr2Zr2O7 single crystal

Authors: Wenjun Chu, Xuefeng Sun

Abstract: To probe the peculiar excitations spinons and magnetic monopoles in the quantum spin ice candidate Pr2Zr2O7, we studied the low-temperature thermal Hall conductivity (\k{appa}xy) and thermal conductivity (\k{appa}xx) of Pr2Zr2O7 single crystal with magnetic fields applied along the [111] axis. The magnetic field dependencies of \k{appa}xx suggest the roles of magnetic excitations in thermal conduc… ▽ More To probe the peculiar excitations spinons and magnetic monopoles in the quantum spin ice candidate Pr2Zr2O7, we studied the low-temperature thermal Hall conductivity (\k{appa}xy) and thermal conductivity (\k{appa}xx) of Pr2Zr2O7 single crystal with magnetic fields applied along the [111] axis. The magnetic field dependencies of \k{appa}xx suggest the roles of magnetic excitations in thermal conductivity, that is, the emergent magnetic monopoles can transport heat at T > 1.4 K and spinons mainly scatter phonons at lower temperatures. The finite \k{appa}xy was observed at low fields of several Tesla and was discussed to be related to the magnetic excitations, including magnetic monopoles as well as spinons. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2302.06637 [pdf, other]

PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees

Authors: Chulin Xie, De-An Huang, Wenda Chu, Daguang Xu, Chaowei Xiao, Bo Li, Anima Anandkumar

Abstract: Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However, existing pFL methods either (1) introduce high communication and computation costs or (2) overfit to local data, which can be limited in scope, and are vulnerable to evolved test samples with natural shifts. In this paper, we propose PerAda, a parameter-efficient pF… ▽ More Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However, existing pFL methods either (1) introduce high communication and computation costs or (2) overfit to local data, which can be limited in scope, and are vulnerable to evolved test samples with natural shifts. In this paper, we propose PerAda, a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance, especially under test-time distribution shifts. PerAda reduces the costs by leveraging the power of pretrained models and only updates and communicates a small number of additional parameters from adapters. PerAda has good generalization since it regularizes each client's personalized adapter with a global adapter, while the global adapter uses knowledge distillation to aggregate generalized information from all clients. Theoretically, we provide generalization bounds to explain why PerAda improves generalization, and we prove its convergence to stationary points under non-convex settings. Empirically, PerAda demonstrates competitive personalized performance (+4.85% on CheXpert) and enables better out-of-distribution generalization (+5.23% on CIFAR-10-C) on different datasets across natural and medical domains compared with baselines, while only updating 12.6% of parameters per model based on the adapter. Our code is available at https://github.com/NVlabs/PerAda. △ Less

Submitted 6 April, 2024; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: CVPR 2024

arXiv:2302.05083 [pdf, other]

DRGCN: Dynamic Evolving Initial Residual for Deep Graph Convolutional Networks

Authors: Lei Zhang, Xiaodong Yan, Jianshan He, Ruopeng Li, Wei Chu

Abstract: Graph convolutional networks (GCNs) have been proved to be very practical to handle various graph-related tasks. It has attracted considerable research interest to study deep GCNs, due to their potential superior performance compared with shallow ones. However, simply increasing network depth will, on the contrary, hurt the performance due to the over-smoothing problem. Adding residual connection… ▽ More Graph convolutional networks (GCNs) have been proved to be very practical to handle various graph-related tasks. It has attracted considerable research interest to study deep GCNs, due to their potential superior performance compared with shallow ones. However, simply increasing network depth will, on the contrary, hurt the performance due to the over-smoothing problem. Adding residual connection is proved to be effective for learning deep convolutional neural networks (deep CNNs), it is not trivial when applied to deep GCNs. Recent works proposed an initial residual mechanism that did alleviate the over-smoothing problem in deep GCNs. However, according to our study, their algorithms are quite sensitive to different datasets. In their setting, the personalization (dynamic) and correlation (evolving) of how residual applies are ignored. To this end, we propose a novel model called Dynamic evolving initial Residual Graph Convolutional Network (DRGCN). Firstly, we use a dynamic block for each node to adaptively fetch information from the initial representation. Secondly, we use an evolving block to model the residual evolving pattern between layers. Our experimental results show that our model effectively relieves the problem of over-smoothing in deep GCNs and outperforms the state-of-the-art (SOTA) methods on various benchmark datasets. Moreover, we develop a mini-batch version of DRGCN which can be applied to large-scale data. Coupling with several fair training techniques, our model reaches new SOTA results on the large-scale ogbn-arxiv dataset of Open Graph Benchmark (OGB). Our reproducible code is available on GitHub. △ Less

Submitted 10 February, 2023; originally announced February 2023.

Comments: 8 pages, Accept by Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023)

arXiv:2302.00848 [pdf, other]

Causal Effect Estimation: Recent Advances, Challenges, and Opportunities

Authors: Zhixuan Chu, Jianmin Huang, Ruopeng Li, Wei Chu, Sheng Li

Abstract: Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous da… ▽ More Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives. △ Less

Submitted 1 February, 2023; originally announced February 2023.

arXiv:2302.00439 [pdf]

Accelerating the calculation of electron-phonon coupling by machine learning methods

Authors: Yang Zhong, Zhiguo Tao, Weibin Chu, Xingao Gong, Hongjun Xiang

Abstract: Electron-phonon coupling (EPC) plays an important role in many fundamental physical phenomena, but the high computational cost of the EPC matrix hinders the theoretical research on them. In this paper, an analytical formula is derived to calculate the EPC matrix in terms of the Hamiltonian and its gradient in the nonorthogonal atomic orbital bases. The recently-developed E(3) equivariant neural ne… ▽ More Electron-phonon coupling (EPC) plays an important role in many fundamental physical phenomena, but the high computational cost of the EPC matrix hinders the theoretical research on them. In this paper, an analytical formula is derived to calculate the EPC matrix in terms of the Hamiltonian and its gradient in the nonorthogonal atomic orbital bases. The recently-developed E(3) equivariant neural network is used to directly predict the Hamiltonian and its gradient needed by the formula, thus bypassing the expensive self-consistent iterations in DFT. The correctness of the proposed EPC calculation formula and the accuracy of the predicted EPC values of the network are illustrated by the tests on a water molecule and a MoS2 crystal. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Comments: 11 pages, 2 figures, 2 tables

arXiv:2212.14489 [pdf, other]

Inference of interaction kernels in mean-field models of opinion dynamics

Authors: Weiqi Chu, Qin Li, Mason A. Porter

Abstract: In models of opinion dynamics, many parameters -- either in the form of constants or in the form of functions -- play a critical role in describing, calibrating, and forecasting how opinions change with time. When examining a model of opinion dynamics, it is beneficial to infer its parameters using empirical data. In this paper, we study an example of such an inference problem. We consider a mean-… ▽ More In models of opinion dynamics, many parameters -- either in the form of constants or in the form of functions -- play a critical role in describing, calibrating, and forecasting how opinions change with time. When examining a model of opinion dynamics, it is beneficial to infer its parameters using empirical data. In this paper, we study an example of such an inference problem. We consider a mean-field bounded-confidence model with an unknown interaction kernel between individuals. This interaction kernel encodes how individuals with different opinions interact and affect each other's opinions. Because it is often difficult to quantitatively measure opinions as empirical data from observations or experiments, we assume that the available data takes the form of partial observations of a cumulative distribution function of opinions. We prove that certain measurements guarantee a precise and unique inference of the interaction kernel and propose a numerical method to reconstruct an interaction kernel from a limited number of data points. Our numerical results suggest that the error of the inferred interaction kernel decays exponentially as we strategically enlarge the data set. △ Less

Submitted 26 October, 2023; v1 submitted 29 December, 2022; originally announced December 2022.

Comments: 20 pages, 3 figures

MSC Class: 91D30; 35R30; 45Q05; 65K10

Showing 1–50 of 391 results for author: Chu, W