Search | arXiv e-print repository

AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Authors: **peng Lin, Min Zhou, Ye Ma, Yifan Gao, Chenxi Fei, Yangjian Chen, Zhang Yu, Tiezheng Ge

Abstract: Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of… ▽ More Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods. △ Less

Submitted 23 August, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

Comments: Accepted for ACM MM 2023

arXiv:2308.00499 [pdf, ps, other]

Stochastic Geometry Based Modeling and Analysis on Network NOMA in Downlink CoMP Systems

Authors: Yanshi Sun, Zhiguo Ding, Xuchu Dai, Momiao Zhou, Zhizhong Ding

Abstract: This paper investigates the performance of network non-orthogonal multiple access (N-NOMA) in a downlink coordinated multi-point (CoMP) system. In the considered N-NOMA scheme, multiple base stations (BSs) cooperatively serve a CoMP user, meanwhile, each BS serves additional NOMA users by occupying the same resource block allocated to the CoMP user. The locations of the BSs and users are modeled b… ▽ More This paper investigates the performance of network non-orthogonal multiple access (N-NOMA) in a downlink coordinated multi-point (CoMP) system. In the considered N-NOMA scheme, multiple base stations (BSs) cooperatively serve a CoMP user, meanwhile, each BS serves additional NOMA users by occupying the same resource block allocated to the CoMP user. The locations of the BSs and users are modeled by stochastic geometric models and the interference from the whole network is considered. Through rigorous derivations, the outage probabilities achieved by the CoMP and NOMA users are obtained, respectively. Numerical results are provided to verify the accuracy of the analytical results and also demonstrate the superior performance of N-NOMA compared to orthogonal multiple access (OMA) based CoMP scheme. △ Less

Submitted 1 August, 2023; originally announced August 2023.

arXiv:2307.14901 [pdf, other]

Text-guided Foundation Model Adaptation for Pathological Image Classification

Authors: Yunkun Zhang, ** Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, Dequan Wang

Abstract: The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest.… ▽ More The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification. Code is available at https://github.com/Yunkun-Zhang/CITE. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: Accepted to MICCAI2023

arXiv:2307.14723 [pdf, other]

doi 10.1109/TGRS.2024.3365677

EFLNet: Enhancing Feature Learning for Infrared Small Target Detection

Authors: Bo Yang, Xinyu Zhang, Jian Zhang, Jun Luo, Mingliang Zhou, Yangjun Pi

Abstract: Single-frame infrared small target detection is considered to be a challenging task, due to the extreme imbalance between target and background, bounding box regression is extremely sensitive to infrared small target, and target information is easy to lose in the high-level semantic layer. In this article, we propose an enhancing feature learning network (EFLNet) to address these problems. First,… ▽ More Single-frame infrared small target detection is considered to be a challenging task, due to the extreme imbalance between target and background, bounding box regression is extremely sensitive to infrared small target, and target information is easy to lose in the high-level semantic layer. In this article, we propose an enhancing feature learning network (EFLNet) to address these problems. First, we notice that there is an extremely imbalance between the target and the background in the infrared image, which makes the model pay more attention to the background features rather than target features. To address this problem, we propose a new adaptive threshold focal loss (ATFL) function that decouples the target and the background, and utilizes the adaptive mechanism to adjust the loss weight to force the model to allocate more attention to target features. Second, we introduce the normalized Gaussian Wasserstein distance (NWD) to alleviate the difficulty of convergence caused by the extreme sensitivity of the bounding box regression to infrared small target. Finally, we incorporate a dynamic head mechanism into the network to enable adaptive learning of the relative importance of each semantic layer. Experimental results demonstrate our method can achieve better performance in the detection performance of infrared small target compared to the state-of-the-art (SOTA) deep-learning-based methods. The source codes and bounding box annotated datasets are available at https://github.com/YangBo0411/infrared-small-target. △ Less

Submitted 27 February, 2024; v1 submitted 27 July, 2023; originally announced July 2023.

Journal ref: IEEE Transactions on Geoscience and Remote Sensing 19 February 2024

arXiv:2307.13693 [pdf, other]

Evaluating Large Language Models for Radiology Natural Language Processing

Authors: Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, Lu Zhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen , et al. (20 additional authors not shown)

Abstract: The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a compreh… ▽ More The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain. △ Less

Submitted 27 July, 2023; v1 submitted 25 July, 2023; originally announced July 2023.

arXiv:2307.12755 [pdf, other]

doi 10.22323/1.447.0016

Testing General Relativity with Black Hole X-Ray Data and ABHModels

Authors: Cosimo Bambi, Askar B. Abdikamalov, Honghui Liu, Shafqat Riaz, Swarnim Shashank, Menglei Zhou

Abstract: The past 10 years have seen tremendous progress in our capability of testing General Relativity in the strong field regime with black hole observations. 10 years ago, the theory of General Relativity was almost completely unexplored in the strong field regime. Today, we have gravitational wave data of the coalescence of stellar-mass black holes, radio images of the supermassive black holes SgrA… ▽ More The past 10 years have seen tremendous progress in our capability of testing General Relativity in the strong field regime with black hole observations. 10 years ago, the theory of General Relativity was almost completely unexplored in the strong field regime. Today, we have gravitational wave data of the coalescence of stellar-mass black holes, radio images of the supermassive black holes SgrA$^*$ and M87$^*$, and high-quality X-ray data of stellar-mass black holes in X-ray binaries and supermassive black holes in active galactic nuclei. In this manuscript, we will review current efforts to test General Relativity with black hole X-ray data and we will provide a detailed description of the public codes available on ABHModels. △ Less

Submitted 23 April, 2024; v1 submitted 24 July, 2023; originally announced July 2023.

Comments: 31 pages, 5 figures. Talk given at the Frascati Workshop 2023 "Multifrequency Behaviour of High Energy Cosmic Sources - XIV" (Palermo, Italy, 12-17 June 2023). v2: refereed version

Journal ref: PoS MULTIF2023, 016 (2024)

arXiv:2307.11952 [pdf, other]

Pathology-and-genomics Multimodal Transformer for Survival Outcome Prediction

Authors: Kexin Ding, Mu Zhou, Dimitris N. Metaxas, Shaoting Zhang

Abstract: Survival outcome assessment is challenging and inherently associated with multiple clinical factors (e.g., imaging and genomics biomarkers) in cancer. Enabling multimodal analytics promises to reveal novel predictive patterns of patient outcomes. In this study, we propose a multimodal transformer (PathOmics) integrating pathology and genomics insights into colon-related cancer survival prediction.… ▽ More Survival outcome assessment is challenging and inherently associated with multiple clinical factors (e.g., imaging and genomics biomarkers) in cancer. Enabling multimodal analytics promises to reveal novel predictive patterns of patient outcomes. In this study, we propose a multimodal transformer (PathOmics) integrating pathology and genomics insights into colon-related cancer survival prediction. We emphasize the unsupervised pretraining to capture the intrinsic interaction between tissue microenvironments in gigapixel whole slide images (WSIs) and a wide range of genomics data (e.g., mRNA-sequence, copy number variant, and methylation). After the multimodal knowledge aggregation in pretraining, our task-specific model finetuning could expand the scope of data utility applicable to both multi- and single-modal data (e.g., image- or genomics-only). We evaluate our approach on both TCGA colon and rectum cancer cohorts, showing that the proposed approach is competitive and outperforms state-of-the-art studies. Finally, our approach is desirable to utilize the limited number of finetuned samples towards data-efficient analytics for survival outcome prediction. The code is available at https://github.com/Cassie07/PathOmics. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: Accepted to MICCAI2023 (Top14%)

arXiv:2307.10636 [pdf, other]

doi 10.1145/3581783.3612831

Learning and Evaluating Human Preferences for Conversational Head Generation

Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

Abstract: A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and… ▽ More A reliable and comprehensive evaluation metric that aligns with manual preference assessments is crucial for conversational head video synthesis methods development. Existing quantitative evaluations often fail to capture the full complexity of human preference, as they only consider limited evaluation dimensions. Qualitative evaluations and user studies offer a solution but are time-consuming and labor-intensive. This limitation hinders the advancement of conversational head generation algorithms and systems. In this paper, we propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions. PS can serve as a quantitative evaluation without the need for human annotation. Experimental results validate the superiority of Preference Score in aligning with human perception, and also demonstrate robustness and generalizability to unseen data, making it a valuable tool for advancing conversation head generation. We expect this metric could facilitate new advances in conversational head generation. Project Page: https://https://github.com/dc3ea9f/PreferenceScore. △ Less

Submitted 2 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted by ACM Multimedia 2023

arXiv:2307.09066 [pdf, other]

PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification

Authors: Miaoge Li, Dongsheng Wang, Xinyang Liu, Zequn Zeng, Ruiying Lu, Bo Chen, Mingyuan Zhou

Abstract: Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such t… ▽ More Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods. △ Less

Submitted 18 August, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: accepted by ICCV23

arXiv:2307.08973 [pdf, other]

doi 10.1051/0004-6361/202346511

Spin measurement of 4U 1543-47 with Insight-HXMT and NICER from its 2021 outburst: A test of accretion disk models at high luminosities

Authors: E. S. Yorgancioglu, Q. C. Bu, A. Santangelo, L. Tao, S. W. Davis, A. Vahdat, L. D. Kong, S. Piraino, M. Zhou, S. N. Zhang

Abstract: 4U 1543--47 is one of a handful of known black hole candidates located in the Milky Way Galaxy, and has undergone a very bright outburst in 2021, reaching a total of $\sim$9 Crab, as observed by the Monitor of All-sky Image (MAXI), and exceeding twice its Eddington luminosity. The unprecedented bright outburst of 4U 1543--47 provides a unique opportunity to test the behavior of accretion disk mode… ▽ More 4U 1543--47 is one of a handful of known black hole candidates located in the Milky Way Galaxy, and has undergone a very bright outburst in 2021, reaching a total of $\sim$9 Crab, as observed by the Monitor of All-sky Image (MAXI), and exceeding twice its Eddington luminosity. The unprecedented bright outburst of 4U 1543--47 provides a unique opportunity to test the behavior of accretion disk models at high luminosities and accretion rates. In addition, we explore the possibility of constraining the spin of the source at high accretion rates, given that previous spin measurements of 4U 1543--47 have been largely inconsistent with each other. We measure the spectral evolution of the source throughout its outburst as observed by Insight-HXMT, and compare the behavior of both the thin disk model kerrbb2, as well as the slim disk model slimbh up to the Eddington limit for two different values of disk $α$-viscosity. In addition, given the behavior of these two models, we identify two `golden' epochs for which it is most suitable to measure the spin with continuum fitting. △ Less

Submitted 21 July, 2023; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: 10 pages, 6 figures

Journal ref: A&A 677, A79 (2023)

arXiv:2307.05633 [pdf, other]

Transaction Fraud Detection via an Adaptive Graph Neural Network

Authors: Yue Tian, Guanjun Liu, Jiacun Wang, Mengchu Zhou

Abstract: Many machine learning methods have been proposed to achieve accurate transaction fraud detection, which is essential to the financial security of individuals and banks. However, most existing methods leverage original features only or require manual feature engineering. They lack the ability to learn discriminative representations from transaction data. Moreover, criminals often commit fraud by im… ▽ More Many machine learning methods have been proposed to achieve accurate transaction fraud detection, which is essential to the financial security of individuals and banks. However, most existing methods leverage original features only or require manual feature engineering. They lack the ability to learn discriminative representations from transaction data. Moreover, criminals often commit fraud by imitating cardholders' behaviors, which causes the poor performance of existing detection models. In this paper, we propose an Adaptive Sampling and Aggregation-based Graph Neural Network (ASA-GNN) that learns discriminative representations to improve the performance of transaction fraud detection. A neighbor sampling strategy is performed to filter noisy nodes and supplement information for fraudulent nodes. Specifically, we leverage cosine similarity and edge weights to adaptively select neighbors with similar behavior patterns for target nodes and then find multi-hop neighbors for fraudulent nodes. A neighbor diversity metric is designed by calculating the entropy among neighbors to tackle the camouflage issue of fraudsters and explicitly alleviate the over-smoothing phenomena. Extensive experiments on three real financial datasets demonstrate that the proposed method ASA-GNN outperforms state-of-the-art ones. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2307.04858 [pdf, other]

AmadeusGPT: a natural language interface for interactive animal behavioral analysis

Authors: Shaokai Ye, Jessy Lauer, Mu Zhou, Alexander Mathis, Mackenzie W. Mathis

Abstract: The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natura… ▽ More The process of quantifying and analyzing animal behavior involves translating the naturally occurring descriptive language of their actions into machine-readable code. Yet, codifying behavior analysis is often challenging without deep understanding of animal behavior and technical machine learning knowledge. To limit this gap, we introduce AmadeusGPT: a natural language interface that turns natural language descriptions of behaviors into machine-executable code. Large-language models (LLMs) such as GPT3.5 and GPT4 allow for interactive language-based queries that are potentially well suited for making interactive behavior analysis. However, the comprehension capability of these LLMs is limited by the context window size, which prevents it from remembering distant conversations. To overcome the context window limitation, we implement a novel dual-memory mechanism to allow communication between short-term and long-term memory using symbols as context pointers for retrieval and saving. Concretely, users directly use language-based definitions of behavior and our augmented GPT develops code based on the core AmadeusGPT API, which contains machine learning, computer vision, spatio-temporal reasoning, and visualization modules. Users then can interactively refine results, and seamlessly add new behavioral modules as needed. We benchmark AmadeusGPT and show we can produce state-of-the-art performance on the MABE 2022 behavior challenge tasks. Note, an end-user would not need to write any code to achieve this. Thus, collectively AmadeusGPT presents a novel way to merge deep biological knowledge, large-language models, and core computer vision modules into a more naturally intelligent system. Code and demos can be found at: https://github.com/AdaptiveMotorControlLab/AmadeusGPT. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: demo available https://github.com/AdaptiveMotorControlLab/AmadeusGPT

Journal ref: Published in Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) 2023

arXiv:2307.03983 [pdf, ps, other]

Hybrid Successive Interference Cancellation and Power Adaptation: a Win-Win Strategy for Robust Uplink NOMA Transmission

Authors: Yanshi Sun, Wei Cao, Momiao Zhou, Zhiguo Ding

Abstract: The aim of this paper is to reveal the importance of hybrid successive interference cancellation (SIC) and power adaptation (PA) for improving transmission robustness of uplink non-orthogonal multiple access (NOMA). Particularly, a cognitive radio inspired uplink NOMA communication scenario is considered, where one primary user is allocated one dedicated resource block, while M secondary users com… ▽ More The aim of this paper is to reveal the importance of hybrid successive interference cancellation (SIC) and power adaptation (PA) for improving transmission robustness of uplink non-orthogonal multiple access (NOMA). Particularly, a cognitive radio inspired uplink NOMA communication scenario is considered, where one primary user is allocated one dedicated resource block, while M secondary users compete with each other to be opportunistically served by using the same resource block of the primary user. Two novel schemes are proposed for the considered scenario, namely hybrid SIC with PA (HSIC-PA) scheme and fixed SIC with PA (FSIC-PA) scheme. Both schemes can ensure that the secondary users are served without degrading the transmission reliability of the primary user compared to conventional orthogonal multiple access (OMA) based schemes. Rigorous analytical results are presented to evaluate the performance of the proposed two schemes. It is shown that both schemes can avoid outage probability error floors without any constraints on users' target rates in the high SNR regime. Furthermore, it is shown that the diversity gain achieved by the HSIC-PA scheme is M, while that of the FISC-PA scheme is only 1. Numerical results are provided to verify the developed analytical results and also demonstrate the superior performance achieved by the proposed schemes by comparing with the existing HSIC without PA (HSIC-NPA) scheme. The presented simulation results also show that HSIC-PA scheme performs the best among the three schemes, which indicates the importance of the combination of HSIC and PA for improving transmission robustness. △ Less

Submitted 8 July, 2023; originally announced July 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2307.01517

arXiv:2307.03705 [pdf, other]

Intelligent Robotic Sonographer: Mutual Information-based Disentangled Reward Learning from Few Demonstrations

Authors: Zhongliang Jiang, Yuan Bi, Mingchuan Zhou, Ying Hu, Michael Burke, Nassir Navab

Abstract: Ultrasound (US) imaging is widely used for biometric measurement and diagnosis of internal organs due to the advantages of being real-time and radiation-free. However, due to inter-operator variations, resulting images highly depend on the experience of sonographers. This work proposes an intelligent robotic sonographer to autonomously "explore" target anatomies and navigate a US probe to a releva… ▽ More Ultrasound (US) imaging is widely used for biometric measurement and diagnosis of internal organs due to the advantages of being real-time and radiation-free. However, due to inter-operator variations, resulting images highly depend on the experience of sonographers. This work proposes an intelligent robotic sonographer to autonomously "explore" target anatomies and navigate a US probe to a relevant 2D plane by learning from the expert. The underlying high-level physiological knowledge from experts is inferred by a neural reward function, using a ranked pairwise image comparisons approach in a self-supervised fashion. This process can be referred to as understanding the "language of sonography". Considering the generalization capability to overcome inter-patient variations, mutual information is estimated by a network to explicitly disentangle the task-related and domain features in latent space. The robotic localization is carried out in coarse-to-fine mode based on the predicted reward associated with B-mode images. To validate the effectiveness of the proposed reward inference network, representative experiments were performed on vascular phantoms ("line" target), two types of ex-vivo animal organs (chicken heart and lamb kidney) phantoms ("point" target) and in-vivo human carotids, respectively. To further validate the performance of the autonomous acquisition framework, physical robotic acquisitions were performed on three phantoms (vascular, chicken heart, and lamb kidney). The results demonstrated that the proposed advanced framework can robustly work on a variety of seen and unseen phantoms as well as in-vivo human carotid data. △ Less

Submitted 29 November, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

arXiv:2307.02090 [pdf, other]

Interactive Conversational Head Generation

Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao

Abstract: We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research pri… ▽ More We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation. The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications, including digital humans, virtual agents, and social robots. While existing research primarily focuses on talking head generation (one-way interaction), hindering the ability to create a digital human for conversation (two-way) interaction due to the absence of listening and interaction parts. In this work, we construct two datasets to address this issue, ``ViCo'' for independent talking and listening head generation tasks at the sentence level, and ``ViCo-X'', for synthesizing interlocutors in multi-turn conversational scenarios. Based on ViCo and ViCo-X, we define three novel tasks targeting the interaction modeling during the face-to-face conversation: 1) responsive listening head generation making listeners respond actively to the speaker with non-verbal signals, 2) expressive talking head generation guiding speakers to be aware of listeners' behaviors, and 3) conversational head generation to integrate the talking/listening ability in one interlocutor. Along with the datasets, we also propose corresponding baseline solutions to the three aforementioned tasks. Experimental results show that our baseline method could generate responsive and vivid agents that can collaborate with real person to fulfil the whole conversation. Project page: https://vico.solutions/. △ Less

Submitted 5 July, 2023; originally announced July 2023.

Comments: arXiv admin note: text overlap with arXiv:2112.13548

arXiv:2307.01517 [pdf, ps, other]

New Designs of Robust Uplink NOMA in Cognitive Radio Inspired Communications

Authors: Yanshi Sun, Wei Cao, Momiao Zhou, Zhiguo Ding

Abstract: This paper considers a cognitive radio inspired uplink communication scenario, where one primary user is allocated with one dedicated resource block, while $M$ secondary users compete with each other to opportunistically access the primary user's channel. Two new designs of NOMA schemes, namely hybrid successive interference cancellation with power adaptation (HSIC-PA) and fixed successive interfe… ▽ More This paper considers a cognitive radio inspired uplink communication scenario, where one primary user is allocated with one dedicated resource block, while $M$ secondary users compete with each other to opportunistically access the primary user's channel. Two new designs of NOMA schemes, namely hybrid successive interference cancellation with power adaptation (HSIC-PA) and fixed successive interference cancellation with power adaptation (FSIC-PA), are proposed. The significant advantages of the proposed schemes are two folds. First, the proposed two schemes can ensure that the secondary users are opportunistically served without degrading the transmission reliability of the primary user. Besides, the transmission robustness of the served secondary users can be guaranteed. Specifically, the outage probability error floors can be avoided for the secondary users, which is proved by asymptotic analysis in the paper. Extensive simulation results are also provided to demonstrate the superior performance of the proposed schemes. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2307.00479 [pdf, other]

Domain Transfer Through Image-to-Image Translation for Uncertainty-Aware Prostate Cancer Classification

Authors: Meng Zhou, Amoon Jamzad, Jason Izard, Alexandre Menard, Robert Siemens, Parvin Mousavi

Abstract: Prostate Cancer (PCa) is a prevalent disease among men, and multi-parametric MRIs offer a non-invasive method for its detection. While MRI-based deep learning solutions have shown promise in supporting PCa diagnosis, acquiring sufficient training data, particularly in local clinics remains challenging. One potential solution is to take advantage of publicly available datasets to pre-train deep mod… ▽ More Prostate Cancer (PCa) is a prevalent disease among men, and multi-parametric MRIs offer a non-invasive method for its detection. While MRI-based deep learning solutions have shown promise in supporting PCa diagnosis, acquiring sufficient training data, particularly in local clinics remains challenging. One potential solution is to take advantage of publicly available datasets to pre-train deep models and fine-tune them on the local data, but multi-source MRIs can pose challenges due to cross-domain distribution differences. These limitations hinder the adoption of explainable and reliable deep-learning solutions in local clinics for PCa diagnosis. In this work, we present a novel approach for unpaired image-to-image translation of prostate multi-parametric MRIs and an uncertainty-aware training approach for classifying clinically significant PCa, to be applied in data-constrained settings such as local and small clinics. Our approach involves a novel pipeline for translating unpaired 3.0T multi-parametric prostate MRIs to 1.5T, thereby augmenting the available training data. Additionally, we introduce an evidential deep learning approach to estimate model uncertainty and employ dataset filtering techniques during training. Furthermore, we propose a simple, yet efficient Evidential Focal Loss, combining focal loss with evidential uncertainty, to train our model effectively. Our experiments demonstrate that the proposed method significantly improves the Area Under ROC Curve (AUC) by over 20% compared to the previous work. Our code is available at https://github.com/med-i-lab/DT_UE_PCa △ Less

Submitted 3 June, 2024; v1 submitted 2 July, 2023; originally announced July 2023.

Comments: Preprint. In Submission

arXiv:2306.16307 [pdf, other]

Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

Authors: Kai Gao, Runzhi He, Bing Xie, Minghui Zhou

Abstract: Deep learning (DL) package supply chains (SCs) are critical for DL frameworks to remain competitive. However, vital knowledge on the nature of DL package SCs is still lacking. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions… ▽ More Deep learning (DL) package supply chains (SCs) are critical for DL frameworks to remain competitive. However, vital knowledge on the nature of DL package SCs is still lacking. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, but Tree and Forest clusters account for most packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common disengagement reason in the two SCs are different. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs. △ Less

Submitted 20 December, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: Preprint of paper accepted by ACM Transactions on Software Engineering and Methodology (TOSEM)

arXiv:2306.14274 [pdf, other]

MEPNet: A Model-Driven Equivariant Proximal Network for Joint Sparse-View Reconstruction and Metal Artifact Reduction in CT Images

Authors: Hong Wang, Minghao Zhou, Dong Wei, Yuexiang Li, Yefeng Zheng

Abstract: Sparse-view computed tomography (CT) has been adopted as an important technique for speeding up data acquisition and decreasing radiation dose. However, due to the lack of sufficient projection data, the reconstructed CT images often present severe artifacts, which will be further amplified when patients carry metallic implants. For this joint sparse-view reconstruction and metal artifact reductio… ▽ More Sparse-view computed tomography (CT) has been adopted as an important technique for speeding up data acquisition and decreasing radiation dose. However, due to the lack of sufficient projection data, the reconstructed CT images often present severe artifacts, which will be further amplified when patients carry metallic implants. For this joint sparse-view reconstruction and metal artifact reduction task, most of the existing methods are generally confronted with two main limitations: 1) They are almost built based on common network modules without fully embedding the physical imaging geometry constraint of this specific task into the dual-domain learning; 2) Some important prior knowledge is not deeply explored and sufficiently utilized. Against these issues, we specifically construct a dual-domain reconstruction model and propose a model-driven equivariant proximal network, called MEPNet. The main characteristics of MEPNet are: 1) It is optimization-inspired and has a clear working mechanism; 2) The involved proximal operator is modeled via a rotation equivariant convolutional neural network, which finely represents the inherent rotational prior underlying the CT scanning that the same organ can be imaged at different angles. Extensive experiments conducted on several datasets comprehensively substantiate that compared with the conventional convolution-based proximal network, such a rotation equivariance mechanism enables our proposed method to achieve better reconstruction performance with fewer network parameters. We will release the code at \url{https://github.com/hongwang01/MEPNet}. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: MICCAI 2023

arXiv:2306.12657 [pdf, other]

doi 10.18653/v1/2023.acl-long.4

Explainable Recommendation with Personalized Review Retrieval and Aspect Learning

Authors: Hao Cheng, Shuo Wang, Wensheng Lu, Wei Zhang, Mingyang Zhou, Kezhong Lu, Hao Liao

Abstract: Explainable recommendation is a technique that combines prediction and generation tasks to produce more persuasive results. Among these tasks, textual generation demands large amounts of data to achieve satisfactory accuracy. However, historical user reviews of items are often insufficient, making it challenging to ensure the precision of generated explanation text. To address this issue, we propo… ▽ More Explainable recommendation is a technique that combines prediction and generation tasks to produce more persuasive results. Among these tasks, textual generation demands large amounts of data to achieve satisfactory accuracy. However, historical user reviews of items are often insufficient, making it challenging to ensure the precision of generated explanation text. To address this issue, we propose a novel model, ERRA (Explainable Recommendation by personalized Review retrieval and Aspect learning). With retrieval enhancement, ERRA can obtain additional information from the training sets. With this additional information, we can generate more accurate and informative explanations. Furthermore, to better capture users' preferences, we incorporate an aspect enhancement component into our model. By selecting the top-n aspects that users are most concerned about for different items, we can model user representation with more relevant details, making the explanation more persuasive. To verify the effectiveness of our model, extensive experiments on three datasets show that our model outperforms state-of-the-art baselines (for example, 3.4% improvement in prediction and 15.8% improvement in explanation for TripAdvisor). △ Less

Submitted 21 June, 2023; originally announced June 2023.

Journal ref: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2023

arXiv:2306.12020 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095084

Visual-Aware Text-to-Speech

Authors: Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, Tao Mei

Abstract: Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and s… ▽ More Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: accepted as oral and top 3% paper by ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 1-5

arXiv:2306.10147 [pdf, other]

doi 10.1145/3571884.3604308

Democratizing Chatbot Debugging: A Computational Framework for Evaluating and Explaining Inappropriate Chatbot Responses

Authors: Xu Han, Michelle Zhou, Yichen Wang, Wenxi Chen, Tom Yeh

Abstract: Evaluating and understanding the inappropriateness of chatbot behaviors can be challenging, particularly for chatbot designers without technical backgrounds. To democratize the debugging process of chatbot misbehaviors for non-technical designers, we propose a framework that leverages dialogue act (DA) modeling to automate the evaluation and explanation of chatbot response inappropriateness. The f… ▽ More Evaluating and understanding the inappropriateness of chatbot behaviors can be challenging, particularly for chatbot designers without technical backgrounds. To democratize the debugging process of chatbot misbehaviors for non-technical designers, we propose a framework that leverages dialogue act (DA) modeling to automate the evaluation and explanation of chatbot response inappropriateness. The framework first produces characterizations of context-aware DAs based on discourse analysis theory and real-world human-chatbot transcripts. It then automatically extracts features to identify the appropriateness level of a response and can explain the causes of the inappropriate response by examining the DA mismatch between the response and its conversational context. Using interview chatbots as a testbed, our framework achieves comparable classification accuracy with higher explainability and fewer computational resources than the deep learning baseline, making it the first step in utilizing DAs for chatbot response appropriateness evaluation and explanation. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: 7 pages, 4 figures, accepted to CUI 2023 poster track

arXiv:2306.09118 [pdf, other]

Hyperbolic Representation Learning: Revisiting and Advancing

Authors: Menglin Yang, Min Zhou, Rex Ying, Yankai Chen, Irwin King

Abstract: The non-Euclidean geometry of hyperbolic spaces has recently garnered considerable attention in the realm of representation learning. Current endeavors in hyperbolic representation largely presuppose that the underlying hierarchies can be automatically inferred and preserved through the adaptive optimization process. This assumption, however, is questionable and requires further validation. In thi… ▽ More The non-Euclidean geometry of hyperbolic spaces has recently garnered considerable attention in the realm of representation learning. Current endeavors in hyperbolic representation largely presuppose that the underlying hierarchies can be automatically inferred and preserved through the adaptive optimization process. This assumption, however, is questionable and requires further validation. In this work, we first introduce a position-tracking mechanism to scrutinize existing prevalent \hlms, revealing that the learned representations are sub-optimal and unsatisfactory. To address this, we propose a simple yet effective method, hyperbolic informed embedding (HIE), by incorporating cost-free hierarchical information deduced from the hyperbolic distance of the node to origin (i.e., induced hyperbolic norm) to advance existing \hlms. The proposed method HIE is both task-agnostic and model-agnostic, enabling its seamless integration with a broad spectrum of models and tasks. Extensive experiments across various models and different tasks demonstrate the versatility and adaptability of the proposed method. Remarkably, our method achieves a remarkable improvement of up to 21.4\% compared to the competing baselines. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: ICML 2023

arXiv:2306.07879 [pdf, other]

Rethinking pose estimation in crowds: overcoming the detection information-bottleneck and ambiguity

Authors: Mu Zhou, Lucas Stoffl, Mackenzie Weygandt Mathis, Alexander Mathis

Abstract: Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlap** indivi… ▽ More Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlap** individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD △ Less

Submitted 30 September, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: Published at ICCV 2023; Code at https://github.com/amathislab/BUCTD Video at https://www.youtube.com/watch?v=BHZnA-CZeZY

Journal ref: ICCV Link: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhou_Rethinking_Pose_Estimation_in_Crowds_Overcoming_the_Detection_Information_Bottleneck_ICCV_2023_paper.pdf

arXiv:2306.06112 [pdf, other]

doi 10.1145/3597926.3598113

ModelObfuscator: Obfuscating Model Information to Protect Deployed ML-based Systems

Authors: Mingyi Zhou, Xiang Gao, **g Wu, John Grundy, Xiao Chen, Chunyang Chen, Li Li

Abstract: More and more edge devices and mobile apps are leveraging deep learning (DL) capabilities. Deploying such models on devices -- referred to as on-device models -- rather than as remote cloud-hosted services, has gained popularity because it avoids transmitting user data off of the device and achieves high response time. However, on-device models can be easily attacked, as they can be accessed by un… ▽ More More and more edge devices and mobile apps are leveraging deep learning (DL) capabilities. Deploying such models on devices -- referred to as on-device models -- rather than as remote cloud-hosted services, has gained popularity because it avoids transmitting user data off of the device and achieves high response time. However, on-device models can be easily attacked, as they can be accessed by unpacking corresponding apps and the model is fully exposed to attackers. Recent studies show that attackers can easily generate white-box-like attacks for an on-device model or even inverse its training data. To protect on-device models from white-box attacks, we propose a novel technique called model obfuscation. Specifically, model obfuscation hides and obfuscates the key information -- structure, parameters and attributes -- of models by renaming, parameter encapsulation, neural structure obfuscation obfuscation, shortcut injection, and extra layer injection. We have developed a prototype tool ModelObfuscator to automatically obfuscate on-device TFLite models. Our experiments show that this proposed approach can dramatically improve model security by significantly increasing the difficulty of parsing models inner information, without increasing the latency of DL models. Our proposed on-device model obfuscation has the potential to be a fundamental technique for on-device model deployment. Our prototype tool is publicly available at: https://github.com/zhoumingyi/ModelObfuscator. △ Less

Submitted 29 February, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Published In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA23)

Journal ref: In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023), 2023, 1005-1017

arXiv:2306.03498 [pdf, other]

Boundary regularity of uniformly rotating vortex patches and an unstable elliptic free boundary problem

Authors: Yuchen Wang, Guanghui Zhang, Maolin Zhou

Abstract: In this paper, we consider a sign-changing free boundary problem that comes from the boundary regularity of rotating vortex patches of the two-dimensional incompressible Euler equations. The complete classification of singular points has been obtained through establishing a new Weiss-type monotonicity formula. Upon these results, we prove that only $90^\circ$ corner type of singularity could happe… ▽ More In this paper, we consider a sign-changing free boundary problem that comes from the boundary regularity of rotating vortex patches of the two-dimensional incompressible Euler equations. The complete classification of singular points has been obtained through establishing a new Weiss-type monotonicity formula. Upon these results, we prove that only $90^\circ$ corner type of singularity could happen at the boundary of a Lipschitz rotating vortex patch, while the other parts are $C^\infty$ smooth. △ Less

Submitted 17 March, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

arXiv:2306.02416 [pdf, other]

Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation

Authors: Yunhe Gao, Zhuowei Li, Di Liu, Mu Zhou, Shaoting Zhang, Dimitris N. Metaxas

Abstract: A major focus of clinical imaging workflow is disease diagnosis and management, leading to medical imaging datasets strongly tied to specific clinical objectives. This scenario has led to the prevailing practice of develo** task-specific segmentation models, without gaining insights from widespread imaging cohorts. Inspired by the training program of medical radiology residents, we propose a shi… ▽ More A major focus of clinical imaging workflow is disease diagnosis and management, leading to medical imaging datasets strongly tied to specific clinical objectives. This scenario has led to the prevailing practice of develo** task-specific segmentation models, without gaining insights from widespread imaging cohorts. Inspired by the training program of medical radiology residents, we propose a shift towards universal medical image segmentation, a paradigm aiming to build medical image understanding foundation models by leveraging the diversity and commonality across clinical targets, body regions, and imaging modalities. Towards this goal, we develop Hermes, a novel context-prior learning approach to address the challenges of data heterogeneity and annotation differences in medical image segmentation. In a large collection of eleven diverse datasets (2,438 3D images) across five modalities (CT, PET, T1, T2 and cine MRI) and multiple body regions, we demonstrate the merit of the universal paradigm over the traditional paradigm on addressing multiple tasks within a single model. By exploiting the synergy across tasks, Hermes achieves state-of-the-art performance on all testing datasets and shows superior model scalability. Results on two additional datasets reveals Hermes' strong performance for transfer learning, incremental learning, and generalization to downstream tasks. Hermes's learned priors demonstrate an appealing trait to reflect the intricate relations among tasks and modalities, which aligns with the established anatomical and imaging principles in radiology. The code is available: https://github.com/yhygao/universal-medical-image-segmentation. △ Less

Submitted 6 April, 2024; v1 submitted 4 June, 2023; originally announced June 2023.

Comments: Accepted by CVPR 2024

arXiv:2306.00398 [pdf, other]

Preference-grounded Token-level Guidance for Language Model Fine-tuning

Authors: Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, Mingyuan Zhou

Abstract: Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the *sequence level* while LM training and generation both occur at the *token level*. There is, therefore, a *granularity mismatch* between the preference and the LM training losses, which may complicate the learning problem. In this… ▽ More Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the *sequence level* while LM training and generation both occur at the *token level*. There is, therefore, a *granularity mismatch* between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by develo** an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two *minimalist* learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks -- discrete-prompt generation and text summarization. △ Less

Submitted 9 October, 2023; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

arXiv:2305.18641 [pdf, other]

Enhanced Chart Understanding in Vision and Language Task via Cross-modal Pre-training on Plot Table Pairs

Authors: Mingyang Zhou, Yi R. Fung, Long Chen, Christopher Thomas, Heng Ji, Shih-Fu Chang

Abstract: Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language(V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from char… ▽ More Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language(V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: Accepted by Findings of ACL 2023

arXiv:2305.18375 [pdf, other]

Learning to Jump: Thinning and Thickening Latent Counts for Generative Modeling

Authors: Tianqi Chen, Mingyuan Zhou

Abstract: Learning to denoise has emerged as a prominent paradigm to design state-of-the-art deep generative models for natural images. How to use it to model the distributions of both continuous real-valued data and categorical data has been well studied in recently proposed diffusion models. However, it is found in this paper to have limited ability in modeling some other types of data, such as count and… ▽ More Learning to denoise has emerged as a prominent paradigm to design state-of-the-art deep generative models for natural images. How to use it to model the distributions of both continuous real-valued data and categorical data has been well studied in recently proposed diffusion models. However, it is found in this paper to have limited ability in modeling some other types of data, such as count and non-negative continuous data, that are often highly sparse, skewed, heavy-tailed, and/or overdispersed. To this end, we propose learning to jump as a general recipe for generative modeling of various types of data. Using a forward count thinning process to construct learning objectives to train a deep neural network, it employs a reverse count thickening process to iteratively refine its generation through that network. We demonstrate when learning to jump is expected to perform comparably to learning to denoise, and when it is expected to perform better. For example, learning to jump is recommended when the training data is non-negative and exhibits strong sparsity, skewness, heavy-tailedness, and/or heterogeneity. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: ICML 2023

arXiv:2305.17030 [pdf, other]

doi 10.3847/1538-4365/acfd29

The First LHAASO Catalog of Gamma-Ray Sources

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: We present the first catalog of very-high energy and ultra-high energy gamma-ray sources detected by the Large High Altitude Air Shower Observatory (LHAASO). The catalog was compiled using 508 days of data collected by the Water Cherenkov Detector Array (WCDA) from March 2021 to September 2022 and 933 days of data recorded by the Kilometer Squared Array (KM2A) from January 2020 to September 2022.… ▽ More We present the first catalog of very-high energy and ultra-high energy gamma-ray sources detected by the Large High Altitude Air Shower Observatory (LHAASO). The catalog was compiled using 508 days of data collected by the Water Cherenkov Detector Array (WCDA) from March 2021 to September 2022 and 933 days of data recorded by the Kilometer Squared Array (KM2A) from January 2020 to September 2022. This catalog represents the main result from the most sensitive large coverage gamma-ray survey of the sky above 1 TeV, covering declination from $-$20$^{\circ}$ to 80$^{\circ}$. In total, the catalog contains 90 sources with an extended size smaller than $2^\circ$ and a significance of detection at $> 5σ$. Based on our source association criteria, 32 new TeV sources are proposed in this study. Among the 90 sources, 43 sources are detected with ultra-high energy ($E > 100$ TeV) emission at $> 4σ$ significance level. We provide the position, extension, and spectral characteristics of all the sources in this catalog. △ Less

Submitted 27 November, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

Comments: 40 pages, 13 figures, 4 tables

Journal ref: The Astrophysical Journal Supplement Series, 271 (2024) 25

arXiv:2305.16310 [pdf, other]

Securing Deep Generative Models with Universal Adversarial Signature

Authors: Yu Zeng, Mo Zhou, Yuan Xue, Vishal M. Patel

Abstract: Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector cap… ▽ More Recent advances in deep generative models have led to the development of methods capable of synthesizing high-quality, realistic images. These models pose threats to society due to their potential misuse. Prior research attempted to mitigate these threats by detecting generated images, but the varying traces left by different generative models make it challenging to create a universal detector capable of generalizing to new, unseen generative models. In this paper, we propose to inject a universal adversarial signature into an arbitrary pre-trained generative model, in order to make its generated contents more detectable and traceable. First, the imperceptible optimal signature for each image can be found by a signature injector through adversarial training. Subsequently, the signature can be incorporated into an arbitrary generator by fine-tuning it with the images processed by the signature injector. In this way, the detector corresponding to the signature can be reused for any fine-tuned generator for tracking the generator identity. The proposed method is validated on the FFHQ and ImageNet datasets with various state-of-the-art generative models, consistently showing a promising detection rate. Code will be made publicly available at \url{https://github.com/zengxianyu/genwm}. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.15066 [pdf, other]

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? An Empirical Evaluation and Benchmarking

Authors: Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, Shi Han

Abstract: Large language models~(LLM) like ChatGPT have become indispensable to artificial general intelligence~(AGI), demonstrating excellent performance in various natural language processing tasks. In the real world, graph data is ubiquitous and an essential part of AGI and prevails in domains like social network analysis, bioinformatics and recommender systems. The training corpus of large language mode… ▽ More Large language models~(LLM) like ChatGPT have become indispensable to artificial general intelligence~(AGI), demonstrating excellent performance in various natural language processing tasks. In the real world, graph data is ubiquitous and an essential part of AGI and prevails in domains like social network analysis, bioinformatics and recommender systems. The training corpus of large language models often includes some algorithmic components, which allows them to achieve certain effects on some graph data-related problems. However, there is still little research on their performance on a broader range of graph-structured data. In this study, we conduct an extensive investigation to assess the proficiency of LLMs in comprehending graph data, employing a diverse range of structural and semantic-related tasks. Our analysis encompasses 10 distinct tasks that evaluate the LLMs' capabilities in graph understanding. Through our study, we not only uncover the current limitations of language models in comprehending graph structures and performing associated reasoning tasks but also emphasize the necessity for further advancements and novel approaches to enhance their graph processing capabilities. Our findings contribute valuable insights towards bridging the gap between language models and graph understanding, paving the way for more effective graph mining and knowledge extraction. △ Less

Submitted 11 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.14674 [pdf, other]

T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities

Authors: Kangfu Mei, Mo Zhou, Vishal M. Patel

Abstract: Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures… ▽ More Diffusion Probabilistic Field (DPF) models the distribution of continuous functions defined over metric spaces. While DPF shows great potential for unifying data generation of various modalities including images, videos, and 3D geometry, it does not scale to a higher data resolution. This can be attributed to the ``scaling property'', where it is difficult for the model to capture local structures through uniform sampling. To this end, we propose a new model comprising of a view-wise sampling algorithm to focus on local structure learning, and incorporating additional guidance, e.g., text description, to complement the global geometry. The model can be scaled to generate high-resolution data while unifying multiple modalities. Experimental results on data generation in various modalities demonstrate the effectiveness of our model, as well as its potential as a foundation framework for scalable modality-unified visual content generation. △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: for project page, see https://t1-diffusion-model.github.io

arXiv:2305.13062 [pdf, other]

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Authors: Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, Dongmei Zhang

Abstract: Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. Although tables can be used as input to LLMs with serialization, there is a lack of comprehensive studies that examine whether LLMs can truly comprehend such data. In this paper… ▽ More Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. Although tables can be used as input to LLMs with serialization, there is a lack of comprehensive studies that examine whether LLMs can truly comprehend such data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities (SUC) of LLMs. The benchmark we create includes seven tasks, each with its own unique challenges, e.g., cell lookup, row retrieval, and size detection. We perform a series of evaluations on GPT-3.5 and GPT-4. We find that performance varied depending on several input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose \textit{self-augmentation} for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. △ Less

Submitted 17 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: This paper has been accepted as a full paper at WSDM 2024. The code will be released at https://github.com/microsoft/TableProvider

arXiv:2305.09967 [pdf, other]

Variable Length Embeddings

Authors: Johnathan Chiu, Andi Gu, Matt Zhou

Abstract: In this work, we introduce a novel deep learning architecture, Variable Length Embeddings (VLEs), an autoregressive model that can produce a latent representation composed of an arbitrary number of tokens. As a proof of concept, we demonstrate the capabilities of VLEs on tasks that involve reconstruction and image decomposition. We evaluate our experiments on a mix of the iNaturalist and ImageNet… ▽ More In this work, we introduce a novel deep learning architecture, Variable Length Embeddings (VLEs), an autoregressive model that can produce a latent representation composed of an arbitrary number of tokens. As a proof of concept, we demonstrate the capabilities of VLEs on tasks that involve reconstruction and image decomposition. We evaluate our experiments on a mix of the iNaturalist and ImageNet datasets and find that VLEs achieve comparable reconstruction results to a state of the art VAE, using less than a tenth of the parameters. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.09135 [pdf, ps, other]

Frobenius splitting of moduli spaces of parabolic bundles

Authors: Xiaotao Sun, Mingshuo Zhou

Abstract: Let $C$ be a nonsingular projective curve over an algebraically closed field of characteristic $p>0$ and $I\subset C$ be a finite set. If $\mathcal{U}_{C,\,ω}$ denotes the moduli space of semistable parabolic bundles of rank $r$ and degree $d$ on $C$ with parabolic structures determined by $ω=(k,\{\vec n(x),\vec a(x)\}_{x\in I})$, we prove that $\mathcal{U}_{C,\,ω}$ is \textit{$F$-split} for gener… ▽ More Let $C$ be a nonsingular projective curve over an algebraically closed field of characteristic $p>0$ and $I\subset C$ be a finite set. If $\mathcal{U}_{C,\,ω}$ denotes the moduli space of semistable parabolic bundles of rank $r$ and degree $d$ on $C$ with parabolic structures determined by $ω=(k,\{\vec n(x),\vec a(x)\}_{x\in I})$, we prove that $\mathcal{U}_{C,\,ω}$ is \textit{$F$-split} for generic $C$ and generic choice of $I$ when $p>3r$. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 42 pages

MSC Class: Algebraic Geometry; 14H60; 14D20

arXiv:2305.08544 [pdf, other]

doi 10.34133/research.0134

Quantum Neural Network for Quantum Neural Computing

Authors: Min-Gang Zhou, Zhi-** Liu, Hua-Lei Yin, Chen-Long Li, Tong-Kai Xu, Zeng-Bing Chen

Abstract: Neural networks have achieved impressive breakthroughs in both industry and academia. How to effectively develop neural networks on quantum computing devices is a challenging open problem. Here, we propose a new quantum neural network model for quantum neural computing using (classically-controlled) single-qubit operations and measurements on real-world quantum systems with naturally occurring env… ▽ More Neural networks have achieved impressive breakthroughs in both industry and academia. How to effectively develop neural networks on quantum computing devices is a challenging open problem. Here, we propose a new quantum neural network model for quantum neural computing using (classically-controlled) single-qubit operations and measurements on real-world quantum systems with naturally occurring environment-induced decoherence, which greatly reduces the difficulties of physical implementations. Our model circumvents the problem that the state-space size grows exponentially with the number of neurons, thereby greatly reducing memory requirements and allowing for fast optimization with traditional optimization algorithms. We benchmark our model for handwritten digit recognition and other nonlinear classification tasks. The results show that our model has an amazing nonlinear classification ability and robustness to noise. Furthermore, our model allows quantum computing to be applied in a wider context and inspires the earlier development of a quantum neural computer than standard quantum computers. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 10 pages, 6 figures

Journal ref: Research 6, 0134 (2023)

arXiv:2305.07835 [pdf, other]

Multi-Scenario Broadband Channel Measurement and Modeling for Sub-6 GHz RIS-Assisted Wireless Communication Systems

Authors: Jian Sang, Mingyong Zhou, Jifeng Lan, Boning Gao, Wankai Tang, Xiao Li, Shi **, Ertugrul Basar, Cen Li, Qiang Cheng, Tie Jun Cui

Abstract: Reconfigurable intelligent surface (RIS)-empowered communication, has been considered widely as one of the revolutionary technologies for next generation networks. However, due to the novel propagation characteristics of RISs, underlying RIS channel modeling and measurement research is still in its infancy and not fully investigated. In this paper, we conduct multi-scenario broadband channel measu… ▽ More Reconfigurable intelligent surface (RIS)-empowered communication, has been considered widely as one of the revolutionary technologies for next generation networks. However, due to the novel propagation characteristics of RISs, underlying RIS channel modeling and measurement research is still in its infancy and not fully investigated. In this paper, we conduct multi-scenario broadband channel measurements and modeling for RIS-assisted communications at the sub-6 GHz band. The measurements are carried out in three scenarios covering outdoor, indoor, and outdoor-to-indoor (O2I) environments, which suffer from non-line-of-sight (NLOS) propagation inherently. Three propagation modes including intelligent reflection with RIS, specular reflection with RIS and the mode without RIS, are taken into account in each scenario. In addition, considering the cascaded characteristics of RIS-assisted channel by nature, two modified empirical models including floating-intercept (FI) and close-in (CI) are proposed, which cover distance and angle domains. The measurement results rooted in 2096 channel acquisitions verify the prediction accuracy of these proposed models. Moreover, the propagation characteristics for RIS-assisted channels, including path loss (PL) gain, PL exponent, spatial consistency, time dispersion, frequency stationarity, etc., are compared and analyzed comprehensively. These channel measurement and modeling results may lay the groundwork for future applications of RIS-assisted communication systems in practice. △ Less

Submitted 13 May, 2023; originally announced May 2023.

arXiv:2305.07774 [pdf, other]

PanFlowNet: A Flow-Based Deep Network for Pan-sharpening

Authors: Gang Yang, Xiangyong Cao, Wenzhe Xiao, Man Zhou, Ai** Liu, Xun chen, Deyu Meng

Abstract: Pan-sharpening aims to generate a high-resolution multispectral (HRMS) image by integrating the spectral information of a low-resolution multispectral (LRMS) image with the texture details of a high-resolution panchromatic (PAN) image. It essentially inherits the ill-posed nature of the super-resolution (SR) task that diverse HRMS images can degrade into an LRMS image. However, existing deep learn… ▽ More Pan-sharpening aims to generate a high-resolution multispectral (HRMS) image by integrating the spectral information of a low-resolution multispectral (LRMS) image with the texture details of a high-resolution panchromatic (PAN) image. It essentially inherits the ill-posed nature of the super-resolution (SR) task that diverse HRMS images can degrade into an LRMS image. However, existing deep learning-based methods recover only one HRMS image from the LRMS image and PAN image using a deterministic map**, thus ignoring the diversity of the HRMS image. In this paper, to alleviate this ill-posed issue, we propose a flow-based pan-sharpening network (PanFlowNet) to directly learn the conditional distribution of HRMS image given LRMS image and PAN image instead of learning a deterministic map**. Specifically, we first transform this unknown conditional distribution into a given Gaussian distribution by an invertible network, and the conditional distribution can thus be explicitly defined. Then, we design an invertible Conditional Affine Coupling Block (CACB) and further build the architecture of PanFlowNet by stacking a series of CACBs. Finally, the PanFlowNet is trained by maximizing the log-likelihood of the conditional distribution given a training set and can then be used to predict diverse HRMS images. The experimental results verify that the proposed PanFlowNet can generate various HRMS images given an LRMS image and a PAN image. Additionally, the experimental results on different kinds of satellite datasets also demonstrate the superiority of our PanFlowNet compared with other state-of-the-art methods both visually and quantitatively. △ Less

Submitted 16 May, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.05372 [pdf, other]

doi 10.1103/PhysRevLett.131.151001

Measurement of ultra-high-energy diffuse gamma-ray emission of the Galactic plane from 10 TeV to 1 PeV with LHAASO-KM2A

Authors: Zhen Cao, F. Aharonian, Q. An, Axikegu, Y. X. Bai, Y. W. Bao, D. Bastieri, X. J. Bi, Y. J. Bi, J. T. Cai, Q. Cao, W. Y. Cao, Zhe Cao, J. Chang, J. F. Chang, A. M. Chen, E. S. Chen, Liang Chen, Lin Chen, Long Chen, M. J. Chen, M. L. Chen, Q. H. Chen, S. H. Chen, S. Z. Chen , et al. (255 additional authors not shown)

Abstract: The diffuse Galactic $γ$-ray emission, mainly produced via interactions between cosmic rays and the interstellar medium and/or radiation field, is a very important probe of the distribution, propagation, and interaction of cosmic rays in the Milky Way. In this work we report the measurements of diffuse $γ$-rays from the Galactic plane between 10 TeV and 1 PeV energies, with the square kilometer ar… ▽ More The diffuse Galactic $γ$-ray emission, mainly produced via interactions between cosmic rays and the interstellar medium and/or radiation field, is a very important probe of the distribution, propagation, and interaction of cosmic rays in the Milky Way. In this work we report the measurements of diffuse $γ$-rays from the Galactic plane between 10 TeV and 1 PeV energies, with the square kilometer array of the Large High Altitude Air Shower Observatory (LHAASO). Diffuse emissions from the inner ($15^{\circ}<l<125^{\circ}$, $|b|<5^{\circ}$) and outer ($125^{\circ}<l<235^{\circ}$, $|b|<5^{\circ}$) Galactic plane are detected with $29.1σ$ and $12.7σ$ significance, respectively. The outer Galactic plane diffuse emission is detected for the first time in the very- to ultra-high-energy domain ($E>10$~TeV). The energy spectrum in the inner Galaxy regions can be described by a power-law function with an index of $-2.99\pm0.04$, which is different from the curved spectrum as expected from hadronic interactions between locally measured cosmic rays and the line-of-sight integrated gas content. Furthermore, the measured flux is higher by a factor of $\sim3$ than the prediction. A similar spectrum with an index of $-2.99\pm0.07$ is found in the outer Galaxy region, and the absolute flux for $10\lesssim E\lesssim60$ TeV is again higher than the prediction for hadronic cosmic ray interactions. The latitude distributions of the diffuse emission are consistent with the gas distribution, while the longitude distributions show clear deviation from the gas distribution. The LHAASO measurements imply that either additional emission sources exist or cosmic ray intensities have spatial variations. △ Less

Submitted 19 August, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

Comments: 12 pages, 8 figures, 5 tables; accepted for publication in Physical Review Letters; source mask file provided as ancillary file

Journal ref: Phys. Rev. Lett. 131, 151001 (2023)

arXiv:2305.02838 [pdf]

doi 10.1088/1361-648X/acc3ec

Superconductivity in Mo$_4$Ga$_{20}$As with Endohedral Gallium Clusters

Authors: Bin-Bin Ruan, Le-Wei Chen, Yun-Qing Shi, Jun-Kun Yi, Qing-Song Yang, Meng-Hu Zhou, Ming-Wei Ma, Gen-Fu Chen, Zhi-An Ren

Abstract: We report the discovery and detailed investigation of superconductivity in Mo$_4$Ga$_{20}$As. Mo$_4$Ga$_{20}$As crystallizes in the space group of $I4/m$ (No. 87), with lattice parameters $a$ = 12.86352 Åand $c$ = 5.30031 Å. The resistivity, magnetization, and specific heat data reveal Mo$_4$Ga$_{20}$As to be a type-II superconductor with $T_c$ = 5.6 K. The upper and lower critical fields are esti… ▽ More We report the discovery and detailed investigation of superconductivity in Mo$_4$Ga$_{20}$As. Mo$_4$Ga$_{20}$As crystallizes in the space group of $I4/m$ (No. 87), with lattice parameters $a$ = 12.86352 Åand $c$ = 5.30031 Å. The resistivity, magnetization, and specific heat data reveal Mo$_4$Ga$_{20}$As to be a type-II superconductor with $T_c$ = 5.6 K. The upper and lower critical fields are estimated to be 2.78 T and 22.0 mT, respectively. In addition, electron-phonon coupling in Mo$_4$Ga$_{20}$As is possibly stronger than the BCS weak-coupling limit. First-principles calculations suggest the Fermi level being dominated by the Mo-4$d$ and Ga-4$p$ orbitals. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Journal ref: Journal of Physics: Condensed Matter 2023 35, 214002

arXiv:2305.02499 [pdf, other]

AutoML-GPT: Automatic Machine Learning with GPT

Authors: Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, Mingyuan Zhou

Abstract: AI tasks encompass a wide range of domains and fields. While numerous AI models have been designed for specific tasks and applications, they often require considerable human efforts in finding the right model architecture, optimization algorithm, and hyperparameters. Recent advances in large language models (LLMs) like ChatGPT show remarkable capabilities in various aspects of reasoning, comprehen… ▽ More AI tasks encompass a wide range of domains and fields. While numerous AI models have been designed for specific tasks and applications, they often require considerable human efforts in finding the right model architecture, optimization algorithm, and hyperparameters. Recent advances in large language models (LLMs) like ChatGPT show remarkable capabilities in various aspects of reasoning, comprehension, and interaction. Consequently, we propose develo** task-oriented prompts and automatically utilizing LLMs to automate the training pipeline. To implement this concept, we present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyperparameters. AutoML-GPT dynamically takes user requests from the model and data cards and composes the corresponding prompt paragraph. Ultimately, with this prompt paragraph, AutoML-GPT will automatically conduct the experiments from data processing to model architecture, hyperparameter tuning, and predicted training log. By leveraging {\ours}'s robust language capabilities and the available AI models, AutoML-GPT can tackle numerous intricate AI tasks across various tasks and datasets. This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many AI tasks. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2305.01115 [pdf, other]

In-Context Learning Unlocked for Diffusion Models

Authors: Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, Mingyuan Zhou

Abstract: We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vi… ▽ More We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly over six different tasks using these prompts. The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation on the trained tasks and generalizes effectively to new, unseen vision tasks with their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Prompt-Diffusion. △ Less

Submitted 18 October, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

arXiv:2305.00562 [pdf, other]

Class-Balancing Diffusion Models

Authors: Yiming Qin, Huangjie Zheng, Jiangchao Yao, Mingyuan Zhou, Ya Zhang

Abstract: Diffusion-based models have shown the merits of generating high-quality visual data while preserving better diversity in recent studies. However, such observation is only justified with curated data distribution, where the data samples are nicely pre-processed to be uniformly distributed in terms of their labels. In practice, a long-tailed data distribution appears more common and how diffusion mo… ▽ More Diffusion-based models have shown the merits of generating high-quality visual data while preserving better diversity in recent studies. However, such observation is only justified with curated data distribution, where the data samples are nicely pre-processed to be uniformly distributed in terms of their labels. In practice, a long-tailed data distribution appears more common and how diffusion models perform on such class-imbalanced data remains unknown. In this work, we first investigate this problem and observe significant degradation in both diversity and fidelity when the diffusion model is trained on datasets with class-imbalanced distributions. Especially in tail classes, the generations largely lose diversity and we observe severe mode-collapse issues. To tackle this problem, we set from the hypothesis that the data distribution is not class-balanced, and propose Class-Balancing Diffusion Models (CBDM) that are trained with a distribution adjustment regularizer as a solution. Experiments show that images generated by CBDM exhibit higher diversity and quality in both quantitative and qualitative ways. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task. △ Less

Submitted 14 June, 2023; v1 submitted 30 April, 2023; originally announced May 2023.

Comments: Accepted by CVPR2023

arXiv:2305.00350 [pdf, other]

POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models

Authors: Korawat Tanwisuth, Shujian Zhang, Huangjie Zheng, Pengcheng He, Mingyuan Zhou

Abstract: Through prompting, large-scale pre-trained models have become more expressive and powerful, gaining significant attention in recent years. Though these big models have zero-shot capabilities, in general, labeled data are still required to adapt them to downstream tasks. To overcome this critical limitation, we propose an unsupervised fine-tuning framework to directly fine-tune the model or prompt… ▽ More Through prompting, large-scale pre-trained models have become more expressive and powerful, gaining significant attention in recent years. Though these big models have zero-shot capabilities, in general, labeled data are still required to adapt them to downstream tasks. To overcome this critical limitation, we propose an unsupervised fine-tuning framework to directly fine-tune the model or prompt on the unlabeled target data. We demonstrate how to apply our method to both language-augmented vision and masked-language models by aligning the discrete distributions extracted from the prompts and target data. To verify our approach's applicability, we conduct extensive experiments on image classification, sentiment analysis, and natural language inference tasks. Across 13 image-related tasks and 15 language-related ones, the proposed approach achieves consistent improvements over the baselines. △ Less

Submitted 29 April, 2023; originally announced May 2023.

Comments: ICML 2023; PyTorch code is available at https://github.com/korawat-tanwisuth/POUF

arXiv:2304.14971 [pdf, ps, other]

doi 10.1145/3589309

Popularity Ratio Maximization: Surpassing Competitors through Influence Propagation

Authors: Hao Liao, Sheng Bi, Jiao Wu, Wei Zhang, Mingyang Zhou, Rui Mao, Wei Chen

Abstract: In this paper, we present an algorithmic study on how to surpass competitors in popularity by strategic promotions in social networks. We first propose a novel model, in which we integrate the Preferential Attachment (PA) model for popularity growth with the Independent Cascade (IC) model for influence propagation in social networks called PA-IC model. In PA-IC, a popular item and a novice item gr… ▽ More In this paper, we present an algorithmic study on how to surpass competitors in popularity by strategic promotions in social networks. We first propose a novel model, in which we integrate the Preferential Attachment (PA) model for popularity growth with the Independent Cascade (IC) model for influence propagation in social networks called PA-IC model. In PA-IC, a popular item and a novice item grab shares of popularity from the natural popularity growth via the PA model, while the novice item tries to gain extra popularity via influence cascade in a social network. The {\em popularity ratio} is defined as the ratio of the popularity measure between the novice item and the popular item. We formulate {\em Popularity Ratio Maximization (PRM)} as the problem of selecting seeds in multiple rounds to maximize the popularity ratio in the end. We analyze the popularity ratio and show that it is monotone but not submodular. To provide an effective solution, we devise a surrogate objective function and show that empirically it is very close to the original objective function while theoretically, it is monotone and submodular. We design two efficient algorithms, one for the overlap** influence and non-overlap** seeds (across rounds) setting and the other for the non-overlap** influence and overlap** seed setting, and further discuss how to deal with other models and problem variants. Our empirical evaluation further demonstrates that the proposed PRM-IMM method consistently achieves the best popularity promotion compared to other methods. Our theoretical and empirical analyses shed light on the interplay between influence maximization and preferential attachment in social networks. △ Less

Submitted 28 April, 2023; originally announced April 2023.

Comments: 22 pages, 8 figures, to be appear SIGMOD 2023

arXiv:2304.12526 [pdf, other]

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Authors: Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou

Abstract: Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the… ▽ More Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion. △ Less

Submitted 18 October, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

arXiv:2304.04968 [pdf, other]

Re-imagine the Negative Prompt Algorithm: Transform 2D Diffusion into 3D, alleviate Janus problem and Beyond

Authors: Mohammadreza Armandpour, Ali Sadeghian, Huangjie Zheng, Amir Sadeghian, Mingyuan Zhou

Abstract: Although text-to-image diffusion models have made significant strides in generating images from text, they are sometimes more inclined to generate images like the data on which the model was trained rather than the provided text. This limitation has hindered their usage in both 2D and 3D applications. To address this problem, we explored the use of negative prompts but found that the current imple… ▽ More Although text-to-image diffusion models have made significant strides in generating images from text, they are sometimes more inclined to generate images like the data on which the model was trained rather than the provided text. This limitation has hindered their usage in both 2D and 3D applications. To address this problem, we explored the use of negative prompts but found that the current implementation fails to produce desired results, particularly when there is an overlap between the main and negative prompts. To overcome this issue, we propose Perp-Neg, a new algorithm that leverages the geometrical properties of the score space to address the shortcomings of the current negative prompts algorithm. Perp-Neg does not require any training or fine-tuning of the model. Moreover, we experimentally demonstrate that Perp-Neg provides greater flexibility in generating images by enabling users to edit out unwanted concepts from the initially generated images in 2D cases. Furthermore, to extend the application of Perp-Neg to 3D, we conducted a thorough exploration of how Perp-Neg can be used in 2D to condition the diffusion model to generate desired views, rather than being biased toward the canonical views. Finally, we applied our 2D intuition to integrate Perp-Neg with the state-of-the-art text-to-3D (DreamFusion) method, effectively addressing its Janus (multi-head) problem. Our project page is available at https://Perp-Neg.github.io/ △ Less

Submitted 26 April, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: Our project page is available at https://Perp-Neg.github.io/

arXiv:2304.04484 [pdf, other]

Quasi-Synchronous Random Access for Massive MIMO-Based LEO Satellite Constellations

Authors: Keke Ying, Zhen Gao, Sheng Chen, Mingyu Zhou, Dezhi Zheng, Symeon Chatzinotas, Björn Ottersten, H. Vincent Poor

Abstract: Low earth orbit (LEO) satellite constellation-enabled communication networks are expected to be an important part of many Internet of Things (IoT) deployments due to their unique advantage of providing seamless global coverage. In this paper, we investigate the random access problem in massive multiple-input multiple-output-based LEO satellite systems, where the multi-satellite cooperative process… ▽ More Low earth orbit (LEO) satellite constellation-enabled communication networks are expected to be an important part of many Internet of Things (IoT) deployments due to their unique advantage of providing seamless global coverage. In this paper, we investigate the random access problem in massive multiple-input multiple-output-based LEO satellite systems, where the multi-satellite cooperative processing mechanism is considered. Specifically, at edge satellite nodes, we conceive a training sequence padded multi-carrier system to overcome the issue of imperfect synchronization, where the training sequence is utilized to detect the devices' activity and estimate their channels. Considering the inherent sparsity of terrestrial-satellite links and the sporadic traffic feature of IoT terminals, we utilize the orthogonal approximate message passing-multiple measurement vector algorithm to estimate the delay coefficients and user terminal activity. To further utilize the structure of the receive array, a two-dimensional estimation of signal parameters via rotational invariance technique is performed for enhancing channel estimation. Finally, at the central server node, we propose a majority voting scheme to enhance activity detection by aggregating backhaul information from multiple satellites. Moreover, multi-satellite cooperative linear data detection and multi-satellite cooperative Bayesian dequantization data detection are proposed to cope with perfect and quantized backhaul, respectively. Simulation results verify the effectiveness of our proposed schemes in terms of channel estimation, activity detection, and data detection for quasi-synchronous random access in satellite systems. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: 38 pages, 16 figures. This paper has been accepted by IEEE JSAC SI on 3GPP Technologies: 5G-Advanced and Beyond. Copyright may be transferred without notice, after which this version may no longer be accessible

Showing 151–200 of 1,131 results for author: Zhou, M