Search | arXiv e-print repository

Advancing Process Verification for Large Language Models via Tree-Based Preference Learning

Authors: Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, Weiming Lu

Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, t… ▽ More Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2406.19417 [pdf, other]

"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models

Authors: Zhen Tan, Chengshuai Zhao, Raha Moraffah, Yifan Li, Song Wang, Jundong Li, Tianlong Chen, Huan Liu

Abstract: Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionall… ▽ More Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) by integrating external knowledge bases, improving their performance in applications like fact-checking and information searching. In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases by injecting deceptive content into the retrieval database, intentionally changing the model's behavior. This threat is critical as it mirrors real-world usage scenarios where RAG systems interact with publicly accessible knowledge bases, such as web scra**s and user-contributed data pools. To be more realistic, we target a realistic setting where the adversary has no knowledge of users' queries, knowledge base data, and the LLM parameters. We demonstrate that it is possible to exploit the model successfully through crafted content uploads with access to the retriever. Our findings emphasize an urgent need for security measures in the design and deployment of RAG systems to prevent potential manipulation and ensure the integrity of machine-generated content. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.18981 [pdf, other]

Exact Fisher zeros and thermofield dynamics across a quantum critical point

Authors: Yang Liu, Songtai Lv, Yuchen Meng, Zefan Tan, Erhai Zhao, Haiyuan Zou

Abstract: By setting the inverse temperature $β$ loose to occupy the complex plane, Michael E. Fisher showed that the zeros of the complex partition function $Z$, if approaching the real $β$ axis, reveal a thermodynamic phase transition. More recently, Fisher zeros have been used to mark the dynamical phase transition in quench dynamics. The success of Fisher zeros however seems limited, and it is unclear h… ▽ More By setting the inverse temperature $β$ loose to occupy the complex plane, Michael E. Fisher showed that the zeros of the complex partition function $Z$, if approaching the real $β$ axis, reveal a thermodynamic phase transition. More recently, Fisher zeros have been used to mark the dynamical phase transition in quench dynamics. The success of Fisher zeros however seems limited, and it is unclear how they can be employed to shed light on quantum phase transitions or the non-unitary dynamics of open quantum systems. Here we answer this question by a comprehensive analysis of the (analytically continued) one-dimensional transverse field Ising model. We exhaust all the Fisher zeros to show that in the thermodynamic limit they congregate into a remarkably simple pattern in the form of continuous open or closed lines. These Fisher lines evolve smoothly as the coupling constant is tuned, and a qualitative change identifies the quantum critical point. By exploiting the connection between $Z$ and the thermofield double states, we obtain analytical expressions for the short- and long-time dynamics of the survival amplitude and the scaling of recurrence time at the quantum critical point. We further point out $Z$ can be realized and probed in monitored quantum circuits. The analytical results are corroborated by numerical tensor renormalization group which elevates the approach outlined here to a powerful tool for interacting quantum systems. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 6+4 pages; 3+4 figures

arXiv:2406.17992 [pdf, other]

Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models

Authors: Bohan Jiang, Chengshuai Zhao, Zhen Tan, Huan Liu

Abstract: Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a cons… ▽ More Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textit{DELD} significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 10 pages, 5 figures

arXiv:2406.17841 [pdf, other]

Probing many-body Bell correlation depth with superconducting qubits

Authors: Ke Wang, Weikang Li, Shibo Xu, Mengyao Hu, Jiachen Chen, Yaozu Wu, Chuanyu Zhang, Feitong **, Xuhao Zhu, Yu Gao, Ziqi Tan, Aosai Zhang, Ning Wang, Yiren Zou, Tingting Li, Fanhao Shen, Jiarun Zhong, Zehang Bao, Zitian Zhu, Zixuan Song, **feng Deng, Hang Dong, Xu Zhang, Pengfei Zhang, Wenjie Jiang , et al. (10 additional authors not shown)

Abstract: Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing… ▽ More Quantum nonlocality describes a stronger form of quantum correlation than that of entanglement. It refutes Einstein's belief of local realism and is among the most distinctive and enigmatic features of quantum mechanics. It is a crucial resource for achieving quantum advantages in a variety of practical applications, ranging from cryptography and certified random number generation via self-testing to machine learning. Nevertheless, the detection of nonlocality, especially in quantum many-body systems, is notoriously challenging. Here, we report an experimental certification of genuine multipartite Bell correlations, which signal nonlocality in quantum many-body systems, up to 24 qubits with a fully programmable superconducting quantum processor. In particular, we employ energy as a Bell correlation witness and variationally decrease the energy of a many-body system across a hierarchy of thresholds, below which an increasing Bell correlation depth can be certified from experimental data. As an illustrating example, we variationally prepare the low-energy state of a two-dimensional honeycomb model with 73 qubits and certify its Bell correlations by measuring an energy that surpasses the corresponding classical bound with up to 48 standard deviations. In addition, we variationally prepare a sequence of low-energy states and certify their genuine multipartite Bell correlations up to 24 qubits via energies measured efficiently by parity oscillation and multiple quantum coherence techniques. Our results establish a viable approach for preparing and certifying multipartite Bell correlations, which provide not only a finer benchmark beyond entanglement for quantum devices, but also a valuable guide towards exploiting multipartite Bell correlation in a wide spectrum of practical applications. △ Less

Submitted 25 June, 2024; originally announced June 2024.

Comments: 11 pages,6 figures + 14 pages, 6 figures

arXiv:2406.16562 [pdf, other]

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Authors: Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Meng** Yang, Cheng Zhang, Hao Li

Abstract: The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our ap… ▽ More The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment. △ Less

Submitted 26 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

Comments: Github Repository: https://github.com/SAIS-FUXI/EvalAlign

arXiv:2406.16260 [pdf, other]

Video-Infinity: Distributed Long Video Generation

Authors: Zhenxiong Tan, Xingyi Yang, Songhua Liu, Xinchao Wang

Abstract: Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A s… ▽ More Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.16003 [pdf]

Unidirectional Chiral Emission via Twisted Bi-layer Metasurfaces

Authors: Dmitrii Gromyko, Shu An, Sergey Gorelik, Jiahui Xu, Li Jun Lim, Henry Yit Loong Lee, Febiana Tjiptoharsono, Zhi-Kuang Tan, Cheng-Wei Qiu, Zhaogang Dong, Lin Wu

Abstract: Controlling and channelling light emissions from unpolarized quantum dots into specific directions with chiral polarization remains a key challenge in modern photonics. Stacked metasurface designs offer a potential compact solution for chirality and directionality engineering. However, experimental observations of directional chiral radiation from resonant metasurfaces with quantum emitters remain… ▽ More Controlling and channelling light emissions from unpolarized quantum dots into specific directions with chiral polarization remains a key challenge in modern photonics. Stacked metasurface designs offer a potential compact solution for chirality and directionality engineering. However, experimental observations of directional chiral radiation from resonant metasurfaces with quantum emitters remain obscure. In this paper, we present experimental observations of unidirectional chiral emission from a twisted bi-layer metasurface via multi-dimensional control, including twist angle, interlayer distance, and lateral displacement between the top and bottom layers, as enabled by doublet alignment lithography (DAL). First, maintaining alignment, the metasurface demonstrates a resonant intrinsic optical chirality with near-unity circular dichroism of 0.94 and reflectance difference of 74%, where a high circular dichroism greater than 0.9 persists across a wide range of angles from -11 to 11 degrees. Second, engineered lateral displacement induces a unidirectional chiral resonance, resulting in unidirectional chiral emission from the quantum dots deposited onto the metasurface. Our bi-layer metasurfaces offer a universal compact platform for efficient radiation manipulation over a wide angular range, promising potential applications in miniaturized lasers, grating couplers, and chiral nanoantennas. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 16 pages, 4 figures

arXiv:2406.15992 [pdf, other]

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

Authors: Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, Yulia Tsvetkov

Abstract: Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merel… ▽ More Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting 'graph LLMs' are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merely memorizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: 16 pages, 6 figures, Code and data will be publicly available at https://github.com/MatthewYZhang/NLGift

ACM Class: I.2.7

arXiv:2406.13906 [pdf, other]

Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data

Authors: Ye Tian, Peng Wu, Zhiqiang Tan

Abstract: The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augment… ▽ More The accessibility of vast volumes of unlabeled data has sparked growing interest in semi-supervised learning (SSL) and covariate shift transfer learning (CSTL). In this paper, we present an inference framework for estimating regression coefficients in conditional mean models within both SSL and CSTL settings, while allowing for the misspecification of conditional mean models. We develop an augmented inverse probability weighted (AIPW) method, employing regularized calibrated estimators for both propensity score (PS) and outcome regression (OR) nuisance models, with PS and OR models being sequentially dependent. We show that when the PS model is correctly specified, the proposed estimator achieves consistency, asymptotic normality, and valid confidence intervals, even with possible OR model misspecification and high-dimensional data. Moreover, by suppressing detailed technical choices, we demonstrate that previous methods can be unified within our AIPW framework. Our theoretical findings are verified through extensive simulation studies and a real-world data application. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13779 [pdf, other]

doi 10.1145/3637528.3672065

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Authors: Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, **jie Gu

Abstract: Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the g… ▽ More Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: https://huggingface.co/forag. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Report number: 30th

Journal ref: KDD 2024

arXiv:2406.12867 [pdf, ps, other]

Classification of quasi-affine Generalized Dynkin Diagrams with Rank $3$ and Rank $2$

Authors: Zhengtang Tan, Shouchuan Zhang

Abstract: All quasi-affine connected Generalized Dynkin Diagram with rank $= 3$ and $2$ are found. All quasi-affine Nichols (Lie braided) algebras with rank $ 3$ and $2$ are also found. All quasi-affine connected Generalized Dynkin Diagram with rank $= 3$ and $2$ are found. All quasi-affine Nichols (Lie braided) algebras with rank $ 3$ and $2$ are also found. △ Less

Submitted 9 April, 2024; originally announced June 2024.

Comments: 338 pages

MSC Class: 16W30; 16G10

arXiv:2406.12313 [pdf]

A framework for develo** a knowledge management platform

Authors: Marie Lisandra Zepeda Mendoza, Sonali Agarwal, James A. Blackshaw, Vanesa Bol, Audrey Fazzi, Filippo Fiorini, Amy Louise Foreman, Nancy George, Brett R. Johnson, Brian Martin, Dave McComb, Euphemia Mutasa-Gottgens, Helen Parkinson, Martin Romacker, Rolf Russell, Valérien Ségard, Shawn Zheng Kai Tan, Wei Kheng Teh, F. P. Winstanley, Benedict Wong, Adrian M. Smith

Abstract: Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide gu… ▽ More Knowledge management (KM) involves collecting, organizing, storing, and disseminating information to improve decision-making, innovation, and performance. Implementing KM at scale has become essential for organizations to effectively leverage vast accessible data. This paper is a compilation of concepts that emerged from KM workshops hosted by EMBL-EBI, attended by SMEs and industry. We provide guidance on envisioning, executing, evaluating, and evolving knowledge management platforms. We emphasize essential considerations such as setting knowledge domain boundaries and measuring success, as well as the importance of making knowledge accessible for downstream applications and non-computational users and highlights necessary personal and organizational skills for success. We stress the importance of collaboration and the need for convergence on shared principles and commitment to provide or seek resources to advance KM. The community is invited to join the journey of KM and contribute to the advancement of the field by applying and improving on the guidelines described. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 18 pages, 1 figure

arXiv:2406.10471 [pdf, other]

Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts

Authors: Zhaoxuan Tan, Zheyuan Liu, Meng Jiang

Abstract: Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and… ▽ More Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and assemble personalized PEFT efficiently with collaborative efforts. Per-Pcs involves selecting sharers, breaking their PEFT into pieces, and training gates for each piece. These pieces are added to a pool, from which target users can select and assemble personalized PEFT using their history data. This approach preserves privacy and enables fine-grained user modeling without excessive storage and computation demands. Experimental results show Per-Pcs outperforms non-personalized and PEFT retrieval baselines, offering performance comparable to OPPU with significantly lower resource use across six tasks. Further analysis highlights Per-Pcs's robustness concerning sharer count and selection strategy, pieces sharing ratio, and scalability in computation time and storage space. Per-Pcs's modularity promotes safe sharing, making LLM personalization more efficient, effective, and widely accessible through collaborative efforts. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09899 [pdf, other]

Learning Solution-Aware Transformers for Efficiently Solving Quadratic Assignment Problem

Authors: Zhentao Tan, Yadong Mu

Abstract: Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of sim… ▽ More Recently various optimization problems, such as Mixed Integer Linear Programming Problems (MILPs), have undergone comprehensive investigation, leveraging the capabilities of machine learning. This work focuses on learning-based solutions for efficiently solving the Quadratic Assignment Problem (QAPs), which stands as a formidable challenge in combinatorial optimization. While many instances of simpler problems admit fully polynomial-time approximate solution (FPTAS), QAP is shown to be strongly NP-hard. Even finding a FPTAS for QAP is difficult, in the sense that the existence of a FPTAS implies $P = NP$. Current research on QAPs suffer from limited scale and computational inefficiency. To attack the aforementioned issues, we here propose the first solution of its kind for QAP in the learn-to-improve category. This work encodes facility and location nodes separately, instead of forming computationally intensive association graphs prevalent in current approaches. This design choice enables scalability to larger problem sizes. Furthermore, a \textbf{S}olution \textbf{AW}are \textbf{T}ransformer (SAWT) architecture integrates the incumbent solution matrix with the attention score to effectively capture higher-order information of the QAPs. Our model's effectiveness is validated through extensive experiments on self-generated QAP instances of varying sizes and the QAPLIB benchmark. △ Less

Submitted 19 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2406.08785 [pdf, other]

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Authors: Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, **gdong Wang, Gaoang Wang, Xi Li

Abstract: Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D map**, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we p… ▽ More Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D map**, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.06911 [pdf, other]

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Authors: Zigeng Chen, Xinyin Ma, Gongfan Fang, Zhenxiong Tan, Xinchao Wang

Abstract: Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enable… ▽ More Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff. △ Less

Submitted 27 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/

arXiv:2406.06904 [pdf, other]

Person Transfer in the Field: Examining Real World Sequential Human-Robot Interaction Between Two Robots

Authors: Xiang Zhi Tan, Elizabeth J. Carter, Aaron Steinfeld

Abstract: With more robots being deployed in the world, users will likely interact with multiple robots sequentially when receiving services. In this paper, we describe an exploratory field study in which unsuspecting participants experienced a ``person transfer'' -- a scenario in which they first interacted with one stationary robot before another mobile robot joined to complete the interaction. In our 7-h… ▽ More With more robots being deployed in the world, users will likely interact with multiple robots sequentially when receiving services. In this paper, we describe an exploratory field study in which unsuspecting participants experienced a ``person transfer'' -- a scenario in which they first interacted with one stationary robot before another mobile robot joined to complete the interaction. In our 7-hour study spanning 4 days, we recorded 18 instances of person transfers with 40+ individuals. We also interviewed 11 participants after the interaction to further understand their experience. We used the recorded video and interview data to extract interesting insights about in-the-field sequential human-robot interaction, such as mobile robot handovers, trust in person transfer, and the importance of the robots' positions. Our findings expose pitfalls and present important factors to consider when designing sequential human-robot interaction. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted to RO-MAN 2024

arXiv:2406.06295 [pdf, other]

Zero-Shot Audio Captioning Using Soft and Hard Prompts

Authors: Yiming Zhang, Xuenan Xu, Ruoyi Du, Haohe Liu, Yuan Dong, Zheng-Hua Tan, Wenwu Wang, Zhanyu Ma

Abstract: In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these model… ▽ More In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space.In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP.We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing

arXiv:2406.06160 [pdf, other]

The Effect of Training Dataset Size on Discriminative and Diffusion-Based Speech Enhancement Systems

Authors: Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Abstract: The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but… ▽ More The performance of deep neural network-based speech enhancement systems typically increases with the training dataset size. However, studies that investigated the effect of training dataset size on speech enhancement performance did not consider recent approaches, such as diffusion-based generative models. Diffusion models are typically trained with massive datasets for image generation tasks, but whether this is also required for speech enhancement is unknown. Moreover, studies that investigated the effect of training dataset size did not control for the data diversity. It is thus unclear whether the performance improvement was due to the increased dataset size or diversity. Therefore, we systematically investigate the effect of training dataset size on the performance of popular state-of-the-art discriminative and diffusion-based speech enhancement systems. We control for the data diversity by using a fixed set of speech utterances, noise segments and binaural room impulse responses to generate datasets of different sizes. We find that the diffusion-based systems do not benefit from increasing the training dataset size as much as the discriminative systems. They perform the best relative to the discriminative systems with datasets of 10 h or less, but they are outperformed by the discriminative systems with datasets of 100 h or more. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.06140 [pdf, other]

Can I understand what I create? Self-Knowledge Evaluation of Large Language Models

Authors: Zhiquan Tan, Lai Wei, **dong Wang, Xing Xie, Weiran Huang

Abstract: Large language models (LLMs) have achieved remarkable progress in linguistic tasks, necessitating robust evaluation frameworks to understand their capabilities and limitations. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework that is easy to implement, evaluating models on their ability to comprehend and respond to self-generated… ▽ More Large language models (LLMs) have achieved remarkable progress in linguistic tasks, necessitating robust evaluation frameworks to understand their capabilities and limitations. Inspired by Feynman's principle of understanding through creation, we introduce a self-knowledge evaluation framework that is easy to implement, evaluating models on their ability to comprehend and respond to self-generated questions. Our findings, based on testing multiple models across diverse tasks, reveal significant gaps in the model's self-knowledge ability. Further analysis indicates these gaps may be due to misalignment with human attention mechanisms. Additionally, fine-tuning on self-generated math task may enhance the model's math performance, highlighting the potential of the framework for efficient and insightful model evaluation and may also contribute to the improvement of LLMs. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.05270 [pdf]

fastMRI Breast: A publicly available radial k-space dataset of breast dynamic contrast-enhanced MRI

Authors: Eddy Solomon, Patricia M. Johnson, Zhengguo Tan, Radhika Tibrewala, Yvonne W. Lui, Florian Knoll, Linda Moy, Sungheon Gene Kim, Laura Heacock

Abstract: This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will… ▽ More This data curation work introduces the first large-scale dataset of radial k-space and DICOM data for breast DCE-MRI acquired in diagnostic breast MRI exams. Our dataset includes case-level labels indicating patient age, menopause status, lesion status (negative, benign, and malignant), and lesion type for each case. The public availability of this dataset and accompanying reconstruction code will support research and development of fast and quantitative breast image reconstruction and machine learning methods. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2406.03999 [pdf, other]

Unveiling the Dynamics of Information Interplay in Supervised Learning

Authors: Kun Song, Zhiquan Tan, Bochao Zou, Huimin Ma, Weiran Huang

Abstract: In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of… ▽ More In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2406.03799 [pdf]

Enhanced Semantic Segmentation Pipeline for WeatherProof Dataset Challenge

Authors: Nan Zhang, Xidan Zhang, Jianing Wei, Fangjun Wang, Zhiming Tan

Abstract: This report describes the winning solution to the WeatherProof Dataset Challenge (CVPR 2024 UG2+ Track 3). Details regarding the challenge are available at https://cvpr2024ug2challenge.github.io/track3.html. We propose an enhanced semantic segmentation pipeline for this challenge. Firstly, we improve semantic segmentation models, using backbone pretrained with Depth Anything to improve UperNet mod… ▽ More This report describes the winning solution to the WeatherProof Dataset Challenge (CVPR 2024 UG2+ Track 3). Details regarding the challenge are available at https://cvpr2024ug2challenge.github.io/track3.html. We propose an enhanced semantic segmentation pipeline for this challenge. Firstly, we improve semantic segmentation models, using backbone pretrained with Depth Anything to improve UperNet model and SETRMLA model, and adding language guidance based on both weather and category information to InternImage model. Secondly, we introduce a new dataset WeatherProofExtra with wider viewing angle and employ data augmentation methods, including adverse weather and super-resolution. Finally, effective training strategies and ensemble method are applied to improve final performance further. Our solution is ranked 1st on the final leaderboard. Code will be available at https://github.com/KaneiGi/WeatherProofChallenge. △ Less

Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.03777 [pdf, other]

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

Authors: Ruiyang Qin, Dancheng Liu, Zheyu Yan, Zhaoxuan Tan, Zixuan Pan, Zhenge Jia, Meng Jiang, Ahmed Abbasi, **jun Xiong, Yiyu Shi

Abstract: The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become m… ▽ More The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become more and more prevalent. An urging but open question is how a resource-constrained computing environment would affect the design choices for a personalized LLM. We study this problem empirically in this work. In particular, we consider the tradeoffs among a number of key design factors and their intertwined impacts on learning efficiency and accuracy. The factors include the learning methods for LLM customization, the amount of personalized data used for learning customization, the types and sizes of LLMs, the compression methods of LLMs, the amount of time afforded to learn, and the difficulty levels of the target use cases. Through extensive experimentation and benchmarking, we draw a number of surprisingly insightful guidelines for deploying LLMs onto resource-constrained devices. For example, an optimal choice between parameter learning and RAG may vary depending on the difficulty of the downstream task, the longer fine-tuning time does not necessarily help the model, and a compressed LLM may be a better choice than an uncompressed LLM to learn from limited personalized data. △ Less

Submitted 13 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: Benckmarking paper

arXiv:2406.02178 [pdf, other]

Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations

Authors: Sarthak Yadav, Zheng-Hua Tan

Abstract: Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. T… ▽ More Despite its widespread adoption as the prominent neural architecture, the Transformer has spurred several independent lines of work to address its limitations. One such approach is selective state space models, which have demonstrated promising results for language modelling. However, their feasibility for learning self-supervised, general-purpose audio representations is yet to be investigated. This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations from randomly masked spectrogram patches through self-supervision. Empirical results on ten diverse audio recognition downstream tasks show that the proposed models, pretrained on the AudioSet dataset, consistently outperform comparable self-supervised audio spectrogram transformer (SSAST) baselines by a considerable margin and demonstrate better performance in dataset size, sequence length and model size comparisons. △ Less

Submitted 7 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2406.01196 [pdf, other]

3D WholeBody Pose Estimation based on Semantic Graph Attention Network and Distance Information

Authors: Sihan Wen, Xiantan Zhu, Zhiming Tan

Abstract: In recent years, a plethora of diverse methods have been proposed for 3D pose estimation. Among these, self-attention mechanisms and graph convolutions have both been proven to be effective and practical methods. Recognizing the strengths of those two techniques, we have developed a novel Semantic Graph Attention Network which can benefit from the ability of self-attention to capture global contex… ▽ More In recent years, a plethora of diverse methods have been proposed for 3D pose estimation. Among these, self-attention mechanisms and graph convolutions have both been proven to be effective and practical methods. Recognizing the strengths of those two techniques, we have developed a novel Semantic Graph Attention Network which can benefit from the ability of self-attention to capture global context, while also utilizing the graph convolutions to handle the local connectivity and structural constraints of the skeleton. We also design a Body Part Decoder that assists in extracting and refining the information related to specific segments of the body. Furthermore, our approach incorporates Distance Information, enhancing our model's capability to comprehend and accurately predict spatial relationships. Finally, we introduce a Geometry Loss who makes a critical constraint on the structural skeleton of the body, ensuring that the model's predictions adhere to the natural limits of human posture. The experimental results validate the effectiveness of our approach, demonstrating that every element within the system is essential for improving pose estimation outcomes. With comparison to state-of-the-art, the proposed work not only meets but exceeds the existing benchmarks. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00777 [pdf, other]

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Authors: Yuxiang Ji, Boyong He, Chenyuan Qu, Zhuoyue Tan, Chuan Qin, Liaoni Wu

Abstract: Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. Thi… ▽ More Pre-trained diffusion models have demonstrated remarkable proficiency in synthesizing images across a wide range of scenarios with customizable prompts, indicating their effective capacity to capture universal features. Motivated by this, our study delves into the utilization of the implicit knowledge embedded within diffusion models to address challenges in cross-domain semantic segmentation. This paper investigates the approach that leverages the sampling and fusion techniques to harness the features of diffusion models efficiently. Contrary to the simplistic migration applications characterized by prior research, our finding reveals that the multi-step diffusion process inherent in the diffusion model manifests more robust semantic features. We propose DIffusion Feature Fusion (DIFF) as a backbone use for extracting and integrating effective semantic representations through the diffusion process. By leveraging the strength of text-to-image generation capability, we introduce a new training framework designed to implicitly learn posterior knowledge from it. Through rigorous evaluation in the contexts of domain generalization semantic segmentation, we establish that our methodology surpasses preceding approaches in mitigating discrepancies across distinct domains and attains the state-of-the-art (SOTA) benchmark. Within the synthetic-to-real (syn-to-real) context, our method significantly outperforms ResNet-based and transformer-based backbone methods, achieving an average improvement of $3.84\%$ mIoU across various datasets. The implementation code will be released soon. △ Less

Submitted 2 June, 2024; originally announced June 2024.

arXiv:2406.00323 [pdf, other]

BeFA: A General Behavior-driven Feature Adapter for Multimedia Recommendation

Authors: Qile Fan, Penghang Yu, Zhiyi Tan, Bing-Kun Bao, Guanming Lu

Abstract: Multimedia recommender systems focus on utilizing behavioral information and content information to model user preferences. Typically, it employs pre-trained feature encoders to extract content features, then fuses them with behavioral features. However, pre-trained feature encoders often extract features from the entire content simultaneously, including excessive preference-irrelevant details. We… ▽ More Multimedia recommender systems focus on utilizing behavioral information and content information to model user preferences. Typically, it employs pre-trained feature encoders to extract content features, then fuses them with behavioral features. However, pre-trained feature encoders often extract features from the entire content simultaneously, including excessive preference-irrelevant details. We speculate that it may result in the extracted features not containing sufficient features to accurately reflect user preferences. To verify our hypothesis, we introduce an attribution analysis method for visually and intuitively analyzing the content features. The results indicate that certain products' content features exhibit the issues of information drift}and information omission,reducing the expressive ability of features. Building upon this finding, we propose an effective and efficient general Behavior-driven Feature Adapter (BeFA) to tackle these issues. This adapter reconstructs the content feature with the guidance of behavioral information, enabling content features accurately reflecting user preferences. Extensive experiments demonstrate the effectiveness of the adapter across all multimedia recommendation methods. The code will be publicly available upon the paper's acceptance. △ Less

Submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.20754 [pdf, ps, other]

Non-uniqueness of weak solutions to 2D generalized Navier-Stokes equations

Authors: Xinliang Li, Zhong Tan

Abstract: We study the non-uniqueness of weak solutions for the two-dimensional hyper-dissipative Navier-Stokes equations in the super-critical spaces $L_{t}^γL_{x}^{p}$ when $α\in[1,\frac{3}{2})$, and obtain the conclusion that the non-uniqueness of the weak solutions at the endpoint $(γ,p)=(\infty, \frac{2}{2α-1})$ is sharp in view of the generalized Ladyženskaja-Prodi-Serrin condition by using a differen… ▽ More We study the non-uniqueness of weak solutions for the two-dimensional hyper-dissipative Navier-Stokes equations in the super-critical spaces $L_{t}^γL_{x}^{p}$ when $α\in[1,\frac{3}{2})$, and obtain the conclusion that the non-uniqueness of the weak solutions at the endpoint $(γ,p)=(\infty, \frac{2}{2α-1})$ is sharp in view of the generalized Ladyženskaja-Prodi-Serrin condition by using a different spatial-temporal building block from [Cheskidov-Luo, Ann. PDE, 9:13 (2023)] and taking advantage of the intermittency of the temporal concentrated function $g_{(k)}$ in an almost optimal way. Our results recover the above 2D non-uniqueness conclusion and extend to the hyper-dissipative case $α\in(1,\frac{3}{2})$. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: 30 pages

MSC Class: 35A02; 35Q30; 76D05

arXiv:2405.20711 [pdf, other]

Revisiting Mutual Information Maximization for Generalized Category Discovery

Authors: Zhaorui Tan, Chengrui Zhang, Xi Yang, Jie Sun, Kaizhu Huang

Abstract: Generalized category discovery presents a challenge in a realistic scenario, which requires the model's generalization ability to recognize unlabeled samples from known and unknown categories. This paper revisits the challenge of generalized category discovery through the lens of information maximization (InfoMax) with a probabilistic parametric classifier. Our findings reveal that ensuring indepe… ▽ More Generalized category discovery presents a challenge in a realistic scenario, which requires the model's generalization ability to recognize unlabeled samples from known and unknown categories. This paper revisits the challenge of generalized category discovery through the lens of information maximization (InfoMax) with a probabilistic parametric classifier. Our findings reveal that ensuring independence between known and unknown classes while concurrently assuming a uniform probability distribution across all classes, yields an enlarged margin among known and unknown classes that promotes the model's performance. To achieve the aforementioned independence, we propose a novel InfoMax-based method, Regularized Parametric InfoMax (RPIM), which adopts pseudo labels to supervise unlabeled samples during InfoMax, while proposing a regularization to ensure the quality of the pseudo labels. Additionally, we introduce novel semantic-bias transformation to refine the features from the pre-trained model instead of direct fine-tuning to rescue the computational costs. Extensive experiments on six benchmark datasets validate the effectiveness of our method. RPIM significantly improves the performance regarding unknown classes, surpassing the state-of-the-art method by an average margin of 3.5%. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: Preprint version

arXiv:2405.18756 [pdf, other]

Provable Contrastive Continual Learning

Authors: Yichen Wen, Zhiquan Tan, Kaipeng Zheng, Chuanlong Xie, Weiran Huang

Abstract: Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill… ▽ More Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Accepted by ICML 2024

arXiv:2405.14278 [pdf, other]

SCMix: Stochastic Compound Mixing for Open Compound Domain Adaptation in Semantic Segmentation

Authors: Kai Yao, Zhaorui Tan, Zixian Su, Xi Yang, Jie Sun, Kaizhu Huang

Abstract: Open compound domain adaptation (OCDA) aims to transfer knowledge from a labeled source domain to a mix of unlabeled homogeneous compound target domains while generalizing to open unseen domains. Existing OCDA methods solve the intra-domain gaps by a divide-and-conquer strategy, which divides the problem into several individual and parallel domain adaptation (DA) tasks. Such approaches often conta… ▽ More Open compound domain adaptation (OCDA) aims to transfer knowledge from a labeled source domain to a mix of unlabeled homogeneous compound target domains while generalizing to open unseen domains. Existing OCDA methods solve the intra-domain gaps by a divide-and-conquer strategy, which divides the problem into several individual and parallel domain adaptation (DA) tasks. Such approaches often contain multiple sub-networks or stages, which may constrain the model's performance. In this work, starting from the general DA theory, we establish the generalization bound for the setting of OCDA. Built upon this, we argue that conventional OCDA approaches may substantially underestimate the inherent variance inside the compound target domains for model generalization. We subsequently present Stochastic Compound Mixing (SCMix), an augmentation strategy with the primary objective of mitigating the divergence between source and mixed target distributions. We provide theoretical analysis to substantiate the superiority of SCMix and prove that the previous methods are sub-groups of our methods. Extensive experiments show that our method attains a lower empirical risk on OCDA semantic segmentation tasks, thus supporting our theories. Combining the transformer architecture, SCMix achieves a notable performance boost compared to the SoTA results. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.14092 [pdf, other]

Large Language Models Can Self-Correct with Minimal Effort

Authors: Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, Meng Jiang

Abstract: Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current… ▽ More Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct. △ Less

Submitted 23 June, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: Work in Progress

arXiv:2405.12914 [pdf, other]

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Authors: Zhiyu Tan, Meng** Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li

Abstract: One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Langu… ▽ More One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: Technical report. Project page: https://github.com/llm-conditioned-diffusion/llm-conditioned-diffusion

arXiv:2405.11405 [pdf, ps, other]

On the Rate-Distortion Function for Sampled Cyclostationary Gaussian Processes with Memory: Extended Version with Proofs

Authors: Zikun Tan, Ron Dabora, H. Vincent Poor

Abstract: In this work we study the rate-distortion function (RDF) for lossy compression of asynchronously-sampled continuous-time (CT) wide-sense cyclostationary (WSCS) Gaussian processes with memory. As the case of synchronous sampling, i.e., when the sampling interval is commensurate with the period of the cyclostationary statistics, has already been studied, we focus on discrete-time (DT) processes obta… ▽ More In this work we study the rate-distortion function (RDF) for lossy compression of asynchronously-sampled continuous-time (CT) wide-sense cyclostationary (WSCS) Gaussian processes with memory. As the case of synchronous sampling, i.e., when the sampling interval is commensurate with the period of the cyclostationary statistics, has already been studied, we focus on discrete-time (DT) processes obtained by asynchronous sampling, i.e., when the sampling interval is incommensurate with the period of the cyclostationary statistics of the CT WSCS source process. It is further assumed that the sampling interval is smaller than the maximal autocorrelation length of the CT source process, which implies that the DT process possesses memory. Thus, the sampled process is a DT wide-sense almost cyclostationary (WSACS) processes with memory. This problem is motivated by the fact that man-made communications signals are modelled as CT WSCS processes; hence, applications of such sampling include, e.g., compress-and-forward relaying and recording systems. The main challenge follows because, with asynchronous sampling, the DT sampled process is not information-stable, and hence the characterization of its RDF should be carried out within the information-spectrum framework instead of using conventional information-theoretic arguments. This work expands upon our previous work which addressed the special case in which the DT process is independent across time. The existence of dependence between the samples requires new tools to obtain the characterization of the RDF. △ Less

Submitted 23 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

Comments: 11 pages, 0 figures, accepted by the 2024 IEEE International Symposium on Information Theory (ISIT 2024)

arXiv:2405.07027 [pdf, other]

TD-NeRF: Novel Truncated Depth Prior for Joint Camera Pose and Neural Radiance Field Optimization

Authors: Zhen Tan, Zongtan Zhou, Yangbing Ge, Zi Wang, Xieyuanli Chen, Dewen Hu

Abstract: The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncate… ▽ More The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncated Depth NeRF (TD-NeRF), a novel approach that enables training NeRF from unknown camera poses - by jointly optimizing learnable parameters of the radiance field and camera poses. Our approach explicitly utilizes monocular depth priors through three key advancements: 1) we propose a novel depth-based ray sampling strategy based on the truncated normal distribution, which improves the convergence speed and accuracy of pose estimation; 2) to circumvent local minima and refine depth geometry, we introduce a coarse-to-fine training strategy that progressively improves the depth precision; 3) we propose a more robust inter-frame point constraint that enhances robustness against depth noise during training. The experimental results on three datasets demonstrate that TD-NeRF achieves superior performance in the joint optimization of camera pose and NeRF, surpassing prior works, and generates more accurate depth geometry. The implementation of our method has been released at https://github.com/nubot-nudt/TD-NeRF. △ Less

Submitted 11 May, 2024; originally announced May 2024.

arXiv:2405.06332 [pdf, ps, other]

Solving maximally comonotone inclusion problems via an implicit Newton-like inertial dynamical system and its discretization

Authors: Z. Z. Tan, R. Hu, Y. P. Fang

Abstract: This paper deals with an implicit Newton-like inertial dynamical system governed by a maximally comonotone inclusion problem in a Hilbert space. Under suitable conditions, we establish not only pointwise estimates and integral estimates for the velocity and the value of the associated Yosida regularization operator along the trajectory of the system, but also the weak convergence of the trajectory… ▽ More This paper deals with an implicit Newton-like inertial dynamical system governed by a maximally comonotone inclusion problem in a Hilbert space. Under suitable conditions, we establish not only pointwise estimates and integral estimates for the velocity and the value of the associated Yosida regularization operator along the trajectory of the system, but also the weak convergence of the trajectory to a zero of the maximally comonotone operator. Moreover, a new inertial algorithm is developed via a time discretization of the proposed system. Our analysis reveals that the resulting discrete algorithm exhibits fast convergence properties matching the ones of the continuous time counterpart. Finally, the theoretical results are illustrated by numerical experiments. △ Less

Submitted 10 May, 2024; originally announced May 2024.

arXiv:2405.06041 [pdf]

Gate Tunable Asymmetric Ozone Adsorption on Graphene

Authors: Zhen Qi, Wanlei Li, Jun Cheng, Zhongxin Guo, Chenglong Li, Shang Wang, Zuoquan Tan, Zhiting Gao, Yongchao Wang, Zichen Lian, Shanshan Chen, Yonglin He, Zhiyong Wang, Yapei Wang, **song Zhang, Yayu Wang, Peng Cai

Abstract: Molecular adsorption is pivotal in device fabrication and material synthesis for quantum technology. However, elucidating the behavior of physisorption poses technical challenges. Here graphene with ultrahigh sensitivity was utilized to detect ozone adsorption at cryogenic temperatures. Significant hole do** observed in graphene indicates a strong interaction between ozone and graphene. Interest… ▽ More Molecular adsorption is pivotal in device fabrication and material synthesis for quantum technology. However, elucidating the behavior of physisorption poses technical challenges. Here graphene with ultrahigh sensitivity was utilized to detect ozone adsorption at cryogenic temperatures. Significant hole do** observed in graphene indicates a strong interaction between ozone and graphene. Interestingly, the adsorption exhibits asymmetry with positive and negative gate voltages. The strong affinity of ozone provides a tool to modulate materials and devices, while the gate tunability of adsorption offers new insights into construction and manipulation of oxide quantum materials. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.05413 [pdf]

Digital Evolution: Novo Nordisk's Shift to Ontology-Based Data Management

Authors: Shawn Zheng Kai Tan, Shounak Baksi, Thomas Gade Bjerregaard, Preethi Elangovan, Thrishna Kuttikattu Gopalakrishnan, Darko Hric, Joffrey Joumaa, Beidi Li, Kashif Rabbani, Santhosh Kannan Venkatesan, Joshua Daniel Valdez, Saritha Vettikunnel Kuriakose

Abstract: Biomedical data is growing exponentially, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transf… ▽ More Biomedical data is growing exponentially, and managing it is increasingly challenging. While Findable, Accessible, Interoperable and Reusable (FAIR) data principles provide guidance, their adoption has proven difficult, especially in larger enterprises like pharmaceutical companies. In this manuscript, we describe how we leverage an Ontology-Based Data Management (OBDM) strategy for digital transformation in Novo Nordisk Research & Early Development. Here, we include both our technical blueprint and our approach for organizational change management. We further discuss how such an OBDM ecosystem plays a pivotal role in the organizations digital aspirations for data federation and discovery fuelled by artificial intelligence. Our aim for this paper is to share the lessons learned in order to foster dialogue with parties navigating similar waters while collectively advancing the efforts in the fields of data management, semantics and data driven drug discovery. △ Less

Submitted 10 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: 14 pages, 2 figures

arXiv:2405.04819 [pdf, other]

DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer's Disease Questions with Scientific Literature

Authors: Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sukwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, Huan Liu, Li Shen, Tianlong Chen

Abstract: Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on… ▽ More Recent advancements in large language models (LLMs) have achieved promising performances across various applications. Nonetheless, the ongoing challenge of integrating long-tail knowledge continues to impede the seamless adoption of LLMs in specialized domains. In this work, we introduce DALK, a.k.a. Dynamic Co-Augmentation of LLMs and KG, to address this limitation and demonstrate its ability on studying Alzheimer's Disease (AD), a specialized sub-field in biomedicine and a global health priority. With a synergized framework of LLM and KG mutually enhancing each other, we first leverage LLM to construct an evolving AD-specific knowledge graph (KG) sourced from AD-related scientific literature, and then we utilize a coarse-to-fine sampling method with a novel self-aware knowledge retrieval approach to select appropriate knowledge from the KG to augment LLM inference capabilities. The experimental results, conducted on our constructed AD question answering (ADQA) benchmark, underscore the efficacy of DALK. Additionally, we perform a series of detailed analyses that can offer valuable insights and guidelines for the emerging topic of mutually enhancing KG and LLM. We will release the code and data at https://github.com/David-Li0406/DALK. △ Less

Submitted 12 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: Under Review; Incorrect author name revised

arXiv:2405.03723 [pdf, other]

Generative adversarial learning with optimal input dimension and its adaptive generator architecture

Authors: Zhiyao Tan, Ling Zhou, Huazhen Lin

Abstract: We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which include… ▽ More We investigate the impact of the input dimension on the generalization error in generative adversarial networks (GANs). In particular, we first provide both theoretical and practical evidence to validate the existence of an optimal input dimension (OID) that minimizes the generalization error. Then, to identify the OID, we introduce a novel framework called generalized GANs (G-GANs), which includes existing GANs as a special case. By incorporating the group penalty and the architecture penalty developed in the paper, G-GANs have several intriguing features. First, our framework offers adaptive dimensionality reduction from the initial dimension to a dimension necessary for generating the target distribution. Second, this reduction in dimensionality also shrinks the required size of the generator network architecture, which is automatically identified by the proposed architecture penalty. Both reductions in dimensionality and the generator network significantly improve the stability and the accuracy of the estimation and prediction. Theoretical support for the consistent selection of the input dimension and the generator network is provided. Third, the proposed algorithm involves an end-to-end training process, and the algorithm allows for dynamic adjustments between the input dimension and the generator network during training, further enhancing the overall performance of G-GANs. Extensive experiments conducted with simulated and benchmark data demonstrate the superior performance of G-GANs. In particular, compared to that of off-the-shelf methods, G-GANs achieves an average improvement of 45.68% in the CT slice dataset, 43.22% in the MNIST dataset and 46.94% in the FashionMNIST dataset in terms of the maximum mean discrepancy or Frechet inception distance. Moreover, the features generated based on the input dimensions identified by G-GANs align with visually significant features. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.00974 [pdf, other]

On Ridge Estimation in High-dimensional Rotationally Sparse Linear Regression

Authors: Libin Liang, Zhiqiang Tan

Abstract: Recently, deep neural networks have been found to nearly interpolate training data but still generalize well in various applications. To help understand such a phenomenon, it has been of interest to analyze the ridge estimator and its interpolation limit in high-dimensional regression models. For this motivation, we study the ridge estimator in a rotationally sparse setting of high-dimensional lin… ▽ More Recently, deep neural networks have been found to nearly interpolate training data but still generalize well in various applications. To help understand such a phenomenon, it has been of interest to analyze the ridge estimator and its interpolation limit in high-dimensional regression models. For this motivation, we study the ridge estimator in a rotationally sparse setting of high-dimensional linear regression, where the signal of a response is aligned with a small number, $d$, of covariates with large or spiked variances, compared with the remaining covariates with small or tail variances, \textit{after} an orthogonal transformation of the covariate vector. We establish high-probability upper and lower bounds on the out-sample and in-sample prediction errors in two distinct regimes depending on the ratio of the effective rank of tail variances over the sample size $n$. The separation of the two regimes enables us to exploit relevant concentration inequalities and derive concrete error bounds without making any oracle assumption or independent components assumption on covariate vectors. Moreover, we derive sufficient and necessary conditions which indicate that the prediction errors of ridge estimation can be of the order $O(\frac{d}{n})$ if and only if the gap between the spiked and tail variances are sufficiently large. We also compare the orders of optimal out-sample and in-sample prediction errors and find that, remarkably, the optimal out-sample prediction error may be significantly smaller than the optimal in-sample one. Finally, we present numerical experiments which empirically confirm our theoretical findings. △ Less

Submitted 1 May, 2024; originally announced May 2024.

arXiv:2404.17125 [pdf, other]

doi 10.1109/ICPSAsia48933.2020.9208421

Misaka: Interactive Swarm Testbed for Smart Grid Distributed Algorithm Test and Evaluation

Authors: Tingliang Zhang, Haiwang Zhong, Zhenfei Tan, Xinfei Yan

Abstract: In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activi… ▽ More In this paper, we present Misaka, a visualized swarm testbed for smart grid algorithm evaluation, also an extendable open-source open-hardware platform for develo** tabletop tangible swarm interfaces. The platform consists of a collection of custom-designed 3 omni-directional wheels robots each 10 cm in diameter, high accuracy localization through a microdot pattern overlaid on top of the activity sheets, and a software framework for application development and control, while remaining affordable (per unit cost about 30 USD at the prototype stage). We illustrate the potential of tabletop swarm user interfaces through a set of smart grid algorithm application scenarios developed with Misaka. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Journal ref: 2020 IEEE/IAS Industrial and Commercial Power System Asia (I&CPS Asia)

arXiv:2404.16831 [pdf, other]

The Third Monocular Depth Estimation Challenge

Authors: Jaime Spencer, Fabio Tosi, Matteo Poggi, Ripudaman Singh Arora, Chris Russell, Simon Hadfield, Richard Bowden, GuangYuan Zhou, ZhengXin Li, Qiang Rao, Yi** Bao, Xiao Liu, Dohyeong Kim, **seong Kim, Myunghyun Kim, Mykola Lavreniuk, Rui Li, Qing Mao, Jiang Wu, Yu Zhu, **qiu Sun, Yanning Zhang, Suraj Patni, Aradhye Agarwal, Chetan Arora , et al. (16 additional authors not shown)

Abstract: This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 su… ▽ More This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%. △ Less

Submitted 27 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

Comments: To appear in CVPRW2024

arXiv:2404.16339 [pdf, other]

Training-Free Unsupervised Prompt for Vision-Language Models

Authors: Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, Shengsheng Wang, **gdong Wang

Abstract: Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and r… ▽ More Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset. △ Less

Submitted 25 April, 2024; originally announced April 2024.

arXiv:2404.15878 [pdf, other]

Simulating unsteady fluid flows on a superconducting quantum processor

Authors: Zhaoyuan Meng, Jiarun Zhong, Shibo Xu, Ke Wang, Jiachen Chen, Feitong **, Xuhao Zhu, Yu Gao, Yaozu Wu, Chuanyu Zhang, Ning Wang, Yiren Zou, Aosai Zhang, Zhengyi Cui, Fanhao Shen, Zehang Bao, Zitian Zhu, Ziqi Tan, Tingting Li, Pengfei Zhang, Shiying Xiong, Hekang Li, Qiujiang Guo, Zhen Wang, Chao Song , et al. (2 additional authors not shown)

Abstract: Recent advancements of intermediate-scale quantum processors have triggered tremendous interest in the exploration of practical quantum advantage. The simulation of fluid dynamics, a highly challenging problem in classical physics but vital for practical applications, emerges as a good candidate for showing quantum utility. Here, we report an experiment on the digital simulation of unsteady flows,… ▽ More Recent advancements of intermediate-scale quantum processors have triggered tremendous interest in the exploration of practical quantum advantage. The simulation of fluid dynamics, a highly challenging problem in classical physics but vital for practical applications, emerges as a good candidate for showing quantum utility. Here, we report an experiment on the digital simulation of unsteady flows, which consists of quantum encoding, evolution, and detection of flow states, with a superconducting quantum processor. The quantum algorithm is based on the Hamiltonian simulation using the hydrodynamic formulation of the Schrödinger equation. With the median fidelities of 99.97% and 99.67% for parallel single- and two-qubit gates respectively, we simulate the dynamics of a two-dimensional (2D) compressible diverging flow and a 2D decaying vortex with ten qubits. The experimental results well capture the temporal evolution of averaged density and momentum profiles, and qualitatively reproduce spatial flow fields with moderate noises. This work demonstrates the potential of quantum computing in simulating more complex flows, such as turbulence, for practical applications. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.15381 [pdf, other]

Advances and Open Challenges in Federated Learning with Foundation Models

Authors: Chao Ren, Han Yu, Hongyi Peng, Xiaoli Tang, Anran Li, Yulan Gao, Alysa Ziying Tan, Bo Zhao, Xiaoxiao Li, Zengxiang Li, Qiang Yang

Abstract: The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI), offering enhanced capabilities while addressing concerns of privacy, data decentralization, and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic rel… ▽ More The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI), offering enhanced capabilities while addressing concerns of privacy, data decentralization, and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic relationship and exploring novel methodologies, challenges, and future directions that the FL research field needs to focus on in order to thrive in the age of foundation models. A systematic multi-tiered taxonomy is proposed, categorizing existing FedFM approaches for model training, aggregation, trustworthiness, and incentivization. Key challenges, including how to enable FL to deal with high complexity of computational demands, privacy considerations, contribution evaluation, and communication efficiency, are thoroughly discussed. Moreover, the paper explores the intricate challenges of communication, scalability and security inherent in training/fine-tuning FMs via FL, highlighting the potential of quantum computing to revolutionize the training, inference, optimization and data encryption processes. This survey underscores the importance of further research to propel innovation in FedFM, emphasizing the need for develo** trustworthy solutions. It serves as a foundational guide for researchers and practitioners interested in contributing to this interdisciplinary and rapidly advancing field. △ Less

Submitted 29 April, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

Comments: Survey of Federated Foundation Models (FedFM)

arXiv:2404.14453 [pdf, other]

EPI-SQL: Enhancing Text-to-SQL Translation with Error-Prevention Instructions

Authors: Xi** Liu, Zhao Tan

Abstract: The conversion of natural language queries into SQL queries, known as Text-to-SQL, is a critical yet challenging task. This paper introduces EPI-SQL, a novel methodological framework leveraging Large Language Models (LLMs) to enhance the performance of Text-to-SQL tasks. EPI-SQL operates through a four-step process. Initially, the method involves gathering instances from the Spider dataset on whic… ▽ More The conversion of natural language queries into SQL queries, known as Text-to-SQL, is a critical yet challenging task. This paper introduces EPI-SQL, a novel methodological framework leveraging Large Language Models (LLMs) to enhance the performance of Text-to-SQL tasks. EPI-SQL operates through a four-step process. Initially, the method involves gathering instances from the Spider dataset on which LLMs are prone to failure. These instances are then utilized to generate general error-prevention instructions (EPIs). Subsequently, LLMs craft contextualized EPIs tailored to the specific context of the current task. Finally, these context-specific EPIs are incorporated into the prompt used for SQL generation. EPI-SQL is distinguished in that it provides task-specific guidance, enabling the model to circumvent potential errors for the task at hand. Notably, the methodology rivals the performance of advanced few-shot methods despite being a zero-shot approach. An empirical assessment using the Spider benchmark reveals that EPI-SQL achieves an execution accuracy of 85.1\%, underscoring its effectiveness in generating accurate SQL queries through LLMs. The findings indicate a promising direction for future research, i.e. enhancing instructions with task-specific and contextualized rules, for boosting LLMs' performance in NLP tasks. △ Less

Submitted 20 April, 2024; originally announced April 2024.

arXiv:2404.14135 [pdf, other]

Text in the Dark: Extremely Low-Light Text Image Enhancement

Authors: Che-Tsung Lin, Chun Chet Ng, Zhi Qin Tan, Wan Jun Nah, Xinyu Wang, Jie Long Kew, Pohao Hsu, Shang Hong Lai, Chee Seng Chan, Christopher Zach

Abstract: Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text t… ▽ More Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: The first two authors contributed equally to this work

Showing 1–50 of 702 results for author: Tan, Z