Search | arXiv e-print repository

Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection

Authors: Xingyu Peng, Yan Bai, Chen Gao, Lirong Yang, Fei Xia, Beipeng Mu, Xiaofei Wang, Si Liu

Abstract: Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous l… ▽ More Open-Vocabulary Detection (OVD) is the task of detecting all interesting objects in a given scene without predefined object classes. Extensive work has been done to deal with the OVD for 2D RGB images, but the exploration of 3D OVD is still limited. Intuitively, lidar point clouds provide 3D information, both object level and scene level, to generate trustful detection results. However, previous lidar-based OVD methods only focus on the usage of object-level features, ignoring the essence of scene-level information. In this paper, we propose a Global-Local Collaborative Scheme (GLIS) for the lidar-based OVD task, which contains a local branch to generate object-level detection result and a global branch to obtain scene-level global feature. With the global-local information, a Large Language Model (LLM) is applied for chain-of-thought inference, and the detection result can be refined accordingly. We further propose Reflected Pseudo Labels Generation (RPLG) to generate high-quality pseudo labels for supervision and Background-Aware Object Localization (BAOL) to select precise object proposals. Extensive experiments on ScanNetV2 and SUN RGB-D demonstrate the superiority of our methods. Code is released at https://github.com/GradiusTwinbee/GLIS. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: accepted by ECCV 2024

arXiv:2407.06042 [pdf, ps, other]

Near-Optimal MIMO Detection Using Gradient-Based MCMC in Discrete Spaces

Authors: Xingyu Zhou, Le Liang, **g Zhang, Chao-Kai Wen, Shi **

Abstract: The discrete nature of transmitted symbols poses challenges for achieving optimal detection in multiple-input multiple-output (MIMO) systems associated with a large number of antennas. Recently, the combination of two powerful machine learning methods, Markov chain Monte Carlo (MCMC) sampling and gradient descent, has emerged as a highly efficient solution to address this issue. However, existing… ▽ More The discrete nature of transmitted symbols poses challenges for achieving optimal detection in multiple-input multiple-output (MIMO) systems associated with a large number of antennas. Recently, the combination of two powerful machine learning methods, Markov chain Monte Carlo (MCMC) sampling and gradient descent, has emerged as a highly efficient solution to address this issue. However, existing gradient-based MCMC detectors are heuristically designed and thus are theoretically untenable. To bridge this gap, we introduce a novel sampling algorithm tailored for discrete spaces. This algorithm leverages gradients from the underlying continuous spaces for acceleration while maintaining the validity of probabilistic sampling. We prove the convergence of this method and also analyze its convergence rate using both MCMC theory and empirical diagnostics. On this basis, we develop a MIMO detector that precisely samples from the target discrete distribution and generates posterior Bayesian estimates using these samples, whose performance is thereby theoretically guaranteed. Furthermore, our proposed detector is highly parallelizable and scalable to large MIMO dimensions, positioning it as a compelling candidate for next-generation wireless networks. Simulation results show that our detector achieves near-optimal performance, significantly outperforms state-of-the-art baselines, and showcases resilience to various system setups. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2407.05286 [pdf, other]

Stability and Generalization for Stochastic Recursive Momentum-based Algorithms for (Strongly-)Convex One to $K$-Level Stochastic Optimizations

Authors: Xiaokang Pan, Xingyu Li, ** Liu, Tao Sun, Kai Sun, Lixing Chen, Zhe Qu

Abstract: STOchastic Recursive Momentum (STORM)-based algorithms have been widely developed to solve one to $K$-level ($K \geq 3$) stochastic optimization problems. Specifically, they use estimators to mitigate the biased gradient issue and achieve near-optimal convergence results. However, there is relatively little work on understanding their generalization performance, particularly evident during the tra… ▽ More STOchastic Recursive Momentum (STORM)-based algorithms have been widely developed to solve one to $K$-level ($K \geq 3$) stochastic optimization problems. Specifically, they use estimators to mitigate the biased gradient issue and achieve near-optimal convergence results. However, there is relatively little work on understanding their generalization performance, particularly evident during the transition from one to $K$-level optimization contexts. This paper provides a comprehensive generalization analysis of three representative STORM-based algorithms: STORM, COVER, and SVMR, for one, two, and $K$-level stochastic optimizations under both convex and strongly convex settings based on algorithmic stability. Firstly, we define stability for $K$-level optimizations and link it to generalization. Then, we detail the stability results for three prominent STORM-based algorithms. Finally, we derive their excess risk bounds by balancing stability results with optimization errors. Our theoretical results provide strong evidence to complete STORM-based algorithms: (1) Each estimator may decrease their stability due to variance with its estimation target. (2) Every additional level might escalate the generalization error, influenced by the stability and the variance between its cumulative stochastic gradient and the true gradient. (3) Increasing the batch size for the initial computation of estimators presents a favorable trade-off, enhancing the generalization performance. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2407.05232 [pdf, other]

PAPM: A Physics-aware Proxy Model for Process Systems

Authors: Pengwei Liu, Zhongkai Hao, Xingyu Ren, Hangjie Yuan, Jiayang Ren, Dong Ni

Abstract: In the context of proxy modeling for process systems, traditional data-driven deep learning approaches frequently encounter significant challenges, such as substantial training costs induced by large amounts of data, and limited generalization capabilities. As a promising alternative, physics-aware models incorporate partial physics knowledge to ameliorate these challenges. Although demonstrating… ▽ More In the context of proxy modeling for process systems, traditional data-driven deep learning approaches frequently encounter significant challenges, such as substantial training costs induced by large amounts of data, and limited generalization capabilities. As a promising alternative, physics-aware models incorporate partial physics knowledge to ameliorate these challenges. Although demonstrating efficacy, they fall short in terms of exploration depth and universality. To address these shortcomings, we introduce a physics-aware proxy model (PAPM) that fully incorporates partial prior physics of process systems, which includes multiple input conditions and the general form of conservation relations, resulting in better out-of-sample generalization. Additionally, PAPM contains a holistic temporal-spatial step** module for flexible adaptation across various process systems. Through systematic comparisons with state-of-the-art pure data-driven and physics-aware models across five two-dimensional benchmarks in nine generalization tasks, PAPM notably achieves an average performance improvement of 6.7%, while requiring fewer FLOPs, and just 1% of the parameters compared to the prior leading method. The code is available at https://github.com/pengwei07/PAPM. △ Less

Submitted 6 July, 2024; originally announced July 2024.

Comments: ICML 2024

arXiv:2407.04480 [pdf, other]

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

Authors: Xingyu Xie, Zhijie Lin, Kim-Chuan Toh, Pan Zhou

Abstract: To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on loc… ▽ More To efficiently train large-scale models, low-bit gradient communication compresses full-precision gradients on local GPU nodes into low-precision ones for higher gradient synchronization efficiency among GPU nodes. However, it often degrades training quality due to compression information loss. To address this, we propose the Low-bit Communication Adaptor (LoCo), which compensates gradients on local GPU nodes before compression, ensuring efficient synchronization without compromising training quality. Specifically, LoCo designs a moving average of historical compensation errors to stably estimate concurrent compression error and then adopts it to compensate for the concurrent gradient compression, yielding a less lossless compression. This mechanism allows it to be compatible with general optimizers like Adam and sharding strategies like FSDP. Theoretical analysis shows that integrating LoCo into full-precision optimizers like Adam and SGD does not impair their convergence speed on nonconvex problems. Experimental results show that across large-scale model training frameworks like Megatron-LM and PyTorch's FSDP, LoCo significantly improves communication efficiency, e.g., improving Adam's training speed by 14% to 40% without performance degradation on large language models like LLAMAs and MoE. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.03233 [pdf, other]

Asynchronous Parallel Policy Gradient Methods for the Linear Quadratic Regulator

Authors: Xingyu Sha, Feiran Zhao, Keyou You

Abstract: Learning policies in an asynchronous parallel way is essential to the numerous successes of RL for solving large-scale problems. However, their convergence performance is still not rigorously evaluated. To this end, we adopt the asynchronous parallel zero-order policy gradient (AZOPG) method to solve the continuous-time linear quadratic regulation problem. Specifically, as in the celebrated A3C al… ▽ More Learning policies in an asynchronous parallel way is essential to the numerous successes of RL for solving large-scale problems. However, their convergence performance is still not rigorously evaluated. To this end, we adopt the asynchronous parallel zero-order policy gradient (AZOPG) method to solve the continuous-time linear quadratic regulation problem. Specifically, as in the celebrated A3C algorithm, there are multiple parallel workers to asynchronously estimate PGs which are then sent to a central master for policy updates. Via quantifying its convergence rate of policy iterations, we show the linear speedup property of the AZOPG, both in theory and simulation, which clearly reveals the advantages of using parallel workers for learning policies. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: This article was submitted to IEEE TAC on Jan. 10, 2024

arXiv:2407.03096 [pdf, ps, other]

Collective advantages in qubit reset: effect of coherent qubits

Authors: Yue Liu, Chenlong Huang, Xingyu Zhang, Dahai He

Abstract: The Landauer principle sets a lower bound on the thermodynamic cost of qubit reset, which is only attainable for the quasistatic process. In this Letter, we explore the collective advantage of qubit reset of coherent qubits in three aspects. First, for the quasistatic process, the thermodynamic cost of collective reset is remarkably lower than parallel reset because of the reduced Hilbert space di… ▽ More The Landauer principle sets a lower bound on the thermodynamic cost of qubit reset, which is only attainable for the quasistatic process. In this Letter, we explore the collective advantage of qubit reset of coherent qubits in three aspects. First, for the quasistatic process, the thermodynamic cost of collective reset is remarkably lower than parallel reset because of the reduced Hilbert space dimension due to entanglement effects. Second, for the finite-time qubit reset, we prove that the error probability fades away and per-qubit heat production tends the Landauer bound for initially continuous protocols in the thermodynamic limit. Third, we show that qubit reset performance enhances with the increase in the number of qubits. Our results, illustrated by different protocols, provide a blueprint for future quantum device fabrication. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 6 pages, 3 figures

arXiv:2407.02657 [pdf, other]

Large Scale Hierarchical Industrial Demand Time-Series Forecasting incorporating Sparsity

Authors: Harshavardhan Kamarthi, Aditya B. Sasanur, Xinjie Tong, Xingyu Zhou, James Peters, Joe Czyzyk, B. Aditya Prakash

Abstract: Hierarchical time-series forecasting (HTSF) is an important problem for many real-world business applications where the goal is to simultaneously forecast multiple time-series that are related to each other via a hierarchical relation. Recent works, however, do not address two important challenges that are typically observed in many demand forecasting applications at large companies. First, many t… ▽ More Hierarchical time-series forecasting (HTSF) is an important problem for many real-world business applications where the goal is to simultaneously forecast multiple time-series that are related to each other via a hierarchical relation. Recent works, however, do not address two important challenges that are typically observed in many demand forecasting applications at large companies. First, many time-series at lower levels of the hierarchy have high sparsity i.e., they have a significant number of zeros. Most HTSF methods do not address this varying sparsity across the hierarchy. Further, they do not scale well to the large size of the real-world hierarchy typically unseen in benchmarks used in literature. We resolve both these challenges by proposing HAILS, a novel probabilistic hierarchical model that enables accurate and calibrated probabilistic forecasts across the hierarchy by adaptively modeling sparse and dense time-series with different distributional assumptions and reconciling them to adhere to hierarchical constraints. We show the scalability and effectiveness of our methods by evaluating them against real-world demand forecasting datasets. We deploy HAILS at a large chemical manufacturing company for a product demand forecasting application with over ten thousand products and observe a significant 8.5\% improvement in forecast accuracy and 23% better improvement for sparse time-series. The enhanced accuracy and scalability make HAILS a valuable tool for improved business planning and customer experience. △ Less

Submitted 2 July, 2024; originally announced July 2024.

Comments: Accepted at KDD 2024

arXiv:2407.01893 [pdf, other]

CausalPrism: A Visual Analytics Approach for Subgroup-based Causal Heterogeneity Exploration

Authors: Jiehui Zhou, Xumeng Wang, Wong Kam-Kwai, Wei Zhang, Xingyu Liu, Juntian Zhang, Minfeng Zhu, Wei Chen

Abstract: In causal inference, estimating Heterogeneous Treatment Effects (HTEs) from observational data is critical for understanding how different subgroups respond to treatments, with broad applications such as precision medicine and targeted advertising. However, existing work on HTE, subgroup discovery, and causal visualization is insufficient to address two challenges: first, the sheer number of poten… ▽ More In causal inference, estimating Heterogeneous Treatment Effects (HTEs) from observational data is critical for understanding how different subgroups respond to treatments, with broad applications such as precision medicine and targeted advertising. However, existing work on HTE, subgroup discovery, and causal visualization is insufficient to address two challenges: first, the sheer number of potential subgroups and the necessity to balance multiple objectives (e.g., high effects and low variances) pose a considerable analytical challenge. Second, effective subgroup analysis has to follow the analysis goal specified by users and provide causal results with verification. To this end, we propose a visual analytics approach for subgroup-based causal heterogeneity exploration. Specifically, we first formulate causal subgroup discovery as a constrained multi-objective optimization problem and adopt a heuristic genetic algorithm to learn the Pareto front of optimal subgroups described by interpretable rules. Combining with this model, we develop a prototype system, CausalPrism, that incorporates tabular visualization, multi-attribute rankings, and uncertainty plots to support users in interactively exploring and sorting subgroups and explaining treatment effects. Quantitative experiments validate that the proposed model can efficiently mine causal subgroups that outperform state-of-the-art HTE and subgroup discovery methods, and case studies and expert interviews demonstrate the effectiveness and usability of the system. Code is available at https://osf.io/jaqmf/?view_only=ac9575209945476b955bf829c85196e9. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 12 pages, 7 figures

arXiv:2407.01445 [pdf, other]

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

Authors: Xiyuan Wei, Fanjiang Ye, Ori Yonay, Xingyu Chen, Baixi Sun, Dingwen Tao, Tianbao Yang

Abstract: Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstra… ▽ More Existing studies of training state-of-the-art Contrastive Language-Image Pretraining (CLIP) models on large-scale data involve hundreds of or even thousands of GPUs due to the requirement of a large batch size. However, such a large amount of resources is not accessible to most people. While advanced compositional optimization techniques for optimizing global contrastive losses have been demonstrated effective for removing the requirement of large batch size, their performance on large-scale data remains underexplored and not optimized. To bridge the gap, this paper explores several aspects of CLIP training with limited resources (e.g., up to tens of GPUs). First, we introduce FastCLIP, a general CLIP training framework built on advanced compositional optimization techniques while designed and optimized for the distributed setting. Our framework is equipped with an efficient gradient reduction strategy to reduce communication overhead. Second, to further boost training efficiency, we investigate three components of the framework from an optimization perspective: the schedule of the inner learning rate, the update rules of the temperature parameter and the model parameters, respectively. Experiments on different strategies for each component shed light on how to conduct CLIP training more efficiently. Finally, we benchmark the performance of FastCLIP and the state-of-the-art training baseline (OpenCLIP) on different compute scales up to 32 GPUs on 8 nodes, and three data scales ranging from 2.7 million, 9.1 million to 315 million image-text pairs to demonstrate the significant improvement of FastCLIP in the resource-limited setting. We release the code of FastCLIP at https://github.com/Optimization-AI/fast_clip . △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 23 pages

arXiv:2407.01004 [pdf, other]

CURLS: Causal Rule Learning for Subgroups with Significant Treatment Effect

Authors: Jiehui Zhou, Linxiao Yang, Xingyu Liu, Xinyue Gu, Liang Sun, Wei Chen

Abstract: In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strateg… ▽ More In causal inference, estimating heterogeneous treatment effects (HTE) is critical for identifying how different subgroups respond to interventions, with broad applications in fields such as precision medicine and personalized advertising. Although HTE estimation methods aim to improve accuracy, how to provide explicit subgroup descriptions remains unclear, hindering data interpretation and strategic intervention management. In this paper, we propose CURLS, a novel rule learning method leveraging HTE, which can effectively describe subgroups with significant treatment effects. Specifically, we frame causal rule learning as a discrete optimization problem, finely balancing treatment effect with variance and considering the rule interpretability. We design an iterative procedure based on the minorize-maximization algorithm and solve a submodular lower bound as an approximation for the original. Quantitative experiments and qualitative case studies verify that compared with state-of-the-art methods, CURLS can find subgroups where the estimated and true effects are 16.1% and 13.8% higher and the variance is 12.0% smaller, while maintaining similar or better estimation accuracy and rule interpretability. Code is available at https://osf.io/zwp2k/. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 12 pages, 3 figures

arXiv:2406.16605 [pdf, other]

CLEAR: Can Language Models Really Understand Causal Graphs?

Authors: Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Chaochao Lu

Abstract: Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we devel… ▽ More Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.15720 [pdf, other]

Scaling Laws for Fact Memorization of Large Language Models

Authors: Xingyu Lu, Xiaonan Li, Qinyuan Cheng, Kai Ding, Xuan**g Huang, Xipeng Qiu

Abstract: Fact knowledge memorization is crucial for Large Language Models (LLM) to generate factual and reliable responses. However, the behaviors of LLM fact memorization remain under-explored. In this paper, we analyze the scaling laws for LLM's fact knowledge and LLMs' behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law r… ▽ More Fact knowledge memorization is crucial for Large Language Models (LLM) to generate factual and reliable responses. However, the behaviors of LLM fact memorization remain under-explored. In this paper, we analyze the scaling laws for LLM's fact knowledge and LLMs' behaviors of memorizing different types of facts. We find that LLMs' fact knowledge capacity has a linear and negative exponential law relationship with model size and training epochs, respectively. Estimated by the built scaling law, memorizing the whole Wikidata's facts requires training an LLM with 1000B non-embed parameters for 100 epochs, suggesting that using LLMs to memorize all public facts is almost implausible for a general pre-training setting. Meanwhile, we find that LLMs can generalize on unseen fact knowledge and its scaling law is similar to general pre-training. Additionally, we analyze the compatibility and preference of LLMs' fact memorization. For compatibility, we find LLMs struggle with memorizing redundant facts in a unified way. Only when correlated facts have the same direction and structure, the LLM can compatibly memorize them. This shows the inefficiency of LLM memorization for redundant facts. For preference, the LLM pays more attention to memorizing more frequent and difficult facts, and the subsequent facts can overwrite prior facts' memorization, which significantly hinders low-frequency facts memorization. Our findings reveal the capacity and characteristics of LLMs' fact knowledge learning, which provide directions for LLMs' fact knowledge augmentation. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14401 [pdf, other]

Fair Streaming Feature Selection

Authors: Zhangling Duan, Tianci Li, Xingyu Wu, Zhaolong Ling, **gye Yang, Zhaohong Jia

Abstract: Streaming feature selection techniques have become essential in processing real-time data streams, as they facilitate the identification of the most relevant attributes from continuously updating information. Despite their performance, current algorithms to streaming feature selection frequently fall short in managing biases and avoiding discrimination that could be perpetuated by sensitive attrib… ▽ More Streaming feature selection techniques have become essential in processing real-time data streams, as they facilitate the identification of the most relevant attributes from continuously updating information. Despite their performance, current algorithms to streaming feature selection frequently fall short in managing biases and avoiding discrimination that could be perpetuated by sensitive attributes, potentially leading to unfair outcomes in the resulting models. To address this issue, we propose FairSFS, a novel algorithm for Fair Streaming Feature Selection, to uphold fairness in the feature selection process without compromising the ability to handle data in an online manner. FairSFS adapts to incoming feature vectors by dynamically adjusting the feature set and discerns the correlations between classification attributes and sensitive attributes from this revised set, thereby forestalling the propagation of sensitive data. Empirical evaluations show that FairSFS not only maintains accuracy that is on par with leading streaming feature selection methods and existing fair feature techniques but also significantly improves fairness metrics. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 30 pages, 10 figures

arXiv:2406.14359 [pdf, other]

Learning to Transfer for Evolutionary Multitasking

Authors: Sheng-Hao Wu, Yuxiao Huang, Xingyu Wu, Liang Feng, Zhi-Hui Zhan, Kay Chen Tan

Abstract: Evolutionary multitasking (EMT) is an emerging approach for solving multitask optimization problems (MTOPs) and has garnered considerable research interest. The implicit EMT is a significant research branch that utilizes evolution operators to enable knowledge transfer (KT) between tasks. However, current approaches in implicit EMT face challenges in adaptability, due to the use of a limited numbe… ▽ More Evolutionary multitasking (EMT) is an emerging approach for solving multitask optimization problems (MTOPs) and has garnered considerable research interest. The implicit EMT is a significant research branch that utilizes evolution operators to enable knowledge transfer (KT) between tasks. However, current approaches in implicit EMT face challenges in adaptability, due to the use of a limited number of evolution operators and insufficient utilization of evolutionary states for performing KT. This results in suboptimal exploitation of implicit KT's potential to tackle a variety of MTOPs. To overcome these limitations, we propose a novel Learning to Transfer (L2T) framework to automatically discover efficient KT policies for the MTOPs at hand. Our framework conceptualizes the KT process as a learning agent's sequence of strategic decisions within the EMT process. We propose an action formulation for deciding when and how to transfer, a state representation with informative features of evolution states, a reward formulation concerning convergence and transfer efficiency gain, and the environment for the agent to interact with MTOPs. We employ an actor-critic network structure for the agent and learn it via proximal policy optimization. This learned agent can be integrated with various evolutionary algorithms, enhancing their ability to address a range of new MTOPs. Comprehensive empirical studies on both synthetic and real-world MTOPs, encompassing diverse inter-task relationships, function classes, and task distributions are conducted to validate the proposed L2T framework. The results show a marked improvement in the adaptability and performance of implicit EMT when solving a wide spectrum of unseen MTOPs. △ Less

Submitted 22 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

Comments: Under review

arXiv:2406.13149 [pdf, other]

High-Fidelity Facial Albedo Estimation via Texture Quantization

Authors: Zimin Ran, Xingyu Ren, Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jia Guo, Linchao Zhu, Jiankang Deng

Abstract: Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results. In this paper, we present a novel facial albedo recons… ▽ More Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results. In this paper, we present a novel facial albedo reconstruction model, HiFiAlbedo, which recovers the albedo map directly from a single image without the need for captured albedo data. Our key insight is that the albedo map is the illumination invariant texture map, which enables us to use inexpensive texture data to derive an albedo estimation by eliminating illumination. To achieve this, we first collect large-scale ultra-high-resolution facial images and train a high-fidelity facial texture codebook. By using the FFHQ dataset and limited UV textures, we then fine-tune the encoder for texture reconstruction from the input image with adversarial supervision in both image and UV space. Finally, we train a cross-attention module and utilize group identity loss to learn the adaptation from facial texture to the albedo domain. Extensive experimentation has demonstrated that our method exhibits excellent generalizability and is capable of achieving high-fidelity results for in-the-wild facial albedo recovery. Our code, pre-trained weights, and training data will be made publicly available at https://hifialbedo.github.io/. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11243 [pdf, other]

FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Authors: Bangzheng Li, Ben Zhou, Xingyu Fu, Fei Wang, Dan Roth, Muhao Chen

Abstract: Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity… ▽ More Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10189 [pdf, ps, other]

Topological rigidity of small RCD(K,N) spaces with maximal rank

Authors: Sergio Zamora, Xingyu Zhu

Abstract: For a polycyclic group $Λ$, rank$(Λ)$ is defined as the number of $\mathbb{Z}$ factors in a polycyclic decomposition of $Λ$. For a finitely generated group $G$, rank$(G)$ is defined as the infimum of rank$(Λ)$ among finite index polycyclic subgroups $Λ\leq G$. For a compact RCD$(K,N)$ space $(X,\mathsf{d}, \mathfrak{m} )$ with diam$(X) \leq \varepsilon (K,N)$, the rank of $π_1(X)$ is at most… ▽ More For a polycyclic group $Λ$, rank$(Λ)$ is defined as the number of $\mathbb{Z}$ factors in a polycyclic decomposition of $Λ$. For a finitely generated group $G$, rank$(G)$ is defined as the infimum of rank$(Λ)$ among finite index polycyclic subgroups $Λ\leq G$. For a compact RCD$(K,N)$ space $(X,\mathsf{d}, \mathfrak{m} )$ with diam$(X) \leq \varepsilon (K,N)$, the rank of $π_1(X)$ is at most $N$. In this note we show that in case of equality, $X$ is homeomorphic to an infranilmanifold, generalizing to the non-smooth setting a result by Kapovitch--Wilking. We also fill a gap in the proof that if $π_1(X) = \mathbb{Z}^N$, then $X$ is bi-Hölder homeomorphic to a flat torus (diffeomorphic in the smooth case). △ Less

Submitted 14 June, 2024; originally announced June 2024.

Report number: MPIM-Bonn-2024 MSC Class: 53C23; 53C21

arXiv:2406.09411 [pdf, other]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Authors: Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

Abstract: We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a… ▽ More We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements. △ Less

Submitted 1 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: typos corrected, references added, Project Page: https://muirbench.github.io/

arXiv:2406.09403 [pdf, other]

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Authors: Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In t… ▽ More Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/. △ Less

Submitted 10 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Project and codes url: https://visualsketchpad.github.io/

arXiv:2406.08948 [pdf, other]

Validity of the Lieb-Schultz-Mattis Theorem in Long-Range Interacting Systems

Authors: Yi-Neng Zhou, Xingyu Li

Abstract: The Lieb-Schultz-Mattis (LSM) theorem asserts that microscopic details of the system can impose non-trivial constraints on the system's low-energy properties. While traditionally applied to short-range interaction systems, where locality ensures a vanishing spectral gap in large system size limit, the impact of long-range interactions on the LSM theorem remains an open question. Long-range interac… ▽ More The Lieb-Schultz-Mattis (LSM) theorem asserts that microscopic details of the system can impose non-trivial constraints on the system's low-energy properties. While traditionally applied to short-range interaction systems, where locality ensures a vanishing spectral gap in large system size limit, the impact of long-range interactions on the LSM theorem remains an open question. Long-range interactions are prevalent in experimental platforms such as Rydberg atoms, dipolar quantum gases, polar molecules, optical cavities, and trapped ions, where the interaction decay exponent can be experimentally tuned. We extend the LSM theorem in one dimension to long-range interacting systems and find that the LSM theorem holds for exponentially or power-law two-body interactions with a decay exponent $α> 2$. However, for power-law interactions with $α< 2$, the constraints of the LSM theorem on the ground state do not apply. Numerical simulations of long-range versions of the Heisenberg and Majumdar-Ghosh models, both satisfying the LSM symmetry requirements, are also provided. Our results suggest promising directions for experimental validation of the LSM theorem in systems with tunable long-range interactions. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 6 pages, 2 figures

arXiv:2406.07546 [pdf, other]

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Authors: Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth

Abstract: We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I model… ▽ More We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Text-to-Image Generation, Commonsense, Project Url: https://zeyofu.github.io/CommonsenseT2I/

arXiv:2406.07411 [pdf, other]

VersiCode: Towards Version-controllable Code Generation

Authors: Tongtong Wu, Weigang Wu, Xingyu Wang, Kang Xu, Suyu Ma, Bo Jiang, ** Yang, Zhenchang Xing, Yuan-Fang Li, Gholamreza Haffari

Abstract: Significant research has focused on improving the performance of large language model on code-related tasks due to their practical importance. Although performance is typically evaluated using public benchmark datasets, the existing datasets do not account for the concept of \emph{version}, which is crucial in professional software development. In this paper, we introduce VersiCode, the first comp… ▽ More Significant research has focused on improving the performance of large language model on code-related tasks due to their practical importance. Although performance is typically evaluated using public benchmark datasets, the existing datasets do not account for the concept of \emph{version}, which is crucial in professional software development. In this paper, we introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions. VersiCode encompasses 300 libraries across more than 2,000 versions spanning 9 years. We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE). Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode, that even state-of-the-art LLMs struggle to generate version-correct code. This dataset, together with the proposed tasks, sheds light on LLMs' capabilities and limitations in handling version-specific code generation, and opens up an important new area of research for further investigation. The resources can be found at https://github.com/wutong8023/VersiCode. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.03274 [pdf, other]

Enhancing CTC-based speech recognition with diverse modeling units

Authors: Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

Abstract: In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvem… ▽ More In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal integration of heterogeneous modeling units in the development of more robust and accurate ASR systems. △ Less

Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.01003 [pdf, other]

Uni-ISP: Unifying the Learning of ISPs from Multiple Cameras

Authors: Lingen Li, Mingde Yao, Xingyu Meng, Muquan Yu, Tianfan Xue, **wei Gu

Abstract: Modern end-to-end image signal processors (ISPs) can learn complex map**s from RAW/XYZ data to sRGB (or inverse), opening new possibilities in image processing. However, as the diversity of camera models continues to expand, develo** and maintaining individual ISPs is not sustainable in the long term, which inherently lacks versatility, hindering the adaptability to multiple camera models. In… ▽ More Modern end-to-end image signal processors (ISPs) can learn complex map**s from RAW/XYZ data to sRGB (or inverse), opening new possibilities in image processing. However, as the diversity of camera models continues to expand, develo** and maintaining individual ISPs is not sustainable in the long term, which inherently lacks versatility, hindering the adaptability to multiple camera models. In this paper, we propose a novel pipeline, Uni-ISP, which unifies the learning of ISPs from multiple cameras, offering an accurate and versatile processor to multiple camera models. The core of Uni-ISP is leveraging device-aware embeddings through learning inverse/forward ISPs and its special training scheme. By doing so, Uni-ISP not only improves the performance of inverse/forward ISPs but also unlocks a variety of new applications inaccessible to existing learned ISPs. Moreover, since there is no dataset synchronously captured by multiple cameras for training, we construct a real-world 4K dataset, FiveCam, comprising more than 2,400 pairs of sRGB-RAW images synchronously captured by five smartphones. We conducted extensive experiments demonstrating Uni-ISP's accuracy in inverse/forward ISPs (with improvements of +1.5dB/2.4dB PSNR), its versatility in enabling new applications, and its adaptability to new camera models. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00440 [pdf, other]

Topo4D: Topology-Preserving Gaussian Splatting for High-Fidelity 4D Head Capture

Authors: Xuanchen Li, Yuhao Cheng, Xingyu Ren, Haozhe Jia, Di Xu, Wenhan Zhu, Yichao Yan

Abstract: 4D head capture aims to generate dynamic topological meshes and corresponding texture maps from videos, which is widely utilized in movies and games for its ability to simulate facial muscle movements and recover dynamic textures in pore-squeezing. The industry often adopts the method involving multi-view stereo and non-rigid alignment. However, this approach is prone to errors and heavily reliant… ▽ More 4D head capture aims to generate dynamic topological meshes and corresponding texture maps from videos, which is widely utilized in movies and games for its ability to simulate facial muscle movements and recover dynamic textures in pore-squeezing. The industry often adopts the method involving multi-view stereo and non-rigid alignment. However, this approach is prone to errors and heavily reliant on time-consuming manual processing by artists. To simplify this process, we propose Topo4D, a novel framework for automatic geometry and texture generation, which optimizes densely aligned 4D heads and 8K texture maps directly from calibrated multi-view time-series images. Specifically, we first represent the time-series faces as a set of dynamic 3D Gaussians with fixed topology in which the Gaussian centers are bound to the mesh vertices. Afterward, we perform alternative geometry and texture optimization frame-by-frame for high-quality geometry and texture learning while maintaining temporal topology stability. Finally, we can extract dynamic facial meshes in regular wiring arrangement and high-fidelity textures with pore-level details from the learned Gaussians. Extensive experiments show that our method achieves superior results than the current SOTA face reconstruction methods both in the quality of meshes and textures. Project page: https://xuanchenli.github.io/Topo4D/. △ Less

Submitted 1 July, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2405.21013 [pdf, other]

StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond

Authors: Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, **gdong Wang

Abstract: Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Langu… ▽ More Text-rich images have significant and extensive value, deeply integrated into various aspects of human life. Notably, both visual cues and linguistic symbols in text-rich images play crucial roles in information transmission but are accompanied by diverse challenges. Therefore, the efficient and effective understanding of text-rich images is a crucial litmus test for the capability of Vision-Language Models. We have crafted an efficient vision-language model, StrucTexTv3, tailored to tackle various intelligent tasks for text-rich images. The significant design of StrucTexTv3 is presented in the following aspects: Firstly, we adopt a combination of an effective multi-scale reduced visual transformer and a multi-granularity token sampler (MG-Sampler) as a visual token generator, successfully solving the challenges of high-resolution input and complex representation learning for text-rich images. Secondly, we enhance the perception and comprehension abilities of StrucTexTv3 through instruction learning, seamlessly integrating various text-oriented tasks into a unified framework. Thirdly, we have curated a comprehensive collection of high-quality text-rich images, abbreviated as TIM-30M, encompassing diverse scenarios like incidental scenes, office documents, web pages, and screenshots, thereby improving the robustness of our model. Our method achieved SOTA results in text-rich image perception tasks, and significantly improved performance in comprehension tasks. Among multimodal models with LLM decoder of approximately 1.8B parameters, it stands out as a leader, which also makes the deployment of edge devices feasible. In summary, the StrucTexTv3 model, featuring efficient structural design, outstanding performance, and broad adaptability, offers robust support for diverse intelligent application tasks involving text-rich images, thus exhibiting immense potential for widespread application. △ Less

Submitted 4 June, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

arXiv:2405.19765 [pdf, other]

Towards Unified Multi-granularity Text Detection with Interactive Attention

Authors: Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, **gdong Wang

Abstract: Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a… ▽ More Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands. In this paper, we introduce "Detect Any Text" (DAT), an advanced paradigm that seamlessly unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. This design enables DAT to efficiently manage text instances at different granularities, including *word*, *line*, *paragraph* and *page*. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances at varying granularities by correlating structural information across different text queries. As a result, it enables the model to achieve mutually beneficial detection performances across multiple text granularities. Additionally, a prompt-based segmentation module refines detection outcomes for texts of arbitrary curvature and complex layouts, thereby improving DAT's accuracy and expanding its real-world applicability. Experimental results demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks, including multi-oriented/arbitrarily-shaped scene text detection, document layout analysis and page detection tasks. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: ICML 2024

arXiv:2405.19568 [pdf, other]

Organizing Background to Explore Latent Classes for Incremental Few-shot Semantic Segmentation

Authors: Lianlei Shan, Wenzhang Zhou, Wei Li, Xingyu Ding

Abstract: The goal of incremental Few-shot Semantic Segmentation (iFSS) is to extend pre-trained segmentation models to new classes via few annotated images without access to old training data. During incrementally learning novel classes, the data distribution of old classes will be destroyed, leading to catastrophic forgetting. Meanwhile, the novel classes have only few samples, making models impossible to… ▽ More The goal of incremental Few-shot Semantic Segmentation (iFSS) is to extend pre-trained segmentation models to new classes via few annotated images without access to old training data. During incrementally learning novel classes, the data distribution of old classes will be destroyed, leading to catastrophic forgetting. Meanwhile, the novel classes have only few samples, making models impossible to learn the satisfying representations of novel classes. For the iFSS problem, we propose a network called OINet, i.e., the background embedding space \textbf{O}rganization and prototype \textbf{I}nherit Network. Specifically, when training base classes, OINet uses multiple classification heads for the background and sets multiple sub-class prototypes to reserve embedding space for the latent novel classes. During incrementally learning novel classes, we propose a strategy to select the sub-class prototypes that best match the current learning novel classes and make the novel classes inherit the selected prototypes' embedding space. This operation allows the novel classes to be registered in the embedding space using few samples without affecting the distribution of the base classes. Results on Pascal-VOC and COCO show that OINet achieves a new state of the art. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 10 pages, 5 figures

arXiv:2405.18663 [pdf, other]

Lifelong Learning and Selective Forgetting via Contrastive Strategy

Authors: Lianlei Shan, Wenzhang Zhou, Wei Li, Xingyu Ding

Abstract: Lifelong learning aims to train a model with good performance for new tasks while retaining the capacity of previous tasks. However, some practical scenarios require the system to forget undesirable knowledge due to privacy issues, which is called selective forgetting. The joint task of the two is dubbed Learning with Selective Forgetting (LSF). In this paper, we propose a new framework based on c… ▽ More Lifelong learning aims to train a model with good performance for new tasks while retaining the capacity of previous tasks. However, some practical scenarios require the system to forget undesirable knowledge due to privacy issues, which is called selective forgetting. The joint task of the two is dubbed Learning with Selective Forgetting (LSF). In this paper, we propose a new framework based on contrastive strategy for LSF. Specifically, for the preserved classes (tasks), we make features extracted from different samples within a same class compacted. And for the deleted classes, we make the features from different samples of a same class dispersed and irregular, i.e., the network does not have any regular response to samples from a specific deleted class as if the network has no training at all. Through maintaining or disturbing the feature distribution, the forgetting and memory of different classes can be or independent of each other. Experiments are conducted on four benchmark datasets, and our method acieves new state-of-the-art. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 10 pages, 5 figure

arXiv:2405.17776 [pdf, other]

The Binary Quantized Neural Network for Dense Prediction via Specially Designed Upsampling and Attention

Authors: Xingyu Ding, Lianlei Shan, Guiqin Zhao, Meiqi Wu, Wenzhang Zhou, Wei Li

Abstract: Deep learning-based information processing consumes long time and requires huge computing resources, especially for dense prediction tasks which require an output for each pixel, like semantic segmentation and salient object detection. There are mainly two challenges for quantization of dense prediction tasks. Firstly, directly applying the upsampling operation that dense prediction tasks require… ▽ More Deep learning-based information processing consumes long time and requires huge computing resources, especially for dense prediction tasks which require an output for each pixel, like semantic segmentation and salient object detection. There are mainly two challenges for quantization of dense prediction tasks. Firstly, directly applying the upsampling operation that dense prediction tasks require is extremely crude and causes unacceptable accuracy reduction. Secondly, the complex structure of dense prediction networks means it is difficult to maintain a fast speed as well as a high accuracy when performing quantization. In this paper, we propose an effective upsampling method and an efficient attention computation strategy to transfer the success of the binary neural networks (BNN) from single prediction tasks to dense prediction tasks. Firstly, we design a simple and robust multi-branch parallel upsampling structure to achieve the high accuracy. Then we further optimize the attention method which plays an important role in segmentation but has huge computation complexity. Our attention method can reduce the computational complexity by a factor of one hundred times but retain the original effect. Experiments on Cityscapes, KITTI road, and ECSSD fully show the effectiveness of our work. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 30 pages, 6 figures

arXiv:2405.16778 [pdf, other]

Unusual switch from low-temperature T-quadratic resistivity in the underdoped pseudogap phase of cuprate superconductors to low-temperature T-linear resistivity in the overdoped strange-metal phase

Authors: Xingyu Ma, Minghuan Zeng, Huaiming Guo, Shi** Feng

Abstract: The transport experiments demonstrate a dramatic switch from the low-temperature T-linear resistivity in the overdoped strange-metal phase to the T-quadratic resistivity in the underdoped pseudogap phase of cuprate superconductors, however, a consensus on the origin of this switch is still lacking. Here the low-temperature resistivity in the underdoped pseudogap phase of cuprate superconductors is… ▽ More The transport experiments demonstrate a dramatic switch from the low-temperature T-linear resistivity in the overdoped strange-metal phase to the T-quadratic resistivity in the underdoped pseudogap phase of cuprate superconductors, however, a consensus on the origin of this switch is still lacking. Here the low-temperature resistivity in the underdoped pseudogap phase of cuprate superconductors is investigated using the Boltzmann transport equation. The low-temperature resistivity originates from the electron umklapp scattering mediated by the spin excitation. However, the dominant contribution to the resistivity mainly comes from the antinodal umklapp scattering. In particular, a low temperature $T_{scale}$ scales with $Δ^{2}_{p}$ in the underdoped regime due to the opening of a momentum-dependent spin pseudogap, where $Δ_{p}$ is the minimal umklapp vector at the antinode. Notably, this $T_{scale}$ as a function of do** presents a similar behavior of the antinodal spin pseudogap crossover temperature, i.e., $T_{scale}$ decreases with the increase of do** in the underdoped regime, and then is reduced to a very low temperature in the overdoped regime. In the underdoped regime, the resistivity is T-quadratic in the low temperatures below $T_{scale}$ with the strength of the T-quadratic resistivity that weakens as the do** is raised. However, in the overdoped regime, the resistivity is T-linear in the low temperatures above $T_{scale}$. The result in this paper together with the recent study on the electrical transport in the overdoped regime therefore show that the electron Umklapp scattering from a spin excitation responsible for the low-temperature T-linear resistivity in the overdoped strange-metal phase naturally produces the low-temperature T-quadratic resistivity in the underdoped pseudogap phase resulting from the opening of a momentum dependent spin pseudogap. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.16455 [pdf, other]

On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization

Authors: Jiancong Xiao, Ziniu Li, Xingyu Xie, Emily Getzen, Cong Fang, Qi Long, Weijie J. Su

Abstract: Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its K… ▽ More Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.15914 [pdf, other]

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

Authors: Yumin Zhang, Xingyu Miao, Haoran Duan, Bo Wei, Tejal Shah, Yang Long, Rajiv Ranjan

Abstract: Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge,… ▽ More Text-to-3D content creation is a rapidly evolving research area. Given the scarcity of 3D data, current approaches often adapt pre-trained 2D diffusion models for 3D synthesis. Among these approaches, Score Distillation Sampling (SDS) has been widely adopted. However, the issue of over-smoothing poses a significant limitation on the high-fidelity generation of 3D models. To address this challenge, LucidDreamer replaces the Denoising Diffusion Probabilistic Model (DDPM) in SDS with the Denoising Diffusion Implicit Model (DDIM) to construct Interval Score Matching (ISM). However, ISM inevitably inherits inconsistencies from DDIM, causing reconstruction errors during the DDIM inversion process. This results in poor performance in the detailed generation of 3D objects and loss of content. To alleviate these problems, we propose a novel method named Exact Score Matching (ESM). Specifically, ESM leverages auxiliary variables to mathematically guarantee exact recovery in the DDIM reverse process. Furthermore, to effectively capture the dynamic changes of the original and auxiliary variables, the LoRA of a pre-trained diffusion model implements these exact paths. Extensive experiments demonstrate the effectiveness of ESM in text-to-3D generation, particularly highlighting its superiority in detailed generation. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15682 [pdf, other]

The Road Less Scheduled

Authors: Aaron Defazio, Xingyu, Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, Ashok Cutkosky

Abstract: Existing learning rate schedules that do not require specification of the optimization stop** step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stop** time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from c… ▽ More Existing learning rate schedules that do not require specification of the optimization stop** step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stop** time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available (https://github.com/facebookresearch/schedule_free). △ Less

Submitted 30 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.15289 [pdf, other]

Learning Invariant Causal Mechanism from Vision-Language Models

Authors: Zeen Song, Siyu Zhao, Xingyu Zhang, Jiangmeng Li, Changwen Zheng, Wenwen Qiang

Abstract: Pre-trained large-scale models have become a major research focus, but their effectiveness is limited in real-world applications due to diverse data distributions. In contrast, humans excel at decision-making across various domains by learning reusable knowledge that remains invariant despite environmental changes in a complex world. Although CLIP, as a successful vision-language pre-trained model… ▽ More Pre-trained large-scale models have become a major research focus, but their effectiveness is limited in real-world applications due to diverse data distributions. In contrast, humans excel at decision-making across various domains by learning reusable knowledge that remains invariant despite environmental changes in a complex world. Although CLIP, as a successful vision-language pre-trained model, demonstrates remarkable performance in various visual downstream tasks, our experiments reveal unsatisfactory results in specific domains. Our further analysis with causal inference exposes the current CLIP model's inability to capture the invariant causal mechanisms across domains, attributed to its deficiency in identifying latent factors generating the data. To address this, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM), an algorithm designed to provably identify invariant latent factors with the aid of interventional data, and perform accurate prediction on various domains. Theoretical analysis demonstrates that our method has a lower generalization bound in out-of-distribution (OOD) scenarios. Experimental results showcase the outstanding performance of CLIP-ICM. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.11792 [pdf, other]

Source Localization by Multidimensional Steered Response Power Map** with Sparse Bayesian Learning

Authors: Wei-Ting Lai, Lachlan Birnie, Xingyu Chen, Amy Bastine, Thushara D. Abhayapala, Prasanga N. Samarasinghe

Abstract: We propose an advance Steered Response Power (SRP) method for localizing multiple sources. While conventional SRP performs well in adverse conditions, it remains to struggle in scenarios with closely neighboring sources, resulting in ambiguous SRP maps. We address this issue by applying sparsity optimization in SRP to obtain high-resolution maps. Our approach represents SRP maps as multidimensiona… ▽ More We propose an advance Steered Response Power (SRP) method for localizing multiple sources. While conventional SRP performs well in adverse conditions, it remains to struggle in scenarios with closely neighboring sources, resulting in ambiguous SRP maps. We address this issue by applying sparsity optimization in SRP to obtain high-resolution maps. Our approach represents SRP maps as multidimensional matrices to preserve time-frequency information and further improve performance in unfavorable conditions. We use multi-dictionary Sparse Bayesian Learning to localize sources without needing prior knowledge of their quantity. We validate our method through practical experiments with a 16-channel planar microphone array and compare against three other SRP and sparsity-based methods. Our multidimensional SRP approach outperforms conventional SRP and the current state-of-the-art sparse SRP methods for localizing closely spaced sources in a reverberant room. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2405.11349 [pdf, other]

Unlock the Power of Algorithm Features: A Generalization Analysis for Algorithm Selection

Authors: Xingyu Wu, Yan Zhong, Jibin Wu, Yuxiao Huang, Sheng-hao Wu, Kay Chen Tan

Abstract: In the algorithm selection research, the discussion surrounding algorithm features has been significantly overshadowed by the emphasis on problem features. Although a few empirical studies have yielded evidence regarding the effectiveness of algorithm features, the potential benefits of incorporating algorithm features into algorithm selection models and their suitability for different scenarios r… ▽ More In the algorithm selection research, the discussion surrounding algorithm features has been significantly overshadowed by the emphasis on problem features. Although a few empirical studies have yielded evidence regarding the effectiveness of algorithm features, the potential benefits of incorporating algorithm features into algorithm selection models and their suitability for different scenarios remain unclear. In this paper, we address this gap by proposing the first provable guarantee for algorithm selection based on algorithm features, taking a generalization perspective. We analyze the benefits and costs associated with algorithm features and investigate how the generalization error is affected by different factors. Specifically, we examine adaptive and predefined algorithm features under transductive and inductive learning paradigms, respectively, and derive upper bounds for the generalization error based on their model's Rademacher complexity. Our theoretical findings not only provide tight upper bounds, but also offer analytical insights into the impact of various factors, such as the training scale of problem instances and candidate algorithms, model parameters, feature values, and distributional differences between the training and test data. Notably, we demonstrate how models will benefit from algorithm features in complex scenarios involving many algorithms, and proves the positive correlation between generalization error bound and $χ^2$-divergence of distributions. △ Less

Submitted 3 June, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.11252 [pdf, other]

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

Authors: Xingyu Miao, Haoran Duan, Varun Ojha, Jun Song, Tejal Shah, Yang Long, Rajiv Ranjan

Abstract: In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversi… ▽ More In this work, we propose a novel Trajectory Score Matching (TSM) method that aims to solve the pseudo ground truth inconsistency problem caused by the accumulated error in Interval Score Matching (ISM) when using the Denoising Diffusion Implicit Models (DDIM) inversion process. Unlike ISM which adopts the inversion process of DDIM to calculate on a single path, our TSM method leverages the inversion process of DDIM to generate two paths from the same starting point for calculation. Since both paths start from the same starting point, TSM can reduce the accumulated error compared to ISM, thus alleviating the problem of pseudo ground truth inconsistency. TSM enhances the stability and consistency of the model's generated paths during the distillation process. We demonstrate this experimentally and further show that ISM is a special case of TSM. Furthermore, to optimize the current multi-stage optimization process from high-resolution text to 3D generation, we adopt Stable Diffusion XL for guidance. In response to the issues of abnormal replication and splitting caused by unstable gradients during the 3D Gaussian splatting process when using Stable Diffusion XL, we propose a pixel-by-pixel gradient clip** method. Extensive experiments show that our model significantly surpasses the state-of-the-art models in terms of visual quality and performance. Code: \url{https://github.com/xingy038/Dreamer-XL}. △ Less

Submitted 18 May, 2024; originally announced May 2024.

arXiv:2405.10022 [pdf, other]

Monaural speech enhancement on drone via Adapter based transfer learning

Authors: Xingyu Chen, Hanwen Bi, Wei-Ting Lai, Fei Ma

Abstract: Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing… ▽ More Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing models to overfit. Considering the harmonic nature of drone noise, this paper proposes a frequency domain bottleneck adapter to enable transfer learning. Specifically, the adapter's parameters are trained on drone noise while retaining the parameters of the pre-trained Frequency Recurrent Convolutional Recurrent Network (FRCRN) fixed. Evaluation results demonstrate the proposed method can effectively enhance speech quality. Moreover, it is a more efficient alternative to fine-tuning models for various drone types, which typically requires substantial computational resources. △ Less

Submitted 16 May, 2024; originally announced May 2024.

arXiv:2405.09492 [pdf, other]

MGSER-SAM: Memory-Guided Soft Experience Replay with Sharpness-Aware Optimization for Enhanced Continual Learning

Authors: Xingyu Li, Bo Tang

Abstract: Deep neural networks suffer from the catastrophic forgetting problem in the field of continual learning (CL). To address this challenge, we propose MGSER-SAM, a novel memory replay-based algorithm specifically engineered to enhance the generalization capabilities of CL models. We first intergrate the SAM optimizer, a component designed for optimizing flatness, which seamlessly fits into well-known… ▽ More Deep neural networks suffer from the catastrophic forgetting problem in the field of continual learning (CL). To address this challenge, we propose MGSER-SAM, a novel memory replay-based algorithm specifically engineered to enhance the generalization capabilities of CL models. We first intergrate the SAM optimizer, a component designed for optimizing flatness, which seamlessly fits into well-known Experience Replay frameworks such as ER and DER++. Then, MGSER-SAM distinctively addresses the complex challenge of reconciling conflicts in weight perturbation directions between ongoing tasks and previously stored memories, which is underexplored in the SAM optimizer. This is effectively accomplished by the strategic integration of soft logits and the alignment of memory gradient directions, where the regularization terms facilitate the concurrent minimization of various training loss terms integral to the CL process. Through rigorous experimental analysis conducted across multiple benchmarks, MGSER-SAM has demonstrated a consistent ability to outperform existing baselines in all three CL scenarios. Comparing to the representative memory replay-based baselines ER and DER++, MGSER-SAM not only improves the testing accuracy by $24.4\%$ and $17.6\%$ respectively, but also achieves the lowest forgetting on each benchmark. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 8 pages, 5 figures

arXiv:2405.09312 [pdf, ps, other]

Agnostic Active Learning of Single Index Models with Linear Sample Complexity

Authors: Aarshvi Gajjar, Wai Ming Tai, Xingyu Xu, Chinmay Hegde, Yi Li, Christopher Musco

Abstract: We study active learning methods for single index models of the form $F({\mathbf x}) = f(\langle {\mathbf w}, {\mathbf x}\rangle)$, where $f:\mathbb{R} \to \mathbb{R}$ and ${\mathbf x,\mathbf w} \in \mathbb{R}^d$. In addition to their theoretical interest as simple examples of non-linear neural networks, single index models have received significant recent attention due to applications in scientif… ▽ More We study active learning methods for single index models of the form $F({\mathbf x}) = f(\langle {\mathbf w}, {\mathbf x}\rangle)$, where $f:\mathbb{R} \to \mathbb{R}$ and ${\mathbf x,\mathbf w} \in \mathbb{R}^d$. In addition to their theoretical interest as simple examples of non-linear neural networks, single index models have received significant recent attention due to applications in scientific machine learning like surrogate modeling for partial differential equations (PDEs). Such applications require sample-efficient active learning methods that are robust to adversarial noise. I.e., that work even in the challenging agnostic learning setting. We provide two main results on agnostic active learning of single index models. First, when $f$ is known and Lipschitz, we show that $\tilde{O}(d)$ samples collected via {statistical leverage score sampling} are sufficient to learn a near-optimal single index model. Leverage score sampling is simple to implement, efficient, and already widely used for actively learning linear models. Our result requires no assumptions on the data distribution, is optimal up to log factors, and improves quadratically on a recent ${O}(d^{2})$ bound of \cite{gajjar2023active}. Second, we show that $\tilde{O}(d)$ samples suffice even in the more difficult setting when $f$ is \emph{unknown}. Our results leverage tools from high dimensional probability, including Dudley's inequality and dual Sudakov minoration, as well as a novel, distribution-aware discretization of the class of Lipschitz functions. △ Less

Submitted 9 July, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

arXiv:2405.08340 [pdf, other]

Achieving Resolution-Agnostic DNN-based Image Watermarking:A Novel Perspective of Implicit Neural Representation

Authors: Yuchen Wang, Xingyu Zhu, Guanhui Ye, Shiyao Zhang, Xuetao Wei

Abstract: DNN-based watermarking methods are rapidly develo** and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the b… ▽ More DNN-based watermarking methods are rapidly develo** and delivering impressive performances. Recent advances achieve resolution-agnostic image watermarking by reducing the variant resolution watermarking problem to a fixed resolution watermarking problem. However, such a reduction process can potentially introduce artifacts and low robustness. To address this issue, we propose the first, to the best of our knowledge, Resolution-Agnostic Image WaterMarking (RAIMark) framework by watermarking the implicit neural representation (INR) of image. Unlike previous methods, our method does not rely on the previous reduction process by directly watermarking the continuous signal instead of image pixels, thus achieving resolution-agnostic watermarking. Precisely, given an arbitrary-resolution image, we fit an INR for the target image. As a continuous signal, such an INR can be sampled to obtain images with variant resolutions. Then, we quickly fine-tune the fitted INR to get a watermarked INR conditioned on a binary secret message. A pre-trained watermark decoder extracts the hidden message from any sampled images with arbitrary resolutions. By directly watermarking INR, we achieve resolution-agnostic watermarking with increased robustness. Extensive experiments show that our method outperforms previous methods with significant improvements: averagely improved bit accuracy by 7%$\sim$29%. Notably, we observe that previous methods are vulnerable to at least one watermarking attack (e.g. JPEG, crop, resize), while ours are robust against all watermarking attacks. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2405.07801 [pdf, other]

Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Authors: Jian Liu, Wei Sun, Hui Yang, Zhiwen Zeng, Chongpei Liu, ** Zheng, Xingyu Liu, Hossein Rahmani, Nicu Sebe, Ajmal Mian

Abstract: Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependen… ▽ More Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, \emph{i.e.}, instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing the readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating the readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews the prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation. △ Less

Submitted 31 May, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: 27 pages, 7 figures

arXiv:2405.07059 [pdf, other]

Numerical Analysis of Finite Dimensional Approximations in Finite Temperature DFT

Authors: Ge Xu, Huajie Chen, Xingyu Gao

Abstract: In this paper, we study numerical approximations of the ground states in finite temperature density functional theory. We formulate the problem with respect to the density matrices and justify the convergence of the finite dimensional approximations. Moreover, we provide an optimal a priori error estimate under some mild assumptions and present some numerical experiments to support the theory. In this paper, we study numerical approximations of the ground states in finite temperature density functional theory. We formulate the problem with respect to the density matrices and justify the convergence of the finite dimensional approximations. Moreover, we provide an optimal a priori error estimate under some mild assumptions and present some numerical experiments to support the theory. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 20 pages, 6 figures

arXiv:2405.06784 [pdf, other]

Open Challenges and Opportunities in Federated Foundation Models Towards Biomedical Healthcare

Authors: Xingyu Li, Lu Peng, Yu** Wang, Weihua Zhang

Abstract: This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) for advancing biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforceme… ▽ More This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) for advancing biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions. The incorporation of FL with these sophisticated models presents a promising strategy to harness their analytical power while safeguarding the privacy of sensitive medical data. This approach not only enhances the capabilities of FMs in medical diagnostics and personalized treatment but also addresses critical concerns about data privacy and security in healthcare. This survey reviews the current applications of FMs in federated settings, underscores the challenges, and identifies future research directions including scaling FMs, managing data diversity, and enhancing communication efficiency within FL frameworks. The objective is to encourage further research into the combined potential of FMs and FL, laying the groundwork for groundbreaking healthcare innovations. △ Less

Submitted 10 May, 2024; originally announced May 2024.

Comments: 42 pages

arXiv:2405.05993 [pdf]

Precision Rehabilitation for Patients Post-Stroke based on Electronic Health Records and Machine Learning

Authors: Fengyi Gao, Xingyu Zhang, Sonish Sivarajkumar, Parker Denny, Bayan Aldhahwani, Shyam Visweswaran, Ryan Shi, William Hogan, Allyn Bove, Yanshan Wang

Abstract: In this study, we utilized statistical analysis and machine learning methods to examine whether rehabilitation exercises can improve patients post-stroke functional abilities, as well as forecast the improvement in functional abilities. Our dataset is patients' rehabilitation exercises and demographic information recorded in the unstructured electronic health records (EHRs) data and free-text reha… ▽ More In this study, we utilized statistical analysis and machine learning methods to examine whether rehabilitation exercises can improve patients post-stroke functional abilities, as well as forecast the improvement in functional abilities. Our dataset is patients' rehabilitation exercises and demographic information recorded in the unstructured electronic health records (EHRs) data and free-text rehabilitation procedure notes. We collected data for 265 stroke patients from the University of Pittsburgh Medical Center. We employed a pre-existing natural language processing (NLP) algorithm to extract data on rehabilitation exercises and developed a rule-based NLP algorithm to extract Activity Measure for Post-Acute Care (AM-PAC) scores, covering basic mobility (BM) and applied cognitive (AC) domains, from procedure notes. Changes in AM-PAC scores were classified based on the minimal clinically important difference (MCID), and significance was assessed using Friedman and Wilcoxon tests. To identify impactful exercises, we used Chi-square tests, Fisher's exact tests, and logistic regression for odds ratios. Additionally, we developed five machine learning models-logistic regression (LR), Adaboost (ADB), support vector machine (SVM), gradient boosting (GB), and random forest (RF)-to predict outcomes in functional ability. Statistical analyses revealed significant associations between functional improvements and specific exercises. The RF model achieved the best performance in predicting functional outcomes. In this study, we identified three rehabilitation exercises that significantly contributed to patient post-stroke functional ability improvement in the first two months. Additionally, the successful application of a machine learning model to predict patient-specific functional outcomes underscores the potential for precision rehabilitation. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2405.04861 [pdf, other]

Insights into Deep Learning Refactoring: Bridging the Gap Between Practices and Expectations

Authors: SiQi Wang, Xing Hu, Bei Wang, WenXin Yao, Xin Xia, XingYu Wang

Abstract: With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditio… ▽ More With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditional software. However, the insight of code refactoring in the context of deep learning is still unclear. This study endeavors to fill this knowledge gap by empirically examining the current state of code refactoring in deep learning realm, and practitioners' views on refactoring. We first manually analyzed the commit history of five popular and well-maintained deep learning projects (e.g., PyTorch). We mined 4,921 refactoring practices in historical commits and measured how different types and elements of refactoring operations are distributed and found that refactoring operation types' distribution in deep learning projects is different from it in traditional Java software. We then surveyed 159 practitioners about their views of code refactoring in deep learning projects and their expectations of current refactoring tools. The result of the survey showed that refactoring research and the development of related tools in the field of deep learning are crucial for improving project maintainability and code quality, and that current refactoring tools do not adequately meet the needs of practitioners. Lastly, we provided our perspective on the future advancement of refactoring tools and offered suggestions for developers' development practices. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 24 pages, 18 figures

arXiv:2405.04782 [pdf, other]

Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection

Authors: Zhaoxiang Zhang, Hanqiu Deng, **an Bao, Xingyu Li

Abstract: Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highli… ▽ More Image Anomaly Detection has been a challenging task in Computer Vision field. The advent of Vision-Language models, particularly the rise of CLIP-based frameworks, has opened new avenues for zero-shot anomaly detection. Recent studies have explored the use of CLIP by aligning images with normal and prompt descriptions. However, the exclusive dependence on textual guidance often falls short, highlighting the critical importance of additional visual references. In this work, we introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system. Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context. This dual-image strategy markedly enhanced both anomaly classification and localization performances. Furthermore, we have strengthened our model with a test-time adaptation module that incorporates synthesized anomalies to refine localization capabilities. Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets. △ Less

Submitted 7 May, 2024; originally announced May 2024.

arXiv:2405.03534 [pdf, other]

Meta-Evolve: Continuous Robot Evolution for One-to-many Policy Transfer

Authors: Xingyu Liu, Deepak Pathak, Ding Zhao

Abstract: We investigate the problem of transferring an expert policy from a source robot to multiple different robots. To solve this problem, we propose a method named $Meta$-$Evolve$ that uses continuous robot evolution to efficiently transfer the policy to each target robot through a set of tree-structured evolutionary robot sequences. The robot evolution tree allows the robot evolution paths to be share… ▽ More We investigate the problem of transferring an expert policy from a source robot to multiple different robots. To solve this problem, we propose a method named $Meta$-$Evolve$ that uses continuous robot evolution to efficiently transfer the policy to each target robot through a set of tree-structured evolutionary robot sequences. The robot evolution tree allows the robot evolution paths to be shared, so our approach can significantly outperform naive one-to-one policy transfer. We present a heuristic approach to determine an optimized robot evolution tree. Experiments have shown that our method is able to improve the efficiency of one-to-three transfer of manipulation policy by up to 3.2$\times$ and one-to-six transfer of agile locomotion policy by 2.4$\times$ in terms of simulation cost over the baseline of launching multiple independent one-to-one policy transfers. △ Less

Submitted 6 May, 2024; originally announced May 2024.

Comments: ICLR 2024

Showing 1–50 of 803 results for author: Xingyu