Search | arXiv e-print repository

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Authors: Xueqing Wu, Zongyu Lin, Songyan Zhao, Te-Lin Wu, Pan Lu, Nanyun Peng, Kai-Wei Chang

Abstract: Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debug… ▽ More Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/ △ Less

Submitted 27 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

Comments: update reference

arXiv:2406.12799 [pdf, ps, other]

Sample-Based Matroid Prophet Inequalities

Authors: Hu Fu, Pinyan Lu, Zhihao Gavin Tang, Hongxun Wu, **zhao Wu, Qianfan Zhang

Abstract: We study matroid prophet inequalities when distributions are unknown and accessible only through samples. While single-sample prophet inequalities for special matroids are known, no constant-factor competitive algorithm with even a sublinear number of samples was known for general matroids. Adding more to the stake, the single-sample version of the question for general matroids has close (two-way)… ▽ More We study matroid prophet inequalities when distributions are unknown and accessible only through samples. While single-sample prophet inequalities for special matroids are known, no constant-factor competitive algorithm with even a sublinear number of samples was known for general matroids. Adding more to the stake, the single-sample version of the question for general matroids has close (two-way) connections with the long-standing matroid secretary conjecture. In this work, we give a $(\frac14 - \varepsilon)$-competitive matroid prophet inequality with only $O_\varepsilon(\mathrm{poly} \log n)$ samples. Our algorithm consists of two parts: (i) a novel quantile-based reduction from matroid prophet inequalities to online contention resolution schemes (OCRSs) with $O_\varepsilon(\log n)$ samples, and (ii) a $(\frac14 - \varepsilon)$-selectable matroid OCRS with $O_\varepsilon(\mathrm{poly} \log n)$ samples which carefully addresses an adaptivity challenge. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: To appear at EC'24

arXiv:2406.09411 [pdf, other]

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Authors: Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

Abstract: We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a… ▽ More We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements. △ Less

Submitted 1 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: typos corrected, references added, Project Page: https://muirbench.github.io/

arXiv:2406.08552 [pdf, other]

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Authors: Zhihang Yuan, Pu Lu, Hanling Zhang, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

Abstract: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention's quadratic complexity. We propose DiTFastAttn, a novel post-training compression method to alleviate DiT's computational bottleneck. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on… ▽ More Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to self-attention's quadratic complexity. We propose DiTFastAttn, a novel post-training compression method to alleviate DiT's computational bottleneck. We identify three key redundancies in the attention computation during DiT inference: 1. spatial redundancy, where many attention heads focus on local information; 2. temporal redundancy, with high similarity between neighboring steps' attention outputs; 3. conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. To tackle these redundancies, we propose three techniques: 1. Window Attention with Residual Caching to reduce spatial redundancy; 2. Temporal Similarity Reduction to exploit the similarity between steps; 3. Conditional Redundancy Elimination to skip redundant computations during conditional generation. To demonstrate the effectiveness of DiTFastAttn, we apply it to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Evaluation results show that for image generation, our method reduces up to 88\% of the FLOPs and achieves up to 1.6x speedup at high resolution generation. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.19716 [pdf, other]

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Authors: Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang

Abstract: Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have b… ▽ More Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 19 pages, 14 figures, 6 tables

arXiv:2405.13872 [pdf, other]

Image-of-Thought Prompting for Visual Reasoning Refinement in Multimodal Large Language Models

Authors: Qiji Zhou, Ruochen Zhou, Zike Hu, Panzhong Lu, Siyang Gao, Yue Zhang

Abstract: Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. With the evolution of Multimodal Large Language Models (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT ha… ▽ More Recent advancements in Chain-of-Thought (CoT) and related rationale-based works have significantly improved the performance of Large Language Models (LLMs) in complex reasoning tasks. With the evolution of Multimodal Large Language Models (MLLMs), enhancing their capability to tackle complex multimodal reasoning problems is a crucial frontier. However, incorporating multimodal rationales in CoT has yet to be thoroughly investigated. We propose the Image-of-Thought (IoT) prompting method, which helps MLLMs to extract visual rationales step-by-step. Specifically, IoT prompting can automatically design critical visual information extraction operations based on the input images and questions. Each step of visual information refinement identifies specific visual rationales that support answers to complex visual reasoning questions. Beyond the textual CoT, IoT simultaneously utilizes visual and textual rationales to help MLLMs understand complex multimodal information. IoT prompting has improved zero-shot visual reasoning performance across various visual understanding tasks in different MLLMs. Moreover, the step-by-step visual feature explanations generated by IoT prompting elucidate the visual reasoning process, aiding in analyzing the cognitive processes of large multimodal models △ Less

Submitted 28 May, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

Comments: Correct the case title

arXiv:2405.13089 [pdf, other]

SEGAN: semi-supervised learning approach for missing data imputation

Authors: Xiaohua Pan, Weifeng Wu, Peiran Liu, Zhen Li, Peng Lu, Peijian Cao, Jianfeng Zhang, Xianfei Qiu, YangYang Wu

Abstract: In many practical real-world applications, data missing is a very common phenomenon, making the development of data-driven artificial intelligence theory and technology increasingly difficult. Data completion is an important method for missing data preprocessing. Most existing miss-ing data completion models directly use the known information in the missing data set but ignore the impact of the da… ▽ More In many practical real-world applications, data missing is a very common phenomenon, making the development of data-driven artificial intelligence theory and technology increasingly difficult. Data completion is an important method for missing data preprocessing. Most existing miss-ing data completion models directly use the known information in the missing data set but ignore the impact of the data label information contained in the data set on the missing data completion model. To this end, this paper proposes a missing data completion model SEGAN based on semi-supervised learning, which mainly includes three important modules: generator, discriminator and classifier. In the SEGAN model, the classifier enables the generator to make more full use of known data and its label information when predicting missing data values. In addition, the SE-GAN model introduces a missing hint matrix to allow the discriminator to more effectively distinguish between known data and data filled by the generator. This paper theoretically proves that the SEGAN model that introduces a classifier and a missing hint matrix can learn the real known data distribution characteristics when reaching Nash equilibrium. Finally, a large number of experiments were conducted in this article, and the experimental results show that com-pared with the current state-of-the-art multivariate data completion method, the performance of the SEGAN model is improved by more than 3%. △ Less

Submitted 12 June, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2404.10696 [pdf, other]

Integrating knowledge bases to improve coreference and bridging resolution for the chemical domain

Authors: Pengcheng Lu, Massimo Poesio

Abstract: Resolving coreference and bridging relations in chemical patents is important for better understanding the precise chemical process, where chemical domain knowledge is very critical. We proposed an approach incorporating external knowledge into a multi-task learning model for both coreference and bridging resolution in the chemical domain. The results show that integrating external knowledge can b… ▽ More Resolving coreference and bridging relations in chemical patents is important for better understanding the precise chemical process, where chemical domain knowledge is very critical. We proposed an approach incorporating external knowledge into a multi-task learning model for both coreference and bridging resolution in the chemical domain. The results show that integrating external knowledge can benefit both chemical coreference and bridging resolution. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: working in progress

arXiv:2404.06252 [pdf, other]

Design and Characterization of Strategy-Proof Mechanisms for Two-Facility Game on a Line

Authors: Pinyan Lu, Zihan Luo, Jialin Zhang

Abstract: We focus on the problem of placing two facilities along a linear space to serve a group of agents. Each agent is committed to minimizing the distance between her location and the closest facility. A mechanism is an algorithm that maps the reported agent locations to the facility locations. We are interested in mechanisms without money that are deterministic, strategy-proof, and provide a bounded a… ▽ More We focus on the problem of placing two facilities along a linear space to serve a group of agents. Each agent is committed to minimizing the distance between her location and the closest facility. A mechanism is an algorithm that maps the reported agent locations to the facility locations. We are interested in mechanisms without money that are deterministic, strategy-proof, and provide a bounded approximation ratio for social cost. It is a fundamental problem to characterize the family of strategy-proof mechanisms with a bounded approximation ratio. Fotakis and Tzamos already demonstrated that the deterministic strategy-proof mechanisms for the 2-facility game problem are mechanisms with a unique dictator and the leftmost-rightmost mechanism. In this paper, we first present a more refined characterization of the first family. We then reveal three new classes of strategy-proof mechanisms that show the intricacy of structure within this family. This helps us get a more complete picture of the characterization of the 2-facility game problem, and may also have value in understanding and solving more general facility allocation game problems. Besides, based on our refined characterization, we surprisingly find that prediction cannot effectively improve the performance of the mechanism in the two-facility game problem, while this methodology to overcome bad approximation ratio works in many other mechanism design problems. We show that if we require that the mechanism admits a bounded approximation ratio when the prediction is arbitrarily bad, then at the same time, the mechanism can never achieve sublinear approximation ratios even with perfect prediction. △ Less

Submitted 9 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2403.19484 [pdf, other]

Improved Genetic Algorithm Based on Greedy and Simulated Annealing Ideas for Vascular Robot Ordering Strategy

Authors: Zixi Wang, Yubo Huang, Yukai Zhang, Yifei Sheng, Xin Lai, Peng Lu

Abstract: This study presents a comprehensive approach for optimizing the acquisition, utilization, and maintenance of ABLVR vascular robots in healthcare settings. Medical robotics, particularly in vascular treatments, necessitates precise resource allocation and optimization due to the complex nature of robot and operator maintenance. Traditional heuristic methods, though intuitive, often fail to achieve… ▽ More This study presents a comprehensive approach for optimizing the acquisition, utilization, and maintenance of ABLVR vascular robots in healthcare settings. Medical robotics, particularly in vascular treatments, necessitates precise resource allocation and optimization due to the complex nature of robot and operator maintenance. Traditional heuristic methods, though intuitive, often fail to achieve global optimization. To address these challenges, this research introduces a novel strategy, combining mathematical modeling, a hybrid genetic algorithm, and ARIMA time series forecasting. Considering the dynamic healthcare environment, our approach includes a robust resource allocation model for robotic vessels and operators. We incorporate the unique requirements of the adaptive learning process for operators and the maintenance needs of robotic components. The hybrid genetic algorithm, integrating simulated annealing and greedy approaches, efficiently solves the optimization problem. Additionally, ARIMA time series forecasting predicts the demand for vascular robots, further enhancing the adaptability of our strategy. Experimental results demonstrate the superiority of our approach in terms of optimization, transparency, and convergence speed from other state-of-the-art methods. △ Less

Submitted 2 July, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: 17 pages

arXiv:2403.14624 [pdf, other]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Authors: Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, Hongsheng Li

Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in… ▽ More The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io △ Less

Submitted 21 March, 2024; originally announced March 2024.

Comments: 46 Pages, Work in Progress, Benchmark Project Page: https://mathverse-cuhk.github.io

arXiv:2403.05225 [pdf, other]

Trust Recognition in Human-Robot Cooperation Using EEG

Authors: Caiyue Xu, Changming Zhang, Yanmin Zhou, Zhipeng Wang, ** Lu, Bin He

Abstract: Collaboration between humans and robots is becoming increasingly crucial in our daily life. In order to accomplish efficient cooperation, trust recognition is vital, empowering robots to predict human behaviors and make trust-aware decisions. Consequently, there is an urgent need for a generalized approach to recognize human-robot trust. This study addresses this need by introducing an EEG-based m… ▽ More Collaboration between humans and robots is becoming increasingly crucial in our daily life. In order to accomplish efficient cooperation, trust recognition is vital, empowering robots to predict human behaviors and make trust-aware decisions. Consequently, there is an urgent need for a generalized approach to recognize human-robot trust. This study addresses this need by introducing an EEG-based method for trust recognition during human-robot cooperation. A human-robot cooperation game scenario is used to stimulate various human trust levels when working with robots. To enhance recognition performance, the study proposes an EEG Vision Transformer model coupled with a 3-D spatial representation to capture the spatial information of EEG, taking into account the topological relationship among electrodes. To validate this approach, a public EEG-based human trust dataset called EEGTrust is constructed. Experimental results indicate the effectiveness of the proposed approach, achieving an accuracy of 74.99% in slice-wise cross-validation and 62.00% in trial-wise cross-validation. This outperforms baseline models in both recognition accuracy and generalization. Furthermore, an ablation study demonstrates a significant improvement in trust recognition performance of the spatial representation. The source code and EEGTrust dataset are available at https://github.com/CaiyueXu/EEGTrust. △ Less

Submitted 8 March, 2024; originally announced March 2024.

Comments: Accepted at IEEE International Conference on Robotics and Automation (ICRA) 2024

arXiv:2403.03740 [pdf, other]

Self-supervised Photographic Image Layout Representation Learning

Authors: Zhaoran Zhao, Peng Lu, Xujun Peng, Wenhao Guo

Abstract: In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuance… ▽ More In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances of photographic image layouts. This shortfall makes the learning process for photographic image layouts suboptimal. In our research, we directly address these challenges. We innovate by defining basic layout primitives that encapsulate various levels of layout information and by map** these, along with their interconnections, onto a heterogeneous graph structure. This graph is meticulously engineered to capture the intricate layout information within the pixel domain explicitly. Advancing further, we introduce novel pretext tasks coupled with customized loss functions, strategically designed for effective self-supervised learning of these layout graphs. Building on this foundation, we develop an autoencoder-based network architecture skilled in compressing these heterogeneous layout graphs into precise, dimensionally-reduced layout representations. Additionally, we introduce the LODB dataset, which features a broader range of layout categories and richer semantics, serving as a comprehensive benchmark for evaluating the effectiveness of layout representation learning methods. Our extensive experimentation on this dataset demonstrates the superior performance of our approach in the realm of photographic image layout representation learning. △ Less

Submitted 6 March, 2024; originally announced March 2024.

arXiv:2403.00071 [pdf, other]

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Authors: Suyuchen Wang, Ivan Kobyzev, Peng Lu, Mehdi Rezagholizadeh, Bang Liu

Abstract: This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refi… ▽ More This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications. △ Less

Submitted 10 June, 2024; v1 submitted 29 February, 2024; originally announced March 2024.

Comments: 13 pages, 4 figures, accepted at ACL 2024 Findings

arXiv:2402.17644 [pdf, other]

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data

Authors: Xiao Liu, Zirui Wu, Xueqing Wu, Pan Lu, Kai-Wei Chang, Yansong Feng

Abstract: Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied b… ▽ More Quantitative reasoning is a critical skill to analyze data, yet the assessment of such ability remains limited. To address this gap, we introduce the Quantitative Reasoning with Data (QRData) benchmark, aiming to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a carefully constructed dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText. We evaluate natural language reasoning, program-based reasoning, and agent reasoning methods including Chain-of-Thought, Program-of-Thoughts, ReAct, and code interpreter assistants on diverse models. The strongest model GPT-4 achieves an accuracy of 58%, which has much room for improvement. Among open-source models, Deepseek-coder-instruct, a code LLM pretrained on 2T tokens, gets the highest accuracy of 37%. Analysis reveals that models encounter difficulties in data analysis and causal reasoning, and struggle in using causal knowledge and provided data simultaneously. Code and data are in https://github.com/xxxiaol/QRData. △ Less

Submitted 9 June, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: Findings of ACL 2024. Project website: https://xxxiaol.github.io/QRData/

arXiv:2402.05935 [pdf, other]

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Authors: Dongyang Liu, Renrui Zhang, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng **, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao, Peng Gao

Abstract: We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we… ▽ More We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory △ Less

Submitted 26 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

arXiv:2401.17592 [pdf, other]

doi 10.1016/j.inffus.2024.102344

Local Feature Matching Using Deep Learning: A Survey

Authors: Shibiao Xu, Shunpeng Chen, Rongtao Xu, Changwei Wang, Peng Lu, Li Guo

Abstract: Local feature matching enjoys wide-ranging applications in the realm of computer vision, encompassing domains such as image retrieval, 3D reconstruction, and object recognition. However, challenges persist in improving the accuracy and robustness of matching due to factors like viewpoint and lighting variations. In recent years, the introduction of deep learning models has sparked widespread explo… ▽ More Local feature matching enjoys wide-ranging applications in the realm of computer vision, encompassing domains such as image retrieval, 3D reconstruction, and object recognition. However, challenges persist in improving the accuracy and robustness of matching due to factors like viewpoint and lighting variations. In recent years, the introduction of deep learning models has sparked widespread exploration into local feature matching techniques. The objective of this endeavor is to furnish a comprehensive overview of local feature matching methods. These methods are categorized into two key segments based on the presence of detectors. The Detector-based category encompasses models inclusive of Detect-then-Describe, Joint Detection and Description, Describe-then-Detect, as well as Graph Based techniques. In contrast, the Detector-free category comprises CNN Based, Transformer Based, and Patch Based methods. Our study extends beyond methodological analysis, incorporating evaluations of prevalent datasets and metrics to facilitate a quantitative comparison of state-of-the-art techniques. The paper also explores the practical application of local feature matching in diverse domains such as Structure from Motion, Remote Sensing Image Registration, and Medical Image Registration, underscoring its versatility and significance across various fields. Ultimately, we endeavor to outline the current challenges faced in this domain and furnish future research directions, thereby serving as a reference for researchers involved in local feature matching and its interconnected domains. A comprehensive list of studies in this survey is available at https://github.com/vignywang/Awesome-Local-Feature-Matching . △ Less

Submitted 10 March, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

Comments: Accepted by Information Fusion 2024. Project page: https://github.com/vignywang/Awesome-Local-Feature-Matching

arXiv:2401.04700 [pdf, other]

Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue

Authors: Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng

Abstract: Model editing is a technique that edits the large language models (LLMs) with updated knowledge to alleviate hallucinations without resource-intensive retraining. While current model editing methods can effectively modify a model's behavior within a specific area of interest, they often overlook the potential unintended side effects on the general abilities of LLMs such as reasoning, natural langu… ▽ More Model editing is a technique that edits the large language models (LLMs) with updated knowledge to alleviate hallucinations without resource-intensive retraining. While current model editing methods can effectively modify a model's behavior within a specific area of interest, they often overlook the potential unintended side effects on the general abilities of LLMs such as reasoning, natural language inference, and question answering. In this paper, we raise concerns that model editing's improvements on factuality may come at the cost of a significant degradation of the model's general abilities. We systematically analyze the side effects by evaluating four popular editing methods on three LLMs across eight representative tasks. Our extensive empirical experiments show that it is challenging for current editing methods to simultaneously improve factuality of LLMs and maintain their general abilities. Our analysis reveals that the side effects are caused by model editing altering the original model weights excessively, leading to overfitting to the edited facts. To mitigate this, a method named RECT (RElative Change in weighT) is proposed to regularize the edit update weights. Evaluation results show that RECT can significantly mitigate the side effects of editing while still maintaining over 94% editing performance. △ Less

Submitted 16 June, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: Propose a new regularization method

arXiv:2312.11911 [pdf, other]

EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Map**

Authors: Weipeng Guan, Peiyu Chen, Huibin Zhao, Yu Wang, Peng Lu

Abstract: Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging t… ▽ More Event cameras are bio-inspired, motion-activated sensors that demonstrate substantial potential in handling challenging situations, such as motion blur and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the problem of 6 DoF pose tracking and 3D reconstruction using monocular event camera. A novel event-based hybrid tracking framework is designed to estimate the pose, leveraging the robustness of feature matching and the precision of direct alignment. Specifically, we develop an event-based 2D-2D alignment to construct the photometric constraint, and tightly integrate it with the event-based reprojection constraint. The map** module recovers the dense and colorful depth of the scene through the image-guided event-based map** method. Subsequently, the appearance, texture, and surface mesh of the 3D scene can be reconstructed by fusing the dense depth map from multiple viewpoints using truncated signed distance function (TSDF) fusion. To the best of our knowledge, this is the first non-learning work to realize event-based dense map**. Numerical evaluations are performed on both publicly available and self-collected datasets, which qualitatively and quantitatively demonstrate the superior performance of our method. Our EVI-SAM effectively balances accuracy and robustness while maintaining computational efficiency, showcasing superior pose tracking and dense map** performance in challenging scenarios. Video Demo: https://youtu.be/Nn40U4e5Si8. △ Less

Submitted 23 May, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

arXiv:2312.08743 [pdf, other]

FAPP: Fast and Adaptive Perception and Planning for UAVs in Dynamic Cluttered Environments

Authors: Minghao Lu, Xiyu Fan, Han Chen, Peng Lu

Abstract: Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the domi… ▽ More Obstacle avoidance for Unmanned Aerial Vehicles (UAVs) in cluttered environments is significantly challenging. Existing obstacle avoidance for UAVs either focuses on fully static environments or static environments with only a few dynamic objects. In this paper, we take the initiative to consider the obstacle avoidance of UAVs in dynamic cluttered environments in which dynamic objects are the dominant objects. This type of environment poses significant challenges to both perception and planning. Multiple dynamic objects possess various motions, making it extremely difficult to estimate and predict their motions using one motion model. The planning must be highly efficient to avoid cluttered dynamic objects. This paper proposes Fast and Adaptive Perception and Planning (FAPP) for UAVs flying in complex dynamic cluttered environments. A novel and efficient point cloud segmentation strategy is proposed to distinguish static and dynamic objects. To address multiple dynamic objects with different motions, an adaptive estimation method with covariance adaptation is proposed to quickly and accurately predict their motions. Our proposed trajectory optimization algorithm is highly efficient, enabling it to avoid fast objects. Furthermore, an adaptive re-planning method is proposed to address the case when the trajectory optimization cannot find a feasible solution, which is common for dynamic cluttered environments. Extensive validations in both simulation and real-world experiments demonstrate the effectiveness of our proposed system for highly dynamic and cluttered environments. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.07526 [pdf, other]

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

Authors: Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, Wenming Yang

Abstract: Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordi… ▽ More Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo. △ Less

Submitted 8 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: Accepted at CVPR 2024. Project page: https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo

arXiv:2312.00949 [pdf, other]

Hyperparameter Optimization for Large Language Model Instruction-Tuning

Authors: Christophe Tribes, Sacha Benarroch-Lelong, Peng Lu, Ivan Kobyzev

Abstract: The fine-tuning of Large Language Models (LLMs) has enabled them to recently achieve milestones in natural language processing applications. The emergence of ever larger LLMs has paved the way for more efficient fine-tuning methods. Among these, the Low-Rank Adaptation (LoRA) method keeps most of the weights of the pre-trained LLM frozen while introducing a low-rank decomposition of the weight mat… ▽ More The fine-tuning of Large Language Models (LLMs) has enabled them to recently achieve milestones in natural language processing applications. The emergence of ever larger LLMs has paved the way for more efficient fine-tuning methods. Among these, the Low-Rank Adaptation (LoRA) method keeps most of the weights of the pre-trained LLM frozen while introducing a low-rank decomposition of the weight matrix, enabling the tuning of only a very small proportion of the network. The performance on downstream tasks of models fine-tuned with LoRA heavily relies on a set of hyperparameters including the rank of the decomposition. In this work, we investigate the choice of these hyperparameters through two main blackbox optimization (BBO) techniques. We examine the whole pipeline of performing fine-tuning and validation on a pre-trained LLM as a blackbox and efficiently explore the space of hyperparameters with the \nomad algorithm, achieving a boost in performance and human alignment of the tuned model. △ Less

Submitted 30 January, 2024; v1 submitted 1 December, 2023; originally announced December 2023.

arXiv:2312.00111 [pdf, other]

Multimodal Learning for Materials

Authors: Viggo Moro, Charlotte Loh, Rumen Dangovski, Ali Ghorashi, Andrew Ma, Zhuo Chen, Samuel Kim, Peter Y. Lu, Thomas Christensen, Marin Soljačić

Abstract: Artificial intelligence is transforming computational materials science, improving the prediction of material properties, and accelerating the discovery of novel materials. Recently, publicly available material data repositories have grown rapidly. This growth encompasses not only more materials, but also a greater variety and quantity of their associated properties. Existing machine learning effo… ▽ More Artificial intelligence is transforming computational materials science, improving the prediction of material properties, and accelerating the discovery of novel materials. Recently, publicly available material data repositories have grown rapidly. This growth encompasses not only more materials, but also a greater variety and quantity of their associated properties. Existing machine learning efforts in materials science focus primarily on single-modality tasks, i.e., relationships between materials and a single physical property, thus not taking advantage of the rich and multimodal set of material properties. Here, we introduce Multimodal Learning for Materials (MultiMat), which enables self-supervised multi-modality training of foundation models for materials. We demonstrate our framework's potential using data from the Materials Project database on multiple axes: (i) MultiMat achieves state-of-the-art performance for challenging material property prediction tasks; (ii) MultiMat enables novel and accurate material discovery via latent space similarity, enabling screening for stable materials with desired properties; and (iii) MultiMat encodes interpretable emergent features that may provide novel scientific insights. △ Less

Submitted 12 April, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

Comments: 11 pages, 4 figures

arXiv:2311.02327 [pdf, other]

ECMD: An Event-Centric Multisensory Driving Dataset for SLAM

Authors: Peiyu Chen, Weipeng Guan, Feng Huang, Yihan Zhong, Weisong Wen, Li-Ta Hsu, Peng Lu

Abstract: Leveraging multiple sensors enhances complex environmental perception and increases resilience to varying luminance conditions and high-speed motion patterns, achieving precise localization and map**. This paper proposes, ECMD, an event-centric multisensory dataset containing 81 sequences and covering over 200 km of various challenging driving scenarios including high-speed motion, repetitive sc… ▽ More Leveraging multiple sensors enhances complex environmental perception and increases resilience to varying luminance conditions and high-speed motion patterns, achieving precise localization and map**. This paper proposes, ECMD, an event-centric multisensory dataset containing 81 sequences and covering over 200 km of various challenging driving scenarios including high-speed motion, repetitive scenarios, dynamic objects, etc. ECMD provides data from two sets of stereo event cameras with different resolutions (640*480, 346*260), stereo industrial cameras, an infrared camera, a top-installed mechanical LiDAR with two slanted LiDARs, two consumer-level GNSS receivers, and an onboard IMU. Meanwhile, the ground-truth of the vehicle was obtained using a centimeter-level high-accuracy GNSS-RTK/INS navigation system. All sensors are well-calibrated and temporally synchronized at the hardware level, with recording data simultaneously. We additionally evaluate several state-of-the-art SLAM algorithms for benchmarking visual and LiDAR SLAM and identifying their limitations. The dataset is available at https://arclab-hku.github.io/ecmd/. △ Less

Submitted 4 November, 2023; originally announced November 2023.

arXiv:2310.11954 [pdf, other]

MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

Authors: Dingyao Yu, Kaitao Song, Peiling Lu, Tianyu He, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

Abstract: AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data… ▽ More AI-empowered music processing is a diverse field that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classification). For developers and amateurs, it is very difficult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience. △ Less

Submitted 25 October, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.10967 [pdf, other]

EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset

Authors: Hang Yin, Pinren Lu, Ziang Li, Bin Sun, Kan Li

Abstract: The need for high-quality data has been a key issue hindering the research of dialogue tasks. Recent studies try to build datasets through manual, web crawling, and large pre-trained models. However, man-made data is expensive and data collected from the internet often includes generic responses, meaningless statements, and toxic dialogues. Automatic data generation through large models is a cost-… ▽ More The need for high-quality data has been a key issue hindering the research of dialogue tasks. Recent studies try to build datasets through manual, web crawling, and large pre-trained models. However, man-made data is expensive and data collected from the internet often includes generic responses, meaningless statements, and toxic dialogues. Automatic data generation through large models is a cost-effective method, but for open-domain multimodal dialogue tasks, there are still three drawbacks: 1) There is currently no open-source large model that can accept multimodal input; 2) The content generated by the model lacks interpretability; 3) The generated data is usually difficult to quality control and require extensive resource to collect. To alleviate the significant human and resource expenditure in data collection, we propose a Multimodal Data Construction Framework (MDCF). MDCF designs proper prompts to spur the large-scale pre-trained language model to generate well-formed and satisfactory content. Additionally, MDCF also automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability and facilitate manual follow-up quality inspection. Based on this, we release an Explanatory Multimodal Open-Domain dialogue dataset (EXMODD). Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses. Our code and data can be found at https://github.com/poplpr/EXMODD. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.02995

IBCL: Zero-shot Model Generation for Task Trade-offs in Continual Learning

Authors: Pengyuan Lu, Michele Caprio, Eric Eaton, Insup Lee

Abstract: Like generic multi-task learning, continual learning has the nature of multi-objective optimization, and therefore faces a trade-off between the performance of different tasks. That is, to optimize for the current task distribution, it may need to compromise performance on some previous tasks. This means that there exist multiple models that are Pareto-optimal at different times, each addressing a… ▽ More Like generic multi-task learning, continual learning has the nature of multi-objective optimization, and therefore faces a trade-off between the performance of different tasks. That is, to optimize for the current task distribution, it may need to compromise performance on some previous tasks. This means that there exist multiple models that are Pareto-optimal at different times, each addressing a distinct task performance trade-off. Researchers have discussed how to train particular models to address specific trade-off preferences. However, existing algorithms require training overheads proportional to the number of preferences -- a large burden when there are multiple, possibly infinitely many, preferences. As a response, we propose Imprecise Bayesian Continual Learning (IBCL). Upon a new task, IBCL (1) updates a knowledge base in the form of a convex hull of model parameter distributions and (2) obtains particular models to address task trade-off preferences with zero-shot. That is, IBCL does not require any additional training overhead to generate preference-addressing models from its knowledge base. We show that models obtained by IBCL have guarantees in identifying the Pareto optimal parameters. Moreover, experiments on standard image classification and NLP tasks support this guarantee. Statistically, IBCL improves average per-task accuracy by at most 23\% and peak per-task accuracy by at most 15\% with respect to the baseline methods, with steadily near-zero or positive backward transfer. Most importantly, IBCL significantly reduces the training overhead from training 1 model per preference to at most 3 models for all preferences. △ Less

Submitted 9 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: This should be a replacement for arXiv:2305.14782. I falsely submitted a new paper

arXiv:2310.02255 [pdf, other]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Authors: Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, Jianfeng Gao

Abstract: Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived… ▽ More Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at https://mathvista.github.io/. △ Less

Submitted 20 January, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

Comments: 116 pages, 120 figures. Accepted to ICLR 2024

arXiv:2309.15414 [pdf, other]

Competitive Auctions with Imperfect Predictions

Authors: Pinyan Lu, Zongqi Wan, Jialin Zhang

Abstract: The competitive auction was first proposed by Goldberg, Hartline, and Wright. In their paper, they introduce the competitive analysis framework of online algorithm designing into the traditional revenue-maximizing auction design problem. While the competitive analysis framework only cares about the worst-case bound, a growing body of work in the online algorithm community studies the learning-augm… ▽ More The competitive auction was first proposed by Goldberg, Hartline, and Wright. In their paper, they introduce the competitive analysis framework of online algorithm designing into the traditional revenue-maximizing auction design problem. While the competitive analysis framework only cares about the worst-case bound, a growing body of work in the online algorithm community studies the learning-augmented framework. In this framework, designers are allowed to leverage imperfect machine-learned predictions of unknown information and pursue better theoretical guarantees when the prediction is accurate(consistency). Meanwhile, designers also need to maintain a nearly-optimal worst-case ratio(robustness). In this work, we revisit the competitive auctions in the learning-augmented setting. We leverage the imperfect predictions of the private value of the bidders and design the learning-augmented mechanisms for several competitive auctions with different constraints, including digital good auctions, limited-supply auctions, and general downward-closed permutation environments. For all these auction environments, our mechanisms enjoy $1$-consistency against the strongest benchmark $OPT$, which is impossible to achieve $O(1)$-competitive without predictions. At the same time, our mechanisms also maintain the $O(1)$-robustness against all benchmarks considered in the traditional competitive analysis. Considering the possible inaccuracy of the predictions, we provide a reduction that transforms our learning-augmented mechanisms into an error-tolerant version, which enables the learning-augmented mechanism to ensure satisfactory revenue in scenarios where the prediction error is moderate. △ Less

Submitted 18 June, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

Comments: Accepted by EC 2024. Improved the error-tolerant results

arXiv:2309.04735 [pdf, other]

Two-State Spin Systems with Negative Interactions

Authors: Yumou Fei, Leslie Ann Goldberg, Pinyan Lu

Abstract: We study the approximability of computing the partition functions of two-state spin systems. The problem is parameterized by a $2\times 2$ symmetric matrix. Previous results on this problem were restricted either to the case where the matrix has non-negative entries, or to the case where the diagonal entries are equal, i.e. Ising models. In this paper, we study the generalization to arbitrary… ▽ More We study the approximability of computing the partition functions of two-state spin systems. The problem is parameterized by a $2\times 2$ symmetric matrix. Previous results on this problem were restricted either to the case where the matrix has non-negative entries, or to the case where the diagonal entries are equal, i.e. Ising models. In this paper, we study the generalization to arbitrary $2\times 2$ interaction matrices with real entries. We show that in some regions of the parameter space, it's \#P-hard to even determine the sign of the partition function, while in other regions there are fully polynomial approximation schemes for the partition function. Our results reveal several new computational phase transitions. △ Less

Submitted 21 November, 2023; v1 submitted 9 September, 2023; originally announced September 2023.

arXiv:2309.00894 [pdf, other]

Regularly Truncated M-estimators for Learning with Noisy Labels

Authors: Xiaobo Xia, Pengqian Lu, Chen Gong, Bo Han, Jun Yu, Jun Yu, Tongliang Liu

Abstract: The sample selection approach is very popular in learning with noisy labels. As deep networks learn pattern first, prior methods built on sample selection share a similar training procedure: the small-loss examples can be regarded as clean examples and used for hel** generalization, while the large-loss examples are treated as mislabeled ones and excluded from network parameter updates. However,… ▽ More The sample selection approach is very popular in learning with noisy labels. As deep networks learn pattern first, prior methods built on sample selection share a similar training procedure: the small-loss examples can be regarded as clean examples and used for hel** generalization, while the large-loss examples are treated as mislabeled ones and excluded from network parameter updates. However, such a procedure is arguably debatable from two folds: (a) it does not consider the bad influence of noisy labels in selected small-loss examples; (b) it does not make good use of the discarded large-loss examples, which may be clean or have meaningful information for generalization. In this paper, we propose regularly truncated M-estimators (RTME) to address the above two issues simultaneously. Specifically, RTME can alternately switch modes between truncated M-estimators and original M-estimators. The former can adaptively select small-losses examples without knowing the noise rate and reduce the side-effects of noisy labels in them. The latter makes the possibly clean examples but with large losses involved to help generalization. Theoretically, we demonstrate that our strategies are label-noise-tolerant. Empirically, comprehensive experimental results show that our method can outperform multiple baselines and is robust to broad noise types and levels. △ Less

Submitted 2 September, 2023; originally announced September 2023.

Comments: 16 pages, 11 tables, 9 figures

arXiv:2308.07611 [pdf]

GAMER-MRIL identifies Disability-Related Brain Changes in Multiple Sclerosis

Authors: Po-Jui Lu, Benjamin Odry, Muhamed Barakovic, Matthias Weigel, Robin Sandkühler, Reza Rahmanzadeh, Xinjie Chen, Mario Ocampo-Pineda, Jens Kuhle, Ludwig Kappos, Philippe Cattin, Cristina Granziera

Abstract: Objective: Identifying disability-related brain changes is important for multiple sclerosis (MS) patients. Currently, there is no clear understanding about which pathological features drive disability in single MS patients. In this work, we propose a novel comprehensive approach, GAMER-MRIL, leveraging whole-brain quantitative MRI (qMRI), convolutional neural network (CNN), and an interpretability… ▽ More Objective: Identifying disability-related brain changes is important for multiple sclerosis (MS) patients. Currently, there is no clear understanding about which pathological features drive disability in single MS patients. In this work, we propose a novel comprehensive approach, GAMER-MRIL, leveraging whole-brain quantitative MRI (qMRI), convolutional neural network (CNN), and an interpretability method from classifying MS patients with severe disability to investigating relevant pathological brain changes. Methods: One-hundred-sixty-six MS patients underwent 3T MRI acquisitions. qMRI informative of microstructural brain properties was reconstructed, including quantitative T1 (qT1), myelin water fraction (MWF), and neurite density index (NDI). To fully utilize the qMRI, GAMER-MRIL extended a gated-attention-based CNN (GAMER-MRI), which was developed to select patch-based qMRI important for a given task/question, to the whole-brain image. To find out disability-related brain regions, GAMER-MRIL modified a structure-aware interpretability method, Layer-wise Relevance Propagation (LRP), to incorporate qMRI. Results: The test performance was AUC=0.885. qT1 was the most sensitive measure related to disability, followed by NDI. The proposed LRP approach obtained more specifically relevant regions than other interpretability methods, including the saliency map, the integrated gradients, and the original LRP. The relevant regions included the corticospinal tract, where average qT1 and NDI significantly correlated with patients' disability scores ($ρ$=-0.37 and 0.44). Conclusion: These results demonstrated that GAMER-MRIL can classify patients with severe disability using qMRI and subsequently identify brain regions potentially important to the integrity of the mobile function. Significance: GAMER-MRIL holds promise for develo** biomarkers and increasing clinicians' trust in NN. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2308.04639 [pdf, ps, other]

A Hierarchical Destroy and Repair Approach for Solving Very Large-Scale Travelling Salesman Problem

Authors: Zhang-Hua Fu, Sipeng Sun, **tong Ren, Tianshu Yu, Haoyu Zhang, Yuanyuan Liu, Lingxiao Huang, Xiang Yan, Pinyan Lu

Abstract: For prohibitively large-scale Travelling Salesman Problems (TSPs), existing algorithms face big challenges in terms of both computational efficiency and solution quality. To address this issue, we propose a hierarchical destroy-and-repair (HDR) approach, which attempts to improve an initial solution by applying a series of carefully designed destroy-and-repair operations. A key innovative concept… ▽ More For prohibitively large-scale Travelling Salesman Problems (TSPs), existing algorithms face big challenges in terms of both computational efficiency and solution quality. To address this issue, we propose a hierarchical destroy-and-repair (HDR) approach, which attempts to improve an initial solution by applying a series of carefully designed destroy-and-repair operations. A key innovative concept is the hierarchical search framework, which recursively fixes partial edges and compresses the input instance into a small-scale TSP under some equivalence guarantee. This neat search framework is able to deliver highly competitive solutions within a reasonable time. Fair comparisons based on nineteen famous large-scale instances (with 10,000 to 10,000,000 cities) show that HDR is highly competitive against existing state-of-the-art TSP algorithms, in terms of both efficiency and solution quality. Notably, on two large instances with 3,162,278 and 10,000,000 cities, HDR breaks the world records (i.e., best-known results regardless of computation time), which were previously achieved by LKH and its variants, while HDR is completely independent of LKH. Finally, ablation studies are performed to certify the importance and validity of the hierarchical search framework. △ Less

Submitted 8 August, 2023; originally announced August 2023.

arXiv:2307.10635 [pdf, other]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Authors: Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang

Abstract: Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dat… ▽ More Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery. △ Less

Submitted 28 June, 2024; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: To appear at ICML 2024

arXiv:2307.08209 [pdf, other]

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

Authors: Tianchen Zhao, Xuefei Ning, Ke Hong, Zhongyuan Qiu, Pu Lu, Yali Zhao, Linfeng Zhang, Lipu Zhou, Guohao Dai, Huazhong Yang, Yu Wang

Abstract: Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving. However, their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles. One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redu… ▽ More Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving. However, their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles. One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redundancy in both 3D voxel and dense BEV map representations. To address this issue, we propose an adaptive inference framework called Ada3D, which focuses on exploiting the input-level spatial redundancy. Ada3D adaptively filters the redundant input, guided by a lightweight importance predictor and the unique properties of the Lidar point cloud. Additionally, we utilize the BEV features' intrinsic sparsity by introducing the Sparsity Preserving Batch Normalization. With Ada3D, we achieve 40% reduction for 3D voxels and decrease the density of 2D BEV feature maps from 100% to 20% without sacrificing accuracy. Ada3D reduces the model computational and memory cost by 5x, and achieves 1.52x/1.45x end-to-end GPU latency and 1.5x/4.5x GPU peak memory optimization for the 3D and 2D backbone respectively. △ Less

Submitted 8 August, 2023; v1 submitted 16 July, 2023; originally announced July 2023.

Comments: Accepted at ICCV2023

arXiv:2307.04302 [pdf, ps, other]

Auction Design for Value Maximizers with Budget and Return-on-spend Constraints

Authors: Pinyan Lu, Chenyang Xu, Ruilong Zhang

Abstract: The paper designs revenue-maximizing auction mechanisms for agents who aim to maximize their total obtained values rather than the classical quasi-linear utilities. Several models have been proposed to capture the behaviors of such agents in the literature. In the paper, we consider the model where agents are subject to budget and return-on-spend constraints. The budget constraint of an agent limi… ▽ More The paper designs revenue-maximizing auction mechanisms for agents who aim to maximize their total obtained values rather than the classical quasi-linear utilities. Several models have been proposed to capture the behaviors of such agents in the literature. In the paper, we consider the model where agents are subject to budget and return-on-spend constraints. The budget constraint of an agent limits the maximum payment she can afford, while the return-on-spend constraint means that the ratio of the total obtained value (return) to the total payment (spend) cannot be lower than the targeted bar set by the agent. The problem was first coined by [Balseiro et al., EC 2022]. In their work, only Bayesian mechanisms were considered. We initiate the study of the problem in the worst-case model and compare the revenue of our mechanisms to an offline optimal solution, the most ambitious benchmark. The paper distinguishes two main auction settings based on the accessibility of agents' information: fully private and partially private. In the fully private setting, an agent's valuation, budget, and target bar are all private. We show that if agents are unit-demand, constant approximation mechanisms can be obtained; while for additive agents, there exists a mechanism that achieves a constant approximation ratio under a large market assumption. The partially private setting is the setting considered in the previous work [Balseiro et al., EC 2022] where only the agents' target bars are private. We show that in this setting, the approximation ratio of the single-item auction can be further improved, and a $Ω(1/\sqrt{n})$-approximation mechanism can be derived for additive agents. △ Less

Submitted 9 July, 2023; originally announced July 2023.

Comments: 29 pages

arXiv:2307.01229 [pdf, other]

EmoGen: Eliminating Subjective Bias in Emotional Music Generation

Authors: Chenfei Kang, Peiling Lu, Botao Yu, Xu Tan, Wei Ye, Shikun Zhang, Jiang Bian

Abstract: Music is used to convey emotions, and thus generating emotional music is important in automatic music generation. Previous work on emotional music generation directly uses annotated emotion labels as control signals, which suffers from subjective bias: different people may annotate different emotions on the same music, and one person may feel different emotions under different situations. Therefor… ▽ More Music is used to convey emotions, and thus generating emotional music is important in automatic music generation. Previous work on emotional music generation directly uses annotated emotion labels as control signals, which suffers from subjective bias: different people may annotate different emotions on the same music, and one person may feel different emotions under different situations. Therefore, directly map** emotion labels to music sequences in an end-to-end way would confuse the learning process and hinder the model from generating music with general emotions. In this paper, we propose EmoGen, an emotional music generation system that leverages a set of emotion-related music attributes as the bridge between emotion and music, and divides the generation into two stages: emotion-to-attribute map** with supervised clustering, and attribute-to-music generation with self-supervised learning. Both stages are beneficial: in the first stage, the attribute values around the clustering center represent the general emotions of these samples, which help eliminate the impacts of the subjective bias of emotion labels; in the second stage, the generation is completely disentangled from emotion labels and thus free from the subjective bias. Both subjective and objective evaluations show that EmoGen outperforms previous methods on emotion control accuracy and music quality respectively, which demonstrate our superiority in generating emotional music. Music samples generated by EmoGen are available via this link:https://ai-muzic.github.io/emogen/, and the code is available at this link:https://github.com/microsoft/muzic/. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: 12 pages, 7 pages

arXiv:2306.08954 [pdf, other]

Re-Benchmarking Pool-Based Active Learning for Binary Classification

Authors: Po-Yi Lu, Chun-Liang Li, Hsuan-Tien Lin

Abstract: Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data is expensive. While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment. This discrepancy motivates us to develop a transparent and reproducible benchmark for the community. Our efforts result in an open-sourced im… ▽ More Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data is expensive. While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment. This discrepancy motivates us to develop a transparent and reproducible benchmark for the community. Our efforts result in an open-sourced implementation (https://github.com/ariapoy/active-learning-benchmark) that is reliable and extensible for future research. By conducting thorough re-benchmarking experiments, we have not only rectified misconfigurations in existing benchmark but also shed light on the under-explored issue of model compatibility, which directly causes the observed discrepancy. Resolving the discrepancy reassures that the uncertainty sampling strategy of active learning remains an effective and preferred choice for most datasets. Our experience highlights the importance of dedicating research efforts towards re-benchmarking existing benchmarks to produce more credible results and gain deeper insights. △ Less

Submitted 23 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

arXiv:2306.07207 [pdf, other]

Valley: Video Assistant with Large Language model Enhanced abilitY

Authors: Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Da Li, Pengcheng Lu, Tao Wang, Linmei Hu, Minghui Qiu, Zhongyu Wei

Abstract: Large language models (LLMs), with their remarkable conversational capabilities, have demonstrated impressive performance across various applications and have emerged as formidable AI assistants. In view of this, it raises an intuitive question: Can we harness the power of LLMs to build multimodal AI assistants for visual applications? Recently, several multi-modal models have been developed for t… ▽ More Large language models (LLMs), with their remarkable conversational capabilities, have demonstrated impressive performance across various applications and have emerged as formidable AI assistants. In view of this, it raises an intuitive question: Can we harness the power of LLMs to build multimodal AI assistants for visual applications? Recently, several multi-modal models have been developed for this purpose. They typically pre-train an adaptation module to align the semantics of the vision encoder and language model, followed by fine-tuning on instruction-following data. However, despite the success of this pipeline in image and language understanding, its effectiveness in joint video and language understanding has not been widely explored. In this paper, we aim to develop a novel multi-modal foundation model capable of comprehending video, image, and language within a general framework. To achieve this goal, we introduce Valley, a Video Assistant with Large Language model Enhanced abilitY. The Valley consists of a LLM, a temporal modeling module, a visual encoder, and a simple projection module designed to bridge visual and textual modes. To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset and adopt a two-stage tuning procedure to train it. Specifically, we employ ChatGPT to facilitate the construction of task-oriented conversation data encompassing various tasks, including multi-shot captions, long video descriptions, action recognition, causal relationship inference, etc. Subsequently, we adopt a pre-training-then-instructions-tuned pipeline to align visual and textual modalities and improve the instruction-following capability of Valley. Qualitative experiments demonstrate that Valley has the potential to function as a highly effective video assistant that can make complex video understanding scenarios easy. △ Less

Submitted 8 October, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.06955 [pdf, other]

A Brief Review of Hypernetworks in Deep Learning

Authors: Vinod Kumar Chauhan, Jiandong Zhou, ** Lu, Soheila Molaei, David A. Clifton

Abstract: Hypernetworks, or hypernets in short, are neural networks that generate weights for another neural network, known as the target network. They have emerged as a powerful deep learning technique that allows for greater flexibility, adaptability, dynamism, faster training, information sharing, and model compression etc. Hypernets have shown promising results in a variety of deep learning problems, in… ▽ More Hypernetworks, or hypernets in short, are neural networks that generate weights for another neural network, known as the target network. They have emerged as a powerful deep learning technique that allows for greater flexibility, adaptability, dynamism, faster training, information sharing, and model compression etc. Hypernets have shown promising results in a variety of deep learning problems, including continual learning, causal inference, transfer learning, weight pruning, uncertainty quantification, zero-shot learning, natural language processing, and reinforcement learning etc. Despite their success across different problem settings, currently, there is no review available to inform the researchers about the developments and to help in utilizing hypernets. To fill this gap, we review the progress in hypernets. We present an illustrative example to train deep neural networks using hypernets and propose categorizing hypernets based on five design criteria as inputs, outputs, variability of inputs and outputs, and architecture of hypernets. We also review applications of hypernets across different deep learning problem settings, followed by a discussion of general scenarios where hypernets can be effectively employed. Finally, we discuss the challenges and future directions that remain under-explored in the field of hypernets. We believe that hypernetworks have the potential to revolutionize the field of deep learning. They offer a new way to design and train neural networks, and they have the potential to improve the performance of deep learning models on a variety of tasks. Through this review, we aim to inspire further advancements in deep learning through hypernetworks. △ Less

Submitted 10 August, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

Comments: revised categorisation, added new Section '5 When can we use Hypernets?', and other corrections(2 figures and 2 tables) (under review)

arXiv:2306.01665 [pdf, other]

SourceP: Detecting Ponzi Schemes on Ethereum with Source Code

Authors: Pengcheng Lu, Liang Cai, Keting Yin

Abstract: As blockchain technology becomes more and more popular, a typical financial scam, the Ponzi scheme, has also emerged in the blockchain platform Ethereum. This Ponzi scheme deployed through smart contracts, also known as the smart Ponzi scheme, has caused a lot of economic losses and negative impacts. Existing methods for detecting smart Ponzi schemes on Ethereum mainly rely on bytecode features, o… ▽ More As blockchain technology becomes more and more popular, a typical financial scam, the Ponzi scheme, has also emerged in the blockchain platform Ethereum. This Ponzi scheme deployed through smart contracts, also known as the smart Ponzi scheme, has caused a lot of economic losses and negative impacts. Existing methods for detecting smart Ponzi schemes on Ethereum mainly rely on bytecode features, opcode features, account features, and transaction behavior features of smart contracts, which are unable to truly characterize the behavioral features of Ponzi schemes, and thus generally perform poorly in terms of detection accuracy and false alarm rates. In this paper, we propose SourceP, a method to detect smart Ponzi schemes on the Ethereum platform using pre-trained models and data flow, which only requires using the source code of smart contracts as features. SourceP reduces the difficulty of data acquisition and feature extraction of existing detection methods. Specifically, we first convert the source code of a smart contract into a data flow graph and then introduce a pre-trained model based on learning code representations to build a classification model to identify Ponzi schemes in smart contracts. The experimental results show that SourceP achieves 87.2% recall and 90.7% F-score for detecting smart Ponzi schemes within Ethereum's smart contract dataset, outperforming state-of-the-art methods in terms of performance and sustainability. We also demonstrate through additional experiments that pre-trained models and data flow play an important contribution to SourceP, as well as proving that SourceP has a good generalization ability. △ Less

Submitted 29 February, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: 12 pages, 5 figures, 4 tables

arXiv:2306.01187 [pdf, other]

Training neural operators to preserve invariant measures of chaotic attractors

Authors: Ruoxi Jiang, Peter Y. Lu, Elena Orlova, Rebecca Willett

Abstract: Chaotic systems make long-horizon forecasts difficult because small perturbations in initial conditions cause trajectories to diverge at an exponential rate. In this setting, neural operators trained to minimize squared error losses, while capable of accurate short-term forecasts, often fail to reproduce statistical or structural properties of the dynamics over longer time horizons and can yield d… ▽ More Chaotic systems make long-horizon forecasts difficult because small perturbations in initial conditions cause trajectories to diverge at an exponential rate. In this setting, neural operators trained to minimize squared error losses, while capable of accurate short-term forecasts, often fail to reproduce statistical or structural properties of the dynamics over longer time horizons and can yield degenerate results. In this paper, we propose an alternative framework designed to preserve invariant measures of chaotic attractors that characterize the time-invariant statistical properties of the dynamics. Specifically, in the multi-environment setting (where each sample trajectory is governed by slightly different dynamics), we consider two novel approaches to training with noisy data. First, we propose a loss based on the optimal transport distance between the observed dynamics and the neural operator outputs. This approach requires expert knowledge of the underlying physics to determine what statistical features should be included in the optimal transport loss. Second, we show that a contrastive learning framework, which does not require any specialized prior knowledge, can preserve statistical properties of the dynamics nearly as well as the optimal transport approach. On a variety of chaotic systems, our method is shown empirically to preserve invariant measures of chaotic attractors. △ Less

Submitted 16 April, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

Comments: Accepted at NeurIPS 2023

arXiv:2306.00110 [pdf, other]

MuseCoco: Generating Symbolic Music from Text

Authors: Peiling Lu, Xin Xu, Chenfei Kang, Botao Yu, Chengyi Xing, Xu Tan, Jiang Bian

Abstract: Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements.… ▽ More Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality. △ Less

Submitted 31 May, 2023; originally announced June 2023.

arXiv:2305.19685 [pdf, other]

Deep Stochastic Mechanics

Authors: Elena Orlova, Aleksei Ustimenko, Ruoxi Jiang, Peter Y. Lu, Rebecca Willett

Abstract: This paper introduces a novel deep-learning-based approach for numerical simulation of a time-evolving Schrödinger equation inspired by stochastic mechanics and generative diffusion models. Unlike existing approaches, which exhibit computational complexity that scales exponentially in the problem dimension, our method allows us to adapt to the latent low-dimensional structure of the wave function… ▽ More This paper introduces a novel deep-learning-based approach for numerical simulation of a time-evolving Schrödinger equation inspired by stochastic mechanics and generative diffusion models. Unlike existing approaches, which exhibit computational complexity that scales exponentially in the problem dimension, our method allows us to adapt to the latent low-dimensional structure of the wave function by sampling from the Markovian diffusion. Depending on the latent dimension, our method may have far lower computational complexity in higher dimensions. Moreover, we propose novel equations for stochastic quantum mechanics, resulting in quadratic computational complexity with respect to the number of dimensions. Numerical simulations verify our theoretical findings and show a significant advantage of our method compared to other deep-learning-based approaches used for quantum mechanics. △ Less

Submitted 4 June, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.14782 [pdf, other]

IBCL: Zero-shot Model Generation for Task Trade-offs in Continual Learning

Authors: Pengyuan Lu, Michele Caprio, Eric Eaton, Insup Lee

Abstract: Like generic multi-task learning, continual learning has the nature of multi-objective optimization, and therefore faces a trade-off between the performance of different tasks. That is, to optimize for the current task distribution, it may need to compromise performance on some previous tasks. This means that there exist multiple models that are Pareto-optimal at different times, each addressing a… ▽ More Like generic multi-task learning, continual learning has the nature of multi-objective optimization, and therefore faces a trade-off between the performance of different tasks. That is, to optimize for the current task distribution, it may need to compromise performance on some previous tasks. This means that there exist multiple models that are Pareto-optimal at different times, each addressing a distinct task performance trade-off. Researchers have discussed how to train particular models to address specific trade-off preferences. However, existing algorithms require training overheads proportional to the number of preferences -- a large burden when there are multiple, possibly infinitely many, preferences. As a response, we propose Imprecise Bayesian Continual Learning (IBCL). Upon a new task, IBCL (1) updates a knowledge base in the form of a convex hull of model parameter distributions and (2) obtains particular models to address task trade-off preferences with zero-shot. That is, IBCL does not require any additional training overhead to generate preference-addressing models from its knowledge base. We show that models obtained by IBCL have guarantees in identifying the Pareto optimal parameters. Moreover, experiments on standard image classification and NLP tasks support this guarantee. Statistically, IBCL improves average per-task accuracy by at most 23% and peak per-task accuracy by at most 15% with respect to the baseline methods, with steadily near-zero or positive backward transfer. Most importantly, IBCL significantly reduces the training overhead from training 1 model per preference to at most 3 models for all preferences. △ Less

Submitted 9 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: We falsely submitted this as a new article at 2310.02995. That article is withdrawn. This is the correct submission (to replace a previous version)

arXiv:2305.12524 [pdf, other]

TheoremQA: A Theorem-driven Question Answering dataset

Authors: Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia

Abstract: The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed… ▽ More The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA. △ Less

Submitted 5 December, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted to Main Conference of EMNLP 2023

arXiv:2305.10841 [pdf, other]

GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework

Authors: Ang Lv, Xu Tan, Peiling Lu, Wei Ye, Shikun Zhang, Jiang Bian, Rui Yan

Abstract: Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrument tracks based on provided source tracks. In practical scenarios where there's a predefined ensemble of tracks and various composition needs, an efficient and effective generative model that can generate any target tracks based on the other tracks becomes crucial. However,… ▽ More Symbolic music generation aims to create musical notes, which can help users compose music, such as generating target instrument tracks based on provided source tracks. In practical scenarios where there's a predefined ensemble of tracks and various composition needs, an efficient and effective generative model that can generate any target tracks based on the other tracks becomes crucial. However, previous efforts have fallen short in addressing this necessity due to limitations in their music representations and models. In this paper, we introduce a framework known as GETMusic, with ``GET'' standing for ``GEnerate music Tracks.'' This framework encompasses a novel music representation ``GETScore'' and a diffusion model ``GETDiff.'' GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time. At a training step, each track of a music piece is randomly selected as either the target or source. The training involves two processes: In the forward process, target tracks are corrupted by masking their tokens, while source tracks remain as the ground truth; in the denoising process, GETDiff is trained to predict the masked target tokens conditioning on the source tracks. Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations. Our experiments demonstrate that the versatile GETMusic outperforms prior works proposed for certain specific composition tasks. △ Less

Submitted 29 September, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: 13 pages, 4 figures

arXiv:2305.08178 [pdf, other]

Path Planning for Air-Ground Robot Considering Modal Switching Point Optimization

Authors: Xiaoyu Wang, Kangyao Huang, Xinyu Zhang, Honglin Sun, Wenzhuo Liu, Hua** Liu, Jun Li, **** Lu

Abstract: An innovative sort of mobility platform that can both drive and fly is the air-ground robot. The need for an agile flight cannot be satisfied by traditional path planning techniques for air-ground robots. Prior studies had mostly focused on improving the energy efficiency of paths, seldom taking the seeking speed and optimizing take-off and landing places into account. A robot for the field applic… ▽ More An innovative sort of mobility platform that can both drive and fly is the air-ground robot. The need for an agile flight cannot be satisfied by traditional path planning techniques for air-ground robots. Prior studies had mostly focused on improving the energy efficiency of paths, seldom taking the seeking speed and optimizing take-off and landing places into account. A robot for the field application environment was proposed, and a lightweight global spatial planning technique for the robot based on the graph-search algorithm taking mode switching point optimization into account, with an emphasis on energy efficiency, searching speed, and the viability of real deployment. The fundamental concept is to lower the computational burden by employing an interchangeable search approach that combines planar and spatial search. Furthermore, to safeguard the health of the power battery and the integrity of the mission execution, a trap escape approach was also provided. Simulations are run to test the effectiveness of the suggested model based on the field DEM map. The simulation results show that our technology is capable of producing finished, plausible 3D paths with a high degree of believability. Additionally, the mode-switching point optimization method efficiently identifies additional acceptable places for mode switching, and the improved paths use less time and energy. △ Less

Submitted 14 May, 2023; originally announced May 2023.

arXiv:2305.04971 [pdf, other]

LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization

Authors: Peng Lu, Ahmad Rashid, Ivan Kobyzev, Mehdi Rezagholizadeh, Philippe Langlais

Abstract: Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Con… ▽ More Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted at ACL2023 (Findings)

arXiv:2305.01795 [pdf, other]

Multimodal Procedural Planning via Dual Text-Image Prompting

Authors: Yujie Lu, Pan Lu, Zhiyu Chen, Wanrong Zhu, Xin Eric Wang, William Yang Wang

Abstract: Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired… ▽ More Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in large language models (LLMs) and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy. Our code and data: https://github.com/YujieLu10/MPP. △ Less

Submitted 2 May, 2023; originally announced May 2023.

Showing 1–50 of 214 results for author: Lu, P