Search | arXiv e-print repository

Look Ahead or Look Around? A Theoretical Comparison Between Autoregressive and Masked Pretraining

Authors: Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang

Abstract: In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical co… ▽ More In recent years, the rise of generative self-supervised learning (SSL) paradigms has exhibited impressive performance across visual, language, and multi-modal domains. While the varied designs of generative SSL objectives lead to distinct properties in downstream tasks, a theoretical understanding of these differences remains largely unexplored. In this paper, we establish the first theoretical comparisons between two leading generative SSL paradigms: autoregressive SSL and masked SSL. Through establishing theoretical frameworks, we elucidate the strengths and limitations of autoregressive and masked SSL within the primary evaluation tasks of classification and content generation. Our findings demonstrate that in classification tasks, the flexibility of targeted tokens in masked SSL fosters more inter-sample connections compared to the fixed position of target tokens in autoregressive SSL, which yields superior clustering performance. In content generation tasks, the misalignment between the flexible lengths of test samples and the fixed length of unmasked texts in masked SSL (vs. flexible lengths of conditional texts in autoregressive SSL) hinders its generation performance. To leverage each other's strengths and mitigate weaknesses, we propose diversity-enhanced autoregressive and variable-length masked objectives, which substantially improve the classification performance of autoregressive SSL and the generation performance of masked SSL. Code is available at https://github.com/PKU-ML/LookAheadLookAround. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2407.00560 [pdf, other]

DCI: An Accurate Quality Assessment Criteria for Protein Complex Structure Models

Authors: Wenda Wang, Jiaqi Zhai, He Huang, Xinqi Gong

Abstract: The structure of proteins is the basis for studying protein function and drug design. The emergence of AlphaFold 2 has greatly promoted the prediction of protein 3D structures, and it is of great significance to give an overall and accurate evaluation of the predicted models, especially the complex models. Among the existing methods for evaluating multimer structures, DockQ is the most commonly us… ▽ More The structure of proteins is the basis for studying protein function and drug design. The emergence of AlphaFold 2 has greatly promoted the prediction of protein 3D structures, and it is of great significance to give an overall and accurate evaluation of the predicted models, especially the complex models. Among the existing methods for evaluating multimer structures, DockQ is the most commonly used. However, as a more suitable metric for complex docking, DockQ cannot provide a unique and accurate evaluation in the non-docking situation. Therefore, it is necessary to propose an evaluation strategy that can directly evaluate the whole complex without limitation and achieve good results. In this work, we proposed DCI score, a new evaluation strategy for protein complex structure models, which only bases on distance map and CI (contact-interface) map, DCI focuses on the prediction accuracy of the contact interface based on the overall evaluation of complex structure, is not inferior to DockQ in the evaluation accuracy according to CAPRI classification, and is able to handle the non-docking situation better than DockQ. Besides, we calculated DCI score on CASP datasets and compared it with CASP official assessment, which obtained good results. In addition, we found that DCI can better evaluate the overall structure deviation caused by interface prediction errors in the case of multi-chains. Our DCI is available at \url{https://gitee.com/WendaWang/DCI-score.git}, and the online-server is available at \url{http://mialab.ruc.edu.cn/DCIServer/}. △ Less

Submitted 29 June, 2024; originally announced July 2024.

arXiv:2407.00125 [pdf, other]

A Survey on Failure Analysis and Fault Injection in AI Systems

Authors: Guangba Yu, Gou Tan, Haojia Huang, Zhenyu Zhang, Pengfei Chen, Roberto Natella, Zibin Zheng

Abstract: The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ens… ▽ More The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems. △ Less

Submitted 27 June, 2024; originally announced July 2024.

arXiv:2406.19954 [pdf, other]

BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5

Authors: Zhehuai Chen, He Huang, Oleksii Hrinchuk, Krishna C. Puvvada, Nithin Rao Koluguri, Piotr Żelasko, Jagadeesh Balam, Boris Ginsburg

Abstract: Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTO… ▽ More Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech. △ Less

Submitted 28 June, 2024; originally announced June 2024.

MSC Class: 68T10 ACM Class: I.2.7

arXiv:2406.19674 [pdf, other]

Less is More: Accurate Speech Recognition & Translation without Web-Scale Data

Authors: Krishna C. Puvvada, Piotr Żelasko, He Huang, Oleksii Hrinchuk, Nithin Rao Koluguri, Kunal Dhawan, Somshubra Majumdar, Elena Rastorgueva, Zhehuai Chen, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg

Abstract: Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while b… ▽ More Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech-2024

arXiv:2406.19212 [pdf, other]

JuliVQC: an Efficient Variational Quantum Circuit Simulator for Near-Term Quantum Algorithms

Authors: Wei-You Liao, Xiang Wang, Xiao-Yue Xu, Chen Ding, Shuo Zhang, He-Liang Huang, Chu Guo

Abstract: We introduce JuliVQC: a light-weight, yet extremely efficient variational quantum circuit simulator. JuliVQC is part of an effort for classical simulation of the \textit{Zuchongzhi} quantum processors, where it is extensively used to characterize the circuit noises, as a building block in the Schr$\ddot{\text{o}}$dinger-Feynman algorithm for classical verification and performance benchmarking, and… ▽ More We introduce JuliVQC: a light-weight, yet extremely efficient variational quantum circuit simulator. JuliVQC is part of an effort for classical simulation of the \textit{Zuchongzhi} quantum processors, where it is extensively used to characterize the circuit noises, as a building block in the Schr$\ddot{\text{o}}$dinger-Feynman algorithm for classical verification and performance benchmarking, and for variational optimization of the Fsim gate parameters. The design principle of JuliVQC is three-fold: (1) Transparent implementation of its core algorithms, realized by using the high-performance script language Julia; (2) Efficiency is the focus, with a cache-friendly implementation of each elementary operations and support for shared-memory parallelization; (3) Native support of automatic differentiation for both the noiseless and noisy quantum circuits. We perform extensive numerical experiments on JuliVQC in different application scenarios, including quantum circuits, variational quantum circuits and their noisy counterparts, which show that its performance is among the top of the popular alternatives. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 12 pages, 8 figures

arXiv:2406.18871 [pdf, other]

DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment

Authors: Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, He Huang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract: Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby fa… ▽ More Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted to Interspeech 2024

arXiv:2406.18254 [pdf, other]

doi 10.1145/3637528.3671787

Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning

Authors: Zhijie Nie, Richong Zhang, Zhangchi Feng, Hailang Huang, Xudong Liu

Abstract: Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on l… ▽ More Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted by KDD 2024 Research Track

arXiv:2406.18146 [pdf, other]

A Refer-and-Ground Multimodal Large Language Model for Biomedicine

Authors: Xiaoshuang Huang, Haifeng Huang, Lingdong Shen, Yehui Yang, Fangxin Shang, Junwei Liu, Jia Liu

Abstract: With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this chall… ▽ More With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants. △ Less

Submitted 28 June, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: Accepted by MICCAI2024

arXiv:2406.18018 [pdf, other]

A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset

Authors: Muwei Jian, Haoran Zhang, Mingju Shao, Hongyu Chen, Huihui Huang, Yanjie Zhong, Changlei Zhang, Bin Wang, Penghui Gao

Abstract: Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in im… ▽ More Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in imaging data from various captured periods of lung cancer. If the evolution patterns of nodules across various periods in the patients' CT sequences can be explored, it will play a crucial role in guiding the precise screening identification of lung cancer. Therefore, a cross spatio-temporal lung nodule dataset with pathological information for nodule identification and diagnosis is constructed, which contains 328 CT sequences and 362 annotated nodules from 109 patients. This comprehensive database is intended to drive research in the field of CAD towards more practical and robust methods, and also contribute to the further exploration of precision medicine related field. To ensure patient confidentiality, we have removed sensitive information from the dataset. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17770 [pdf, other]

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Authors: Xiangyu Zhao, Xiangtai Li, Haodong Duan, Haian Huang, Yining Li, Kai Chen, Hua Yang

Abstract: Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capa… ▽ More Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17565 [pdf, other]

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Authors: Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

Abstract: Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemP… ▽ More Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.17507 [pdf, other]

ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling

Authors: Minghui Fang, Shengpeng Ji, Jialong Zuo, Hai Huang, Yan Xia, Jieming Zhu, Xize Cheng, Xiaoda Yang, Wenrui Liu, Gang Wang, Zhenhua Dong, Zhou Zhao

Abstract: Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights… ▽ More Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights for cross-modal retrieval. However, constructing identifiers for multimodal data remains an untapped problem, and the modality gap between natural language queries and multimodal candidates hinders retrieval performance due to the absence of additional encoders. To this end, we propose a pioneering generAtive Cross-modal rEtrieval framework (ACE), which is a comprehensive framework for end-to-end cross-modal retrieval based on coarse-to-fine semantic modeling. We propose combining K-Means and RQ-VAE to construct coarse and fine tokens, serving as identifiers for multimodal data. Correspondingly, we design the coarse-to-fine feature fusion strategy to efficiently align natural language queries and candidate identifiers. ACE is the first work to comprehensively demonstrate the feasibility of generative approach on text-to-image/audio/video retrieval, challenging the dominance of the embedding-based dual-tower architecture. Extensive experiments show that ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16520 [pdf]

Gigantic-oxidative atomically layered epitaxy for designed complex oxides

Authors: Guangdi Zhou, Haoliang Huang, Fengzhe Wang, Heng Wang, Qishuo Yang, Zihao Nie, Wei Lv, Cui Ding, Yueying Li, Danfeng Li, Yujie Sun, Junhao Lin, Guang-Ming Zhang, Qi-Kun Xue, Zhuoyu Chen

Abstract: In designing material functionality within the intricate realm of transition metal oxides, lattice structure and d-orbital occupancy are two principal determinants of the correlated physical properties, such as superconductivity. However, the modulation of these two factors is inherently limited by the need to balance thermodynamic stability, kinetic mobility, and synthesis precision, particularly… ▽ More In designing material functionality within the intricate realm of transition metal oxides, lattice structure and d-orbital occupancy are two principal determinants of the correlated physical properties, such as superconductivity. However, the modulation of these two factors is inherently limited by the need to balance thermodynamic stability, kinetic mobility, and synthesis precision, particularly for oxidation-demanding phases. We introduce a methodology, namely the gigantic-oxidative atomically layered epitaxy (GOAL-Epitaxy), enhancing oxidation power 3-4 orders of magnitude beyond oxide molecular beam epitaxy (OMBE) and pulsed laser deposition (PLD), while ensuring atomic-layer-by-layer growth of designed complex structures. Consequently, thermodynamic stability is markedly augmented at elevated temperatures, improving growth kinetics. We demonstrate the accurate synthesis of complex nickelates and cuprates, especially an artificially designed structure as a parent of high-temperature superconductivity, in which alternating single and double NiO2 layers possess distinct nominal d-orbital occupancy. The GOAL-Epitaxy enables material discovery within the vastly broadened growth parameter space. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.16137 [pdf, other]

MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling

Authors: Jian Yang, Jiakun Li, Guoming Li, Zhen Shen, Huai-Yu Wu, Zhaoxin Fan, Heng Huang

Abstract: Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed fo… ▽ More Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. MLP Hand consists of two primary modules: (1) a lightweight MLP-based Skeleton2Mesh model that efficiently recovers hand meshes from hand skeletons, and (2) a multi-view geometry feature fusion prediction module that enhances the Skeleton2Mesh model with detailed geometric information from multiple views. Experiments on three widely used datasets demonstrate that MLPHand can reduce computational complexity by 90% while achieving comparable reconstruction accuracy to existing state-of-the-art baselines. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.15677 [pdf, other]

Open-vocabulary Pick and Place via Patch-level Semantic Maps

Authors: Mingxi Jia, Haojie Huang, Zhewen Zhang, Chenghao Wang, Linfeng Zhao, Dian Wang, Jason Xinyu Liu, Robin Walters, Robert Platt, Stefanie Tellex

Abstract: Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and… ▽ More Controlling robots through natural language instructions in open-vocabulary scenarios is pivotal for enhancing human-robot collaboration and complex robot behavior synthesis. However, achieving this capability poses significant challenges due to the need for a system that can generalize from limited data to a wide range of tasks and environments. Existing methods rely on large, costly datasets and struggle with generalization. This paper introduces Grounded Equivariant Manipulation (GEM), a novel approach that leverages the generative capabilities of pre-trained vision-language models and geometric symmetries to facilitate few-shot and zero-shot learning for open-vocabulary robot manipulation tasks. Our experiments demonstrate GEM's high sample efficiency and superior generalization across diverse pick-and-place tasks in both simulation and real-world experiments, showcasing its ability to adapt to novel instructions and unseen objects with minimal data requirements. GEM advances a significant step forward in the domain of language-conditioned robot control, bridging the gap between semantic understanding and action generation in robotic systems. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14955 [pdf, other]

ICLEval: Evaluating In-Context Learning Ability of Large Language Models

Authors: Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, Ji-Rong Wen

Abstract: In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge,… ▽ More In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval. △ Less

Submitted 21 June, 2024; originally announced June 2024.

arXiv:2406.14909 [pdf, other]

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

Authors: Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang

Abstract: Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring thei… ▽ More Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction and boosts decode throughput by $5.5-6.7 \times$ for 7B and 13B dense models on a single GPU, with minimal impact on performance. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 10 pages

ACM Class: I.2.7

arXiv:2406.14828 [pdf, other]

Word Matters: What Influences Domain Adaptation in Summarization?

Authors: Yinghao Li, Siyu Miao, Heyan Huang, Yang Gao

Abstract: Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation perform… ▽ More Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14307 [pdf, other]

QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis

Authors: Chao Hui Huang

Abstract: In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functi… ▽ More In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: 12 pages, 7 figures

arXiv:2406.14075 [pdf, other]

EXCEEDS: Extracting Complex Events as Connecting the Dots to Graphs in Scientific Domain

Authors: Yi-Fan Lu, Xian-Ling Mao, Bo Wang, Xiao Liu, Heyan Huang

Abstract: It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more compl… ▽ More It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more complex events. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It has 2,508 documents and 24,381 events under refined annotation and quality control. Then, we propose EXCEEDS, a novel end-to-end scientific event extraction framework by storing dense nuggets in a grid matrix and simplifying complex event extraction into a dot construction and connection task. Experimental results demonstrate state-of-the-art performances of EXCEEDS on SciEvents. Additionally, we release SciEvents and EXCEEDS on GitHub. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: This paper is working in process

arXiv:2406.14025 [pdf]

Direct Observation of Dendrites Nucleation in Li Metal Battery by Machine Learning Accelerated Molecular Simulations under Realistic Electrochemical Conditions

Authors: Tai** Hu, Haichao Huang, Guobing Zhou, Xinyan Wang, Zheng Cheng, Fangjia Fu, Xiaoxu Wang, Fuzhi Dai, Kuang Yu, Shenzhen Xu

Abstract: Uncontrollable dendrites growth during electrochemical cycles leads to low Coulombic efficiency and critical safety issues in Li metal batteries. Hence, a comprehensive understanding of the dendrite formation mechanism is essential for further enhancing the performance of Li metal batteries. Machine learning accelerated molecular dynamics (MD) simulations can provide atomic-scale resolution for va… ▽ More Uncontrollable dendrites growth during electrochemical cycles leads to low Coulombic efficiency and critical safety issues in Li metal batteries. Hence, a comprehensive understanding of the dendrite formation mechanism is essential for further enhancing the performance of Li metal batteries. Machine learning accelerated molecular dynamics (MD) simulations can provide atomic-scale resolution for various key processes at an ab-initio level accuracy. However, traditional MD simulation tools hardly capture Li electrochemical depositions, due to lack of an electrochemical constant potential (ConstP) condition. In this work, we propose a ConstP approach that combines a machine learning force field with the charge equilibration method to reveal the dynamic process of Li dendrites nucleation at Li metal anode surfaces. Our results show that both dead Li cluster formation and inhomogeneous Li electro-depositions can induce Li dendrites nucleation. We further reveal that the local aggregation of Li atoms in amorphous inorganic components of solid electrolyte interphase is the key factor triggering the nucleation process. Overall, our simulations provide microscopic insights for Li dendrites formations in Li metal anodes. More importantly, we present an efficient and accurate simulation method for modeling realistic ConstP conditions, which holds considerable potential for broader applications in modeling of complex electrochemical interfaces. △ Less

Submitted 28 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13978 [pdf, other]

Topological Solitons in Square-root Graphene Nanoribbons Controlled by Electric Fields

Authors: Haiyue Huang, Mamun Sarker, Percy Zahl, C. Stephen Hellberg, Jeremy Levy, Ioannis Petrides, Alexander Sinitskii, Prineha Narang

Abstract: Graphene nanoribbons (GNRs) are unique quasi-one-dimensional (1D) materials that have garnered a lot of research interest in the field of topological insulators. While the topological phases exhibited by GNRs are primarily governed by their chemical structures, the ability to externally control these phases is crucial for their potential utilization in quantum electronics and spintronics. Here we… ▽ More Graphene nanoribbons (GNRs) are unique quasi-one-dimensional (1D) materials that have garnered a lot of research interest in the field of topological insulators. While the topological phases exhibited by GNRs are primarily governed by their chemical structures, the ability to externally control these phases is crucial for their potential utilization in quantum electronics and spintronics. Here we propose a class of GNRs featured by mirror symmetry and four zigzag segments in a unit cell that has unique topological properties induced and controlled by an externally applied electric field. Their band structures manifest two finite gaps which support topological solitons, as described by an effective square-root model. To demonstrate the experimental feasibility, we design and synthesize a representative partially zigzag chevron-type GNR (pzc-GNR) with the desired zigzag segments using a bottom-up approach. First-principles calculations on pzc-GNR reveal band inversions at the two finite gaps by switching the direction of the electric field, which is in accordance with predictions from the square-root Hamiltonian. We show different topological phases can be achieved by controlling the direction of the field and the chemical potential of the system in square-root GNRs. Consequently, upon adding a step-function electric field, solitons states can be generated at the domain wall. We discuss the properties of two types of soliton states, depending on whether the terminating commensurate unit cell is mirror symmetric. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13455 [pdf, ps, other]

Johnson graphs as slices of a hypercube and an algebra homomorphism from the universal Racah algebra into $U(\mathfrak{sl}_2)$

Authors: Hau-Wen Huang, Chia-Yi Wen

Abstract: From the viewpoint of Johnson graphs as slices of a hypercube, we derive a novel algebra homomorphism $\sharp$ from the universal Racah algebra $\Re$ into $U(\mathfrak{sl}_2)$. We use the Casimir elements of $\Re$ to describe the kernel of $\sharp$. By pulling back via $\sharp$ every $U(\mathfrak{sl}_2)$-module can be viewed as an $\Re$-module. We show that for any finite-dimensional… ▽ More From the viewpoint of Johnson graphs as slices of a hypercube, we derive a novel algebra homomorphism $\sharp$ from the universal Racah algebra $\Re$ into $U(\mathfrak{sl}_2)$. We use the Casimir elements of $\Re$ to describe the kernel of $\sharp$. By pulling back via $\sharp$ every $U(\mathfrak{sl}_2)$-module can be viewed as an $\Re$-module. We show that for any finite-dimensional $U(\mathfrak{sl}_2)$-module $V$, the $\Re$-module $V$ is completely reducible and three generators of $\Re$ act on every irreducible $\Re$-submodule of $V$ as a Leonard triple. In particular, Leonard triples can be constructed in terms of the second dual distance operator of the hypercube $H(D,2)$ and a decomposition of the second distance operator of $H(D,2)$ induced by Johnson graphs. △ Less

Submitted 19 June, 2024; originally announced June 2024.

MSC Class: 05E30; 16G30; 16S30; 33D45

arXiv:2406.13324 [pdf, other]

Quantum Metric-induced Oscillations in Flat Bands

Authors: Hui Zeng, Zijian Zhou, Wenhui Duan, Huaqing Huang

Abstract: The transport of Bloch electrons under strong fields is traditionally understood through two mechanisms: intraband Bloch oscillations and interband Zener tunneling. Here we propose a new oscillation mechanism induced by the interband quantum metric, which would significantly affect the electron dynamics under strong fields. By considering the multiband dynamics to the second order of the density m… ▽ More The transport of Bloch electrons under strong fields is traditionally understood through two mechanisms: intraband Bloch oscillations and interband Zener tunneling. Here we propose a new oscillation mechanism induced by the interband quantum metric, which would significantly affect the electron dynamics under strong fields. By considering the multiband dynamics to the second order of the density matrix, we reveal that quantum metric-induced oscillations (QMO) persist regardless of band dispersion, even in exactly flat bands. The resultant drift current can reach a magnitude comparable to the Bloch oscillations-induced drift current in systems where interband tunneling is negligible. Notably, the {QMO}-induced drift current increases linearly with electric field strength under the constraints of time-reversal or spatial-inversion symmetry, emerging as the primary delocalized current. We further show that both one-dimensional and two-dimensional superlattices are potential platforms for investigating QMO. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.13167 [pdf, other]

QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism

Authors: Bo Wang, Heyan Huang, Yixin Cao, Jiahao Ying, Wei Tang, Chong Feng

Abstract: While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques… ▽ More While large language models (LLMs) have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanism offers a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), incorporating a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM enhanced performance compared to existing approaches. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12847 [pdf, other]

ChangeViT: Unleashing Plain Vision Transformers for Change Detection

Authors: Duowang Zhu, Xiaohu Huang, Haiyan Huang, Zhenfeng Shao, Qimin Cheng

Abstract: Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper,… ▽ More Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs' unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12640 [pdf]

Research and Implementation of Data Enhancement Techniques for Graph Neural Networks

Authors: **gzhao Gu, Haoyang Huang

Abstract: Data, algorithms, and arithmetic power are the three foundational conditions for deep learning to be effective in the application domain. Data is the focus for develo** deep learning algorithms. In practical engineering applications, some data are affected by the conditions under which more data cannot be obtained or the cost of obtaining data is too high, resulting in smaller data sets (general… ▽ More Data, algorithms, and arithmetic power are the three foundational conditions for deep learning to be effective in the application domain. Data is the focus for develo** deep learning algorithms. In practical engineering applications, some data are affected by the conditions under which more data cannot be obtained or the cost of obtaining data is too high, resulting in smaller data sets (generally several hundred to several thousand) and data sizes that are far smaller than the size of large data sets (tens of thousands). The above two methods are based on the original dataset to generate, in the case of insufficient data volume of the original data may not reflect all the real environment, such as the real environment of the light, silhouette and other information, if the amount of data is not enough, it is difficult to use a simple transformation or neural network generative model to generate the required data. The research in this paper firstly analyses the key points of the data enhancement technology of graph neural network, and at the same time introduces the composition foundation of graph neural network in depth, on the basis of which the data enhancement technology of graph neural network is optimized and analysed. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 7 pages, 7 figures, to be published in IEEE International Conference on Artificial Intelligence and Electromechanical Automation

arXiv:2406.12437 [pdf, other]

Slow rates of approximation of U-statistics and V-statistics by quadratic forms of Gaussians

Authors: Kevin Han Huang, Peter Orbanz

Abstract: We construct examples of degree-two U- and V-statistics of $n$ i.i.d.~heavy-tailed random vectors in $\mathbb{R}^{d(n)}$, whose $ν$-th moments exist for ${ν> 2}$, and provide tight bounds on the error of approximating both statistics by a quadratic form of Gaussians. In the case ${ν=3}$, the error of approximation is $Θ(n^{-1/12})$. The proof adapts a result of Huang, Austern and Orbanz [12] to U-… ▽ More We construct examples of degree-two U- and V-statistics of $n$ i.i.d.~heavy-tailed random vectors in $\mathbb{R}^{d(n)}$, whose $ν$-th moments exist for ${ν> 2}$, and provide tight bounds on the error of approximating both statistics by a quadratic form of Gaussians. In the case ${ν=3}$, the error of approximation is $Θ(n^{-1/12})$. The proof adapts a result of Huang, Austern and Orbanz [12] to U- and V-statistics. The lower bound for U-statistics is a simple example of the concept of variance domination used in [12]. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.12380 [pdf, other]

Search for fractionally charged particles with CUORE

Authors: CUORE Collaboration, D. Q. Adams, C. Alduino, K. Alfonso, F. T. Avignone III, O. Azzolini, G. Bari, F. Bellini, G. Benato, M. Beretta, M. Biassoni, A. Branca, C. Brofferio, C. Bucci, J. Camilleri, A. Caminata, A. Campani, J. Cao, S. Capelli, C. Capelli, L. Cappelli, L. Cardani, P. Carniti, N. Casali, E. Celi , et al. (95 additional authors not shown)

Abstract: The Cryogenic Underground Observatory for Rare Events (CUORE) is a detector array comprised by 988 5$\;$cm$\times$5$\;$cm$\times$5$\;$cm TeO$_2$ crystals held below 20 mK, primarily searching for neutrinoless double-beta decay in $^{130}$Te. Unprecedented in size amongst cryogenic calorimetric experiments, CUORE provides a promising setting for the study of exotic through-going particles. Using th… ▽ More The Cryogenic Underground Observatory for Rare Events (CUORE) is a detector array comprised by 988 5$\;$cm$\times$5$\;$cm$\times$5$\;$cm TeO$_2$ crystals held below 20 mK, primarily searching for neutrinoless double-beta decay in $^{130}$Te. Unprecedented in size amongst cryogenic calorimetric experiments, CUORE provides a promising setting for the study of exotic through-going particles. Using the first tonne-year of CUORE's exposure, we perform a search for hypothesized fractionally charged particles (FCPs), which are well-motivated by various Standard Model extensions and would have suppressed interactions with matter. No excess of FCP candidate tracks is observed over background, setting leading limits on the underground FCP flux with charges between $e/24-e/5$ at 90\% confidence level. Using the low background environment and segmented geometry of CUORE, we establish the sensitivity of tonne-scale sub-Kelvin detectors to diverse signatures of new physics. △ Less

Submitted 18 June, 2024; originally announced June 2024.

Comments: 7 pages, 5 figures

arXiv:2406.12100 [pdf, other]

Adaptive Uncertainty Quantification for Trajectory Prediction Under Distributional Shift

Authors: Huiqun Huang, Sihong He, Fei Miao

Abstract: Trajectory prediction models that can infer both finite future trajectories and their associated uncertainties of the target vehicles in an online setting (e.g., real-world application scenarios) is crucial for ensuring the safe and robust navigation and path planning of autonomous vehicle motion. However, the majority of existing trajectory prediction models have neither considered reducing the u… ▽ More Trajectory prediction models that can infer both finite future trajectories and their associated uncertainties of the target vehicles in an online setting (e.g., real-world application scenarios) is crucial for ensuring the safe and robust navigation and path planning of autonomous vehicle motion. However, the majority of existing trajectory prediction models have neither considered reducing the uncertainty as one objective during the training stage nor provided reliable uncertainty quantification during inference stage under potential distribution shift. Therefore, in this paper, we propose the Conformal Uncertainty Quantification under Distribution Shift framework, CUQDS, to quantify the uncertainty of the predicted trajectories of existing trajectory prediction models under potential data distribution shift, while considering improving the prediction accuracy of the models and reducing the estimated uncertainty during the training stage. Specifically, CUQDS includes 1) a learning-based Gaussian process regression module that models the output distribution of the base model (any existing trajectory prediction or time series forecasting neural networks) and reduces the estimated uncertainty by additional loss term, and 2) a statistical-based Conformal P control module to calibrate the estimated uncertainty from the Gaussian process regression module in an online setting under potential distribution shift between training and testing data. △ Less