Skip to main content

Showing 1–50 of 8,348 results for author: Yan

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.06168  [pdf, other

    cs.RO cs.CV

    TARGO: Benchmarking Target-driven Object Gras** under Occlusions

    Authors: Yan Xia, Ran Ding, Ziyuan Qin, Guanqi Zhan, Kaichen Zhou, Long Yang, Hao Dong, Daniel Cremers

    Abstract: Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic gras**. However, previous gras** models face challenges in cluttered environments where nearby objects impact the target object's grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Gras** under Occlusions, named TARGO. We make the following contr… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 19 pages, 17 figures

  2. arXiv:2407.06153  [pdf, other

    cs.SE cs.CL

    What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

    Authors: Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, **gang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, Xuan**g Huang

    Abstract: The increasing development of large language models (LLMs) in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundar… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 17 pages, 7 figures

  3. arXiv:2407.06135  [pdf, other

    cs.CL cs.AI cs.CV

    ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

    Authors: Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu

    Abstract: Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mi… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  4. arXiv:2407.06112  [pdf, other

    cs.CL

    Enhancing Language Model Rationality with Bi-Directional Deliberation Reasoning

    Authors: Yadong Zhang, Shaoguang Mao, Wenshan Wu, Yan Xia, Tao Ge, Man Lan, Furu Wei

    Abstract: This paper introduces BI-Directional DEliberation Reasoning (BIDDER), a novel reasoning approach to enhance the decision rationality of language models. Traditional reasoning methods typically rely on historical information and employ uni-directional (left-to-right) reasoning strategy. This lack of bi-directional deliberation reasoning results in limited awareness of potential future outcomes and… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  5. arXiv:2407.06027  [pdf, other

    cs.CL

    PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

    Authors: Miao Zheng, Hao Liang, Fan Yang, Haoze Sun, Tianpeng Li, Lingchu Xiong, Yan Zhang, Yozhen Wu, Kun Li, Yanjun Sheng, Mingan Lin, Tao Zhang, Guosheng Dong, Yu**g Qiao, Kun Fang, Weipeng Chen, Bin Cui, Wentao Zhang, Zenan Zhou

    Abstract: In recent years, the rise of Large Language Models (LLMs) has spurred a growing demand for plug-and-play AI systems. Among the various AI techniques, prompt engineering stands out as particularly significant. However, users often face challenges in writing prompts due to the steep learning curve and significant time investment, and existing automatic prompt engineering (APE) models can be difficul… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  6. arXiv:2407.05810  [pdf, other

    cs.AI cs.HC

    Integrating AI in College Education: Positive yet Mixed Experiences with ChatGPT

    Authors: Xinrui Song, Jia** Zhang, **kun Yan, Juergen Hahn, Uwe Kruger, Hisham Mohamed, Ge Wang

    Abstract: The integration of artificial intelligence (AI) chatbots into higher education marks a shift towards a new generation of pedagogical tools, mirroring the arrival of milestones like the internet. With the launch of ChatGPT-4 Turbo in November 2023, we developed a ChatGPT-based teaching application (https://chat.openai.com/g/g-1imx1py4K-chatge-medical-imaging) and integrated it into our undergraduat… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  7. arXiv:2407.05771  [pdf, other

    cs.CV

    Multi-times Monte Carlo Rendering for Inter-reflection Reconstruction

    Authors: Tengjie Zhu, Zhuo Chen, **gnan Gao, Yichao Yan, Xiaokang Yang

    Abstract: Inverse rendering methods have achieved remarkable performance in reconstructing high-fidelity 3D objects with disentangled geometries, materials, and environmental light. However, they still face huge challenges in reflective surface reconstruction. Although recent methods model the light trace to learn specularity, the ignorance of indirect illumination makes it hard to handle inter-reflections… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 10 pages,6 figures,NeurIPS 2024 Submitted

  8. arXiv:2407.05750  [pdf, other

    cs.CL

    Large Language Models Understand Layouts

    Authors: Weiming Li, Manni Duan, Dong An, Yan Shao

    Abstract: Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks. In this paper, we show that, beyond text understanding capability, LLMs are capable of processing text layouts that are denoted by spatial markers. They are able to answer questions that require explicit spatial perceiving and reasoning, while a drastic performance drop is o… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  9. arXiv:2407.05688  [pdf

    cs.CV cs.AI

    Learning with Alignments: Tackling the Inter- and Intra-domain Shifts for Cross-multidomain Facial Expression Recognition

    Authors: Yuxiang Yang, Lu Wen, Xinyi Zeng, Yuanyuan Xu, Xi Wu, Jiliu Zhou, Yan Wang

    Abstract: Facial Expression Recognition (FER) holds significant importance in human-computer interactions. Existing cross-domain FER methods often transfer knowledge solely from a single labeled source domain to an unlabeled target domain, neglecting the comprehensive information across multiple sources. Nevertheless, cross-multidomain FER (CMFER) is very challenging for (i) the inherent inter-domain shifts… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  10. arXiv:2407.05682  [pdf, other

    cs.CL

    Retrieved In-Context Principles from Previous Mistakes

    Authors: Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang

    Abstract: In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Prin… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  11. arXiv:2407.05639  [pdf

    cs.LG cs.CR

    Deep Learning-based Anomaly Detection and Log Analysis for Computer Networks

    Authors: Shuzhan Wang, Ruxue Jiang, Zhaoqi Wang, Yan Zhou

    Abstract: Computer network anomaly detection and log analysis, as an important topic in the field of network security, has been a key task to ensure network security and system reliability. First, existing network anomaly detection and log analysis methods are often challenged by high-dimensional data and complex network topologies, resulting in unstable performance and high false-positive rates. In additio… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 38 pages

  12. Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

    Authors: Zhenchun Lei, Hui Yan, Changhong Liu, Minglei Ma, Yingen Yang

    Abstract: The automatic speaker verification system is sometimes vulnerable to various spoofing attacks. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline for spoofing detection. However, the GMM classifier does not separately consider the scores of feature frames on each Gaussian component. In addition, the GMM accumulates the scores on all frames… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  13. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  14. arXiv:2407.05396  [pdf, other

    cs.CR cs.AI

    Evolutionary Trigger Detection and Lightweight Model Repair Based Backdoor Defense

    Authors: Qi Zhou, Zipeng Ye, Yubo Tang, Wenjian Luo, Yuhui Shi, Yan Jia

    Abstract: Deep Neural Networks (DNNs) have been widely used in many areas such as autonomous driving and face recognition. However, DNN model is fragile to backdoor attack. A backdoor in the DNN model can be activated by a poisoned input with trigger and leads to wrong prediction, which causes serious security issues in applications. It is challenging for current defenses to eliminate the backdoor effective… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: 13 pages, 9 figures

  15. arXiv:2407.05355  [pdf, other

    cs.CV cs.CL

    VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool

    Authors: Yan Wang, Yawen Zeng, **gsheng Zheng, Xiaofen Xing, ** Xu, Xiangmin Xu

    Abstract: Multimodal large language models (MLLMs) are flourishing, but mainly focus on images with less attention than videos, especially in sub-fields such as prompt engineering, video chain-of-thought (CoT), and instruction tuning on videos. Therefore, we try to explore the collection of CoT datasets in videos to lead to video OpenQA and improve the reasoning ability of MLLMs. Unfortunately, making such… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: ACL 2024 Workshop

  16. arXiv:2407.05312  [pdf, other

    cs.CV

    An Improved Method for Personalizing Diffusion Models

    Authors: Yan Zeng, Masanori Suganuma, Takayuki Okatani

    Abstract: Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, res… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  17. arXiv:2407.05302  [pdf, other

    cs.LG stat.ML

    Mamba Hawkes Process

    Authors: Anningzhe Gao, Shan Dai, Yan Hu

    Abstract: Irregular and asynchronous event sequences are prevalent in many domains, such as social media, finance, and healthcare. Traditional temporal point processes (TPPs), like Hawkes processes, often struggle to model mutual inhibition and nonlinearity effectively. While recent neural network models, including RNNs and Transformers, address some of these issues, they still face challenges with long-ter… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  18. arXiv:2407.05267  [pdf, other

    cs.CV

    DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

    Authors: Ting-Wei Zhou, Xi-Le Zhao, Jian-Li Wang, Yi-Si Luo, Min Wang, Xiao-Xuan Bai, Hong Yan

    Abstract: Recently, the transform-based tensor representation has attracted increasing attention in multimedia data (e.g., images and videos) recovery problems, which consists of two indispensable components, i.e., transform and characterization. Previously, the development of transform-based tensor representation mainly focuses on the transform aspect. Although several attempts consider using shallow matri… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

  19. arXiv:2407.05246  [pdf, other

    cs.LG cs.CV

    Deep Probability Aggregation Clustering

    Authors: Yuxuan Yan, Na Lu, Ruofan Yan

    Abstract: Combining machine clustering with deep models has shown remarkable superiority in deep clustering. It modifies the data processing pipeline into two alternating phases: feature clustering and model training. However, such alternating schedule may lead to instability and computational burden issues. We propose a centerless clustering algorithm called Probability Aggregation Clustering (PAC) to proa… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

    Comments: 19 pages,2 figures, conference

  20. arXiv:2407.05233  [pdf, other

    cs.CL cs.AI

    Advancing Prompt Recovery in NLP: A Deep Dive into the Integration of Gemma-2b-it and Phi2 Models

    Authors: Jianlong Chen, Wei Xu, Zhicheng Ding, **xin Xu, Hao Yan, Xinyu Zhang

    Abstract: Prompt recovery, a crucial task in natural language processing, entails the reconstruction of prompts or instructions that language models use to convert input text into a specific output. Although pivotal, the design and effectiveness of prompts represent a challenging and relatively untapped field within NLP research. This paper delves into an exhaustive investigation of prompt recovery methodol… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  21. arXiv:2407.05047  [pdf, other

    cs.AI

    MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning

    Authors: Min Zhang, Jianye Hao, Xian Fu, Peilong Han, Hao Zhang, Lei Shi, Hongyao Tang, Yan Zheng

    Abstract: In recent years, Multi-modal Foundation Models (MFMs) and Embodied Artificial Intelligence (EAI) have been advancing side by side at an unprecedented pace. The integration of the two has garnered significant attention from the AI research community. In this work, we attempt to provide an in-depth and comprehensive evaluation of the performance of MFM s on embodied task planning, aiming to shed lig… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  22. arXiv:2407.05021  [pdf, other

    cs.CV

    Incremental Multiview Point Cloud Registration

    Authors: Xiaoya Cheng, Yu Liu, Maojun Zhang, Shen Yan

    Abstract: In this paper, we present a novel approach for multiview point cloud registration. Different from previous researches that typically employ a global scheme for multiview registration, we propose to adopt an incremental pipeline to progressively align scans into a canonical coordinate system. Specifically, drawing inspiration from image-based 3D reconstruction, our approach first builds a sparse sc… ▽ More

    Submitted 6 July, 2024; originally announced July 2024.

  23. arXiv:2407.04942  [pdf, other

    cs.RO cs.LG

    FOSP: Fine-tuning Offline Safe Policy through World Models

    Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang

    Abstract: Model-based Reinforcement Learning (RL) has shown its high training efficiency and capability of handling high-dimensional tasks. Regarding safety issues, safe model-based RL can achieve nearly zero-cost performance and effectively manage the trade-off between performance and safety. Nevertheless, prior works still pose safety challenges due to the online exploration in real-world deployment. To a… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: 21 pages

  24. arXiv:2407.04903  [pdf, other

    cs.CL cs.AI cs.CV

    MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension

    Authors: Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, ** Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, Linda Ruth Petzold, Stephen D. Wilson, Woosang Lim, William Yang Wang

    Abstract: The rapid advancement of Large Language Models (LLMs) and Large Multimodal Models (LMMs) has heightened the demand for AI-based scientific assistants capable of understanding scientific articles and figures. Despite progress, there remains a significant gap in evaluating models' comprehension of professional, graduate-level, and even PhD-level scientific content. Current datasets and benchmarks pr… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Code and data are available at https://github.com/Leezekun/MMSci

  25. arXiv:2407.04362  [pdf, other

    cs.CV cs.HC

    Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR

    Authors: Shogo Morita, Yan Zhang, Takuto Yamauchi, Sinan Chen, Jialong Li, Kenji Tei

    Abstract: People with color vision deficiency often face challenges in distinguishing colors such as red and green, which can complicate daily tasks and require the use of assistive tools or environmental adjustments. Current support tools mainly focus on presentation-based aids, like the color vision modes found in iPhone accessibility settings. However, offering context-aware support, like indicating the… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  26. arXiv:2407.04305  [pdf, other

    cs.CV

    Towards Stable 3D Object Detection

    Authors: Jiabao Wang, Qiang Meng, Guochao Liu, Liujiang Yan, Ke Wang, Ming-Ming Cheng, Qibin Hou

    Abstract: In autonomous driving, the temporal stability of 3D object detection greatly impacts the driving safety. However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes Stability Index (SI), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  27. arXiv:2407.04292  [pdf, other

    cs.AR cs.RO

    Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design

    Authors: Yiyang Huang, Yuhui Hao, Bo Yu, Feng Yan, Yuxin Yang, Feng Min, Yinhe Han, Lin Ma, Shaoshan Liu, Qiang Liu, Yiming Gan

    Abstract: Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot act… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  28. arXiv:2407.04242  [pdf, other

    cs.CV

    Fine-grained Context and Multi-modal Alignment for Freehand 3D Ultrasound Reconstruction

    Authors: Zhongnuo Yan, Xin Yang, Mingyuan Luo, Jiongquan Chen, Rusi Chen, Lian Liu, Dong Ni

    Abstract: Fine-grained spatio-temporal learning is crucial for freehand 3D ultrasound reconstruction. Previous works mainly resorted to the coarse-grained spatial features and the separated temporal dependency learning and struggles for fine-grained spatio-temporal learning. Mining spatio-temporal information in fine-grained scales is extremely challenging due to learning difficulties in long-range dependen… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted at MICCAI 2024. This is the submitted manuscript and the preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections

  29. arXiv:2407.04237  [pdf, other

    cs.CV cs.GR

    GSD: View-Guided Gaussian Splatting Diffusion for 3D Reconstruction

    Authors: Yuxuan Mu, Xinxin Zuo, Chuan Guo, Yilin Wang, Juwei Lu, Xiaofeng Wu, Songcen Xu, Peng Dai, Youliang Yan, Li Cheng

    Abstract: We present GSD, a diffusion model approach based on Gaussian Splatting (GS) representation for 3D object reconstruction from a single view. Prior works suffer from inconsistent 3D geometry or mediocre rendering quality due to improper representations. We take a step towards resolving these shortcomings by utilizing the recent state-of-the-art 3D explicit representation, Gaussian Splatting, and an… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted for ECCV 2024

  30. arXiv:2407.04031  [pdf

    cs.CE

    Towards reproducible machine learning-based process monitoring and quality prediction research for additive manufacturing

    Authors: Jiarui Xie, Mutahar Safdar, Andrei Mircea, Yan Lu, Hyunwoong Ko, Zhuo Yang, Yaoyao Fiona Zhao

    Abstract: Machine learning (ML)-based monitoring systems have been extensively developed to enhance the print quality of additive manufacturing (AM). In-situ and in-process data acquired using sensors can be used to train ML models that detect process anomalies, predict part quality, and adjust process parameters. However, the reproducibility of the proposed AM monitoring systems has not been investigated.… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: 13 pages, 6 figures, 2 tables. This paper has been accepted to be published in the proceedings of IDETC-CIE 2024

  31. arXiv:2407.03699  [pdf, other

    cs.CV

    Generalized Robust Fundus Photography-based Vision Loss Estimation for High Myopia

    Authors: Zipei Yan, Zhile Liang, Zhengji Liu, Shuai Wang, Rachel Ka-Man Chun, Jizhou Li, Chea-su Kee, Dong Liang

    Abstract: High myopia significantly increases the risk of irreversible vision loss. Traditional perimetry-based visual field (VF) assessment provides systematic quantification of visual loss but it is subjective and time-consuming. Consequently, machine learning models utilizing fundus photographs to estimate VF have emerged as promising alternatives. However, due to the high variability and the limited ava… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted by MICCAI 2024, code: https://github.com/yanzipei/VF_RED

  32. arXiv:2407.03658  [pdf, other

    cs.CL

    GPT-4 vs. Human Translators: A Comprehensive Evaluation of Translation Quality Across Languages, Domains, and Expertise Levels

    Authors: Jianhao Yan, **chuan Yan, Yulong Chen, Judy Li, Xianchao Zhu, Yue Zhang

    Abstract: This study comprehensively evaluates the translation quality of Large Language Models (LLMs), specifically GPT-4, against human translators of varying expertise levels across multiple language pairs and domains. Through carefully designed annotation rounds, we find that GPT-4 performs comparably to junior translators in terms of total errors made but lags behind medium and senior translators. We a… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  33. arXiv:2407.03625  [pdf, other

    cs.SE

    Augmenting LLMs to Repair Obsolete Test Cases with Static Collector and Neural Reranker

    Authors: Jun Liu, Jiwei Yan, Yuanyuan Xie, Jun Yan, Jian Zhang

    Abstract: During software evolution, it is advocated that test code should co-evolve with production code. In real development scenarios, test updating may lag behind production code changing, which may cause the project to fail to compile or bring other troubles. Existing techniques based on pre-trained language models can be adopted to repair obsolete tests caused by such unsynchronized code changes, espe… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  34. arXiv:2407.03595  [pdf, other

    econ.GN cs.LG

    Machine Learning for Economic Forecasting: An Application to China's GDP Growth

    Authors: Yanqing Yang, Xingcheng Xu, **feng Ge, Yan Xu

    Abstract: This paper aims to explore the application of machine learning in forecasting Chinese macroeconomic variables. Specifically, it employs various machine learning models to predict the quarterly real GDP growth of China, and analyzes the factors contributing to the performance differences among these models. Our findings indicate that the average forecast errors of machine learning models are genera… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  35. arXiv:2407.03320  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    Authors: Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, **gwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao , et al. (2 additional authors not shown)

    Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. Th… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Technical Report. https://github.com/InternLM/InternLM-XComposer

  36. arXiv:2407.03314  [pdf, other

    cs.CV cs.CL cs.DB

    BACON: Supercharge Your VLM with Bag-of-Concept Graph to Mitigate Hallucinations

    Authors: Zhantao Yang, Ruili Feng, Keyu Yan, Huangji Wang, Zhicai Wang, Shangwen Zhu, Han Zhang, Jie Xiao, **yu Wu, Kai Zhu, Jixuan Chen, Chen-Wei Xie, Chaojie Mao, Yue Yang, Hongyang Zhang, Yu Liu, Fan Cheng

    Abstract: This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and boost downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimu… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  37. arXiv:2407.03251  [pdf, other

    cs.CV

    ACTRESS: Active Retraining for Semi-supervised Visual Grounding

    Authors: Weitai Kang, Mengxue Qu, Yunchao Wei, Yan Yan

    Abstract: Semi-Supervised Visual Grounding (SSVG) is a new challenge for its sparse labeled data with the need for multimodel understanding. A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. However, this approach is incompatible with current state-of-the-art visual gro… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  38. arXiv:2407.03243  [pdf, other

    cs.CV

    Visual Grounding with Attention-Driven Constraint Balancing

    Authors: Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan

    Abstract: Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language exp… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  39. arXiv:2407.03217  [pdf, other

    cs.CV

    MHNet: Multi-view High-order Network for Diagnosing Neurodevelopmental Disorders Using Resting-state fMRI

    Authors: Yueyang Li, Weiming Zeng, Wenhao Dong, Luhui Cai, Lei Wang, Hongyu Chen, Hongjie Yan, Lingbin Bian, Nizhuan Wang

    Abstract: Background: Deep learning models have shown promise in diagnosing neurodevelopmental disorders (NDD) like ASD and ADHD. However, many models either use graph neural networks (GNN) to construct single-level brain functional networks (BFNs) or employ spatial convolution filtering for local information extraction from rs-fMRI data, often neglecting high-order features crucial for NDD classification.… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 18 pages

  40. arXiv:2407.03200  [pdf, other

    cs.CV

    SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

    Authors: Weitai Kang, Gaowen Liu, Mubarak Shah, Yan Yan

    Abstract: Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper,… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  41. arXiv:2407.03188  [pdf, other

    cs.SD cs.AI cs.MM eess.AS

    MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation

    Authors: Zihao Wang, Haoxuan Liu, Jiaxing Yu, Tao Zhang, Yan Liu, Kejun Zhang

    Abstract: Amid the rising intersection of generative AI and human artistic processes, this study probes the critical yet less-explored terrain of alignment in human-centric automatic song composition. We propose a novel task of Colloquial Description-to-Song Generation, which focuses on aligning the generated content with colloquial human expressions. This task is aimed at bridging the gap between colloquia… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: 19 pages, 5 figures

    MSC Class: 68Txx(Primary)14F05; 91Fxx(Secondary) ACM Class: I.2.7; J.5

  42. arXiv:2407.03135  [pdf, other

    cs.SD cs.AI cs.HC eess.AS

    GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification

    Authors: Hui Yan, Zhenchun Lei, Changhong Liu, Yong Zhou

    Abstract: With the development of deep learning, many different network architectures have been explored in speaker verification. However, most network architectures rely on a single deep learning architecture, and hybrid networks combining different architectures have been little studied in ASV tasks. In this paper, we propose the GMM-ResNext model for speaker verification. Conventional GMM does not consid… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  43. Efficient IoT Devices Localization Through Wi-Fi CSI Feature Fusion and Anomaly Detection

    Authors: Yan Li, Jie Yang, Shang-Ling Shih, Wan-Ting Shih, Chao-Kai Wen, Shi **

    Abstract: Internet of Things (IoT) device localization is fundamental to smart home functionalities, including indoor navigation and tracking of individuals. Traditional localization relies on relative methods utilizing the positions of anchors within a home environment, yet struggles with precision due to inherent inaccuracies in these anchor positions. In response, we introduce a cutting-edge smartphone-b… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted in IEEE Internet of Things Journal, Early Access, 2024

    Journal ref: IEEE Internet of Things Journal, Early Access, 2024

  44. arXiv:2407.02888  [pdf, ps, other

    cs.LG cs.AI

    Joint Optimization of Resource Allocation and Data Selection for Fast and Cost-Efficient Federated Edge Learning

    Authors: Yunjian Jia, Zhen Huang, Ji** Yan, Yulu Zhang, Kun Luo, Wanli Wen

    Abstract: Deploying federated learning at the wireless edge introduces federated edge learning (FEEL). Given FEEL's limited communication resources and potential mislabeled data on devices, improper resource allocation or data selection can hurt convergence speed and increase training costs. Thus, to realize an efficient FEEL system, this paper emphasizes jointly optimizing resource allocation and data sele… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  45. arXiv:2407.02842  [pdf, other

    cs.CV cs.AI cs.CL

    MindBench: A Comprehensive Benchmark for Mind Map Structure Recognition and Analysis

    Authors: Lei Chen, Feng Yan, Yujie Zhong, Shaoxiang Chen, Zequn Jie, Lin Ma

    Abstract: Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis. Despite this, existing benchmarks typically focus only on extracting text and simple layout information, neglecting the complex interactions between elements in structured documents such as mind maps and flowcharts. To address this issue, we introduce the new benchmark named MindBench, which n… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

    Comments: technical report

  46. arXiv:2407.02803  [pdf, other

    cs.DB

    KnobCF: Uncertainty-aware Knob Tuning

    Authors: Yu Yan, Junfang Huang, Hongzhi Wang, Jian Geng, Kaixin Zhang, Tao Yu

    Abstract: The knob tuning aims to optimize database performance by searching for the most effective knob configuration under a certain workload. Existing works suffer two significant problems. On the one hand, there exist multiple similar even useless evaluations of knob tuning even with the diverse searching methods because of the different sensitivities of knobs on a certain workload. On the other hand, t… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  47. arXiv:2407.02765  [pdf, ps, other

    eess.SY cs.AI math.OC math.PR

    Graphon Particle Systems, Part II: Dynamics of Distributed Stochastic Continuum Optimization

    Authors: Yan Chen, Tao Li

    Abstract: We study the distributed optimization problem over a graphon with a continuum of nodes, which is regarded as the limit of the distributed networked optimization as the number of nodes goes to infinity. Each node has a private local cost function. The global cost function, which all nodes cooperatively minimize, is the integral of the local cost functions on the node set. We propose stochastic grad… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  48. arXiv:2407.02539  [pdf

    cs.RO cs.AI cs.LG stat.ML

    Research on Autonomous Robots Navigation based on Reinforcement Learning

    Authors: Zixiang Wang, Hao Yan, Yining Wang, Zhengjia Xu, Zhuoyue Wang, Zhizhong Wu

    Abstract: Reinforcement learning continuously optimizes decision-making based on real-time feedback reward signals through continuous interaction with the environment, demonstrating strong adaptive and self-learning capabilities. In recent years, it has become one of the key methods to achieve autonomous navigation of robots. In this work, an autonomous robot navigation method based on reinforcement learnin… ▽ More

    Submitted 8 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

  49. arXiv:2407.02483  [pdf, other

    cs.CL cs.AI

    MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

    Authors: Binxu Li, Tiankai Yan, Yuanting Pan, Zhe Xu, Jie Luo, Ruiyang Ji, Shilong Liu, Haoyu Dong, Zihao Lin, Yixin Wang

    Abstract: Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To brid… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  50. arXiv:2407.02301  [pdf, other

    cs.CL

    CFinBench: A Comprehensive Chinese Financial Benchmark for Large Language Models

    Authors: Ying Nie, Binwei Yan, Tianyu Guo, Hao Liu, Haoyu Wang, Wei He, Binfan Zheng, Weihao Wang, Qiang Li, Weijian Sun, Yunhe Wang, Dacheng Tao

    Abstract: Large language models (LLMs) have achieved remarkable performance on various NLP tasks, yet their potential in more challenging and domain-specific task, such as finance, has not been fully explored. In this paper, we present CFinBench: a meticulously crafted, the most comprehensive evaluation benchmark to date, for assessing the financial knowledge of LLMs under Chinese context. In practice, to b… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.