Skip to main content

Showing 1–50 of 439 results for author: Lin, D

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18926  [pdf, other

    cs.LG

    Fine-tuned network relies on generic representation to solve unseen cognitive task

    Authors: Dongyan Lin

    Abstract: Fine-tuning pretrained language models has shown promising results on a wide range of tasks, but when encountering a novel task, do they rely more on generic pretrained representation, or develop brand new task-specific solutions? Here, we fine-tuned GPT-2 on a context-dependent decision-making task, novel to the model but adapted from neuroscience literature. We compared its performance and inter… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.15486  [pdf, other

    cs.CL cs.AI cs.LG

    SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

    Authors: Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Xiuhong Li, Guanyu Feng, Xin Lv, Huanqi Cao, Xiao Chuanfu, Xingcheng Zhang, Dahua Lin, Chao Yang

    Abstract: Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for nea… ▽ More

    Submitted 28 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2406.14887  [pdf, other

    cs.CL

    InternLM-Law: An Open Source Chinese Legal Large Language Model

    Authors: Zhiwei Fei, Songyang Zhang, Xiaoyu Shen, Dawei Zhu, Xiao Wang, Maosong Cao, Fengzhe Zhou, Yining Li, Wenwei Zhang, Dahua Lin, Kai Chen, Jidong Ge

    Abstract: While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., l… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Our dataset, code and models will be released at https://github.com/InternLM/InternLM-Law

  4. arXiv:2406.14544  [pdf, other

    cs.CV cs.CL

    Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

    Authors: Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

    Abstract: Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an in… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  5. arXiv:2406.14515  [pdf, other

    cs.CV cs.MM

    MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

    Authors: Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen

    Abstract: The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Vide… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

  6. arXiv:2406.12753  [pdf, other

    cs.CL cs.AI

    OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

    Authors: Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, Yikai Zhang, Yuqing Yang, Ting Wu, Binjie Wang, Shichao Sun, Yang Xiao, Yiyuan Li, Fan Zhou, Steffi Chern, Yiwei Qin, Yan Ma, Jiadi Su, Yixiu Liu, Yuxiang Zheng, Shaoting Zhang , et al. (3 additional authors not shown)

    Abstract: The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoni… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 44 pages

  7. arXiv:2406.11833  [pdf, other

    cs.CV cs.AI cs.LG

    MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

    Authors: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history wit… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: This project is available at https://github.com/Liuziyu77/MMDU

  8. arXiv:2406.11739  [pdf, other

    cs.CV

    V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results

    Authors: Jiaqi Wang, Yuhang Zang, Pan Zhang, Tao Chu, Yuhang Cao, Zeyi Sun, Ziyu Liu, Xiaoyi Dong, Tong Wu, Dahua Lin, Zeming Chen, Zhi Wang, Lingchen Meng, Wenhao Yao, Jianwei Yang, Sihong Wu, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou , et al. (9 additional authors not shown)

    Abstract: Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  9. arXiv:2406.11340  [pdf, other

    cs.CV cs.LG

    CM2-Net: Continual Cross-Modal Map** Network for Driver Action Recognition

    Authors: Ruoyu Wang, Chen Cai, Wenqian Wang, Jianjun Gao, Dan Lin, Wenyang Liu, Kim-Hui Yap

    Abstract: Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested indep… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  10. arXiv:2406.09401  [pdf, other

    cs.CV cs.AI cs.RO

    MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

    Authors: Ruiyuan Lyu, Tai Wang, **gli Lin, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

    Abstract: With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Follow-up of EmbodiedScan. A multi-modal 3D dataset with the most-ever comprehensive language annotations for 3D-LLMs. Project page: https://tai-wang.github.io/mmscan/

  11. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang **, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  12. arXiv:2406.04854  [pdf, other

    cs.CL

    Uncertainty Aware Learning for Language Model Alignment

    Authors: Yikun Wang, Rui Zheng, Liang Ding, Qi Zhang, Dahua Lin, Dacheng Tao

    Abstract: As instruction-tuned large language models (LLMs) evolve, aligning pretrained foundation models presents increasing challenges. Existing alignment strategies, which typically leverage diverse and high-quality data sources, often overlook the intrinsic uncertainty of tasks, learning all data samples equally. This may lead to suboptimal data efficiency and model performance. In response, we propose… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: ACL 2024

  13. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  14. arXiv:2406.03847  [pdf, other

    cs.CL

    Lean Workbook: A large-scale Lean problem set formalized from natural language math problems

    Authors: Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, Kai Chen

    Abstract: Large language models have demonstrated impressive capabilities across various natural language processing tasks, especially in solving mathematical problems. However, large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. To address this issue, we propos… ▽ More

    Submitted 7 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  15. arXiv:2406.01349  [pdf, other

    cs.CV

    Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

    Authors: Enhui Ma, Lijun Zhou, Tao Tang, Zhan Zhang, Dong Han, Junpeng Jiang, Kun Zhan, Peng Jia, Xianpeng Lang, Haiyang Sun, Di Lin, Kaicheng Yu

    Abstract: Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomous driving models as the generated videos are usually less than 8 frames and the spatial and tempora… ▽ More

    Submitted 6 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Project Page: https://westlake-autolab.github.io/delphi.github.io/, 8 figures

  16. arXiv:2406.00917  [pdf, other

    cs.CV

    Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

    Authors: Kunpeng Wang, Danying Lin, Chenglong Li, Zhengzheng Tu, Bin Luo

    Abstract: RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: Accepted by TMM 2024

  17. arXiv:2406.00093  [pdf, other

    cs.CV cs.AI cs.GR cs.LG cs.MM

    Bootstrap3D: Improving 3D Content Creation with Synthetic Data

    Authors: Zeyi Sun, Tong Wu, Pan Zhang, Yuhang Zang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: Project Page: https://sunzey.github.io/Bootstrap3D/

  18. arXiv:2406.00023  [pdf, other

    cs.CL

    LocMoE+: Enhanced Router with Token Feature Awareness for Efficient LLM Pre-Training

    Authors: **g Li, Zhijie Sun, Dachao Lin, Xuan He, Yi Lin, Binfan Zheng, Li Zeng, Rongqian Zhao, Xin Chen

    Abstract: Mixture-of-Experts (MoE) architectures have recently gained increasing popularity within the domain of large language models (LLMs) due to their ability to significantly reduce training and inference overhead. However, MoE architectures face challenges, such as significant disparities in the number of tokens assigned to each expert and a tendency toward homogenization among experts, which adversel… ▽ More

    Submitted 23 May, 2024; originally announced June 2024.

  19. arXiv:2405.20330  [pdf, other

    cs.CV cs.AI cs.GR

    4DHands: Reconstructing Interactive Hands in 4D with Transformers

    Authors: Dixuan Lin, Yuxiang Zhang, Mengcheng Li, Yebin Liu, Wei **g, Qi Yan, Qianying Wang, Hongwen Zhang

    Abstract: In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transforme… ▽ More

    Submitted 31 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

    Comments: More demo videos can be seen at our project page: https://4dhands.github.io

  20. arXiv:2405.20315  [pdf, other

    cs.CL cs.AI

    ANAH: Analytical Annotation of Hallucinations in Large Language Models

    Authors: Ziwei Ji, Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen

    Abstract: Reducing the `$\textit{hallucination}$' problem of Large Language Models (LLMs) is crucial for their wide applications. A comprehensive and fine-grained measurement of the hallucination is the first key step for the governance of this issue but is under-explored in the community. Thus, we present $\textbf{ANAH}$, a bilingual dataset that offers $\textbf{AN}$alytical $\textbf{A}$nnotation of… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: Accepted by ACL 2024

  21. arXiv:2405.19265  [pdf, other

    cs.CL

    AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

    Authors: Zifan Song, Yudong Wang, Wenwei Zhang, Kuikun Liu, Chengqi Lyu, Demin Song, Qipeng Guo, Hang Yan, Dahua Lin, Kai Chen, Cairong Zhao

    Abstract: Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality and diversity, which may insufficiently elicit the potential of pre-trained Code LLMs. In this paper, we present AlchemistCoder, a series of Code LLMs with enh… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Preprint with 20 pages and 20 figures. Source code and models at https://github.com/InternLM/AlchemistCoder

  22. arXiv:2405.18315  [pdf, other

    cs.AI cs.PL

    DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data

    Authors: Bin Wang, Linke Ouyang, Fan Wu, Wenchang Ning, Xiao Han, Zhiyuan Zhao, Jiahui Peng, Yiying Jiang, Dahua Lin, Conghui He

    Abstract: In the era of artificial intelligence, the diversity of data modalities and annotation formats often renders data unusable directly, requiring understanding and format conversion before it can be used by researchers or developers with different needs. To tackle this problem, this article introduces a framework called Dataset Description Language (DSDL) that aims to simplify dataset processing by p… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  23. arXiv:2405.16009  [pdf, other

    cs.CV

    Streaming Long Video Understanding with Large Language Models

    Authors: Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang

    Abstract: This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for video understanding, that capably understands arbitrary-length video with a constant number of video tokens streamingly encoded and adaptively selected. The challenge of video understanding in the vision language area mainly lies in the significant computational burden caused by the great number of tokens extrac… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  24. arXiv:2405.12538  [pdf, other

    cs.CV

    Bridging the Intent Gap: Knowledge-Enhanced Visual Generation

    Authors: Yi Cheng, Ziwei Xu, Dongyun Lin, Harry Cheng, Yongkang Wong, Ying Sun, Joo Hwee Lim, Mohan Kankanhalli

    Abstract: For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not fully captured by input prompts. The absence of such details makes it challenging for generative models to accurately reflect the intended meaning, leadi… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  25. arXiv:2405.12209  [pdf, other

    cs.CL

    MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

    Authors: Hongwei Liu, Zilong Zheng, Yuxuan Qiao, Haodong Duan, Zhiwei Fei, Fengzhe Zhou, Wenwei Zhang, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large la… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Project: https://github.com/open-compass/MathBench

  26. arXiv:2405.11190  [pdf, other

    cs.CV

    ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

    Authors: Ying **, Pengyang Ling, Xiaoyi Dong, Pan Zhang, Jiaqi Wang, Dahua Lin

    Abstract: Instruction-based image editing focuses on equip** a generative model with the capacity to adhere to human-written instructions for editing images. Current approaches typically comprehend explicit and specific instructions. However, they often exhibit a deficiency in executing active reasoning capacities required to comprehend instructions that are implicit or insufficiently defined. To enhance… ▽ More

    Submitted 31 May, 2024; v1 submitted 18 May, 2024; originally announced May 2024.

  27. arXiv:2405.10565  [pdf, other

    cs.GR

    Real-time Level-of-Detail Strand-based Hair Rendering

    Authors: Tao Huang, Yang Zhou, Daqi Lin, Junqiu Zhu, Ling-Qi Yan, Kui Wu

    Abstract: Strand-based hair rendering has become increasingly popular in production for its realistic appearance. However, the prevailing level-of-detail solution employing hair cards for distant hair models introduces a significant discontinuity in dynamics and appearance during the transition from strands to cards. We introduce an innovative real-time framework for strand-based hair rendering that ensures… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 12 pages, 10 figures, 1 performance plot

    ACM Class: I.3.5; I.3.3

  28. arXiv:2405.10370  [pdf, other

    cs.CV

    Grounded 3D-LLM with Referent Tokens

    Authors: Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang

    Abstract: Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to ref… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: Preprint

  29. arXiv:2405.06219  [pdf, other

    cs.LG cs.CL

    SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models

    Authors: Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin

    Abstract: Large language models (LLMs) can now handle longer sequences of tokens, enabling complex tasks like book understanding and generating lengthy novels. However, the key-value (KV) cache required for LLMs consumes substantial memory as context length increasing, becoming the bottleneck for deployment. In this paper, we present a strategy called SKVQ, which stands for sliding-window KV cache quantizat… ▽ More

    Submitted 13 May, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

  30. arXiv:2404.19168  [pdf, other

    cs.CV

    PEVA-Net: Prompt-Enhanced View Aggregation Network for Zero/Few-Shot Multi-View 3D Shape Recognition

    Authors: Dongyun Lin, Yi Cheng, Shangbo Mao, Aiyuan Guo, Yiqun Li

    Abstract: Large vision-language models have impressively promote the performance of 2D visual recognition under zero/few-shot scenarios. In this paper, we focus on exploiting the large vision-language model, i.e., CLIP, to address zero/few-shot 3D shape recognition based on multi-view representations. The key challenge for both tasks is to generate a discriminative descriptor of the 3D shape represented by… ▽ More

    Submitted 29 April, 2024; originally announced April 2024.

  31. arXiv:2404.19048  [pdf, other

    cs.CL cs.AI

    A Framework for Real-time Safeguarding the Text Generation of Large Language Model

    Authors: Ximing Dong, Dayi Lin, Shaowei Wang, Ahmed E. Hassan

    Abstract: Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks but also pose ethical and societal risks due to their propensity to generate harmful content. To address this, various approaches have been developed to safeguard LLMs from producing unsafe content. However, existing methods have limitations, including the need for training specific control models and… ▽ More

    Submitted 1 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

  32. arXiv:2404.18359  [pdf, other

    cs.CL cs.AI

    FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

    Authors: Wei Li, Ren Ma, Jiang Wu, Chenya Gu, Jiahui Peng, **yang Len, Songyang Zhang, Hang Yan, Dahua Lin, Conghui He

    Abstract: In the burgeoning field of large language models (LLMs), the assessment of fundamental knowledge remains a critical challenge, particularly for models tailored to Chinese language and culture. This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. FoundaBench encompasses a diverse array of 3354 multiple-choi… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

  33. arXiv:2404.16829  [pdf, other

    cs.CV cs.AI cs.CL

    Make-it-Real: Unleashing Large Multimodal Model for Painting 3D Objects with Realistic Materials

    Authors: Ye Fang, Zeyi Sun, Tong Wu, Jiaqi Wang, Ziwei Liu, Gordon Wetzstein, Dahua Lin

    Abstract: Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs… ▽ More

    Submitted 23 May, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Project Page: https://sunzey.github.io/Make-it-Real/

  34. arXiv:2404.16821  [pdf, other

    cs.CV

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Authors: Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang **, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai , et al. (10 additional authors not shown)

    Abstract: In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual… ▽ More

    Submitted 29 April, 2024; v1 submitted 25 April, 2024; originally announced April 2024.

    Comments: Technical report

  35. arXiv:2404.14405  [pdf, other

    cs.RO

    Learning H-Infinity Locomotion Control

    Authors: Junfeng Long, Wenye Yu, Quanyi Li, Zirui Wang, Dahua Lin, Jiangmiao Pang

    Abstract: Stable locomotion in precipitous environments is an essential task for quadruped robots, requiring the ability to resist various external disturbances. Recent neural policies enhance robustness against disturbances by learning to resist external forces sampled from a fixed distribution in the simulated environment. However, the force generation process doesn't consider the robot's current state, m… ▽ More

    Submitted 12 June, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Project Page: https://junfeng-long.github.io/HINF/

  36. arXiv:2404.10552  [pdf, other

    cs.CL cs.AI

    Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning

    Authors: Xiao Wang, Tianze Chen, Xianjun Yang, Qi Zhang, Xun Zhao, Dahua Lin

    Abstract: The open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. This includes both base models, which are pre-trained on extensive datasets without alignment, and aligned models, deliberately designed to align with ethical standards and human values. Contrary to the prevalent assumption that the inherent instruction-following limitations… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  37. arXiv:2404.10225  [pdf

    cs.SE cs.AI

    Rethinking Software Engineering in the Foundation Model Era: From Task-Driven AI Copilots to Goal-Driven AI Pair Programmers

    Authors: Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, Jiang

    Abstract: The advent of Foundation Models (FMs) and AI-powered copilots has transformed the landscape of software development, offering unprecedented code completion capabilities and enhancing developer productivity. However, the current task-driven nature of these copilots falls short in addressing the broader goals and complexities inherent in software engineering (SE). In this paper, we propose a paradig… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  38. arXiv:2404.06512  [pdf, other

    cs.CV cs.CL

    InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

    Authors: Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, **gwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

    Abstract: The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow reso… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: Code and models are publicly available at https://github.com/InternLM/InternLM-XComposer

  39. arXiv:2404.06480  [pdf, other

    cs.CL cs.AI

    Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

    Authors: Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

    Abstract: Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct lo… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    Comments: NAACL 2024

  40. arXiv:2404.06247  [pdf, other

    cs.CV

    LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

    Authors: Jianlang Chen, Xuhong Ren, Qing Guo, Felix Juefei-Xu, Di Lin, Wei Feng, Lei Ma, Jianjun Zhao

    Abstract: Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when the… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  41. arXiv:2404.02015  [pdf, other

    cs.DC

    MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

    Authors: Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, Hao Zhang

    Abstract: Large language models (LLMs) have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing… ▽ More

    Submitted 12 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  42. arXiv:2404.00906  [pdf, other

    cs.CV

    From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

    Authors: Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He

    Abstract: Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-t… ▽ More

    Submitted 24 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  43. arXiv:2403.20330  [pdf, other

    cs.CV

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Authors: Lin Chen, **song Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

    Abstract: Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomeno… ▽ More

    Submitted 9 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: Project page: https://mmstar-benchmark.github.io/

  44. arXiv:2403.17297  [pdf, other

    cs.CL cs.AI

    InternLM2 Technical Report

    Authors: Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang , et al. (75 additional authors not shown)

    Abstract: The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  45. arXiv:2403.14097  [pdf, other

    cs.DC

    Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances

    Authors: Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, Zhihao Jia

    Abstract: Deep neural networks (DNNs) are becoming progressively large and costly to train. This paper aims to reduce DNN training costs by leveraging preemptible instances on modern clouds, which can be allocated at a much lower price when idle but may be preempted by the cloud provider at any time. Prior work that supports DNN training on preemptive instances employs a reactive approach to handling instan… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: NSDI '24

  46. arXiv:2403.13805  [pdf, other

    cs.CV cs.AI cs.LG

    RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

    Authors: Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

    Abstract: CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Project: https://github.com/Liuziyu77/RAR

  47. arXiv:2403.12881  [pdf, other

    cs.CL

    Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models

    Authors: Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, Feng Zhao

    Abstract: Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent re… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: Technical Report

  48. arXiv:2403.12013  [pdf, other

    cs.CV

    GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

    Authors: Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, ** Tan, Shaojie Shen, Dahua Lin, Xiaoxiao Long

    Abstract: We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenar… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://fuxiao0719.github.io/projects/geowizard/

  49. arXiv:2403.08604  [pdf, other

    cs.CL cs.SE

    DevBench: A Comprehensive Benchmark for Software Development

    Authors: Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, **yang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, ** Yang, Dahua Lin, Chao Peng, Kai Chen

    Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of programming, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. To this end, we propo… ▽ More

    Submitted 15 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: Our data and code are available at https://github.com/open-compass/DevBench

  50. arXiv:2403.07648  [pdf, other

    cs.DC cs.LG

    Characterization of Large Language Model Development in the Datacenter

    Authors: Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang

    Abstract: Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characteriz… ▽ More

    Submitted 3 April, 2024; v1 submitted 12 March, 2024; originally announced March 2024.