Skip to main content

Showing 1–50 of 297 results for author: Luo, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11802  [pdf, other

    cs.CV

    PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

    Authors: Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, ** Luo

    Abstract: Text-to-image (T2I) models have made substantial progress in generating images from textual prompts. However, they frequently fail to produce images consistent with physical commonsense, a vital capability for applications in world simulation and everyday tasks. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal know… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.09953  [pdf, other

    cs.RO cs.AI

    DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

    Authors: Zeyu Gao, Yao Mu, **ye Qu, Mengkang Hu, Lingyue Guo, ** Luo, Yanfeng Lu

    Abstract: Dual-arm robots offer enhanced versatility and efficiency over single-arm counterparts by enabling concurrent manipulation of multiple objects or cooperative execution of tasks using both arms. However, effectively coordinating the two arms for complex long-horizon tasks remains a significant challenge. Existing task planning methods predominantly focus on single-arm robots or rely on predefined b… ▽ More

    Submitted 30 June, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 46 pages, 13 figures

  3. arXiv:2406.08845  [pdf, other

    cs.CV

    Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

    Authors: Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, ** Luo, Kaipeng Zhang

    Abstract: Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. H… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2406.08451  [pdf, other

    cs.CV

    GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

    Authors: Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, ** Luo

    Abstract: Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms. Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets compri… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 16 pages, 8 figures, a cross-app GUI navigation dataset

  5. arXiv:2406.08394  [pdf, other

    cs.CV

    VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

    Authors: Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, ** Luo, Yu Qiao, Jifeng Dai

    Abstract: We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such a… ▽ More

    Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: 43 pages

  6. arXiv:2406.07230  [pdf, other

    cs.CV cs.AI

    Needle In A Multimodal Haystack

    Authors: Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, ** Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

    Abstract: With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capab… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  7. arXiv:2406.06525  [pdf, other

    cs.CV

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Authors: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, ** Luo, Zehuan Yuan

    Abstract: We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spa… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Codes and models: \url{https://github.com/FoundationVision/LlamaGen}

  8. arXiv:2406.04113  [pdf, other

    cs.CL

    Uncovering Limitations of Large Language Models in Information Seeking from Tables

    Authors: Chaoxu Pang, Yixuan Cao, Chunhao Yang, ** Luo

    Abstract: Tables are recognized for their high information density and widespread usage, serving as essential sources of information. Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems. However, this field presently suffers from an absence of thorough and reliable evaluation. This paper introduces a more re… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Findings of ACL 2024

  9. arXiv:2406.00439  [pdf, other

    cs.RO cs.CV

    Learning Manipulation by Predicting Interaction

    Authors: Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, ** Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

    Abstract: Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI

  10. arXiv:2405.17201  [pdf, other

    cs.CV

    Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

    Authors: ** Wang, Shichao Dong, Yapeng Zhu, Kelu Yao, Weidong Zhao, Chao Li, ** Luo

    Abstract: Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for th… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 21 pages, 8 figures

  11. arXiv:2405.16888  [pdf, other

    cs.GR cs.CV

    Part123: Part-aware 3D Reconstruction from a Single-view Image

    Authors: Anran Liu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Zhiyang Dou, Hao-Xiang Guo, ** Luo, Wen** Wang

    Abstract: Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from lar… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted to SIGGRAPH 2024 (conference track),webpage: https://liuar0512.github.io/part123_official_page/

  12. arXiv:2405.14918  [pdf, other

    cs.LG cs.ET

    AnalogCoder: Analog Circuit Design via Training-Free Code Generation

    Authors: Yao Lai, Sungyoung Lee, Guo** Chen, Souradip Poddar, Mengkang Hu, David Z. Pan, ** Luo

    Abstract: Analog circuit design is a significant task in modern chip technology, focusing on the selection of component types, connectivity, and parameters to ensure proper circuit functionality. Despite advances made by Large Language Models (LLMs) in digital circuit design, the complexity and scarcity of data in analog circuitry pose significant challenges. To mitigate these issues, we introduce AnalogCod… ▽ More

    Submitted 30 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

  13. arXiv:2405.14554  [pdf, other

    cs.CV cs.AI

    UDKAG: Augmenting Large Vision-Language Models with Up-to-Date Knowledge

    Authors: Chuanhao Li, Zhen Li, Chenchen **g, Shuo Liu, Wenqi Shao, Yuwei Wu, ** Luo, Yu Qiao, Kaipeng Zhang

    Abstract: Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently due to the large amount of resources required, and therefore fail in many cases. For example, if a LVLM was released on January 2024, and it wouldn't know the detailed plot of the new movie Dune 2, which wasn't released until February 2024. To solve the pro… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 12 pages, 6 figures, a framework to augment large vision-language models with up-to-date knowledge

  14. arXiv:2405.13726  [pdf, other

    cs.LG

    Score-based Generative Models with Adaptive Momentum

    Authors: Ziqing Wen, Xiaoge Deng, ** Luo, Tao Sun, Dongsheng Li

    Abstract: Score-based generative models have demonstrated significant practical success in data-generating tasks. The models establish a diffusion process that perturbs the ground truth data to Gaussian noise and then learn the reverse process to transform noise into data. However, existing denoising methods such as Langevin dynamic and numerical stochastic differential equation solvers enjoy randomness but… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  15. arXiv:2405.08099  [pdf, other

    cs.CL

    KET-QA: A Dataset for Knowledge Enhanced Table Question Answering

    Authors: Mengkang Hu, Haoyu Dong, ** Luo, Shi Han, Dongmei Zhang

    Abstract: Due to the concise and structured nature of tables, the knowledge contained therein may be incomplete or missing, posing a significant challenge for table question answering (TableQA) and data analysis systems. Most existing datasets either fail to address the issue of external knowledge in TableQA or only utilize unstructured text as supplementary information for tables. In this paper, we propose… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: LREC-Coling 2024

  16. arXiv:2405.07990  [pdf, other

    cs.CL cs.CV

    Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

    Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, ** Luo

    Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLM… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  17. arXiv:2405.06758  [pdf, other

    cs.LG

    Scalable and Effective Arithmetic Tree Generation for Adder and Multiplier Designs

    Authors: Yao Lai, **xin Liu, David Z. Pan, ** Luo

    Abstract: Across a wide range of hardware scenarios, the computational efficiency and physical size of the arithmetic units significantly influence the speed and footprint of the overall hardware system. Nevertheless, the effectiveness of prior arithmetic design techniques proves inadequate, as it does not sufficiently optimize speed and area, resulting in a reduced processing rate and larger module size. T… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  18. arXiv:2404.19401  [pdf, other

    cs.CV

    UniFS: Universal Few-shot Instance Perception with Point Representations

    Authors: Sheng **, Ruijie Yao, Lumin Xu, Wentao Liu, Chen Qian, Ji Wu, ** Luo

    Abstract: Instance perception tasks (object detection, instance segmentation, pose estimation, counting) play a key role in industrial applications of visual models. As supervised learning methods suffer from high labeling cost, few-shot learning methods which effectively learn from a limited number of labeled examples are desired. Existing few-shot learning methods primarily focus on a restricted set of ta… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  19. arXiv:2404.16006  [pdf, other

    cs.CV

    MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

    Authors: Kaining Ying, Fanqing Meng, ** Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, Jiayi Lei, Quanfeng Lu, Runjian Chen, Peng Xu, Renrui Zhang, Haozhe Zhang, Peng Gao, Yali Wang, Yu Qiao, ** Luo, Kaipeng Zhang, Wenqi Shao

    Abstract: Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 77 pages, 41 figures

  20. arXiv:2404.06773  [pdf, other

    cs.CV

    Adapting LLaMA Decoder to Vision Transformer

    Authors: Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, ** Luo

    Abstract: This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the net… ▽ More

    Submitted 27 May, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

    Comments: 23 pages, 11 figures

  21. arXiv:2404.01342  [pdf, other

    cs.CL cs.AI

    DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

    Authors: Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, ** Luo, Rongrong Ji

    Abstract: Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example, the Civitai community, a platform for T2I innovation, currently hosts an impressive array of 74,492 distinct models. However, this diversity presents a formidable challenge in selecting the most appropriate model and parameters, a process tha… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

    Comments: Published as a conference paper at CVPR 2024

  22. arXiv:2404.00717  [pdf, other

    cs.RO cs.CV cs.MA

    End-to-End Autonomous Driving through V2X Cooperation

    Authors: Haibao Yu, Wenxian Yang, Jiaru Zhong, Zhenwei Yang, Siqi Fan, ** Luo, Zaiqing Nie

    Abstract: Cooperatively utilizing both ego-vehicle and infrastructure sensor data via V2X communication has emerged as a promising approach for advanced autonomous driving. However, current research mainly focuses on improving individual modules, rather than taking end-to-end learning to optimize final planning performance, resulting in underutilized data potential. In this paper, we introduce UniV2X, a pio… ▽ More

    Submitted 19 April, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

  23. arXiv:2403.20194  [pdf, other

    cs.MM

    ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

    Authors: Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, ** Luo, Wenqi Shao, Kaipeng Zhang

    Abstract: This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on… ▽ More

    Submitted 25 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  24. arXiv:2403.19591  [pdf, other

    cs.LG cs.AR cs.NE

    Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

    Authors: **cheng Dong, Yonghao Tan, Dong Zhang, Tianwei Ni, Xuejiao Liu, Yu Liu, Peng Luo, Luhong Liang, Shih-Yang Liu, Xijie Huang, Huaiyu Zhu, Yun Pan, Fengwei An, Kwang-Ting Cheng

    Abstract: Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of… ▽ More

    Submitted 29 March, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

    Comments: 61st ACM/IEEE Design Automation Conference (DAC) 2024

  25. arXiv:2403.17146  [pdf, other

    cs.CL

    Outcome-Constrained Large Language Models for Countering Hate Speech

    Authors: Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song

    Abstract: Counterspeech that challenges or responds to hate speech has been seen as an alternative to mitigate the negative impact of hate speech and foster productive online communications. Research endeavors have been directed to using language models for the automatic generation of counterspeech to assist efforts in combating online hate. Existing research focuses on the generation of counterspeech with… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  26. arXiv:2403.17008  [pdf, other

    cs.CV

    FlashFace: Human Image Personalization with High-fidelity Identity Preservation

    Authors: Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, ** Luo

    Abstract: This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face i… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Project Page:https://jshilong.github.io/flashface-page

  27. arXiv:2403.16996  [pdf, other

    cs.CV cs.RO

    DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving

    Authors: Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, ** Luo

    Abstract: End-to-end driving has made significant progress in recent years, demonstrating benefits such as system simplicity and competitive driving performance under both open-loop and closed-loop settings. Nevertheless, the lack of interpretability and controllability in its driving decisions hinders real-world deployment for end-to-end driving systems. In this paper, we collect a comprehensive end-to-end… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  28. arXiv:2403.16557  [pdf, ps, other

    cs.LG cs.DC

    Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

    Authors: ** Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

    Abstract: Federated Learning (FL) is a distributed machine learning framework in communication network systems. However, the systems' Non-Independent and Identically Distributed (Non-IID) data negatively affect the convergence efficiency of the global model, since only a subset of these data samples are beneficial for model convergence. In pursuit of this subset, a reliable approach involves determining a m… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

  29. arXiv:2403.10856  [pdf, other

    cs.CL cs.CR

    Zero-shot Generative Linguistic Steganography

    Authors: Ke Lin, Yiyang Luo, Zijian Zhang, ** Luo

    Abstract: Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

    Comments: 15 pages, 6 figures. Accepted at NAACL 2024

  30. arXiv:2403.09630  [pdf, other

    cs.CV

    Generalized Predictive Model for Autonomous Driving

    Authors: Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, ** Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

    Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning ar… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  31. arXiv:2403.09346  [pdf, other

    cs.CV cs.AI

    AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

    Authors: Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, ** Luo, Yu Qiao, Kaipeng Zhang

    Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in well responding to visual-instructions from users. However, these instructions, encompassing images and text, are susceptible to both intentional and inadvertent attacks. Despite the critical importance of LVLMs' robustness against such threats, current research in this area remains limited. To bridge this gap, we introduce AV… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  32. arXiv:2403.06745  [pdf, other

    cs.CL cs.AI

    ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation

    Authors: Shaojie Dai, Xin Liu, ** Luo, Yue Yu

    Abstract: Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning. However, due to the mixture of multilingual data during the pre-training of LLM, the LLM-based translation models face the off-target issue in both prompt-based methods, including a series of phenomena, namely instruction misunderstanding, translat… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

  33. arXiv:2403.04692  [pdf, other

    cs.CV

    PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

    Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, ** Luo, Huchuan Lu, Zhenguo Li

    Abstract: In this paper, we introduce PixArt-Σ, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-Σrepresents a significant advancement over its predecessor, PixArt-α, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-Σis its training efficiency. Leveraging the foundational pre-training of PixArt-α,… ▽ More

    Submitted 17 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Project Page: https://pixart-alpha.github.io/PixArt-sigma-project/

  34. arXiv:2403.02330  [pdf, other

    cs.CV

    RegionGPT: Towards Region Understanding Vision Language Model

    Authors: Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, ** Luo, Sifei Liu

    Abstract: Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  35. arXiv:2403.02118  [pdf, other

    cs.CY cs.AI cs.CV

    Position: Towards Implicit Prompt For Text-To-Image Models

    Authors: Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, ** Luo

    Abstract: Recent text-to-image (T2I) models have had great success, and many benchmarks have been proposed to evaluate their performance and safety. However, they only consider explicit prompts while neglecting implicit prompts (hint at a target without explicitly mentioning it). These prompts may get rid of safety constraints and pose potential threats to the applications of these models. This position pap… ▽ More

    Submitted 28 May, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  36. arXiv:2402.16880  [pdf, other

    cs.LG cs.AI cs.CL

    BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

    Authors: Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, ** Luo

    Abstract: Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer… ▽ More

    Submitted 19 April, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

  37. arXiv:2402.16117  [pdf, other

    cs.RO cs.AI cs.CV

    RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

    Authors: Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, ** Luo

    Abstract: Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  38. arXiv:2402.15351  [pdf, other

    cs.LG cs.CV

    AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

    Authors: Zekang Yang, Wang Zeng, Sheng **, Chen Qian, ** Luo, Wentao Liu

    Abstract: Automated machine learning (AutoML) is a collection of techniques designed to automate the machine learning development process. While traditional AutoML approaches have been successfully applied in several critical steps of model development (e.g. hyperparameter optimization), there lacks a AutoML system that automates the entire end-to-end model production workflow. To fill this blank, we presen… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

  39. arXiv:2402.14623  [pdf, other

    cs.RO cs.AI cs.CL cs.CV

    RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

    Authors: Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, ** Luo

    Abstract: Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental c… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

    Comments: 10 pages of main paper, 4 pages of appendix; 10 figures in main paper, 3 figures in appendix

    ACM Class: I.2.7; I.2.8; I.2.9; I.2.10

  40. arXiv:2402.09181  [pdf, other

    eess.IV cs.CV

    OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

    Authors: Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, ** Luo

    Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However, their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications. To solve this problem, in this pape… ▽ More

    Submitted 21 April, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

  41. arXiv:2402.00222  [pdf, other

    physics.soc-ph cs.LG

    Uncover the nature of overlap** community in cities

    Authors: Peng Luo, Di Zhu

    Abstract: Urban spaces, though often perceived as discrete communities, are shared by various functional and social groups. Our study introduces a graph-based physics-aware deep learning framework, illuminating the intricate overlap** nature inherent in urban communities. Through analysis of individual mobile phone positioning data at Twin Cities metro area (TCMA) in Minnesota, USA, our findings reveal th… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  42. arXiv:2401.05252  [pdf, other

    cs.CV

    PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

    Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, ** Luo, Hang Zhao, Zhenguo Li

    Abstract: This technical report introduces PIXART-δ, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-α model. PIXART-α is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-δ significantly accelerates the inference speed… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: Technical Report

  43. arXiv:2401.02415  [pdf, other

    cs.CL

    LLaMA Pro: Progressive LLaMA with Block Expansion

    Authors: Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, ** Luo

    Abstract: Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e.g., from LLaMA to CodeLLaMA. To this end, we propose a new post-pretraining method for LLMs with an expansion of Transformer blocks. We tune the expanded blocks using only new corpus, efficiently and effectively improving the model's knowledge without catastrophic forge… ▽ More

    Submitted 30 May, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Accepted by ACL 2024, Main Conference

  44. arXiv:2401.02384  [pdf, other

    cs.CV

    ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

    Authors: Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, ** Luo

    Abstract: Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To a… ▽ More

    Submitted 15 February, 2024; v1 submitted 4 January, 2024; originally announced January 2024.

    Comments: Updated and corrected experimental results, removal of inappropriate experiments, and a more comprehensive experimental setup

  45. arXiv:2401.01571  [pdf, other

    cs.SE cs.PL

    CodeFuse-Query: A Data-Centric Static Code Analysis System for Large-Scale Organizations

    Authors: Xiaoheng Xie, Gang Fan, Xiaojun Lin, Ang Zhou, Shijie Li, Xun** Zheng, Yinan Liang, Yu Zhang, Na Yu, Haokun Li, Xinyu Chen, Yingzhuang Chen, Yi Zhen, Dejun Dong, Xian** Fu, **zhou Su, Fuxiong Pan, Pengshuai Luo, Youzheng Feng, Ruoxiang Hu, **g Fan, **guo Zhou, Xiao Xiao, Peng Di

    Abstract: In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design. CodeFuse-Query reimagines code analysis as a data compu… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  46. arXiv:2312.17432  [pdf, other

    cs.CV cs.CL

    Video Understanding with Large Language Models: A Survey

    Authors: Yunlong Tang, **g Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, **gyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, ** Luo, Jiebo Luo, Chenliang Xu

    Abstract: With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-… ▽ More

    Submitted 3 January, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

  47. arXiv:2312.15715  [pdf, other

    cs.CV

    UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

    Authors: Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, ** Luo

    Abstract: The reference-based object segmentation tasks, namely referring image segmentation (RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically desig… ▽ More

    Submitted 25 December, 2023; originally announced December 2023.

    Comments: Extended version of ICCV2023 UniRef. 20 pages

  48. arXiv:2312.14238  [pdf, other

    cs.CV

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Authors: Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, ** Luo, Tong Lu, Yu Qiao, Jifeng Dai

    Abstract: The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model… ▽ More

    Submitted 15 January, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

    Comments: 25 pages, 5 figures, 28 tables

  49. arXiv:2312.14150  [pdf, other

    cs.CV

    DriveLM: Driving with Graph Visual Question Answering

    Authors: Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, ** Luo, Andreas Geiger, Hongyang Li

    Abstract: We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

  50. arXiv:2312.12742  [pdf, other

    cs.CV

    Cached Transformers: Improving Transformers with Differentiable Memory Cache

    Authors: Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, **wei Gu, ** Luo

    Abstract: This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens. GRC attention enables attending to both past and current tokens, increasing the receptive field of attention and allowing for exploring long-range dependencies. By utilizing a recurrent gating… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: AAAI 2024