Skip to main content

Showing 1–50 of 177 results for author: Yue, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18583  [pdf, other

    cs.CV cs.LG

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

    Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  2. arXiv:2406.09795  [pdf, other

    cs.LG math.NA

    DeltaPhi: Learning Physical Trajectory Residual for PDE Solving

    Authors: Xihang Yue, Linchao Zhu, Yi Yang

    Abstract: Although neural operator networks theoretically approximate any operator map**, the limited generalization capability prevents them from learning correct physical dynamics when potential data biases exist, particularly in the practical PDE solving scenario where the available data amount is restricted or the resolution is extremely low. To address this issue, we propose and formulate the Physica… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  3. arXiv:2406.09412  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Explore the Limits of Omni-modal Pretraining at Scale

    Authors: Yiyuan Zhang, Handong Li, **g Liu, Xiangyu Yue

    Abstract: We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Project Website: https://invictus717.github.io/MiCo/

  4. arXiv:2406.07645  [pdf, other

    cs.CV cs.MM

    SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

    Authors: Feng Wang, Haihang Ruan, Zhihuang Xie, Ronggang Wang, Xiangyu Yue

    Abstract: Recently, Neural Video Compression (NVC) techniques have achieved remarkable performance, even surpassing the best traditional lossy video codec. However, most existing NVC methods heavily rely on transmitting Motion Vector (MV) to generate accurate contextual features, which has the following drawbacks. (1) Compressing and transmitting MV requires specialized MV encoder and decoder, which makes m… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by DCC 2024 as Poster. This is the full paper

  5. arXiv:2406.06565  [pdf, other

    cs.CL cs.AI cs.LG

    MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

    Authors: **jie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

    Abstract: Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and s… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  6. arXiv:2406.03092  [pdf, other

    cs.CL

    FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models

    Authors: Xihang Yue, Linchao Zhu, Yi Yang

    Abstract: To process contexts with unlimited length using Large Language Models (LLMs), recent studies explore hierarchically managing the long text. Only several text fragments are taken from the external memory and passed into the temporary working memory, i.e., LLM's context window. However, existing approaches isolatedly handle the text fragments without considering their structural connections, thereby… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  7. arXiv:2406.02582  [pdf, other

    cs.LG cs.AI physics.ao-ph

    Spatiotemporal Predictions of Toxic Urban Plumes Using Deep Learning

    Authors: Yinan Wang, M. Giselle Fernández-Godino, Nipun Gunawardena, Donald D. Lucas, Xiaowei Yue

    Abstract: Industrial accidents, chemical spills, and structural fires can release large amounts of harmful materials that disperse into urban atmospheres and impact populated areas. Computer models are typically used to predict the transport of toxic plumes by solving fluid dynamical equations. However, these models can be computationally expensive due to the need for many grid cells to simulate turbulent f… ▽ More

    Submitted 30 May, 2024; originally announced June 2024.

    Comments: 13 pages, 10 figures

    MSC Class: 86-08 ACM Class: I.2.10

  8. arXiv:2406.01574  [pdf, other

    cs.CL

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Authors: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

    Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in… ▽ More

    Submitted 23 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  9. arXiv:2405.20421  [pdf, other

    cs.AI

    Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

    Authors: Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang

    Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address thi… ▽ More

    Submitted 21 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  10. arXiv:2405.17461  [pdf, other

    cs.LG cs.CV

    EMR-Merging: Tuning-Free High-Performance Model Merging

    Authors: Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, Wanli Ouyang

    Abstract: The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or t… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  11. arXiv:2405.15071  [pdf, other

    cs.CL

    Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

    Authors: Boshi Wang, Xiang Yue, Yu Su, Huan Sun

    Abstract: We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalizati… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 22 pages, 16 figures. Code and data: https://github.com/OSU-NLP-Group/GrokkedTransformer

  12. arXiv:2405.13581  [pdf, other

    cs.CV cs.AI

    Safety Alignment for Vision Language Models

    Authors: Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

    Abstract: Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the e… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 23 pages, 15 figures

  13. arXiv:2405.10514  [pdf, other

    cs.IT eess.SP

    Secrecy Performance Analysis of Multi-Functional RIS-Assisted NOMA Networks

    Authors: Yingjie Pei, Wanli Ni, ** Xu, Xinwei Yue, Xiaofeng Tao, Dusit Niyato

    Abstract: Although reconfigurable intelligent surface (RIS) can improve the secrecy communication performance of wireless users, it still faces challenges such as limited coverage and double-fading effect. To address these issues, in this paper, we utilize a novel multi-functional RIS (MF-RIS) to enhance the secrecy performance of wireless users, and investigate the physical layer secrecy problem in non-ort… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 14 pages, 9 figures, submitted to IEEE transactions on wireless communication

  14. arXiv:2405.03939  [pdf, other

    cs.CL

    Long Context Alignment with Short Instructions and Synthesized Positions

    Authors: Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li

    Abstract: Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skip** Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional effor… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: preview

  15. arXiv:2405.03548  [pdf, other

    cs.CL

    MAmmoTH2: Scaling Instructions from the Web

    Authors: Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

    Abstract: Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involve… ▽ More

    Submitted 23 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

  16. arXiv:2404.10662  [pdf, other

    cs.LG cs.AI

    Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

    Authors: **mei Liu, Wenbin Li, Xiangyu Yue, Shilin Zhang, Chunlin Chen, Zhi Wang

    Abstract: We study continual offline reinforcement learning, a practical paradigm that facilitates forward transfer and mitigates catastrophic forgetting to tackle sequential offline tasks. We propose a dual generative replay framework that retains previous knowledge by concurrent replay of generated pseudo-data. First, we decouple the continual learning policy into a diffusion-based generative behavior mod… ▽ More

    Submitted 18 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

  17. arXiv:2404.06393  [pdf, other

    cs.SD cs.AI eess.AS

    MuPT: A Generative Symbolic Music Pretrained Transformer

    Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (4 additional authors not shown)

    Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More

    Submitted 10 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  18. arXiv:2404.05955  [pdf, other

    cs.CL cs.AI

    VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

    Authors: Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

    Abstract: Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained a… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  19. arXiv:2404.03543  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

    Authors: Jiawei Guo, Ziming Li, Xueling Liu, Kai**g Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

    Abstract: Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench empha… ▽ More

    Submitted 6 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  20. arXiv:2404.02060  [pdf, other

    cs.CL cs.AI

    Long-context LLMs Struggle with Long In-context Learning

    Authors: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

    Abstract: Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark… ▽ More

    Submitted 11 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  21. Secrecy Performance Analysis of RIS Assisted Ambient Backscatter Communication Networks

    Authors: Yingjie Pei, Xinwei Yue, Chongwen Huang, Zhi** Lu

    Abstract: Reconfigurable intelligent surface (RIS) and ambient backscatter communication (AmBC) have been envisioned as two promising technologies due to their high transmission reliability as well as energy-efficiency. This paper investigates the secrecy performance of RIS assisted AmBC networks. New closed-form and asymptotic expressions of secrecy outage probability for RIS-AmBC networks are derived by t… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for publication in IEEE Transactions on Green Communications and Networking

  22. Secure Communication of Active RIS Assisted NOMA Networks

    Authors: Xuehua Li, Yingjie Pei, Xinwei Yue, Yuanwei Liu, Zhiguo Ding

    Abstract: As a revolutionary technology, reconfigurable intelligent surface (RIS) has been deemed as an indispensable part of the 6th generation communications due to its inherent ability to regulate the wireless channels. However, passive RIS (PRIS) still suffers from some pressing issues, one of which is that the fading of the entire reflection link is proportional to the product of the distances from the… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for publication by IEEE Transactions on Wireless Communications

  23. Secrecy Outage Probability Analysis for Downlink RIS-NOMA Networks with On-Off Control

    Authors: Yingjie Pei, Xinwei Yue, Wenqiang Yi, Yuanwei Liu, Xuehua Li, Zhiguo Ding

    Abstract: Reconfigurable intelligent surface (RIS) has been regarded as a promising technology since it has ability to create the favorable channel conditions. This paper investigates the secure communications of RIS assisted non-orthogonal multiple access (NOMA) networks, where both external and internal eavesdrop** scenarios are taken into consideration. More specifically, novel approximate and asymptot… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been published in IEEE Transactions on Vehicular Technology

    Journal ref: vol. 72, no. 9, pp. 11772-11786, Sep. 2023

  24. arXiv:2403.10073  [pdf, other

    cs.CV

    Revisiting Adversarial Training under Long-Tailed Distributions

    Authors: Xinli Yue, Ning** Mou, Qian Wang, Lingchen Zhao

    Abstract: Deep neural networks are vulnerable to adversarial attacks, often leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However, existing adversarial training techniques have predominantly been tested on balanced datasets, whereas real-world data often exhibit a long-tailed distribution, casting doubt on the efficacy of… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  25. arXiv:2403.02502  [pdf, other

    cs.CL cs.AI cs.LG

    Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

    Authors: Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

    Abstract: Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  26. arXiv:2403.00669  [pdf, other

    cs.LG

    Advancing Additive Manufacturing through Deep Learning: A Comprehensive Review of Current Progress and Future Challenges

    Authors: Amirul Islam Saimon, Emmanuel Yangue, Xiaowei Yue, Zhenyu James Kong, Chenang Liu

    Abstract: Additive manufacturing (AM) has already proved itself to be the potential alternative to widely-used subtractive manufacturing due to its extraordinary capacity of manufacturing highly customized products with minimum material wastage. Nevertheless, it is still not being considered as the primary choice for the industry due to some of its major inherent challenges, including complex and dynamic pr… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  27. arXiv:2402.16671  [pdf, other

    cs.CL

    StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

    Authors: Alex Zhuang, Ge Zhang, Tianyu Zheng, Xinrun Du, Junjie Wang, Weiming Ren, Stephen W. Huang, Jie Fu, Xiang Yue, Wenhu Chen

    Abstract: Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs' ability to process structured data, e.g., ChatGPT lags behind state-of-the-art (… ▽ More

    Submitted 24 April, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Technical Report

  28. arXiv:2402.15159  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Machine Unlearning of Pre-trained Large Language Models

    Authors: ** Yao, Eli Chien, Minxin Du, Xinyao Niu, Tianhao Wang, Zezhou Cheng, Xiang Yue

    Abstract: This study investigates the concept of the `right to be forgotten' within the context of large language models (LLMs). We explore machine unlearning as a pivotal solution, with a focus on pre-trained models--a notably under-researched area. Our research delineates a comprehensive framework for machine unlearning in pre-trained LLMs, encompassing a critical analysis of seven diverse unlearning meth… ▽ More

    Submitted 30 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: ACL 2024 main. Code and data at https://github.com/yao**17/Unlearning_LLM

  29. arXiv:2402.15089  [pdf, other

    cs.CL cs.AI cs.LG

    AttributionBench: How Hard is Automatic Attribution Evaluation?

    Authors: Yifei Li, Xiang Yue, Zeyi Liao, Huan Sun

    Abstract: Modern generative search engines enhance the reliability of large language model (LLM) responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for a… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  30. arXiv:2402.14658  [pdf, other

    cs.SE cs.AI cs.CL

    OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

    Authors: Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, Xiang Yue

    Abstract: The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address this, we introduce OpenCodeInterpreter, a family of open-source code systems designed for generating, executing, and iteratively refining code. Supported by Co… ▽ More

    Submitted 27 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

  31. arXiv:2402.10171  [pdf, other

    cs.CL cs.AI

    Data Engineering for Scaling Language Models to 128K Context

    Authors: Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, Hao Peng

    Abstract: We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contex… ▽ More

    Submitted 15 February, 2024; originally announced February 2024.

    Comments: Code at https://github.com/FranxYao/Long-Context-Data-Engineering

  32. arXiv:2402.08978  [pdf, other

    cs.HC cs.CE cs.LG

    Prismatic: Interactive Multi-View Cluster Analysis of Concept Stocks

    Authors: Wong Kam-Kwai, Yan Luo, Xuanwu Yue, Wei Chen, Huamin Qu

    Abstract: Financial cluster analysis allows investors to discover investment alternatives and avoid undertaking excessive risks. However, this analytical task faces substantial challenges arising from many pairwise comparisons, the dynamic correlations across time spans, and the ambiguity in deriving implications from business relational knowledge. We propose Prismatic, a visual analytics system that integr… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: 14 pages. A preprint version submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG), 2024

  33. arXiv:2402.07595  [pdf, other

    eess.IV cs.LG

    Comparative Analysis of ImageNet Pre-Trained Deep Learning Models and DINOv2 in Medical Imaging Classification

    Authors: Yuning Huang, **gchen Zou, Lanxi Meng, Xin Yue, Qing Zhao, Jianqiang Li, Changwei Song, Gabriel Jimenez, Shaowu Li, Guanghui Fu

    Abstract: Medical image analysis frequently encounters data scarcity challenges. Transfer learning has been effective in addressing this issue while conserving computational resources. The recent advent of foundational models like the DINOv2, which uses the vision transformer architecture, has opened new opportunities in the field and gathered significant interest. However, DINOv2's performance on clinical… ▽ More

    Submitted 13 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

  34. arXiv:2402.06852  [pdf

    cs.AI cs.CL

    ChemLLM: A Chemical Large Language Model

    Authors: Di Zhang, Wei Liu, Qian Tan, **gdan Chen, Hang Yan, Yuliang Yan, Jiatong Li, Weiran Huang, Xiangyu Yue, Wanli Ouyang, Dongzhan Zhou, Shufei Zhang, Mao Su, Han-Sen Zhong, Yuqiang Li

    Abstract: Large language models (LLMs) have made impressive progress in chemistry applications. However, the community lacks an LLM specifically designed for chemistry. The main challenges are two-fold: firstly, most chemical data and scientific knowledge are stored in structured databases, which limits the model's ability to sustain coherent dialogue when used directly. Secondly, there is an absence of obj… ▽ More

    Submitted 25 April, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: 9 pages, 5 figures

  35. arXiv:2402.03040  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

    Authors: Yiyuan Zhang, Yuhao Kang, Zhixin Zhang, Xiaohan Ding, Sanyuan Zhao, Xiangyu Yue

    Abstract: We introduce $\textit{InteractiveVideo}$, a user-centric framework for video generation. Different from traditional generative approaches that operate based on user-provided images or text, our framework is designed for dynamic interaction, allowing users to instruct the generative model through various intuitive mechanisms during the whole generation process, e.g. text and image prompts, painting… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: Code, models, and demo are available at https://github.com/invictus717/InteractiveVideo

  36. arXiv:2401.14405  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

    Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

    Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalit… ▽ More

    Submitted 18 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Code and models are available at https://github.com/AILab-CVC/M2PT

  37. arXiv:2401.12264  [pdf, other

    eess.AS cs.MM cs.SD eess.IV

    CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

    Authors: Xianghu Yue, Xiaohai Tian, Lu Lu, Malu Zhang, Zhizheng Wu, Haizhou Li

    Abstract: There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems… ▽ More

    Submitted 21 February, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

  38. A Unified NOMA Framework in Beam-Hop** Satellite Communication Systems

    Authors: Xuyang Zhang, Xinwei Yue, Tian Li, Zhihao Han, Yafei Wang, Yong Ding, Rongke Liu

    Abstract: This paper investigates the application of a unified non-orthogonal multiple access framework in beam hop** (U-NOMA-BH) based satellite communication systems. More specifically, the proposed U-NOMA-BH framework can be applied to code-domain NOMA based BH (CD-NOMA-BH) and power-domain NOMA based BH (PD-NOMA-BH) systems. To satisfy dynamic-uneven traffic demands, we formulate the optimization prob… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Journal ref: IEEE Transactions on Aerospace and Electronic Systems, vol. 59, no. 5, pp. 5390-5404, Oct. 2023

  39. arXiv:2312.14867  [pdf, other

    cs.CV cs.AI cs.CL cs.MM

    VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation

    Authors: Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen

    Abstract: In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal Large Language M… ▽ More

    Submitted 3 June, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to ACL2024 main

  40. arXiv:2312.10652  [pdf, ps, other

    cs.CL cs.SI

    Explorers at #SMM4H 2023: Enhancing BERT for Health Applications through Knowledge and Model Fusion

    Authors: Xutong Yue, Xilai Wang, Yuxin He, Zhenkun Zhou

    Abstract: An increasing number of individuals are willing to post states and opinions in social media, which has become a valuable data resource for studying human health. Furthermore, social media has been a crucial research point for healthcare now. This paper outlines the methods in our participation in the #SMM4H 2023 Shared Tasks, including data preprocessing, continual pre-training and fine-tuned opti… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

  41. arXiv:2312.03700  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    OneLLM: One Framework to Align All Modalities with Language

    Authors: Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

    Abstract: Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Code: https://github.com/csuhan/OneLLM

  42. arXiv:2312.03341  [pdf, other

    cs.CV cs.AI

    Online Vectorized HD Map Construction using Geometry

    Authors: Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, Fusheng **, Xiangyu Yue

    Abstract: The construction of online vectorized High-Definition (HD) maps is critical for downstream prediction and planning. Recent efforts have built strong baselines for this task, however, shapes and relations of instances in urban road systems are still under-explored, such as parallelism, perpendicular, or rectangle-shape. In our work, we propose GeMap ($\textbf{Ge}$ometry $\textbf{Map}$), which end-t… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Project website https://invictus717.github.io/GeMap/

  43. arXiv:2311.16502  [pdf, other

    cs.CL cs.AI cs.CV

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Authors: Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

    Abstract: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and… ▽ More

    Submitted 13 June, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: CVPR 2024 Oral

  44. arXiv:2311.15599  [pdf, other

    cs.CV cs.AI cs.LG

    UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

    Authors: Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

    Abstract: Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention, but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers, while the architectural design for large-kernel ConvNets remains under-addressed. 2) As tr… ▽ More

    Submitted 18 March, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: CVPR 2024. Code, all the models, reproducible training scripts at https://github.com/AILab-CVC/UniRepLKNet

  45. arXiv:2311.14295  [pdf, ps, other

    cs.IT eess.SP

    Exploiting Active RIS in NOMA Networks with Hardware Impairments

    Authors: Xinwei Yue, Meiqi Song, Chongjun Ouyang, Yuanwei Liu, Tian Li, Tianwei Hou

    Abstract: Active reconfigurable intelligent surface (ARIS) is a promising way to compensate for multiplicative fading attenuation by amplifying and reflecting event signals to selected users. This paper investigates the performance of ARIS assisted non-orthogonal multiple access (NOMA) networks over cascaded Nakagami-m fading channels. The effects of hardware impairments (HIS) and reflection coefficients on… ▽ More

    Submitted 12 January, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

  46. arXiv:2311.11176  [pdf, other

    cs.CV cs.AI cs.LG

    Morphology-Enhanced CAM-Guided SAM for weakly supervised Breast Lesion Segmentation

    Authors: Xin Yue, Xiaoling Liu, Qing Zhao, Jianqiang Li, Changwei Song, Suqin Liu, Zhikai Yang, Guanghui Fu

    Abstract: Ultrasound imaging plays a critical role in the early detection of breast cancer. Accurate identification and segmentation of lesions are essential steps in clinical practice, requiring methods to assist physicians in lesion segmentation. However, ultrasound lesion segmentation models based on supervised learning require extensive manual labeling, which is both time-consuming and labor-intensive.… ▽ More

    Submitted 22 May, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

  47. arXiv:2311.09206  [pdf, other

    cs.CL cs.AI cs.DB

    TableLlama: Towards Open Large Generalist Models for Tables

    Authors: Tianshu Zhang, Xiang Yue, Yifei Li, Huan Sun

    Abstract: Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards develo** open-source large language… ▽ More

    Submitted 4 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 long paper

  48. arXiv:2311.00276  [pdf

    cs.RO

    LiDAR-based SLAM for robotic map**: state of the art and new frontiers

    Authors: Xiangdi Yue, Yihuan Zhang, Miaolei He

    Abstract: In recent decades, the field of robotic map** has witnessed widespread research and development in LiDAR (Light Detection And Ranging)-based simultaneous localization and map** (SLAM) techniques. In this paper, we review the state-of-the-art in LiDAR-based SLAM and explore the remaining challenges that still require attention to satisfy the needs of contemporary applications. A distinctive asp… ▽ More

    Submitted 31 October, 2023; originally announced November 2023.

  49. arXiv:2310.10008  [pdf, other

    cs.CV cs.AI cs.LG

    Towards Unified and Effective Domain Generalization

    Authors: Yiyuan Zhang, Kaixiong Gong, Xiaohan Ding, Kaipeng Zhang, Fangrui Lv, Kurt Keutzer, Xiangyu Yue

    Abstract: We propose $\textbf{UniDG}$, a novel and $\textbf{Uni}$fied framework for $\textbf{D}$omain $\textbf{G}$eneralization that is capable of significantly enhancing the out-of-distribution generalization performance of foundation models regardless of their architectures. The core idea of UniDG is to finetune models during the inference stage, which saves the cost of iterative training. Specifically, w… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: Project Website: https://invictus717.github.io/Generalization/

  50. arXiv:2310.05107  [pdf, other

    cs.CV

    OV-PARTS: Towards Open-Vocabulary Part Segmentation

    Authors: Meng Wei, Xiaoyu Yue, Wenwei Zhang, Shu Kong, Xihui Liu, Jiangmiao Pang

    Abstract: Segmenting and recognizing diverse object parts is a crucial ability in applications spanning various computer vision and robotic tasks. While significant progress has been made in object-level Open-Vocabulary Semantic Segmentation (OVSS), i.e., segmenting objects with arbitrary text, the corresponding part-level research poses additional challenges. Firstly, part segmentation inherently involves… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Accepted by NeurIPS Dataset and Benchmark Track 2023