Skip to main content

Showing 1–50 of 549 results for author: Wu, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00468  [pdf, other

    cs.CV cs.AI cs.CL

    MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

    Authors: **sheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, **gyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang

    Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial p… ▽ More

    Submitted 29 June, 2024; originally announced July 2024.

    Comments: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/

  2. arXiv:2407.00102  [pdf, other

    cs.LG cs.AI cs.CL

    Curriculum Learning with Quality-Driven Data Selection

    Authors: Biao Wu, Fang Meng, Ling Chen

    Abstract: The impressive multimodal capabilities demonstrated by OpenAI's GPT-4 have generated significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs with machine-generated instruction-following data has shown to enhance zero-shot capabilities across various tasks. However, there has been limited exploration into controlling the quality of the… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

  3. Multi-agent Cooperative Games Using Belief Map Assisted Training

    Authors: Qinwei Huang, Chen Luo, Alex B. Wu, Simon Khan, Hai Li, Qinru Qiu

    Abstract: In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learn… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Journal ref: ECAI 2023. IOS Press, 2023: 1617-1624

  4. arXiv:2406.19188  [pdf, other

    cs.LG

    Averaging log-likelihoods in direct alignment

    Authors: Nathan Grinsztajn, Yannis Flet-Berliac, Mohammad Gheshlaghi Azar, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Olivier Pietquin, Matthieu Geist

    Abstract: To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involvin… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  5. arXiv:2406.18373  [pdf, other

    cs.CL cs.SD eess.AS

    Dynamic Data Pruning for Automatic Speech Recognition

    Authors: Qiao Xiao, **chuan Ma, Adriana Fernandez-Lopez, Boqian Wu, Lu Yin, Stavros Petridis, Mykola Pechenizkiy, Maja Pantic, Decebal Constantin Mocanu, Shiwei Liu

    Abstract: The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  6. arXiv:2406.18187  [pdf, other

    cs.CL cs.AI cs.LG

    Selective Prompting Tuning for Personalized Conversations with LLMs

    Authors: Qiushi Huang, Xubo Liu, Tom Ko, Bo Wu, Wenwu Wang, Yu Zhang, Lilian Tang

    Abstract: In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 findings

  7. arXiv:2406.17803  [pdf, other

    cs.CL cs.AI cs.IR

    Understanding the Role of User Profile in the Personalization of Large Language Models

    Authors: Bin Wu, Zhengyan Shi, Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz

    Abstract: Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we inves… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  8. arXiv:2406.17519  [pdf, other

    cs.CL

    Entropy-Based Decoding for Retrieval-Augmented Large Language Models

    Authors: Zexuan Qiu, Zi**g Ou, Bin Wu, **g**g Li, Aiwei Liu, Irwin King

    Abstract: Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue, where the generated responses are negatively influenced by noise from both external and internal knowledge sources. In this paper, we introduce a novel, trainin… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  9. arXiv:2406.17419  [pdf, other

    cs.CL cs.AI

    Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

    Authors: Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, Yongbin Li

    Abstract: Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-contex… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: We release our code and data publicly at https://github.com/MozerWang/Loong

  10. arXiv:2406.16866  [pdf, other

    cs.CV

    Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

    Authors: Jierun Chen, Fangyun Wei, ****g Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S. -H. Gary Chan, Hongyang Zhang

    Abstract: Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  11. arXiv:2406.16495  [pdf, other

    cs.CL cs.AI

    OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser

    Authors: **gze Shi, Ting Xie, Bingheng Wu, Chunjun Zheng, Kai Wang

    Abstract: Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the seq… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

  12. arXiv:2406.16254  [pdf, other

    cs.LG cs.AI cs.CL

    Confidence Regulation Neurons in Language Models

    Authors: Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda

    Abstract: Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized b… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: 25 pages, 14 figures

  13. arXiv:2406.15797  [pdf, other

    cs.LG cs.AI

    Synergistic Deep Graph Clustering Network

    Authors: Benyu Wu, Shifei Ding, Xiao Xu, Lili Guo, Ling Ding, Xindong Wu

    Abstract: Employing graph neural networks (GNNs) to learn cohesive and discriminative node representations for clustering has shown promising results in deep graph clustering. However, existing methods disregard the reciprocal relationship between representation learning and structure augmentation. This study suggests that enhancing embedding and structure synergistically becomes imperative for GNNs to unle… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  14. arXiv:2406.13357  [pdf, other

    cs.CL cs.SD eess.AS

    Transferable speech-to-text large language model alignment module

    Authors: Boyong Wu, Chao Yan, Haoran Pu

    Abstract: By leveraging the power of Large Language Models(LLMs) and speech foundation models, state of the art speech-text bimodal works can achieve challenging tasks like spoken translation(ST) and question answering(SQA) altogether with much simpler architectures. In this paper, we utilize the capability of Whisper encoder and pre-trained Yi-6B. Empirical results reveal that modal alignment can be achiev… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by InterSpeech 2024; 5 pages, 2 figures

  15. DLP: towards active defense against backdoor attacks with decoupled learning process

    Authors: Zonghao Ying, Bin Wu

    Abstract: Deep learning models are well known to be susceptible to backdoor attack, where the attacker only needs to provide a tampered dataset on which the triggers are injected. Models trained on the dataset will passively implant the backdoor, and triggers on the input can mislead the models during testing. Our study shows that the model shows different learning behaviors in clean and poisoned subsets du… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  16. arXiv:2406.11157  [pdf, other

    cs.CR

    DeFiGuard: A Price Manipulation Detection Service in DeFi using Graph Neural Networks

    Authors: Dabao Wang, Bang Wu, Xingliang Yuan, Lei Wu, Ya** Zhou, Helei Cui

    Abstract: The prosperity of Decentralized Finance (DeFi) unveils underlying risks, with reported losses surpassing 3.2 billion USD between 2018 and 2022 due to vulnerabilities in Decentralized Applications (DApps). One significant threat is the Price Manipulation Attack (PMA) that alters asset prices during transaction execution. As a result, PMA accounts for over 50 million USD in losses. To address the ur… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: 13 pages, 7 figures

  17. NBA: defensive distillation for backdoor removal via neural behavior alignment

    Authors: Zonghao Ying, Bin Wu

    Abstract: Recently, deep neural networks have been shown to be vulnerable to backdoor attacks. A backdoor is inserted into neural networks via this attack paradigm, thus compromising the integrity of the network. As soon as an attacker presents a trigger during the testing phase, the backdoor in the model is activated, allowing the network to make specific wrong predictions. It is extremely important to def… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  18. arXiv:2406.05756  [pdf, other

    cs.AI cs.CL cs.CV cs.MM

    EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

    Authors: Mengfei Du, Binhao Wu, Zejun Li, Xuan**g Huang, Zhongyu Wei

    Abstract: The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL 2024 Main

  19. arXiv:2406.04264  [pdf, other

    cs.CV cs.AI cs.CL

    MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

    Authors: Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Bo Zhang, Tiejun Huang, Zheng Liu

    Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To addres… ▽ More

    Submitted 19 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

  20. arXiv:2406.03215  [pdf, other

    cs.CV

    Searching Priors Makes Text-to-Video Synthesis Better

    Authors: Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu

    Abstract: Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  21. arXiv:2406.01326  [pdf, other

    cs.CV

    TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

    Authors: Weichao Zhao, Hao Feng, Qi Liu, **gqun Tang, Shu Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Houqiang Li, Can Huang

    Abstract: Tables contain factual and quantitative data accompanied by various structures and contents that pose challenges for machine comprehension. Previous methods generally design task-specific architectures and objectives for individual tasks, resulting in modal isolation and intricate workflows. In this paper, we present a novel large vision-language model, TabPedia, equipped with a concept synergy me… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 20 pages, 8 figures

  22. arXiv:2406.00587  [pdf, other

    cs.CV

    Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction,because the real-world is actually video-based rather th… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: Champion Solution for CVPR 2024 PVUW VSS Track. arXiv admin note: text overlap with arXiv:2306.02894

  23. arXiv:2406.00500  [pdf, other

    cs.CV

    2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation

    Authors: Biao Wu, Diankai Zhang, Si Gao, Chengjian Zheng, Shaoli Liu, Ning Wang

    Abstract: Video Panoptic Segmentation (VPS) is a challenging task that is extends from image panoptic segmentation.VPS aims to simultaneously classify, track, segment all objects in a video, including both things and stuff. Due to its wide application in many downstream tasks such as video understanding, video editing, and autonomous driving. In order to deal with the task of video panoptic segmentation in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 2nd Place Solution for CVPR 2024 PVUW VPS Track

  24. arXiv:2405.19092  [pdf, other

    cs.CV

    Benchmarking and Improving Detail Image Caption

    Authors: Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, Haoyuan Guo

    Abstract: Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annota… ▽ More

    Submitted 31 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  25. arXiv:2405.17876  [pdf, other

    cs.LG cs.DC math.OC

    Decentralized Directed Collaboration for Personalized Federated Learning

    Authors: Yingqi Liu, Yifan Shi, Qinglun Li, Baoyuan Wu, Xueqian Wang, Li Shen

    Abstract: Personalized Federated Learning (PFL) is proposed to find the greatest personalized models for each client. To avoid the central failure and communication bottleneck in the server-based FL, we concentrate on the Decentralized Personalized Federated Learning (DPFL) that performs distributed model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL are based on undirected and sy… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: CVPR 2024. arXiv admin note: text overlap with arXiv:2305.15157

  26. arXiv:2405.17871  [pdf, other

    cs.CV cs.AI cs.CL

    Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

    Authors: Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo

    Abstract: Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  27. arXiv:2405.16134  [pdf, other

    cs.CV

    Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

    Authors: Mingli Zhu, Siyuan Liang, Baoyuan Wu

    Abstract: Deep neural networks face persistent challenges in defending against backdoor attacks, leading to an ongoing battle between attacks and defenses. While existing backdoor defense strategies have shown promising performance on reducing attack success rates, can we confidently claim that the backdoor threat has truly been eliminated from the model? To address it, we re-investigate the characteristics… ▽ More

    Submitted 30 May, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  28. arXiv:2405.16112  [pdf, other

    cs.CR cs.CV

    Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor

    Authors: Shaokui Wei, Hongyuan Zha, Baoyuan Wu

    Abstract: Data-poisoning backdoor attacks are serious security threats to machine learning models, where an adversary can manipulate the training dataset to inject backdoors into models. In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned. Unlike most existing methods that primarily detect and remove/unlearn suspicious samp… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

    Comments: 13 pages, 5 figures and 5 tables

  29. arXiv:2405.14407  [pdf, other

    cs.LG

    Gradient Transformation: Towards Efficient and Model-Agnostic Unlearning for Dynamic Graph Neural Networks

    Authors: He Zhang, Bang Wu, Xiangwen Yang, Xingliang Yuan, Chengqi Zhang, Shirui Pan

    Abstract: Graph unlearning has emerged as an essential tool for safeguarding user privacy and mitigating the negative impacts of undesirable data. Meanwhile, the advent of dynamic graph neural networks (DGNNs) marks a significant advancement due to their superior capability in learning from dynamic graphs, which encapsulate spatial-temporal variations in diverse real-world applications (e.g., traffic foreca… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  30. arXiv:2405.14394  [pdf, other

    cs.CL cs.AI

    Instruction Tuning With Loss Over Instructions

    Authors: Zhengyan Shi, Adam X. Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani

    Abstract: Instruction tuning plays a crucial role in sha** the outputs of language models (LMs) to desired styles. In this work, we propose a simple yet effective method, Instruction Modelling (IM), which trains LMs by applying a loss function to the instruction and prompt part rather than solely to the output part. Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: Code is available at https://github.com/ZhengxiangShi/InstructionModelling

  31. arXiv:2405.11286  [pdf, other

    cs.CV

    Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion

    Authors: Zeyu Zhang, Yiran Wang, Biao Wu, Shuo Chen, Zhiyuan Zhang, Shiya Huang, Wenbo Zhang, Meng Fang, Ling Chen, Yang Zhao

    Abstract: In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. A… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

  32. arXiv:2405.10497  [pdf, other

    cs.MM cs.AI cs.CV cs.SI

    SMP Challenge: An Overview and Analysis of Social Media Prediction Challenge

    Authors: Bo Wu, Peiye Liu, Wen-Huang Cheng, Bei Liu, Zhaoyang Zeng, Jia Wang, Qiushi Huang, Jiebo Luo

    Abstract: Social Media Popularity Prediction (SMPP) is a crucial task that involves automatically predicting future popularity values of online posts, leveraging vast amounts of multimodal data available on social media platforms. Studying and investigating social media popularity becomes central to various online applications and requires novel methods of comprehensive analysis, multimodal comprehension, a… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: ACM Multimedia. arXiv admin note: text overlap with arXiv:1910.01795

  33. arXiv:2405.09713  [pdf, other

    cs.CV cs.AI cs.CL

    SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

    Authors: Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, Chuang Gan

    Abstract: Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically withi… ▽ More

    Submitted 16 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: CVPR

  34. arXiv:2405.09711  [pdf, other

    cs.AI cs.CL cs.CV

    STAR: A Benchmark for Situated Reasoning in Real-World Videos

    Authors: Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan

    Abstract: Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: NeurIPS

  35. arXiv:2405.06676  [pdf, other

    cs.CL cs.AI cs.AR

    EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD

    Authors: Bing-Yue Wu, Utsav Sharma, Sai Rahul Dhanvi Kankipati, Ajay Yadav, Bintu Kappil George, Sai Ritish Guntupalli, Austin Rovinski, Vidya A. Chhabria

    Abstract: Large language models (LLMs) serve as powerful tools for design, providing capabilities for both task automation and design assistance. Recent advancements have shown tremendous potential for facilitating LLM integration into the chip design process; however, many of these works rely on data that are not publicly available and/or not permissively licensed for use in LLM training and distribution.… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Under review at Workshop on LLM-Aided Design (LAD'24)

  36. arXiv:2405.03003  [pdf, other

    cs.LG cs.AI cs.CL

    Parameter-Efficient Fine-Tuning with Discrete Fourier Transform

    Authors: Ziqi Gao, Qichao Wang, Aochuan Chen, Zi**g Liu, Bingzhe Wu, Liang Chen, Jia Li

    Abstract: Low-rank adaptation~(LoRA) has recently gained much interest in fine-tuning foundation models. It effectively reduces the number of trainable parameters by incorporating low-rank matrices $A$ and $B$ to represent the weight change, i.e., $ΔW=BA$. Despite LoRA's progress, it faces storage challenges when handling extensive customization adaptations or larger base models. In this work, we aim to fur… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

    Comments: Accepted by ICML 2024

  37. arXiv:2404.19384  [pdf, other

    cs.CV cs.AI

    Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection

    Authors: Zhanwei Zhang, Minghao Chen, Shuai Xiao, Liang Peng, Hengjia Li, Binbin Lin, ** Li, Wenxiao Wang, Boxi Wu, Deng Cai

    Abstract: Recent self-training techniques have shown notable improvements in unsupervised domain adaptation for 3D object detection (3D UDA). These techniques typically select pseudo labels, i.e., 3D boxes, to supervise models for the target domain. However, this selection process inevitably introduces unreliable 3D boxes, in which 3D points cannot be definitively assigned as foreground or background. Previ… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR2024

  38. arXiv:2404.16484  [pdf, other

    cs.CV eess.IV

    Real-Time 4K Super-Resolution of Compressed AVIF Images. AIS 2024 Challenge Survey

    Authors: Marcos V. Conde, Zhijun Lei, Wen Li, Cosmin Stejerean, Ioannis Katsavounidis, Radu Timofte, Kihwan Yoon, Ganzorig Gankhuyag, Jiangtao Lv, Long Sun, **shan Pan, Jiangxin Dong, **hui Tang, Zhiyuan Li, Hao Wei, Chenyang Ge, Dongyang Zhang, Tianle Liu, Huaian Chen, Yi **, Menghan Zhou, Yiqiang Yan, Si Gao, Biao Wu, Shaoli Liu , et al. (50 additional authors not shown)

    Abstract: This paper introduces a novel benchmark as part of the AIS 2024 Real-Time Image Super-Resolution (RTSR) Challenge, which aims to upscale compressed images from 540p to 4K resolution (4x factor) in real-time on commercial GPUs. For this, we use a diverse test set containing a variety of 4K images ranging from digital art to gaming and photography. The images are compressed using the modern AVIF cod… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Comments: CVPR 2024, AI for Streaming (AIS) Workshop

  39. arXiv:2404.12803  [pdf, other

    cs.CV cs.LG

    TextSquare: Scaling up Text-Centric Visual Instruction Tuning

    Authors: **gqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

    Abstract: Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  40. arXiv:2404.09526  [pdf, other

    cs.DC cs.LG

    LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism

    Authors: Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin **

    Abstract: The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  41. arXiv:2404.09113  [pdf, other

    stat.ML cs.LG math.ST

    Extending Mean-Field Variational Inference via Entropic Regularization: Theory and Computation

    Authors: Bohan Wu, David Blei

    Abstract: Variational inference (VI) has emerged as a popular method for approximate inference for high-dimensional Bayesian models. In this paper, we propose a novel VI method that extends the naive mean field via entropic regularization, referred to as $Ξ$-variational inference ($Ξ$-VI). $Ξ$-VI has a close connection to the entropic optimal transport problem and benefits from the computationally efficient… ▽ More

    Submitted 16 April, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

  42. arXiv:2404.07572  [pdf, other

    cs.CR cs.AI

    Fragile Model Watermark for integrity protection: leveraging boundary volatility and sensitive sample-pairing

    Authors: ZhenZhe Gao, Zhenjun Tang, Zhaoxia Yin, Baoyuan Wu, Yue Lu

    Abstract: Neural networks have increasingly influenced people's lives. Ensuring the faithful deployment of neural networks as designed by their model owners is crucial, as they may be susceptible to various malicious or unintentional modifications, such as backdooring and poisoning attacks. Fragile model watermarks aim to prevent unexpected tampering that could lead DNN models to make incorrect decisions. T… ▽ More

    Submitted 12 June, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: The article has been accepted by IEEE International Conference on Multimedia and Expo 2024

  43. Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

    Authors: Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yuekai Huang, Jun Hu, Qing Wang

    Abstract: Mobile apps have become indispensable for accessing and participating in various environments, especially for low-vision users. Users with visual impairments can use screen readers to read the content of each screen and understand the content that needs to be operated. Screen readers need to read the hint-text attribute in the text input component to remind visually impaired users what to fill in.… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: Accepted by the 2024 CHI Conference on Human Factors in Computing Systems

  44. arXiv:2404.01994  [pdf, other

    cs.CV cs.CL cs.LG

    DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning

    Authors: Mengfei Du, Binhao Wu, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuan**g Huang, Zhongyu Wei

    Abstract: Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction. For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history. Existing works primarily concentrate on cross-modal attention at the fusion stage to achieve this objective. Neverthel… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Accepted by LREC-COLING 2024

  45. arXiv:2404.01585  [pdf, other

    cs.DB cs.PF

    FLEXIS: FLEXible Frequent Subgraph Mining using Maximal Independent Sets

    Authors: Akshit Sharma, Sam Reinher, Dinesh Mehta, Bo Wu

    Abstract: Frequent Subgraph Mining (FSM) is the process of identifying common subgraph patterns that surpass a predefined frequency threshold. While FSM is widely applicable in fields like bioinformatics, chemical analysis, and social network anomaly detection, its execution remains time-consuming and complex. This complexity stems from the need to recognize high-frequency subgraphs and ascertain if they ex… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  46. arXiv:2403.18222  [pdf, other

    cs.RO cs.LG

    Uncertainty-Aware Deployment of Pre-trained Language-Conditioned Imitation Learning Policies

    Authors: Bo Wu, Bruce D. Lee, Kostas Daniilidis, Bernadette Bucher, Nikolai Matni

    Abstract: Large-scale robotic policies trained on data from diverse tasks and robotic platforms hold great promise for enabling general-purpose robots; however, reliable generalization to new environment conditions remains a major challenge. Toward addressing this challenge, we propose a novel approach for uncertainty-aware deployment of pre-trained language-conditioned imitation learning agents. Specifical… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: 8 pages, 7 figures

  47. arXiv:2403.17549  [pdf

    cs.AI cs.CV

    Practical Applications of Advanced Cloud Services and Generative AI Systems in Medical Image Analysis

    Authors: **gyu Xu, Binbin Wu, Jiaxin Huang, Yulu Gong, Yifan Zhang, Bo Liu

    Abstract: The medical field is one of the important fields in the application of artificial intelligence technology. With the explosive growth and diversification of medical data, as well as the continuous improvement of medical needs and challenges, artificial intelligence technology is playing an increasingly important role in the medical field. Artificial intelligence technologies represented by computer… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  48. arXiv:2403.16271  [pdf, other

    cs.CV

    Object Detectors in the Open Environment: Challenges, Solutions, and Outlook

    Authors: Siyuan Liang, Wei Wang, Ruoyu Chen, Aishan Liu, Boxi Wu, Ee-Chien Chang, Xiaochun Cao, Dacheng Tao

    Abstract: With the emergence of foundation models, deep learning-based object detectors have shown practical usability in closed set scenarios. However, for real-world tasks, object detectors often operate in open environments, where crucial factors (e.g., data distribution, objective) that influence model learning are often changing. The dynamic and intricate nature of the open environment poses novel and… ▽ More

    Submitted 9 April, 2024; v1 submitted 24 March, 2024; originally announced March 2024.

    Comments: 37 pages, 17 figures

  49. arXiv:2403.14077  [pdf, other

    cs.AI cs.CR

    Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics

    Authors: Shan Jia, Reilin Lyu, Kangran Zhao, Yize Chen, Zhiyuan Yan, Yan Ju, Chuanbo Hu, Xin Li, Baoyuan Wu, Siwei Lyu

    Abstract: DeepFakes, which refer to AI-generated media content, have become an increasing concern due to their use as a means for disinformation. Detecting DeepFakes is currently solved with programmed machine learning algorithms. In this work, we investigate the capabilities of multimodal large language models (LLMs) in DeepFake detection. We conducted qualitative and quantitative experiments to demonstrat… ▽ More

    Submitted 11 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

  50. arXiv:2403.13619  [pdf

    cs.DC cs.AI

    Dynamic Resource Allocation for Virtual Machine Migration Optimization using Machine Learning

    Authors: Yulu Gong, Jiaxin Huang, Bo Liu, **gyu Xu, Binbin Wu, Yifan Zhang

    Abstract: The paragraph is grammatically correct and logically coherent. It discusses the importance of mobile terminal cloud computing migration technology in meeting the demands of evolving computer and cloud computing technologies. It emphasizes the need for efficient data access and storage, as well as the utilization of cloud computing migration technology to prevent additional time delays. The paragra… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.