Skip to main content

Showing 1–50 of 366 results for author: Bai, X

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01016  [pdf, other

    cs.CV

    SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

    Authors: Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai

    Abstract: Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects common in aerial images unexplored. At the same time, the annotation cost of multi-oriented objects is significantly higher than that of their horizontal counterparts. Ther… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2407.00788  [pdf, other

    cs.CV

    InstantStyle-Plus: Style Transfer with Content-Preserving in Text-to-Image Generation

    Authors: Haofan Wang, Peng Xing, Renyuan Huang, Hao Ai, Qixun Wang, Xu Bai

    Abstract: Style transfer is an inventive process designed to create an image that maintains the essence of the original while embracing the visual style of another. Although diffusion models have demonstrated impressive generative power in personalized subject-driven or style-driven applications, existing state-of-the-art methods still encounter difficulties in achieving a seamless balance between content p… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Technical Report

  3. arXiv:2406.11191  [pdf, other

    cs.CL

    A Survey on Human Preference Learning for Large Language Models

    Authors: Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, Min Zhang

    Abstract: The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning, enhancing LLMs with excellent applicability and effectiveness in a wide range of contexts. Despite the numerous related studies conducted, a perspective on how human preferences are introduced into LLMs remains limited, which ma… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: IEEE copyright statement added (also applied to the former version)

  4. arXiv:2406.08135  [pdf

    cs.RO

    Design, modeling, and characteristics of ringshaped robot actuated by functional fluid

    Authors: Zebing Mao, Xuehang Bai, Yanhong Peng, Yayi Shen

    Abstract: The controlled actuation of hydraulic and pneumatic actuators has unveiled fresh and thrilling opportunities for designing mobile robots with adaptable structures. Previously reported rolling robots, which were powered by fluidic systems, often relied on complex principles, cumbersome pump and valve systems, and intricate control strategies, limiting their applicability in other fields. In this in… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2406.07232  [pdf, other

    cs.CL cs.AI

    DUAL-REFLECT: Enhancing Large Language Models for Reflective Translation through Dual Learning Feedback Mechanisms

    Authors: Andong Chen, Lianzhang Lou, Kehai Chen, Xuefeng Bai, Yang Xiang, Muyun Yang, Tiejun Zhao, Min Zhang

    Abstract: Recently, large language models (LLMs) enhanced by self-reflection have achieved promising performance on machine translation. The key idea is guiding LLMs to generate translation with human-like feedback. However, existing self-reflection methods lack effective feedback information, limiting the translation performance. To address this, we introduce a DUAL-REFLECT framework, leveraging the dual l… ▽ More

    Submitted 21 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference

  6. arXiv:2406.07036  [pdf, other

    cs.CL cs.AI

    Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model

    Authors: Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang

    Abstract: Large language models (LLMs) have showcased impressive multilingual machine translation ability. However, unlike encoder-decoder style models, decoder-only LLMs lack an explicit alignment between source and target contexts. Analyzing contribution scores during generation processes revealed that LLMs can be biased towards previously generated tokens over corresponding source tokens, leading to unfa… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by ACL2024 Findings

  7. arXiv:2406.04801  [pdf, other

    cs.CV

    MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

    Authors: Xingkui Zhu, Yiran Guan, Dingkang Liang, Yuchao Chen, Yuliang Liu, Xiang Bai

    Abstract: The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for M… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 9 pages, 6 figures

    ACM Class: I.2

  8. arXiv:2406.03019  [pdf, other

    cs.CV

    Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

    Authors: Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen **, Xiang Bai, Yuliang Liu

    Abstract: Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters throug… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: ICDAR 2024

  9. arXiv:2406.01302  [pdf

    cs.CV

    Pulmonary Embolism Mortality Prediction Using Multimodal Learning Based on Computed Tomography Angiography and Clinical Data

    Authors: Zhusi Zhong, Helen Zhang, Fayez H. Fayad, Andrew C. Lancaster, John Sollee, Shreyas Kulkarni, Cheng Ting Lin, Jie Li, Xinbo Gao, Scott Collins, Colin Greineder, Sun H. Ahn, Harrison X. Bai, Zhicheng Jiao, Michael K. Atalay

    Abstract: Purpose: Pulmonary embolism (PE) is a significant cause of mortality in the United States. The objective of this study is to implement deep learning (DL) models using Computed Tomography Pulmonary Angiography (CTPA), clinical data, and PE Severity Index (PESI) scores to predict PE mortality. Materials and Methods: 918 patients (median age 64 years, range 13-99 years, 52% female) with 3,978 CTPAs w… ▽ More

    Submitted 5 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

  10. arXiv:2406.00684  [pdf, other

    cs.CV cs.CL

    Deciphering Oracle Bone Language with Diffusion Models

    Authors: Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen **, Xiang Bai, Yuliang Liu

    Abstract: Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a no… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: ACL2024 main conference long paper

  11. arXiv:2405.16038  [pdf, other

    cs.CV

    Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

    Authors: Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, Hui-Liang Shen

    Abstract: Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, w… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  12. arXiv:2405.13874  [pdf, other

    cs.CV

    Affine-based Deformable Attention and Selective Fusion for Semi-dense Matching

    Authors: Hongkai Chen, Zixin Luo, Yurun Tian, Xuyang Bai, Ziyu Wang, Lei Zhou, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan

    Abstract: Identifying robust and accurate correspondences across images is a fundamental problem in computer vision that enables various downstream tasks. Recent semi-dense matching methods emphasize the effectiveness of fusing relevant cross-view information through Transformer. In this paper, we propose several improvements upon this paradigm. Firstly, we introduce affine-based local attention to model cr… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to CVPR2024 Image Matching Workshop

  13. arXiv:2405.12533  [pdf

    cs.CV

    Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering

    Authors: Hiba Maryam, Ling Fu, Jiajun Song, Tajrian ABM Shafayet, Qidi Luo, Xiang Bai, Yuliang Liu

    Abstract: The development of Urdu scene text detection, recognition, and Visual Question Answering (VQA) technologies is crucial for advancing accessibility, information retrieval, and linguistic diversity in digital content, facilitating better understanding and interaction with Urdu-language visual data. This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new mul… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Accepted by the International Conference on Document Analysis and Recognition (ICDAR) 2024

  14. arXiv:2405.12110  [pdf, other

    cs.CV

    CoR-GS: Sparse-View 3D Gaussian Splatting via Co-Regularization

    Authors: Jiawei Zhang, Jiahe Li, Xiaohan Yu, Lei Huang, Lin Gu, ** Zheng, Xiao Bai

    Abstract: 3D Gaussian Splatting (3DGS) creates a radiance field consisting of 3D Gaussians to represent a scene. With sparse training views, 3DGS easily suffers from overfitting, negatively impacting the reconstruction quality. This paper introduces a new co-regularization perspective for improving sparse-view 3DGS. When training two 3D Gaussian radiance fields with the same sparse views of a scene, we obse… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Project page: https://jiaw-z.github.io/CoR-GS/

  15. arXiv:2405.11985  [pdf, other

    cs.CV

    MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering

    Authors: **gqun Tang, Qi Liu, Yongjie Ye, **ghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, Can Huang

    Abstract: Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering wo… ▽ More

    Submitted 11 June, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

  16. arXiv:2405.11437  [pdf, other

    cs.CV

    The First Swahili Language Scene Text Detection and Recognition Dataset

    Authors: Fadila Wendigoundi Douamba, Jianjun Song, Ling Fu, Yuliang Liu, Xiang Bai

    Abstract: Scene text recognition is essential in many applications, including automated translation, information retrieval, driving assistance, and enhancing accessibility for individuals with visual impairments. Much research has been done to improve the accuracy and performance of scene text detection and recognition models. However, most of this research has been conducted in the most common languages, E… ▽ More

    Submitted 18 May, 2024; originally announced May 2024.

    Comments: Accepted to ICDAR 2024

  17. arXiv:2405.06706  [pdf, other

    cs.CL cs.AI

    Exploring the Capabilities of Large Multimodal Models on Dense Text

    Authors: Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai

    Abstract: While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions.… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  18. arXiv:2405.03988  [pdf, other

    cs.IR cs.AI

    Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application

    Authors: Jian Jia, Yipei Wang, Yan Li, Honggang Chen, Xuehan Bai, Zhaocheng Liu, Jian Liang, Quan Chen, Han Li, Peng Jiang, Kun Gai

    Abstract: Contemporary recommender systems predominantly rely on collaborative filtering techniques, employing ID-embedding to capture latent associations among users and items. However, this approach overlooks the wealth of semantic information embedded within textual descriptions of items, leading to suboptimal performance in cold-start scenarios and long-tail user recommendations. Leveraging the capabili… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 11 pages, 6 figures

  19. arXiv:2404.19652  [pdf, other

    cs.CV cs.AI

    VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization

    Authors: Yuliang Liu, Mingxin Huang, Hao Yan, Linger Deng, Weijia Wu, Hao Lu, Chunhua Shen, Lianwen **, Xiang Bai

    Abstract: Text spotting, a task involving the extraction of textual information from image or video sequences, faces challenges in cross-domain adaption, such as image-to-image and image-to-video generalization. In this paper, we introduce a new method, termed VimTS, which enhances the generalization ability of the model by achieving better synergy among different tasks. Typically, we propose a Prompt Queri… ▽ More

    Submitted 14 May, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

  20. arXiv:2404.17513  [pdf, other

    cs.CL cs.AI

    A Comprehensive Evaluation on Event Reasoning of Large Language Models

    Authors: Zhengwei Tao, Zhi **, Yifan Zhang, Xiancai Chen, Xiaoying Bai, Yue Fang, Haiyan Zhao, Jia Li, Chongyang Tao

    Abstract: Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. To mitigate this disparity, we comprehensively evaluate the abil… ▽ More

    Submitted 26 April, 2024; originally announced April 2024.

  21. arXiv:2404.16176  [pdf, other

    cs.DS math.MG

    Unweighted Layered Graph Traversal

    Authors: Xingjian Bai, Christian Coester, Romain Cosson

    Abstract: Introduced by Papadimitriou and Yannakakis in 1989, layered graph traversal is an important problem in online algorithms and mobile computing that has been studied for several decades, and which now is essentially resolved in its original formulation. In this paper, we demonstrate that what appears to be an innocuous modification of the problem actually leads to a drastic (exponential) reduction o… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

  22. arXiv:2404.15264  [pdf, other

    cs.CV

    TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting

    Authors: Jiahe Li, Jiawei Zhang, Xiao Bai, ** Zheng, Xin Ning, Jun Zhou, Lin Gu

    Abstract: Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields fram… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: Project page: https://fictionarry.github.io/TalkingGaussian/

  23. arXiv:2404.12803  [pdf, other

    cs.CV cs.LG

    TextSquare: Scaling up Text-Centric Visual Instruction Tuning

    Authors: **gqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

    Abstract: Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  24. arXiv:2404.11978  [pdf, other

    cs.CL

    EVIT: Event-Oriented Instruction Tuning for Event Reasoning

    Authors: Zhengwei Tao, Xiancai Chen, Zhi **, Xiaoying Bai, Haiyan Zhao, Yiwei Lou

    Abstract: Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event r… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

  25. arXiv:2404.10429  [pdf, other

    cs.AI

    MEEL: Multi-Modal Event Evolution Learning

    Authors: Zhengwei Tao, Zhi **, Junqiang Huang, Xiancai Chen, Xiaoying Bai, Haiyan Zhao, Yifan Zhang, Chongyang Tao

    Abstract: Multi-modal Event Reasoning (MMER) endeavors to endow machines with the ability to comprehend intricate event relations across diverse data modalities. MMER is fundamental and underlies a wide broad of applications. Despite extensive instruction fine-tuning, current multi-modal large language models still fall short in such ability. The disparity stems from that existing models are insufficient to… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  26. arXiv:2404.05989  [pdf, other

    cs.CL cs.IR

    Event-enhanced Retrieval in Real-time Search

    Authors: Yanan Zhang, Xiaoling Bai, Tianhua Zhou

    Abstract: The embedding-based retrieval (EBR) approach is widely used in mainstream search engine retrieval systems and is crucial in recent retrieval-augmented methods for eliminating LLM illusions. However, existing EBR models often face the "semantic drift" problem and insufficient focus on key information, leading to a low adoption rate of retrieval results in subsequent steps. This issue is especially… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: LREC-COLING 2024

  27. arXiv:2404.04624  [pdf, other

    cs.CV

    Bridging the Gap Between End-to-End and Two-Step Text Spotting

    Authors: Mingxin Huang, Hongliang Li, Yuliang Liu, Xiang Bai, Lianwen **

    Abstract: Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies, the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper, we introduce Bridg… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR2024

  28. arXiv:2404.03736  [pdf, other

    cs.CV

    SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

    Authors: Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Fan Wang, Xiang Bai

    Abstract: Recent advances in 2D/3D generative models enable the generation of dynamic 3D objects from a single-view video. Existing approaches utilize score distillation sampling to form the dynamic scene as dynamic NeRF or dense 3D Gaussians. However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Project Page: https://sc4d.github.io/

  29. arXiv:2404.03248  [pdf, other

    cs.CV

    Learning Transferable Negative Prompts for Out-of-Distribution Detection

    Authors: Tianqi Li, Guansong Pang, Xiao Bai, Wenjun Miao, ** Zheng

    Abstract: Existing prompt learning methods have shown certain capabilities in Out-of-Distribution (OOD) detection, but the lack of OOD images in the target dataset in their training can lead to mismatches between OOD images and In-Distribution (ID) categories, resulting in a high false positive rate. To address this issue, we introduce a novel OOD detection method, named 'NegPrompt', to learn a set of negat… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024

  30. arXiv:2404.02733  [pdf, other

    cs.CV

    InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation

    Authors: Haofan Wang, Matteo Spinelli, Qixun Wang, Xu Bai, Zekui Qin, Anthony Chen

    Abstract: Tuning-free diffusion-based models have demonstrated significant potential in the realm of image personalization and customization. However, despite this notable progress, current models continue to grapple with several complex challenges in producing style-consistent image generation. Firstly, the concept of style is inherently underdetermined, encompassing a multitude of elements such as color,… ▽ More

    Submitted 4 April, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

    Comments: Technical Report

  31. arXiv:2403.19128  [pdf, other

    cs.CV

    OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

    Authors: Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang

    Abstract: Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  32. arXiv:2403.14598  [pdf, other

    cs.CV

    PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

    Authors: Zheng Zhang, Yeyao Ma, Enming Zhang, Xiang Bai

    Abstract: PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the mode… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  33. arXiv:2403.11337  [pdf, other

    cs.CV cs.AI

    Enhancing Bandwidth Efficiency for Video Motion Transfer Applications using Deep Learning Based Keypoint Prediction

    Authors: Xue Bai, Tasmiah Haque, Sumit Mohan, Yuliang Cai, Byungheon Jeong, Adam Halasz, Srinjoy Das

    Abstract: We propose a deep learning based novel prediction framework for enhanced bandwidth reduction in motion transfer enabled video applications such as video conferencing, virtual reality gaming and privacy preservation for patient health monitoring. To model complex motion, we use the First Order Motion Model (FOMM) that represents dynamic objects using learned keypoints along with their local affine… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

  34. arXiv:2403.09493  [pdf, other

    cs.CV

    Anomaly Detection by Adapting a pre-trained Vision Language Model

    Authors: Yuxuan Cai, Xinwei He, Dingkang Liang, Ao Tong, Xiang Bai

    Abstract: Recently, large vision and language models have shown their success when adapting them to many downstream tasks. In this paper, we present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. To this end, we make two important improvements: 1) To acquire unified anomaly detection across industrial images of multiple categories, we introduce the learnable p… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

  35. arXiv:2403.07705  [pdf, other

    cs.CV

    Robust Synthetic-to-Real Transfer for Stereo Matching

    Authors: Jiawei Zhang, Jiahe Li, Lei Huang, Xiaohan Yu, Lin Gu, ** Zheng, Xiao Bai

    Abstract: With advancements in domain generalized stereo matching networks, models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However, few studies have investigated the robustness after fine-tuning them in real-world scenarios, during which the domain generalization ability can be seriously degraded. In this paper, we explore fine-tuning stereo matching networks without c… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  36. arXiv:2403.06912  [pdf, other

    cs.CV

    DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization

    Authors: Jiahe Li, Jiawei Zhang, Xiao Bai, ** Zheng, Xin Ning, Jun Zhou, Lin Gu

    Abstract: Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views, yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian, a depth-regularized framework based on 3D Gaussian radiance fields, offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from t… ▽ More

    Submitted 24 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024. Project page: https://fictionarry.github.io/DNGaussian/

  37. arXiv:2403.04473  [pdf, other

    cs.CV cs.AI

    TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

    Authors: Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, Xiang Bai

    Abstract: We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter o… ▽ More

    Submitted 15 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  38. arXiv:2403.01439  [pdf, other

    cs.CV

    Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

    Authors: Xin Zhou, Dingkang Liang, Wei Xu, Xingkui Zhu, Yihan Xu, Zhikang Zou, Xiang Bai

    Abstract: Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However, existing methods for model adaptation usually update all model parameters, i.e., full fine-tuning paradigm, which is inefficient as it relies on high computational costs (e.g., training GPU memory) and massive storage space. In this paper, we aim to study parameter-efficient transfer… ▽ More

    Submitted 5 April, 2024; v1 submitted 3 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024. Code is available at https://github.com/LMD0311/DAPT

  39. arXiv:2403.01038  [pdf, other

    cs.CR cs.AI

    AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks

    Authors: Jiacen Xu, Jack W. Stokes, Geoff McDonald, Xuesong Bai, David Marshall, Siyue Wang, Adith Swaminathan, Zhou Li

    Abstract: Large language models (LLMs) have demonstrated impressive results on natural language tasks, and security researchers are beginning to employ them in both offensive and defensive systems. In cyber-security, there have been multiple research efforts that utilize LLMs focusing on the pre-breach stage of attacks like phishing and malware generation. However, so far there lacks a comprehensive study r… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  40. arXiv:2402.15806  [pdf, other

    cs.CV

    Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition

    Authors: Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai

    Abstract: Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. However, collecting and labeling real text images is expensive and time-consuming, which limits the availability of real data. Therefore, most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. To alleviate this prob… ▽ More

    Submitted 24 February, 2024; originally announced February 2024.

    Comments: Accepted by Pattern Recognition Letters

  41. arXiv:2402.13643  [pdf, other

    cs.CV

    Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition

    Authors: Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai

    Abstract: Scene text recognition is a rapidly develo** field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-a… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

    Comments: Accepted by Pattern Recognition

  42. arXiv:2402.10739  [pdf, other

    cs.CV

    PointMamba: A Simple State Space Model for Point Cloud Analysis

    Authors: Dingkang Liang, Xin Zhou, Wei Xu, Xingkui Zhu, Zhikang Zou, Xiaoqing Ye, Xiao Tan, Xiang Bai

    Abstract: Transformers have become one of the foundational architectures in point cloud analysis tasks due to their excellent global modeling ability. However, the attention mechanism has quadratic complexity, making the design of a linear complexity method with global modeling appealing. In this paper, we propose PointMamba, transferring the success of Mamba, a recent representative state space model (SSM)… ▽ More

    Submitted 29 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Update the architecture and performance. The code is available at https://github.com/LMD0311/PointMamba

  43. arXiv:2402.06126  [pdf, other

    cs.CL cs.AI cs.LG

    Learn To be Efficient: Build Structured Sparsity in Large Language Models

    Authors: Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash

    Abstract: Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. However, existing methods only focus on utilizing this naturally formed activation sparsity in a post-training… ▽ More

    Submitted 3 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  44. arXiv:2402.06107  [pdf, other

    cs.CV cs.AI cs.CY cs.LG

    Multiple Instance Learning for Cheating Detection and Localization in Online Examinations

    Authors: Yemeng Liu, **g Ren, Jianshuo Xu, Xiaomei Bai, Roopdeep Kaur, Feng Xia

    Abstract: The spread of the Coronavirus disease-2019 epidemic has caused many courses and exams to be conducted online. The cheating behavior detection model in examination invigilation systems plays a pivotal role in guaranteeing the equality of long-distance examinations. However, cheating behavior is rare, and most researchers do not comprehensively take into account features such as head posture, gaze a… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: 12 pages, 7 figures

    MSC Class: 68T40; 68T45 ACM Class: I.2.10; I.5.4

    Journal ref: IEEE Transactions on Cognitive and Developmental Systems 2024

  45. arXiv:2402.04420  [pdf, other

    cs.CY cs.AI

    Measuring machine learning harms from stereotypes: requires understanding who is being harmed by which errors in what ways

    Authors: Angelina Wang, Xuechunzi Bai, Solon Barocas, Su Lin Blodgett

    Abstract: As machine learning applications proliferate, we need an understanding of their potential for harm. However, current fairness metrics are rarely grounded in human psychological experiences of harm. Drawing on the social psychology of stereotypes, we use a case study of gender stereotypes in image search to examine how people react to machine learning errors. First, we use survey studies to show th… ▽ More

    Submitted 6 February, 2024; originally announced February 2024.

    Comments: earlier draft non-archival at EAAMO 2023

  46. arXiv:2402.04105  [pdf, other

    cs.CY cs.CL

    Measuring Implicit Bias in Explicitly Unbiased Large Language Models

    Authors: Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths

    Abstract: Large language models (LLMs) can pass explicit social bias tests but still harbor implicit biases, similar to humans who endorse egalitarian beliefs yet exhibit subtle biases. Measuring such implicit biases can be a challenge: as LLMs become increasingly proprietary, it may not be possible to access their embeddings and apply existing bias measures; furthermore, implicit biases are primarily a con… ▽ More

    Submitted 23 May, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

  47. arXiv:2401.17755  [pdf, other

    cs.CL

    CauESC: A Causal Aware Model for Emotional Support Conversation

    Authors: Wei Chen, Hengxu Lin, Qun Zhang, Xiao** Zhang, Xiang Bai, Xuan**g Huang, Zhongyu Wei

    Abstract: Emotional Support Conversation aims at reducing the seeker's emotional distress through supportive response. Existing approaches have two limitations: (1) They ignore the emotion causes of the distress, which is important for fine-grained emotion understanding; (2) They focus on the seeker's own mental state rather than the emotional dynamics during interaction between speakers. To address these i… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: 15 pages, 5 figures

    ACM Class: I.2.7

  48. arXiv:2401.17619  [pdf, ps, other

    cs.SD eess.AS

    Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

    Authors: Jiatong Shi, Yueqian Lin, Xinyi Bai, Keyi Zhang, Yuning Wu, Yuxun Tang, Yifeng Yu, Qin **, Shinji Watanabe

    Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatu… ▽ More

    Submitted 12 June, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech2024

  49. arXiv:2401.15365  [pdf, other

    cs.CV

    An open dataset for oracle bone script recognition and decipherment

    Authors: Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, **peng Wan, Haisu Guan, Zhebin Kuang, Lianwen **, Xiang Bai, Yuliang Liu

    Abstract: Oracle Bone Script (OBS), one of the earliest known forms of ancient Chinese writing, holds invaluable insights into the humanities and geography of the Shang Dynasty, dating back 3,000 years. The immense historical and cultural significance of these writings cannot be overstated. However, the passage of time has obscured much of their meaning, presenting a significant challenge in deciphering the… ▽ More

    Submitted 5 June, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

  50. You Only Look Bottom-Up for Monocular 3D Object Detection

    Authors: Kaixin Xiong, Dingyuan Zhang, Dingkang Liang, Zhe Liu, Hongcheng Yang, Wondimu Dikubab, Jianwei Cheng, Xiang Bai

    Abstract: Monocular 3D Object Detection is an essential task for autonomous driving. Meanwhile, accurate 3D object detection from pure images is very challenging due to the loss of depth information. Most existing image-based methods infer objects' location in 3D space based on their 2D sizes on the image plane, which usually ignores the intrinsic position clues from images, leading to unsatisfactory perfor… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

    Comments: Accepted by IEEE Robotics and Automation Letters (RA-L)