Skip to main content

Showing 1–50 of 312 results for author: Ge, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19311  [pdf, other

    cs.CR cs.SD eess.AS

    Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems

    Authors: Zheng Fang, Tao Wang, Lingchen Zhao, Shenyi Zhang, Bowen Li, Yunjie Ge, Qi Li, Chao Shen, Qian Wang

    Abstract: In recent years, extensive research has been conducted on the vulnerability of ASR systems, revealing that black-box adversarial example attacks pose significant threats to real-world ASR systems. However, most existing black-box attacks rely on queries to the target ASRs, which is impractical when queries are not permitted. In this paper, we propose ZQ-Attack, a transfer-based adversarial attack… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: To appear in the Proceedings of The ACM Conference on Computer and Communications Security (CCS), 2024

  2. arXiv:2406.18008  [pdf, other

    cs.IT

    Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources

    Authors: **g**g Qian, Sadaf Salehkalaibar, Jun Chen, Ashish Khisti, Wei Yu, Wuxian Shi, Yiqun Ge, Wen Tong

    Abstract: This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  3. arXiv:2406.12671  [pdf, other

    cs.CV

    GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models

    Authors: Yongtao Ge, Guangkai Xu, Zhiyue Zhao, Libo Sun, Zheng Huang, Yanlong Sun, Hao Chen, Chunhua Shen

    Abstract: Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by… ▽ More

    Submitted 20 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    Comments: Code and Benchmark are available at: https://github.com/aim-uofa/GeoBench

  4. arXiv:2406.12275  [pdf, other

    cs.CV

    VoCo-LLaMA: Towards Vision Compression with Large Language Models

    Authors: Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

    Abstract: Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 18 pages, 5 figures

  5. arXiv:2406.02395  [pdf, other

    cs.LG cs.CV

    GrootVL: Tree Topology is All You Need in State Space Model

    Authors: Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

    Abstract: The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree t… ▽ More

    Submitted 4 June, 2024; originally announced June 2024.

    Comments: The code is available at https://github.com/EasonXiao-888/GrootVL

  6. arXiv:2405.19519  [pdf, other

    cs.CL cs.AI

    Two-layer retrieval augmented generation framework for low-resource medical question-answering: proof of concept using Reddit data

    Authors: Sudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed Sarker

    Abstract: Retrieval augmented generation (RAG) provides the capability to constrain generative model outputs, and mitigate the possibility of hallucination, by providing relevant in-context text. The number of tokens a generative large language model (LLM) can incorporate as context is finite, thus limiting the volume of knowledge from which to generate an answer. We propose a two-layer RAG framework for qu… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  7. arXiv:2405.15287  [pdf, other

    cs.CV

    StyleMaster: Towards Flexible Stylized Image Generation with Diffusion Models

    Authors: Chengming Xu, Kai Hu, Donghao Luo, Jiangning Zhang, Wei Li, Yanhao Ge, Chengjie Wang

    Abstract: Stylized Text-to-Image Generation (STIG) aims to generate images based on text prompts and style reference images. We in this paper propose a novel framework dubbed as StyleMaster for this task by leveraging pretrained Stable Diffusion (SD), which tries to solve the previous problems such as insufficient style and inconsistent semantics. The enhancement lies in two novel module, namely multi-sourc… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  8. arXiv:2405.12970  [pdf, other

    cs.CV

    Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

    Authors: Yue Han, Junwei Zhu, Keke He, Xu Chen, Yanhao Ge, Wei Li, Xiangtai Li, Jiangning Zhang, Chengjie Wang, Yong Liu

    Abstract: Current face reenactment and swap** methods mainly rely on GAN frameworks, but recent focus has shifted to pre-trained diffusion models for their superior generation capabilities. However, training these models is resource-intensive, and the results have not yet achieved satisfactory performance levels. To address this issue, we introduce Face-Adapter, an efficient and effective adapter designed… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

    Comments: Project Page: https://faceadapter.github.io/face-adapter.github.io/

  9. arXiv:2405.09546  [pdf, other

    cs.CV

    BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

    Authors: Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

    Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and renderin… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/

  10. arXiv:2405.07990  [pdf, other

    cs.CL cs.CV

    Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

    Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, ** Luo

    Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLM… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

  11. arXiv:2405.07027  [pdf, other

    cs.CV cs.AI cs.RO

    TD-NeRF: Novel Truncated Depth Prior for Joint Camera Pose and Neural Radiance Field Optimization

    Authors: Zhen Tan, Zongtan Zhou, Yangbing Ge, Zi Wang, Xieyuanli Chen, Dewen Hu

    Abstract: The reliance on accurate camera poses is a significant barrier to the widespread deployment of Neural Radiance Fields (NeRF) models for 3D reconstruction and SLAM tasks. The existing method introduces monocular depth priors to jointly optimize the camera poses and NeRF, which fails to fully exploit the depth priors and neglects the impact of their inherent noise. In this paper, we propose Truncate… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  12. arXiv:2405.06145  [pdf, other

    cs.CL cs.AI cs.LG

    Reddit-Impacts: A Named Entity Recognition Dataset for Analyzing Clinical and Social Effects of Substance Use Derived from Social Media

    Authors: Yao Ge, Sudeshna Das, Karen O'Connor, Mohammed Ali Al-Garadi, Graciela Gonzalez-Hernandez, Abeed Sarker

    Abstract: Substance use disorders (SUDs) are a growing concern globally, necessitating enhanced understanding of the problem and its trends through data-driven research. Social media are unique and important sources of information about SUDs, particularly since the data in such sources are often generated by people with lived experiences. In this paper, we introduce Reddit-Impacts, a challenging Named Entit… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: 7 pages, 1 figure, 4 tables

  13. arXiv:2405.04007  [pdf, other

    cs.CV

    SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

    Authors: Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

    Abstract: In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario da… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: Technical Report; Dataset released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit

  14. arXiv:2405.03119  [pdf, ps, other

    cs.IT eess.SP

    DAFT-Spread Affine Frequency Division Multiple Access for Downlink Transmission

    Authors: Yiwei Tao, Miaowen Wen, Yao Ge, Tianqi Mao, Lixia Xiao, Jun Li

    Abstract: Affine frequency division multiplexing (AFDM) and orthogonal AFDM access (O-AFDMA) are promising techniques based on chirp signals, which are able to suppress the performance deterioration caused by Doppler shifts in high-mobility scenarios. However, the high peak-to-average power ratio (PAPR) in AFDM or O-AFDMA is still a crucial problem, which severely limits their practical applications. In thi… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  15. arXiv:2405.02604  [pdf, ps, other

    cs.IT eess.SP

    Interleave Frequency Division Multiplexing

    Authors: Yuhao Chi, Lei Liu, Yao Ge, Xuehui Chen, Ying Li, Zhaoyang Zhang

    Abstract: In this letter, we study interleave frequency division multiplexing (IFDM) for multicarrier modulation in static multipath and mobile time-varying channels, which outperforms orthogonal frequency division multiplexing (OFDM), orthogonal time frequency space (OTFS), and affine frequency division multiplexing (AFDM) by considering practical advanced detectors. The fundamental principle underlying ex… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: Accepted by IEEE Wireless Communications Letters

  16. arXiv:2405.01312  [pdf, other

    cs.DB cs.CR

    Privacy-Enhanced Database Synthesis for Benchmark Publishing

    Authors: Yongrui Zhong, Yunqing Ge, Jianbin Qin, Shuyuan Zheng, Bo Tang, Yu-Xuan Qiu, Rui Mao, Ye Yuan, Makoto Onizuka, Chuan Xiao

    Abstract: Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating syn… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

  17. arXiv:2404.19752  [pdf, other

    cs.CV

    Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

    Authors: Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

    Abstract: Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning model… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  18. arXiv:2404.16957  [pdf, other

    cs.AI cs.CY

    Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability

    Authors: Yunfei Ge, Quanyan Zhu

    Abstract: The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility at… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  19. arXiv:2404.16790  [pdf, other

    cs.CV

    SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

    Authors: Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

    Abstract: Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their pro… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  20. arXiv:2404.14396  [pdf, other

    cs.CV

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Authors: Yuying Ge, Sijie Zhao, **guo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

    Abstract: The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. I… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Project released at: https://github.com/AILab-CVC/SEED-X

  21. arXiv:2404.13884  [pdf

    eess.IV cs.CV

    MambaUIE&SR: Unraveling the Ocean's Secrets with Only 2.8 GFLOPs

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: Underwater Image Enhancement (UIE) techniques aim to address the problem of underwater image degradation due to light absorption and scattering. In recent years, both Convolution Neural Network (CNN)-based and Transformer-based methods have been widely explored. In addition, combining CNN and Transformer can effectively combine global and local information for enhancement. However, this approach i… ▽ More

    Submitted 24 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: arXiv admin note: text overlap with arXiv:2305.08824 by other authors

  22. arXiv:2404.13600  [pdf, other

    cs.RO

    Are We Ready for Planetary Exploration Robots? The TAIL-Plus Dataset for SLAM in Granular Environments

    Authors: Zirui Wang, Chen Yao, Yangtao Ge, Guowei Shi, Ningbo Yang, Zheng Zhu, Kewei Dong, Hexiang Wei, Zhenzhong Jia, **g Wu

    Abstract: So far, planetary surface exploration depends on various mobile robot platforms. The autonomous navigation and decision-making of these mobile robots in complex terrains largely rely on their terrain-aware perception, localization and map** capabilities. In this paper we release the TAIL-Plus dataset, a new challenging dataset in deformable granular environments for planetary exploration robots,… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

    Comments: Accepted to the IEEE ICRA Workshop on Field Robotics 2024

  23. arXiv:2404.07855  [pdf, other

    cs.CV

    Resolve Domain Conflicts for Generalizable Remote Physiological Measurement

    Authors: Weiyu Sun, Xinyu Zhang, Hao Lu, Ying Chen, Yun Ge, Xiaolin Huang, Jie Yuan, Yingcong Chen

    Abstract: Remote photoplethysmography (rPPG) technology has become increasingly popular due to its non-invasive monitoring of various physiological indicators, making it widely applicable in multimedia interaction, healthcare, and emotion analysis. Existing rPPG methods utilize multiple datasets for training to enhance the generalizability of models. However, they often overlook the underlying conflict issu… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Accepted by ACM MM 2023

  24. arXiv:2404.06835  [pdf, other

    cs.CV

    Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

    Authors: Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan

    Abstract: In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this w… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  25. arXiv:2404.03443  [pdf, ps, other

    cs.CV

    Part-Attention Based Model Make Occluded Person Re-Identification Stronger

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: The goal of occluded person re-identification (ReID) is to retrieve specific pedestrians in occluded situations. However, occluded person ReID still suffers from background clutter and low-quality local feature representations, which limits model performance. In our research, we introduce a new framework called PAB-ReID, which is a novel ReID model incorporating part-attention mechanisms to tackle… ▽ More

    Submitted 1 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

    Comments: Accepted By International Joint Conference on Neural Networks 2024

  26. arXiv:2404.00308  [pdf, other

    cs.CV

    ST-LLM: Large Language Models Are Effective Temporal Learners

    Authors: Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

    Abstract: Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation, prompting research efforts towards video LLMs to facilitate human-AI interaction at the video level. However, how to effectively encode and understand videos in video-based dialogue systems remains to be solved. In this paper, we investigate a straightforward yet unexplored question: Can we fe… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  27. arXiv:2403.19021  [pdf, other

    cs.IR cs.AI cs.CL cs.LG

    IDGenRec: LLM-RecSys Alignment with Textual ID Learning

    Authors: Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li, Yongfeng Zhang

    Abstract: Generative recommendation based on Large Language Models (LLMs) have transformed the traditional ranking-based recommendation style into a text-to-text generation paradigm. However, in contrast to standard NLP tasks that inherently operate on human vocabulary, current research in generative recommendations struggles to effectively encode recommendation items within the text-to-text framework using… ▽ More

    Submitted 17 May, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted in SIGIR 2024

  28. arXiv:2403.17664  [pdf, other

    cs.CV

    DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

    Authors: Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

    Abstract: Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient infer… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  29. arXiv:2403.16971  [pdf, other

    cs.OS cs.AI cs.CL

    AIOS: LLM Agent Operating System

    Authors: Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, Yongfeng Zhang

    Abstract: The integration and deployment of large language model (LLM)-based intelligent agents have been fraught with challenges that compromise their efficiency and efficacy. Among these issues are sub-optimal scheduling and resource allocation of agent requests over the LLM, the difficulties in maintaining context during interactions between agent and LLM, and the complexities inherent in integrating het… ▽ More

    Submitted 25 March, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: 14 pages, 5 figures, 5 tables; comments and suggestions are appreciated

  30. arXiv:2403.16875  [pdf, other

    cs.RO

    TAIL: A Terrain-Aware Multi-Modal SLAM Dataset for Robot Locomotion in Deformable Granular Environments

    Authors: Chen Yao, Yangtao Ge, Guowei Shi, Zirui Wang, Ningbo Yang, Zheng Zhu, Hexiang Wei, Yuntian Zhao, **g Wu, Zhenzhong Jia

    Abstract: Terrain-aware perception holds the potential to improve the robustness and accuracy of autonomous robot navigation in the wilds, thereby facilitating effective off-road traversals. However, the lack of multi-modal perception across various motion patterns hinders the solutions of Simultaneous Localization And Map** (SLAM), especially when confronting non-geometric hazards in demanding landscapes… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Robotics and Automation Letters

  31. arXiv:2403.13408  [pdf, other

    cs.CV cs.AI

    S2DM: Sector-Shaped Diffusion Models for Video Generation

    Authors: Haoran Lang, Yuxuan Ge, Zheng Tian

    Abstract: Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features.… ▽ More

    Submitted 22 March, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: 17 pages, 6 figures

  32. arXiv:2403.12450  [pdf, other

    cs.CV

    Intention Action Anticipation Model with Guide-Feedback Loop Mechanism

    Authors: Zongnan Ma, Fuchun Zhang, Zhixiong Nan, Yao Ge

    Abstract: Anticipating human intention from videos has broad applications, such as automatic driving, robot assistive technology, and virtual reality. This study addresses the problem of intention action anticipation using egocentric video sequences to estimate actions that indicate human intention. We propose a Hierarchical Complete-Recent (HCR) information fusion model that makes full use of the features… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  33. arXiv:2403.12373  [pdf, other

    cs.CL

    RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners

    Authors: Chi Hu, Yuan Ge, Xiangnan Ma, Hang Cao, Qiang Li, Yonghua Yang, Tong Xiao, **gbo Zhu

    Abstract: Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent res… ▽ More

    Submitted 22 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: LREC-Coling 2024 Long Paper

  34. arXiv:2403.11111  [pdf, other

    cs.CV

    3D Human Reconstruction in the Wild with Synthetic Data Using Generative Models

    Authors: Yongtao Ge, Wenjia Wang, Yongfan Chen, Hao Chen, Chunhua Shen

    Abstract: In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images a… ▽ More

    Submitted 11 April, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

    Comments: project page: https://yongtaoge.github.io/projects/humanwild

  35. arXiv:2403.08557  [pdf

    cs.CV

    Occluded Cloth-Changing Person Re-Identification

    Authors: Zhihao Chen, Yiyuan Ge

    Abstract: Cloth-changing person re-identification aims to retrieve and identify spe-cific pedestrians by using cloth-unrelated features in person cloth-changing scenarios. However, pedestrian images captured by surveillance probes usually contain occlusions in real-world scenarios. The perfor-mance of existing cloth-changing person re-identification methods is sig-nificantly degraded due to the reduction of… ▽ More

    Submitted 14 March, 2024; v1 submitted 13 March, 2024; originally announced March 2024.

  36. arXiv:2403.06090  [pdf, other

    cs.CV

    Diffusion Models Trained with Large Data Are Transferable Visual Models

    Authors: Guangkai Xu, Yongtao Ge, Mingyu Liu, Chengxiang Fan, Kangyang Xie, Zhiyue Zhao, Hao Chen, Chunhua Shen

    Abstract: We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data (even synthetic data only), including monocular depth, surface normal, image segmentation, matting, human pose estimation, among virtual… ▽ More

    Submitted 15 March, 2024; v1 submitted 9 March, 2024; originally announced March 2024.

  37. arXiv:2403.00691  [pdf, other

    cs.CV cs.AI

    Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

    Authors: Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian

    Abstract: Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modal… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  38. arXiv:2402.18191  [pdf, other

    cs.CL

    Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation

    Authors: Yuan Ge, Yilun Liu, Chi Hu, Weibin Meng, Shimin Tao, Xiaofeng Zhao, Hongxia Ma, Li Zhang, Hao Yang, Tong Xiao

    Abstract: With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  39. arXiv:2402.10071  [pdf, other

    eess.SP cs.IT

    Approximate Message Passing-Enhanced Graph Neural Network for OTFS Data Detection

    Authors: Wenhao Zhuang, Yuyi Mao, Hengtao He, Lei Xie, Shenghui Song, Yao Ge, Zhi Ding

    Abstract: Orthogonal time frequency space (OTFS) modulation has emerged as a promising solution to support high-mobility wireless communications, for which, cost-effective data detectors are critical. Although graph neural network (GNN)-based data detectors can achieve decent detection accuracy at reasonable computational cost, they fail to best harness prior information of transmitted data. To further mini… ▽ More

    Submitted 14 April, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

    Comments: 8 pages, 7 figures, and 3 tables. Part of this article was submitted to IEEE for possible publication

  40. arXiv:2402.07140  [pdf, other

    cs.AI

    Graph Descriptive Order Improves Reasoning with Large Language Model

    Authors: Yuyao Ge, Shenghua Liu, Wenjie Feng, Lingrui Mei, Lizhe Chen, Xueqi Cheng

    Abstract: In recent years, large language models have achieved state-of-the-art performance across multiple domains. However, the progress in the field of graph reasoning with LLM remains limited. Our work delves into this gap by thoroughly investigating graph reasoning with LLMs. In this work, we reveal the impact of the order of graph description on LLMs' graph reasoning performance, which significantly a… ▽ More

    Submitted 24 February, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

  41. arXiv:2402.00455  [pdf, ps, other

    cs.IT eess.SP

    New Lower Bounds on Aperiodic Ambiguity Function of Unimodular Sequences

    Authors: Lingsheng Meng, Yong Liang Guan, Yao Ge, Zilong Liu, **zhi Fan

    Abstract: This paper presents new aperiodic ambiguity function (AF) lower bounds of unimodular sequences under certain low ambiguity zone. Our key idea, motivated by the Levenshtein correlation bound, is to introduce two weight vectors associated to the delay and Doppler shifts, respectively, and then exploit the upper and lower bounds on the Frobenius norm of the weighted auto- and cross-AF matrices to der… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

    Comments: 5 pages, 1 figure

  42. arXiv:2402.00284  [pdf, other

    cs.IR cs.AI cs.LG

    PAP-REC: Personalized Automatic Prompt for Recommendation Language Model

    Authors: Zelong Li, Jianchao Ji, Yingqiang Ge, Wenyue Hua, Yongfeng Zhang

    Abstract: Recently emerged prompt-based Recommendation Language Models (RLM) can solve multiple recommendation tasks uniformly. The RLMs make full use of the inherited knowledge learned from the abundant pre-training data to solve the downstream recommendation tasks by prompts, without introducing additional parameters or network training. However, handcrafted prompts require significant expertise and human… ▽ More

    Submitted 31 January, 2024; originally announced February 2024.

  43. arXiv:2401.17270  [pdf, other

    cs.CV

    YOLO-World: Real-Time Open-Vocabulary Object Detection

    Authors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

    Abstract: The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling an… ▽ More

    Submitted 22 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Work still in progress. Code & models are available at: https://github.com/AILab-CVC/YOLO-World

  44. arXiv:2401.14686  [pdf, other

    cs.CV

    SSR: SAM is a Strong Regularizer for domain adaptive semantic segmentation

    Authors: Yanqi Ge, Ye Huang, Wen Li, Lixin Duan

    Abstract: We introduced SSR, which utilizes SAM (segment-anything) as a strong regularizer during training, to greatly enhance the robustness of the image encoder for handling various domains. Specifically, given the fact that SAM is pre-trained with a large number of images over the internet, which cover a diverse variety of domains, the feature encoding extracted by the SAM is obviously less dependent on… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

  45. arXiv:2401.14405  [pdf, other

    cs.CV cs.AI cs.LG

    Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

    Authors: Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

    Abstract: We propose to improve transformers of a specific modality with irrelevant data from other modalities, e.g., improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities, which distinguishes our method from other works utilizing paired (e.g., CLIP) or interleaved data of different modalit… ▽ More

    Submitted 18 March, 2024; v1 submitted 25 January, 2024; originally announced January 2024.

    Comments: CVPR 2024. Code and models are available at https://github.com/AILab-CVC/M2PT

  46. arXiv:2401.13672  [pdf, other

    cs.DB cs.AI cs.IR

    Transforming Agriculture with Intelligent Data Management and Insights

    Authors: Yu Pan, Jianxin Sun, Hongfeng Yu, Geng Bai, Yufeng Ge, Joe Luck, Tala Awada

    Abstract: Modern agriculture faces grand challenges to meet increased demands for food, fuel, feed, and fiber with population growth under the constraints of climate change and dwindling natural resources. Data innovation is urgently required to secure and improve the productivity, sustainability, and resilience of our agroecosystems. As various sensors and Internet of Things (IoT) instrumentation become mo… ▽ More

    Submitted 7 November, 2023; originally announced January 2024.

  47. arXiv:2401.10222  [pdf, other

    cs.CV cs.AI

    Supervised Fine-tuning in turn Improves Visual Foundation Models

    Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

    Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuni… ▽ More

    Submitted 11 April, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: 23 pages, 3 figures, Project page: https://github.com/TencentARC/ViSFT/tree/main

  48. arXiv:2401.09740  [pdf, other

    cs.CR

    Hijacking Attacks against Neural Networks by Analyzing Training Data

    Authors: Yunjie Ge, Qian Wang, Huayang Huang, Qi Li, Cong Wang, Chao Shen, Lingchen Zhao, Peipei Jiang, Zheng Fang, Shenyi Zhang

    Abstract: Backdoors and adversarial examples are the two primary threats currently faced by deep neural networks (DNNs). Both attacks attempt to hijack the model behaviors with unintended outputs by introducing (small) perturbations to the inputs. Backdoor attacks, despite the high success rates, often require a strong assumption, which is not always easy to achieve in reality. Adversarial example attacks,… ▽ More

    Submitted 19 January, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: Full version with major polishing, compared to the Usenix Security 2024 edition

  49. arXiv:2401.07781  [pdf, other

    cs.CV

    Towards A Better Metric for Text-to-Video Generation

    Authors: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou

    Abstract: Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However… ▽ More

    Submitted 15 January, 2024; originally announced January 2024.

    Comments: Project page: https://showlab.github.io/T2VScore/

  50. arXiv:2401.04150  [pdf, other

    cs.CV

    Two-stream joint matching method based on contrastive learning for few-shot action recognition

    Authors: Long Deng, Ziqiang Li, Bingxin Zhou, Zhongming Chen, Ao Li, Yongxin Ge

    Abstract: Although few-shot action recognition based on metric learning paradigm has achieved significant success, it fails to address the following issues: (1) inadequate action relation modeling and underutilization of multi-modal information; (2) challenges in handling video matching problems with different lengths and speeds, and video matching problems with misalignment of video sub-actions. To address… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.