Skip to main content

Showing 1–50 of 128 results for author: Zhan, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.10776  [pdf, other

    cs.MM

    High-level Codes and Fine-grained Weights for Online Multi-modal Hashing Retrieval

    Authors: Yu-Wei Zhan, Xiao-Ming Wu, Xin Luo, Yinwei Wei, Xin-Shun Xu

    Abstract: In the real world, multi-modal data often appears in a streaming fashion, and there is a growing demand for similarity retrieval from such non-stationary data, especially at a large scale. In response to this need, online multi-modal hashing has gained significant attention. However, existing online multi-modal hashing methods face challenges related to the inconsistency of hash codes during long-… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 32 pages, 4 figures

  2. arXiv:2405.18860  [pdf, other

    cs.RO

    Empowering Embodied Manipulation: A Bimanual-Mobile Robot Manipulation Dataset for Household Tasks

    Authors: Tianle Zhang, Dongjiang Li, Yihang Li, Zecui Zeng, Lin Zhao, Lei Sun, Yue Chen, Xuelong Wei, Yibing Zhan, Lusong Li, Xiaodong He

    Abstract: The advancements in embodied AI are increasingly enabling robots to tackle complex real-world tasks, such as household manipulation. However, the deployment of robots in these environments remains constrained by the lack of comprehensive bimanual-mobile robot manipulation data that can be learned. Existing datasets predominantly focus on single-arm manipulation tasks, while the few dual-arm datase… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

  3. arXiv:2405.10642  [pdf, other

    cs.LG

    Hi-GMAE: Hierarchical Graph Masked Autoencoders

    Authors: Chuang Liu, Zelin Yao, Yibing Zhan, Xueqi Ma, Dapeng Tao, Jia Wu, Wenbin Hu, Shirui Pan, Bo Du

    Abstract: Graph Masked Autoencoders (GMAEs) have emerged as a notable self-supervised learning approach for graph-structured data. Existing GMAE models primarily focus on reconstructing node-level information, categorizing them as single-scale GMAEs. This methodology, while effective in certain contexts, tends to overlook the complex hierarchical structures inherent in many real-world graphs. For instance,… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

    Comments: 10 pages, 6 figures, 3 tables

  4. arXiv:2405.08487  [pdf, other

    cs.CV cs.CR

    Semantic Contextualization of Face Forgery: A New Definition, Dataset, and Detection Method

    Authors: Mian Zou, Baosheng Yu, Yibing Zhan, Siwei Lyu, Kede Ma

    Abstract: In recent years, deep learning has greatly streamlined the process of generating realistic fake face images. Aware of the dangers, researchers have developed various tools to spot these counterfeits. Yet none asked the fundamental question: What digital manipulations make a real photographic face image fake, while others do not? In this paper, we put face forgery in a semantic context and define t… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

  5. arXiv:2405.05663  [pdf, other

    cs.CV

    RPBG: Towards Robust Neural Point-based Graphics in the Wild

    Authors: Qingtian Zhu, Zizhuang Wei, Zhongtian Zheng, Yifan Zhan, Zhuyu Yao, Jiawang Zhang, Kejian Wu, Yinqiang Zheng

    Abstract: Point-based representations have recently gained popularity in novel view synthesis, for their unique advantages, e.g., intuitive geometric representation, simple manipulation, and faster convergence. However, based on our observation, these point-based neural re-rendering methods are only expected to perform well under ideal conditions and suffer from noisy, patchy points and unbounded scenes, wh… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

  6. arXiv:2405.04940  [pdf, other

    cs.CV

    Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

    Authors: Wentao Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao

    Abstract: Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and… ▽ More

    Submitted 30 June, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

    Comments: CVPR 2024

  7. arXiv:2405.01649  [pdf, other

    cs.CL

    Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning

    Authors: Tianle Xia, Liang Ding, Guojia Wan, Yibing Zhan, Bo Du, Dacheng Tao

    Abstract: Answering complex queries over incomplete knowledge graphs (KGs) is a challenging job. Most previous works have focused on learning entity/relation embeddings and simulating first-order logic operators with various neural networks. However, they are bottlenecked by the inability to share world knowledge to improve logical reasoning, thus resulting in suboptimal performance. In this paper, we propo… ▽ More

    Submitted 8 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

  8. arXiv:2405.01558  [pdf, other

    cs.CV cs.GR cs.LG eess.IV physics.optics

    Configurable Learned Holography

    Authors: Yicheng Zhan, Liang Shi, Wojciech Matusik, Qi Sun, Kaan Akşit

    Abstract: In the pursuit of advancing holographic display technology, we face a unique yet persistent roadblock: the inflexibility of learned holography in adapting to various hardware configurations. This is due to the variances in the complex optical components and system settings in existing holographic displays. Although the emerging learned approaches have enabled rapid and high-quality hologram genera… ▽ More

    Submitted 6 May, 2024; v1 submitted 24 March, 2024; originally announced May 2024.

    Comments: 14 pages, 5 figures

  9. arXiv:2404.17100  [pdf, other

    cs.CV

    Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

    Authors: Yuanyuan Liu, Yuxuan Huang, Shuyang Liu, Yibing Zhan, Zi**g Chen, Zhe Chen

    Abstract: In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these V-FER models cannot deal with unknown classes that are prevalent in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming at identifying not only known classe… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  10. arXiv:2404.15806  [pdf, other

    cs.LG

    Where to Mask: Structure-Guided Masking for Graph Masked Autoencoders

    Authors: Chuang Liu, Yuyao Wang, Yibing Zhan, Xueqi Ma, Dapeng Tao, Jia Wu, Wenbin Hu

    Abstract: Graph masked autoencoders (GMAE) have emerged as a significant advancement in self-supervised pre-training for graph-structured data. Previous GMAE models primarily utilize a straightforward random masking strategy for nodes or edges during training. However, this strategy fails to consider the varying significance of different nodes within the graph structure. In this paper, we investigate the po… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 9 pages, 3 Figures. Accepted by IJCAI 2024

  11. arXiv:2404.15729  [pdf, other

    cs.LG

    Gradformer: Graph Transformer with Exponential Decay

    Authors: Chuang Liu, Zelin Yao, Yibing Zhan, Xueqi Ma, Shirui Pan, Wenbin Hu

    Abstract: Graph Transformers (GTs) have demonstrated their advantages across a wide range of tasks. However, the self-attention mechanism in GTs overlooks the graph's inductive biases, particularly biases related to structure, which are crucial for the graph tasks. Although some methods utilize positional encoding and attention bias to model inductive biases, their effectiveness is still suboptimal analytic… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: 9 pages, 7 figures. Accepted by IJCAI 2024

  12. arXiv:2404.14139  [pdf, other

    cs.RO

    Human Orientation Estimation under Partial Observation

    Authors: Jieting Zhao, Han**g Ye, Yu Zhan, Hong Zhang

    Abstract: Reliable human orientation estimation (HOE) is critical for autonomous agents to understand human intention and perform human-robot interaction (HRI) tasks. Great progress has been made in HOE under full observation. However, the existing methods easily make a wrong prediction under partial observation and give it an unexpectedly high probability. To solve the above problems, this study first deve… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

    Comments: Submitted to IROS 2024

  13. arXiv:2404.12909  [pdf, other

    cs.RO cs.DC

    Cloud-based Digital Twin for Cognitive Robotics

    Authors: Arthur Niedźwiecki, Sascha Jongebloed, Yanxiang Zhan, Michaela Kümpel, Jörn Syrbe, Michael Beetz

    Abstract: The paper presents a novel cloud-based digital twin learning platform for teaching and training concepts of cognitive robotics. Instead of forcing interested learners or students to install a new operating system and bulky, fragile software onto their personal laptops just to solve tutorials or coding assignments of a single lecture on robotics, it would be beneficial to avoid technical setups and… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: 5 pages, IEEE Global Engineering Education Conference (EDUCON)

  14. arXiv:2404.09285  [pdf, other

    cs.DC

    Egret: Reinforcement Mechanism for Sequential Computation Offloading in Edge Computing

    Authors: Haosong Peng, Yufeng Zhan, DiHua Zhai, Xiaopu Zhang, Yuanqing Xia

    Abstract: As an emerging computing paradigm, edge computing offers computing resources closer to the data sources, hel** to improve the service quality of many real-time applications. A crucial problem is designing a rational pricing mechanism to maximize the revenue of the edge computing service provider (ECSP). However, prior works have considerable limitations: clients are static and are required to di… ▽ More

    Submitted 29 April, 2024; v1 submitted 14 April, 2024; originally announced April 2024.

    Comments: Submitted to IEEE TSC

  15. arXiv:2404.09267  [pdf, other

    cs.DC

    Tangram: High-resolution Video Analytics on Serverless Platform with SLO-aware Batching

    Authors: Haosong Peng, Yufeng Zhan, Peng Li, Yuanqing Xia

    Abstract: Cloud-edge collaborative computing paradigm is a promising solution to high-resolution video analytics systems. The key lies in reducing redundant data and managing fluctuating inference workloads effectively. Previous work has focused on extracting regions of interest (RoIs) from videos and transmitting them to the cloud for processing. However, a naive Infrastructure as a Service (IaaS) resource… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: Accepted by IEEE International Conference on Distributed Computing Systems (ICDCS) 2024

  16. arXiv:2404.09245  [pdf, other

    cs.MM cs.CV

    Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

    Authors: Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia

    Abstract: The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers h… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

  17. arXiv:2403.19160  [pdf, other

    cs.CV

    Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

    Authors: Yutong Chen, Yifan Zhan, Zhihang Zhong, Wei Wang, Xiao Sun, Yu Qiao, Yinqiang Zheng

    Abstract: Neural rendering techniques have significantly advanced 3D human body modeling. However, previous approaches often overlook dynamics induced by factors such as motion inertia, leading to challenges in scenarios like abrupt stops after rotation, where the pose remains static while the appearance changes. This limitation arises from reliance on a single pose as conditional input, resulting in ambigu… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  18. arXiv:2403.09333  [pdf, other

    cs.CV cs.AI

    Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

    Authors: Yufei Zhan, Yousong Zhu, Hongyin Zhao, Fan Yang, Ming Tang, **qiao Wang

    Abstract: Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, Counting and \etc. To address this… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: Tech report working in progress. Codes, models and datasets will be released at https://github.com/jefferyZhan/Griffon

  19. arXiv:2403.02742  [pdf, other

    cs.CL

    Towards Training A Chinese Large Language Model for Anesthesiology

    Authors: Zhonghai Wang, Jie Jiang, Yibing Zhan, Bohao Zhou, Yanhong Li, Chong Zhang, Liang Ding, Hua **, Jun Peng, Xu Lin, Weifeng Liu

    Abstract: Medical large language models (LLMs) have gained popularity recently due to their significant practical utility. However, most existing research focuses on general medicine, and there is a need for in-depth study of LLMs in specific fields like anesthesiology. To fill the gap, we introduce Hypnos, a Chinese Anesthesia model built upon existing LLMs, e.g., Llama. Hypnos' contributions have three as… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  20. arXiv:2402.18400  [pdf, other

    cs.MM

    Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

    Authors: Hanyao Wang, Yibing Zhan, Liu Liu, Liang Ding, Yan Yang, Jun Yu

    Abstract: Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image wit… ▽ More

    Submitted 26 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: This work has been submitted to the lEEE for possible publication. Copyright may betransferred without notice, after which this version may no longer be accessible

  21. arXiv:2402.13874  [pdf, other

    cs.CL

    $Se^2$: Sequential Example Selection for In-Context Learning

    Authors: Haoyu Liu, Jianfeng Liu, Shaohan Huang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Furu Wei, Qi Zhang

    Abstract: The remarkable capability of large language models (LLMs) for in-context learning (ICL) needs to be activated by demonstration examples. Prior work has extensively explored the selection of examples for ICL, predominantly following the "select then organize" paradigm, such approaches often neglect the internal relationships between examples and exist an inconsistency between the training and infer… ▽ More

    Submitted 6 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: Accepted by ACL 2024 Findings

  22. arXiv:2402.13589  [pdf, other

    cs.HC

    Affective Computing for Healthcare: Recent Trends, Applications, Challenges, and Beyond

    Authors: Yuanyuan Liu, Ke Wang, Lin Wei, **gying Chen, Yibing Zhan, Dapeng Tao, Zhe Chen

    Abstract: Affective computing, which aims to recognize, interpret, and understand human emotions, provides benefits in healthcare, such as improving patient care and enhancing doctor-patient communication. However, there is a noticeable absence of a comprehensive summary of recent advancements in affective computing for healthcare, which could pose difficulties for researchers entering this field. To addres… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  23. arXiv:2402.13408  [pdf, other

    cs.CL

    Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation

    Authors: Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Dacheng Tao

    Abstract: The copilot framework, which aims to enhance and tailor large language models (LLMs) for specific complex tasks without requiring fine-tuning, is gaining increasing attention from the community. In this paper, we introduce the construction of a Healthcare Copilot designed for medical consultation. The proposed Healthcare Copilot comprises three main components: 1) the Dialogue component, responsib… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  24. arXiv:2402.08552  [pdf, other

    cs.LG cs.CV

    Confronting Reward Overoptimization for Diffusion Models: A Perspective of Inductive and Primacy Biases

    Authors: Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, Dacheng Tao

    Abstract: Bridging the gap between diffusion models and human preferences is crucial for their integration into practical generative workflows. While optimizing downstream reward models has emerged as a promising alignment strategy, concerns arise regarding the risk of excessive optimization with learned reward models, which potentially compromises ground-truth performance. In this work, we confront the rew… ▽ More

    Submitted 5 June, 2024; v1 submitted 13 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024

  25. arXiv:2402.03667  [pdf, other

    cs.CL cs.AI

    Large Language Models as an Indirect Reasoner: Contrapositive and Contradiction for Automated Reasoning

    Authors: Yanfang Zhang, Yiliu Sun, Yibing Zhan, Dapeng Tao, Dacheng Tao, Chen Gong

    Abstract: Recently, increasing attention has been focused drawn on to improve the ability of Large Language Models (LLMs) to perform complex reasoning. However, previous methods, such as Chain-of-Thought and Self-Consistency, mainly follow Direct Reasoning (DR) frameworks, so they will meet difficulty in solving numerous real-world tasks which can hardly be solved via DR. Therefore, to strengthen the reason… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 20 pages,13 figures,4 tables

  26. arXiv:2401.14024  [pdf

    cs.CV

    PLCNet: Patch-wise Lane Correction Network for Automatic Lane Correction in High-definition Maps

    Authors: Haiyang Peng, Yi Zhan, Benkang Wang, Hongtao Zhang

    Abstract: In High-definition (HD) maps, lane elements constitute the majority of components and demand stringent localization requirements to ensure safe vehicle navigation. Vision lane detection with LiDAR position assignment is a prevalent method to acquire initial lanes for HD maps. However, due to incorrect vision detection and coarse camera-LiDAR calibration, initial lanes may deviate from their true p… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

  27. arXiv:2401.12479  [pdf, other

    cs.CV

    TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation

    Authors: Xin Lin, Chong Shi, Yibing Zhan, Zuopeng Yang, Yaqi Wu, Dacheng Tao

    Abstract: Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain occluded and blurred objects. 2) Label bias, primarily due to the high imbalance between a few positive relationship samples and numerous negative ones.… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

    Comments: Accepted by AAAI 2024

  28. arXiv:2401.09712  [pdf, other

    cs.CV

    SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

    Authors: Yang Zhan, Zhitong Xiong, Yuan Yuan

    Abstract: Large language models (LLMs) have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal large language models (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal large language model specifically… ▽ More

    Submitted 17 January, 2024; originally announced January 2024.

  29. arXiv:2401.08858  [pdf, ps, other

    cs.OS

    File System Aging

    Authors: Alex Conway, Ainesh Bakshi, Arghya Bhattacharya, Rory Bennett, Yizheng Jiao, Eric Knorr, Yang Zhan, Michael A. Bender, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, Martin Farach-Colton

    Abstract: File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause suboptimal file placement decisions that eventually lead to slower performance, or aging. Conventional wisdom suggests that file system aging is a solved problem in the common case; heuristics to avoid aging, such as colocating related files and dat… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: 36 pages, 12 figures. Article is an extension of Conway et al. FAST 17. (see https://www.usenix.org/conference/fast17/technical-sessions/presentation/conway) and Conway et al. HotStorage 19. (see https://www.usenix.org/conference/hotstorage19/presentation/conway)

    ACM Class: H.3.2; D.4.3; D.4.2; D.4.8; E.1; E.5; H.3.4

  30. arXiv:2312.08022  [pdf, other

    cs.CV

    Mono3DVG: 3D Visual Grounding in Monocular Images

    Authors: Yang Zhan, Yuan Yuan, Zhitong Xiong

    Abstract: We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end tran… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: Accepted by the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI 2024)

  31. arXiv:2312.05479  [pdf, other

    cs.LG cs.AI

    Exploring Sparsity in Graph Transformers

    Authors: Chuang Liu, Yibing Zhan, Xueqi Ma, Liang Ding, Dapeng Tao, Jia Wu, Wenbin Hu, Bo Du

    Abstract: Graph Transformers (GTs) have achieved impressive results on various graph-related tasks. However, the huge computational cost of GTs hinders their deployment and application, especially in resource-constrained environments. Therefore, in this paper, we explore the feasibility of sparsifying GTs, a significant yet under-explored topic. We first discuss the redundancy of GTs based on the characteri… ▽ More

    Submitted 9 December, 2023; originally announced December 2023.

    Comments: 9 pages, 8 figures

  32. arXiv:2311.16479  [pdf, other

    cs.CV

    Mitigating Hallucination in Visual Language Models with Visual Supervision

    Authors: Zhiyang Chen, Yousong Zhu, Yufei Zhan, Zhaowen Li, Chaoyang Zhao, **qiao Wang, Ming Tang

    Abstract: Large vision-language models (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global descript… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

  33. arXiv:2311.16475  [pdf, other

    cs.CV

    Generating Human-Centric Visual Cues for Human-Object Interaction Detection via Large Vision-Language Models

    Authors: Yu-Wei Zhan, Fan Liu, Xin Luo, Liqiang Nie, Xin-Shun Xu, Mohan Kankanhalli

    Abstract: Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, the complexity of human behavior and the diverse contexts in which these interactions occur make it challenging. Intuitively, human-centric visual cues, such as the involved participants, the body language, and the surrounding environment, play crucial roles in sha** these in… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

  34. arXiv:2311.15200  [pdf, other

    cs.CV cs.LG

    SpliceMix: A Cross-scale and Semantic Blending Augmentation Strategy for Multi-label Image Classification

    Authors: Lei Wang, Yibing Zhan, Leilei Ma, Dapeng Tao, Liang Ding, Chen Gong

    Abstract: Recently, Mix-style data augmentation methods (e.g., Mixup and CutMix) have shown promising performance in various visual tasks. However, these methods are primarily designed for single-label images, ignoring the considerable discrepancies between single- and multi-label images, i.e., a multi-label image involves multiple co-occurred categories and fickle object scales. On the other hand, previous… ▽ More

    Submitted 26 November, 2023; originally announced November 2023.

    Comments: 13 pages, 10 figures

  35. arXiv:2311.14552  [pdf, other

    cs.CV cs.AI

    Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

    Authors: Yufei Zhan, Yousong Zhu, Zhiyang Chen, Fan Yang, Ming Tang, **qiao Wang

    Abstract: Replicating the innate human ability to detect all objects based on free-form texts at any granularity remains a formidable challenge for Vision-Language models. Current Large Vision Language Models (LVLMs) are predominantly constrained to grounding a single, pre-existing object, relying solely on data from Referring Expression Comprehension tasks. The limitation leads to a compromise in model des… ▽ More

    Submitted 27 November, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

    Comments: Technical report. The codes and dataset will be released soon at https://github.com/jefferyZhan/Griffon

  36. arXiv:2311.12644  [pdf, other

    cs.LG

    Careful Selection and Thoughtful Discarding: Graph Explicit Pooling Utilizing Discarded Nodes

    Authors: Chuang Liu, Wenhang Yu, Kuang Gao, Xueqi Ma, Yibing Zhan, Jia Wu, Bo Du, Wenbin Hu

    Abstract: Graph pooling has been increasingly recognized as crucial for Graph Neural Networks (GNNs) to facilitate hierarchical graph representation learning. Existing graph pooling methods commonly consist of two stages: selecting top-ranked nodes and discarding the remaining to construct coarsened graph representations. However, this paper highlights two key issues with these methods: 1) The process of se… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

    Comments: 14 pages, 7 figures, 4 tables. Submitting to Science China Information Sciences

  37. arXiv:2310.08948  [pdf, other

    cs.CV cs.AI

    Federated Class-Incremental Learning with Prompting

    Authors: Jiale Liu, Yu-Wei Zhan, Chong-Yu Zhang, Xin Luo, Zhen-Duo Chen, Yinwei Wei, Xin-Shun Xu

    Abstract: As Web technology continues to develop, it has become increasingly common to use data stored on different clients. At the same time, federated learning has received widespread attention due to its ability to protect data privacy when let models learn from data which is distributed across various clients. However, most existing works assume that the client's data are fixed. In real-world scenarios,… ▽ More

    Submitted 13 October, 2023; originally announced October 2023.

  38. arXiv:2310.04742  [pdf, other

    cs.LG

    Parameter Efficient Multi-task Model Fusion with Partial Linearization

    Authors: Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, Dacheng Tao

    Abstract: Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, lead… ▽ More

    Submitted 11 March, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

  39. arXiv:2309.16599  [pdf, other

    cs.CL

    Unlikelihood Tuning on Negative Samples Amazingly Improves Zero-Shot Translation

    Authors: Changtong Zan, Liang Ding, Li Shen, Yibin Lei, Yibing Zhan, Weifeng Liu, Dacheng Tao

    Abstract: Zero-shot translation (ZST), which is generally based on a multilingual neural machine translation model, aims to translate between unseen language pairs in training data. The common practice to guide the zero-shot language map** during inference is to deliberately insert the source and target language IDs, e.g., <EN> for English and <DE> for German. Recent studies have shown that language IDs s… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

  40. arXiv:2309.13335  [pdf, other

    cs.IR

    Model-enhanced Vector Index

    Authors: Hailin Zhang, Yu**g Wang, Qi Chen, Ruiheng Chang, Ting Zhang, Ziming Miao, Yingyan Hou, Yang Ding, Xupeng Miao, Haonan Wang, Bochen Pang, Yuefeng Zhan, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Xing Xie, Mao Yang, Bin Cui

    Abstract: Embedding-based retrieval methods construct vector indices to search for document representations that are most similar to the query representations. They are widely used in document retrieval due to low latency and decent recall performance. Recent research indicates that deep retrieval solutions offer better model quality, but are hindered by unacceptable serving latency and the inability to sup… ▽ More

    Submitted 9 November, 2023; v1 submitted 23 September, 2023; originally announced September 2023.

  41. arXiv:2309.11727  [pdf, other

    cs.RO

    Person Re-Identification for Robot Person Following with Online Continual Learning

    Authors: Han**g Ye, Jieting Zhao, Yu Zhan, Weinan Chen, Li He, Hong Zhang

    Abstract: Robot person following (RPF) is a crucial capability in human-robot interaction (HRI) applications, allowing a robot to persistently follow a designated person. In practical RPF scenarios, the person often be occluded by other objects or people. Consequently, it is necessary to re-identify the person when he/she re-appears within the robot's field of view. Previous person re-identification (ReID)… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: Under review

  42. arXiv:2309.03599  [pdf, other

    cs.CV

    Chasing Consistency in Text-to-3D Generation from a Single Image

    Authors: Yichen Ouyang, Wenhao Chai, Jiayi Ye, Dapeng Tao, Yibing Zhan, Gaoang Wang

    Abstract: Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we p… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: 9 pages, 11 figures

  43. arXiv:2308.12526  [pdf, other

    eess.AS cs.LG cs.SD

    UNISOUND System for VoxCeleb Speaker Recognition Challenge 2023

    Authors: Yu Zheng, Yajun Zhang, Chuanying Niu, Yibin Zhan, Yanhua Long, Dongxing Xu

    Abstract: This report describes the UNISOUND submission for Track1 and Track2 of VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC 2023). We submit the same system on Track 1 and Track 2, which is trained with only VoxCeleb2-dev. Large-scale ResNet and RepVGG architectures are developed for the challenge. We propose a consistency-aware score calibration method, which leverages the stability of audio voice… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  44. arXiv:2308.12509  [pdf, other

    cs.CV

    Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

    Authors: Yuan Yuan, Yang Zhan, Zhitong Xiong

    Abstract: Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) d… ▽ More

    Submitted 23 August, 2023; originally announced August 2023.

  45. arXiv:2308.10298  [pdf, ps, other

    cs.DC

    Arena: A Learning-based Synchronization Scheme for Hierarchical Federated Learning--Technical Report

    Authors: Tianyu Qi, Yufeng Zhan, Peng Li, **gcai Guo, Yuanqing Xia

    Abstract: Federated learning (FL) enables collaborative model training among distributed devices without data sharing, but existing FL suffers from poor scalability because of global model synchronization. To address this issue, hierarchical federated learning (HFL) has been recently proposed to let edge servers aggregate models of devices in proximity, while synchronizing via the cloud periodically. Howeve… ▽ More

    Submitted 20 August, 2023; originally announced August 2023.

  46. arXiv:2308.04800  [pdf, other

    cs.CL

    ADMUS: A Progressive Question Answering Framework Adaptable to Multiple Knowledge Sources

    Authors: Yirui Zhan, Yanzeng Li, Minhao Zhang, Lei Zou

    Abstract: With the introduction of deep learning models, semantic parsingbased knowledge base question answering (KBQA) systems have achieved high performance in handling complex questions. However, most existing approaches primarily focus on enhancing the model's effectiveness on individual benchmark datasets, disregarding the high costs of adapting the system to disparate datasets in real-world scenarios… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  47. arXiv:2308.04493  [pdf, other

    quant-ph cs.LG q-fin.CP

    Efficient option pricing with unary-based photonic computing chip and generative adversarial learning

    Authors: Hui Zhang, Lingxiao Wan, Sergi Ramos-Calderer, Yuancheng Zhan, Wai-Keong Mok, Hong Cai, Feng Gao, Xianshu Luo, Guo-Qiang Lo, Leong Chuan Kwek, José Ignacio Latorre, Ai Qun Liu

    Abstract: In the modern financial industry system, the structure of products has become more and more complex, and the bottleneck constraint of classical computing power has already restricted the development of the financial industry. Here, we present a photonic chip that implements the unary approach to European option pricing, in combination with the quantum amplitude estimation algorithm, to achieve a q… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: 11 pages, 7 figures

    Journal ref: Photonics Research 10.1364/PRJ.493865 (2023)

  48. arXiv:2307.06527  [pdf, other

    cs.CV

    Free-Form Composition Networks for Egocentric Action Recognition

    Authors: Haoran Wang, Qinghua Cheng, Baosheng Yu, Yibing Zhan, Dapeng Tao, Liang Ding, Haibin Ling

    Abstract: Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and t… ▽ More

    Submitted 14 October, 2023; v1 submitted 12 July, 2023; originally announced July 2023.

  49. arXiv:2306.12726  [pdf, ps, other

    cs.LG

    On Exploring Node-feature and Graph-structure Diversities for Node Drop Graph Pooling

    Authors: Chuang Liu, Yibing Zhan, Baosheng Yu, Liu Liu, Bo Du, Wenbin Hu, Tongliang Liu

    Abstract: A pooling operation is essential for effective graph-level representation learning, where the node drop pooling has become one mainstream graph pooling technology. However, current node drop pooling methods usually keep the top-k nodes according to their significance scores, which ignore the graph diversity in terms of the node features and the graph structures, thus resulting in suboptimal graph-… ▽ More

    Submitted 22 June, 2023; originally announced June 2023.

    Comments: 14 pages, 14 figures

  50. arXiv:2306.00434  [pdf, other

    cs.CL

    Divide, Conquer, and Combine: Mixture of Semantic-Independent Experts for Zero-Shot Dialogue State Tracking

    Authors: Qingyue Wang, Liang Ding, Yanan Cao, Yibing Zhan, Zheng Lin, Shi Wang, Dacheng Tao, Li Guo

    Abstract: Zero-shot transfer learning for Dialogue State Tracking (DST) helps to handle a variety of task-oriented dialogue domains without the cost of collecting in-domain data. Existing works mainly study common data- or model-level augmentation methods to enhance the generalization but fail to effectively decouple the semantics of samples, limiting the zero-shot performance of DST. In this paper, we pres… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to ACL 2023