Skip to main content

Showing 1–50 of 331 results for author: Tu, Z

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.00115  [pdf, other

    cs.LG cs.AI

    Instance Temperature Knowledge Distillation

    Authors: Zhengbo Zhang, Yuxi Zhou, Jia Gong, Jun Liu, Zhigang Tu

    Abstract: Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these meth… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

  2. arXiv:2406.18011  [pdf, other

    cs.CV

    Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation

    Authors: Yijie Yang, **lu Zhang, Jiaxu Zhang, Zhigang Tu

    Abstract: In the realm of skeleton-based action recognition, the traditional methods which rely on coarse body keypoints fall short of capturing subtle human actions. In this work, we propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, improving the discriminative ability for existing models in discerning intricate actions. To efficiently mode… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  3. arXiv:2406.16382  [pdf, other

    cs.CL

    UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models

    Authors: Zhanyue Qin, Haochuan Wang, Deyuan Liu, Ziyang Song, Cunhang Fan, Zhao Lv, **lin Wu, Zhen Lei, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui

    Abstract: Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  4. arXiv:2406.16330  [pdf, other

    cs.CL cs.AI

    Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging

    Authors: Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Xi Chen, Cunhang Fan, Zhao Lv, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

    Abstract: While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  5. arXiv:2406.13381  [pdf, other

    cs.CL

    CoAct: A Global-Local Hierarchy for Autonomous Agent Collaboration

    Authors: Xinming Hou, Mingming Yang, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Wayne Xin Zhao

    Abstract: Existing LLMs exhibit remarkable performance on various NLP tasks, but still struggle with complex real-world tasks, even equipped with advanced strategies like CoT and ReAct. In this work, we propose the CoAct framework, which transfers the hierarchical planning and collaboration patterns in human society to LLM systems. Specifically, our CoAct framework involves two agents: (1) A global planning… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 9 pages, 4 figures

  6. arXiv:2406.08418  [pdf, other

    cs.CV cs.AI

    OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

    Authors: Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang **, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Zhenxiang Li, Pei Chu, Yi Wang , et al. (15 additional authors not shown)

    Abstract: Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale an… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.05871  [pdf, other

    cs.CV cs.LG

    OmniControlNet: Dual-stage Integration for Conditional Image Generation

    Authors: Yilin Wang, Haiyang Xu, Xiang Zhang, Zeyuan Chen, Zhizhou Sha, Zirui Wang, Zhuowen Tu

    Abstract: We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method and incorporating its individually trained image generation processes into a single model. Despite its tremendous success, the ControlNet of a two-stage pipeline bears limitations in being not self-contained (e.g. calls the external condit… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR 2024 Workshop: Generative Models for Computer Vision

  8. arXiv:2406.01127  [pdf, other

    cs.CV

    Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

    Authors: Kunpeng Wang, Zhengzheng Tu, Chenglong Li, Cheng Zhang, Bin Luo

    Abstract: Multi-modal salient object detection (MSOD) aims to boost saliency detection performance by integrating visible sources with depth or thermal infrared ones. Existing methods generally design different fusion schemes to handle certain issues or challenges. Although these fusion schemes are effective at addressing specific issues or challenges, they may struggle to handle multiple complex challenges… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted by TCSVT 2024

  9. arXiv:2406.00917  [pdf, other

    cs.CV

    Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

    Authors: Kunpeng Wang, Danying Lin, Chenglong Li, Zhengzheng Tu, Bin Luo

    Abstract: RGB and Thermal (RGBT) Salient Object Detection (SOD) aims to achieve high-quality saliency prediction by exploiting the complementary information of visible and thermal image pairs, which are initially captured in an unaligned manner. However, existing methods are tailored for manually aligned image pairs, which are labor-intensive, and directly applying these methods to original unaligned image… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: Accepted by TMM 2024

  10. arXiv:2406.00671  [pdf, other

    cs.RO

    An Efficient Trajectory Generation for Bi-copter Flight in Tight Space

    Authors: Xin Dong, Yangjie Cui, **gwu Xiang, Daochun Li, Zhan Tu

    Abstract: Unlike squared (or alike) quadrotors, elongated bi-copters leverage natural superiority in crossing tight spaces. To date, extensive works have focused on the design, modeling, and control of bi-copters. Besides, a proper motion planner utilizing bi-copters' shape characteristics is essential to efficiently and safely traverse tight spaces, yet it has rarely been studied. Current motion planning m… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: 8 pages,8 figures

  11. arXiv:2405.17580  [pdf, other

    cs.LG cs.AI stat.ML

    Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

    Authors: Zhenfeng Tu, Santiago Aranguri, Arthur Jacot

    Abstract: The training dynamics of linear networks are well studied in two distinct setups: the lazy regime and balanced/active regime, depending on the initialization and width of the network. We provide a surprisingly simple unyfing formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In the mixed regime,… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  12. arXiv:2405.15384  [pdf, other

    cs.LG

    Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

    Authors: Fan-Ming Luo, Zuolin Tu, Zefang Huang, Yang Yu

    Abstract: Real-world decision-making tasks are usually partially observable Markov decision processes (POMDPs), where the state is not fully observable. Recent progress has demonstrated that recurrent reinforcement learning (RL), which consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction and a multilayer perceptron (MLP) policy for decision making, can mi… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  13. arXiv:2405.02730  [pdf, other

    cs.CV

    U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

    Authors: Yuchuan Tian, Zhijun Tu, Hanting Chen, Jie Hu, Chao Xu, Yunhe Wang

    Abstract: Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy… ▽ More

    Submitted 3 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

    Comments: 12 pages, 5 figures

  14. arXiv:2405.02673  [pdf, other

    cs.CL

    On the Information Redundancy in Non-Autoregressive Translation

    Authors: Zhihao Wang, Longyue Wang, **song Su, Junfeng Yao, Zhaopeng Tu

    Abstract: Token repetition is a typical form of multi-modal problem in fully non-autoregressive translation (NAT). In this work, we revisit the multi-modal problem in recently proposed NAT models. Our study reveals that these advanced models have introduced other types of information redundancy errors, which cannot be measured by the conventional metric - the continuous repetition ratio. By manually annotat… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: 10 pages, 10 tables

  15. arXiv:2404.18065  [pdf, other

    cs.CV cs.AI

    Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

    Authors: Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

    Abstract: In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied na… ▽ More

    Submitted 28 April, 2024; originally announced April 2024.

    Comments: 9 pages, 10 figures

  16. arXiv:2404.16205  [pdf, other

    cs.CV cs.MM

    AIS 2024 Challenge on Video Quality Assessment of User-Generated Content: Methods and Results

    Authors: Marcos V. Conde, Saman Zadtootaghaj, Nabajeet Barman, Radu Timofte, Chenlong He, Qi Zheng, Ruoxi Zhu, Zhengzhong Tu, Haiqiang Wang, Xiangguang Chen, Wenhui Meng, Xiang Pan, Huiying Shi, Han Zhu, Xiaozhong Xu, Lei Sun, Zhenzhong Chen, Shan Liu, Zicheng Zhang, Haoning Wu, Yingjie Zhou, Chunyi Li, Xiaohong Liu, Weisi Lin, Guangtao Zhai , et al. (11 additional authors not shown)

    Abstract: This paper reviews the AIS 2024 Video Quality Assessment (VQA) Challenge, focused on User-Generated Content (UGC). The aim of this challenge is to gather deep learning-based methods capable of estimating the perceptual quality of UGC videos. The user-generated videos from the YouTube UGC Dataset include diverse content (sports, games, lyrics, anime, etc.), quality and resolutions. The proposed met… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: CVPR 2024 Workshop -- AI for Streaming (AIS) Video Quality Assessment Challenge

  17. arXiv:2404.14837  [pdf, other

    eess.IV cs.CV

    Ultrasound SAM Adapter: Adapting SAM for Breast Lesion Segmentation in Ultrasound Images

    Authors: Zhengzheng Tu, Le Gu, Xixi Wang, Bo Jiang

    Abstract: Segment Anything Model (SAM) has recently achieved amazing results in the field of natural image segmentation. However, it is not effective for medical image segmentation, owing to the large domain gap between natural and medical images. In this paper, we mainly focus on ultrasound image segmentation. As we know that it is very difficult to train a foundation model for ultrasound image data due to… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  18. arXiv:2404.06075  [pdf, other

    cs.CV

    LIPT: Latency-aware Image Processing Transformer

    Authors: Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin

    Abstract: Transformer is leading a trend in the field of image processing. Despite the great success that existing lightweight image processing transformers have achieved, they are tailored to FLOPs or parameters reduction, rather than practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that su… ▽ More

    Submitted 28 April, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  19. arXiv:2404.04804  [pdf, other

    cs.CV

    Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving

    Authors: **long Li, Baolu Li, Zhengzhong Tu, Xinyu Liu, Qing Guo, Felix Juefei-Xu, Runsheng Xu, Hongkai Yu

    Abstract: Vision-centric perception systems for autonomous driving have gained considerable attention recently due to their cost-effectiveness and scalability, especially compared to LiDAR-based systems. However, these systems often struggle in low-light conditions, potentially compromising their performance and safety. To address this, our paper introduces LightDiff, a domain-tailored framework designed to… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: This paper is accepted by CVPR 2024

  20. arXiv:2404.02883  [pdf, other

    cs.CV cs.AI cs.LG

    On the Scalability of Diffusion-based Text-to-Image Generation

    Authors: Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

    Abstract: Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work,… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

    Comments: CVPR2024

  21. arXiv:2404.01789  [pdf

    cs.SE

    A Feature Dataset of Microservices-based Systems

    Authors: Weipan Yang, Yongchao Xing, Yiming Lyu, Zhihao Liang, Zhiying Tu

    Abstract: Microservice architecture has become a dominant architectural style in the service-oriented software industry. Poor practices in the design and development of microservices are called microservice bad smells. In microservice bad smells research, the detection of these bad smells relies on feature data from microservices. However, there is a lack of an appropriate open-source microservice feature d… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  22. arXiv:2404.01367  [pdf, other

    cs.CV cs.LG

    Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

    Authors: Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, Peyman Milanfar

    Abstract: We study the scaling properties of latent diffusion models (LDMs) with an emphasis on their sampling efficiency. While improved network architecture and inference algorithms have shown to effectively boost sampling efficiency of diffusion models, the role of model size -- a critical determinant of sampling efficiency -- has not been thoroughly examined. Through empirical analysis of established te… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  23. arXiv:2404.00633  [pdf, other

    cs.CV

    IPT-V2: Efficient Image Processing Transformer using Hierarchical Attentions

    Authors: Zhijun Tu, Kunpeng Du, Hanting Chen, Hailing Wang, Wei Li, Jie Hu, Yunhe Wang

    Abstract: Recent advances have demonstrated the powerful capability of transformer architecture in image restoration. However, our analysis indicates that existing transformerbased methods can not establish both exact global and local dependencies simultaneously, which are much critical to restore the details and missing content of degraded images. To this end, we present an efficient image processing trans… ▽ More

    Submitted 31 March, 2024; originally announced April 2024.

  24. arXiv:2403.19390  [pdf, other

    cs.CL

    Checkpoint Merging via Bayesian Optimization in LLM Pretraining

    Authors: Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

    Abstract: The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  25. arXiv:2403.12011  [pdf, other

    cs.CV

    HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

    Authors: Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang

    Abstract: 3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more contr… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

    Comments: Project page: https://mq-zhang1.github.io/HOIDiffusion

  26. arXiv:2403.11807  [pdf, other

    cs.AI cs.CL

    How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

    Authors: Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu

    Abstract: Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce o… ▽ More

    Submitted 25 April, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: 16 pages of main text. 11 pages of appendices. 15 figures, 9 tables. Updated scoring scheme

  27. arXiv:2403.11699  [pdf, other

    eess.IV cs.CV

    A Spatial-Temporal Progressive Fusion Network for Breast Lesion Segmentation in Ultrasound Videos

    Authors: Zhengzheng Tu, Zigang Zhu, Yayang Duan, Bo Jiang, Qishun Wang, Chaoxue Zhang

    Abstract: Ultrasound video-based breast lesion segmentation provides a valuable assistance in early breast lesion detection and treatment. However, existing works mainly focus on lesion segmentation based on ultrasound breast images which usually can not be adapted well to obtain desirable results on ultrasound videos. The main challenge for ultrasound video-based breast lesion segmentation is how to exploi… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  28. arXiv:2403.11469  [pdf, other

    cs.CV cs.GR

    Generative Motion Stylization within Canonical Motion Space

    Authors: Jiaxu Zhang, Xin Chen, Gang Yu, Zhigang Tu

    Abstract: Stylized motion breathes life into characters. However, the fixed skeleton structure and style representation hinder existing data-driven motion synthesis methods from generating stylized motion for various characters. In this work, we propose a generative motion stylization pipeline, named MotionS, for synthesizing diverse and stylized motion on cross-structure characters using cross-modality sty… ▽ More

    Submitted 18 March, 2024; originally announced March 2024.

  29. arXiv:2403.11371  [pdf, other

    cs.CV

    V2X-DGW: Domain Generalization for Multi-agent Perception under Adverse Weather Conditions

    Authors: Baolu Li, **long Li, Xinyu Liu, Runsheng Xu, Zhengzhong Tu, Jiacheng Guo, Xiaopeng Li, Hongkai Yu

    Abstract: Current LiDAR-based Vehicle-to-Everything (V2X) multi-agent perception systems have shown the significant success on 3D object detection. While these models perform well in the trained clean weather, they struggle in unseen adverse weather conditions with the real-world domain gap. In this paper, we propose a domain generalization approach, named V2X-DGW, for LiDAR-based 3D object detection on mul… ▽ More

    Submitted 29 March, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

  30. arXiv:2403.10254  [pdf, other

    cs.CV cs.IR cs.MM

    Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

    Authors: **** Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu

    Abstract: Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: This work is accepted by CVPR2024. More modifications may be performed

  31. arXiv:2403.06973  [pdf, other

    cs.CV cs.LG

    Bayesian Diffusion Models for 3D Shape Reconstruction

    Authors: Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, Zhuowen Tu

    Abstract: We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised)… ▽ More

    Submitted 21 April, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024; Project Page: https://mlpc-ucsd.github.io/BDM/

  32. arXiv:2403.04580  [pdf

    cs.LG

    Beyond Major Product Prediction: Reproducing Reaction Mechanisms with Machine Learning Models Trained on a Large-Scale Mechanistic Dataset

    Authors: Joonyoung F. Joung, Mun Hong Fong, Jihye Roh, Zhengkai Tu, John Bradshaw, Connor W. Coley

    Abstract: Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

    Comments: 105 pages, 9 figures

  33. arXiv:2403.03346  [pdf, other

    cs.CV

    Enhancing Vision-Language Pre-training with Rich Supervisions

    Authors: Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

    Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localiza… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  34. arXiv:2403.02249  [pdf, other

    cs.CV cs.AI

    Non-autoregressive Sequence-to-Sequence Vision-Language Models

    Authors: Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto

    Abstract: Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distributi… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  35. arXiv:2402.19282  [pdf, other

    cs.CL

    WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

    Authors: Jiantao Qiu, Haijun Lv, Zhenjiang **, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Zhenxiang Li, Pei Chu, Yuan Qu, ** Shi, Lindong Lu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Zhikai Lei, Jiawei Hong, Keyu Chen, Zhaoye Fei, Ruiliang Xu, Wei Li, Zhongying Tu, Lin Dahua, Yu Qiao, Hang Yan , et al. (1 additional authors not shown)

    Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy… ▽ More

    Submitted 17 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

  36. arXiv:2402.16915  [pdf, other

    cs.LG cs.AI

    More Than Routing: Joint GPS and Route Modeling for Refine Trajectory Representation Learning

    Authors: Zhipeng Ma, Zheyan Tu, Xinhai Chen, Yan Zhang, Deguo Xia, Guyue Zhou, Yilun Chen, Yu Zheng, Jiangtao Gong

    Abstract: Trajectory representation learning plays a pivotal role in supporting various downstream tasks. Traditional methods in order to filter the noise in GPS trajectories tend to focus on routing-based methods used to simplify the trajectories. However, this approach ignores the motion details contained in the GPS data, limiting the representation capability of trajectory representation learning. To fil… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  37. VN Network: Embedding Newly Emerging Entities with Virtual Neighbors

    Authors: Yongquan He, Zihan Wang, Peng Zhang, Zhaopeng Tu, Zhaochun Ren

    Abstract: Embedding entities and relations into continuous vector spaces has attracted a surge of interest in recent years. Most embedding methods assume that all test entities are available during training, which makes it time-consuming to retrain embeddings for newly emerging entities. To address this issue, recent works apply the graph neural network on the existing neighbors of the unseen entities. In t… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: 10 pages, 5 figures

    ACM Class: I.2.4; I.2.6

    Journal ref: CIKM (2020) 505-514

  38. arXiv:2402.14007  [pdf, other

    cs.CL cs.AI

    Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models

    Authors: Zhiwei He, Binglin Zhou, Hongkun Hao, Aiwei Liu, Xing Wang, Zhaopeng Tu, Zhuosheng Zhang, Rui Wang

    Abstract: Text watermarking technology aims to tag and identify content produced by large language models (LLMs) to prevent misuse. In this study, we introduce the concept of cross-lingual consistency in text watermarking, which assesses the ability of text watermarks to maintain their effectiveness after being translated into other languages. Preliminary empirical results from two LLMs and three watermarki… ▽ More

    Submitted 4 June, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (main conference)

  39. arXiv:2402.07726  [pdf, other

    cs.CL

    Unsupervised Sign Language Translation and Generation

    Authors: Zhengsheng Guo, Zhiwei He, Wenxiang Jiao, Xing Wang, Rui Wang, Kehai Chen, Zhaopeng Tu, Yong Xu, Min Zhang

    Abstract: Motivated by the success of unsupervised neural machine translation (UNMT), we introduce an unsupervised sign language translation and generation network (USLNet), which learns from abundant single-modality (text and video) data without parallel sign language data. USLNet comprises two main components: single-modality reconstruction modules (text and video) that rebuild the input from its noisy ve… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  40. arXiv:2402.05964  [pdf, other

    cs.LG cs.CL cs.CV

    A Survey on Transformer Compression

    Authors: Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao

    Abstract: Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of… ▽ More

    Submitted 7 April, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Model Compression, Transformer, Large Language Model, Large Vision Model, LLM

  41. arXiv:2402.03408  [pdf, other

    cs.SE cs.DC

    A Survey on Effective Invocation Methods of Massive LLM Services

    Authors: Can Wang, Bolin Zhang, Dianbo Sui, Zhiying Tu, Xiaoyu Liu, Jiabao Kang

    Abstract: Language models as a service (LMaaS) enable users to accomplish tasks without requiring specialized knowledge, simply by paying a service provider. However, numerous providers offer massive large language model (LLM) services with variations in latency, performance, and pricing. Consequently, constructing the cost-saving LLM services invocation strategy with low-latency and high-performance respon… ▽ More

    Submitted 29 February, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

  42. arXiv:2402.02084  [pdf, other

    cs.CL

    Revisiting the Markov Property for Machine Translation

    Authors: Cunxiao Du, Hao Zhou, Zhaopeng Tu, **g Jiang

    Abstract: In this paper, we re-examine the Markov property in the context of neural machine translation. We design a Markov Autoregressive Transformer~(MAT) and undertake a comprehensive assessment of its performance across four WMT benchmarks. Our findings indicate that MAT with an order larger than 4 can generate translations with quality on par with that of conventional autoregressive transformers. In ad… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

    Comments: EACL (Findings)

  43. arXiv:2402.02082  [pdf, other

    cs.CL

    GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding

    Authors: Cunxiao Du, **g Jiang, Xu Yuanchen, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, Yang You

    Abstract: Speculative decoding is a relatively new decoding framework that leverages small and efficient draft models to reduce the latency of LLMs. In this study, we introduce GliDe and CaPE, two low-hassle modifications to vanilla speculative decoding to further improve the decoding speed of a frozen LLM. Specifically, GliDe is a modified draft model architecture that reuses the cached keys and values fro… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  44. arXiv:2401.12873  [pdf, other

    cs.CL cs.AI

    Improving Machine Translation with Human Feedback: An Exploration of Quality Estimation as a Reward Model

    Authors: Zhiwei He, Xing Wang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang, Shuming Shi, Zhaopeng Tu

    Abstract: Insufficient modeling of human preferences within the reward model is a major obstacle for leveraging human feedback to improve translation quality. Fortunately, quality estimation (QE), which predicts the quality of a given translation without reference, has achieved impressive alignment with human evaluations in the last two years. In this work, we investigate the potential of employing the QE m… ▽ More

    Submitted 18 March, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: NAACL 2024

  45. arXiv:2401.12794  [pdf, other

    cs.CL

    Benchmarking LLMs via Uncertainty Quantification

    Authors: Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek F. Wong, Emine Yilmaz, Shuming Shi, Zhaopeng Tu

    Abstract: The proliferation of open-source Large Language Models (LLMs) from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking… ▽ More

    Submitted 25 April, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

    Comments: 25 pages, preprints

  46. arXiv:2401.08350  [pdf, other

    cs.CL

    Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models

    Authors: Jianhui Pang, Fanghua Ye, Longyue Wang, Dian Yu, Derek F. Wong, Shuming Shi, Zhaopeng Tu

    Abstract: The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017), which have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, t… ▽ More

    Submitted 17 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

    Comments: 17 pages. Longyue Wang is the Corresponding Author

  47. arXiv:2401.06341  [pdf, other

    cs.CV cs.RO

    AffordanceLLM: Grounding Affordance from Vision Language Models

    Authors: Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li

    Abstract: Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

  48. arXiv:2401.00761  [pdf, other

    cs.SE cs.AI cs.CL

    The Earth is Flat? Unveiling Factual Errors in Large Language Models

    Authors: Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

    Abstract: Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by test data leakage… ▽ More

    Submitted 1 January, 2024; originally announced January 2024.

  49. arXiv:2312.17161  [pdf, other

    cs.CV

    Restoration by Generation with Constrained Priors

    Authors: Zheng Ding, Xuaner Zhang, Zhuowen Tu, Zhihao Xia

    Abstract: The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our… ▽ More

    Submitted 1 June, 2024; v1 submitted 28 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 (Highlight)

  50. arXiv:2312.16256  [pdf, other

    cs.CV cs.AI

    DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision

    Authors: Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xuanmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukherjee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, Aniket Bera

    Abstract: We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets for deep learning-based 3D vision, limited to either synthetic environments or a narrow selection of real-world scenes, are quite insufficient. This insufficiency not… ▽ More

    Submitted 29 December, 2023; v1 submitted 25 December, 2023; originally announced December 2023.