Skip to main content

Showing 1–50 of 117 results for author: Ren, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.06216  [pdf, other

    cs.CV

    Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis

    Authors: Xin **, Pengyi Jiao, Zheng-Peng Duan, Xingchao Yang, Chun-Le Guo, Bo Ren, Chongyi Li

    Abstract: Volumetric rendering based methods, like NeRF, excel in HDR view synthesis from RAWimages, especially for nighttime scenes. While, they suffer from long training times and cannot perform real-time rendering due to dense sampling requirements. The advent of 3D Gaussian Splatting (3DGS) enables real-time rendering and faster training. However, implementing RAW image-based view synthesis directly usi… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  2. arXiv:2405.20008  [pdf, other

    cs.CV

    Sharing Key Semantics in Transformer Makes Efficient Image Restoration

    Authors: Bin Ren, Yawei Li, **gyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe

    Abstract: Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objec… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

    Comments: 9 pages

  3. arXiv:2404.19358  [pdf, other

    cs.IT

    QML-IB: Quantized Collaborative Intelligence between Multiple Devices and the Mobile Network

    Authors: **gchen Peng, Boxiang Ren, Lu Yang, Chenghui Peng, Panpan Niu, Hao Wu

    Abstract: The integration of artificial intelligence (AI) and mobile networks is regarded as one of the most important scenarios for 6G. In 6G, a major objective is to realize the efficient transmission of task-relevant data. Then a key problem arises, how to design collaborative AI models for the device side and the network side, so that the transmitted data between the device and the network is efficient… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  4. arXiv:2404.13528  [pdf, other

    cs.LG cs.AI cs.DC

    SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

    Authors: Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren

    Abstract: This work is motivated by recent developments in Deep Neural Networks, particularly the Transformer architectures underlying applications such as ChatGPT, and the need for performing inference on mobile devices. Focusing on emerging transformers (specifically the ones with computationally efficient Swin-like architectures) and large models (e.g., Stable Diffusion and LLMs) based on transformers, w… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  5. arXiv:2404.10343  [pdf, other

    cs.CV eess.IV

    The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

    Authors: Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang , et al. (109 additional authors not shown)

    Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

    Comments: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024

  6. arXiv:2404.07560  [pdf, other

    cs.RO cs.AI

    Socially Pertinent Robots in Gerontological Healthcare

    Authors: Xavier Alameda-Pineda, Angus Addlesee, Daniel Hernández García, Chris Reinke, Soraya Arias, Federica Arrigoni, Alex Auternaud, Lauriane Blavette, Cigdem Beyan, Luis Gomez Camara, Ohad Cohen, Alessandro Conti, Sébastien Dacunha, Christian Dondrup, Yoav Ellinson, Francesco Ferro, Sharon Gannot, Florian Gras, Nancie Gunson, Radu Horaud, Moreno D'Incà, Imad Kimouche, Séverin Lemaignan, Oliver Lemon, Cyril Liotard , et al. (19 additional authors not shown)

    Abstract: Despite the many recent achievements in develo** and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilitie… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

  7. arXiv:2404.04140  [pdf, other

    cs.CV cs.LG

    Improving Detection in Aerial Images by Capturing Inter-Object Relationships

    Authors: Botao Ren, Botian Xu, Yifan Pu, **gyi Wang, Zhidong Deng

    Abstract: In many image domains, the spatial distribution of objects in a scene exhibits meaningful patterns governed by their semantic relationships. In most modern detection pipelines, however, the detection proposals are processed independently, overlooking the underlying relationships between objects. In this work, we introduce a transformer-based approach to capture these inter-object relationships to… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

  8. arXiv:2403.00176  [pdf, other

    cs.LG cs.AI cs.PL

    SoD$^2$: Statically Optimizing Dynamic Deep Neural Network

    Authors: Wei Niu, Gagan Agrawal, Bin Ren

    Abstract: Though many compilation and runtime systems have been developed for DNNs in recent years, the focus has largely been on static DNNs. Dynamic DNNs, where tensor shapes and sizes and even the set of operators used are dependent upon the input and/or execution, are becoming common. This paper presents SoD$^2$, a comprehensive framework for optimizing Dynamic DNNs. The basis of our approach is a class… ▽ More

    Submitted 29 February, 2024; originally announced March 2024.

  9. arXiv:2402.02634  [pdf, other

    cs.CV cs.LG eess.IV

    Key-Graph Transformer for Image Restoration

    Authors: Bin Ren, Yawei Li, **gyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Nicu Sebe

    Abstract: While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In… ▽ More

    Submitted 4 February, 2024; originally announced February 2024.

    Comments: 9 pages, 6 figures

  10. arXiv:2402.02339  [pdf, other

    cs.CV cs.AI cs.LG

    Uncertainty-Aware Testing-Time Optimization for 3D Human Pose Estimation

    Authors: Ti Wang, Mengyuan Liu, Hong Liu, Bin Ren, Yingxuan You, Wenhao Li, Nicu Sebe, Xia Li

    Abstract: Although data-driven methods have achieved success in 3D human pose estimation, they often suffer from domain gaps and exhibit limited generalization. In contrast, optimization-based methods excel in fine-tuning for specific cases but are generally inferior to data-driven methods in overall performance. We observe that previous optimization-based methods commonly rely on projection constraint, whi… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  11. arXiv:2402.02045  [pdf, other

    cs.CV

    MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

    Authors: Zhe Li, Laurence T. Yang, Bocheng Ren, Xin Nie, Zhangyang Gao, Cheng Tan, Stan Z. Li

    Abstract: The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across differe… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

  12. arXiv:2312.08520  [pdf, other

    cs.AI

    Revisiting Recommendation Loss Functions through Contrastive Learning (Technical Report)

    Authors: Dong Li, Ruoming **, Bin Ren

    Abstract: Inspired by the success of contrastive learning, we systematically examine recommendation losses, including listwise (softmax), pairwise (BPR), and pointwise (MSE and CCL) losses. In this endeavor, we introduce InfoNCE+, an optimized generalization of InfoNCE with balance coefficients, and highlight its performance advantages, particularly when aligned with our new decoupled contrastive loss, MINE… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    Comments: This manuscript was initially submitted for review in August 2023

  13. arXiv:2312.05460  [pdf, other

    stat.ML cs.LG

    Multi-source domain adaptation for regression

    Authors: Yujie Wu, Giovanni Parmigiani, Boyu Ren

    Abstract: Multi-source domain adaptation (DA) aims at leveraging information from more than one source domain to make predictions in a target domain, where different domains may have different data distributions. Most existing methods for multi-source DA focus on classification problems while there is only limited investigation in the regression settings. In this paper, we fill in this gap through a two-ste… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  14. arXiv:2312.03032  [pdf, other

    cs.CV

    Zero-Shot Point Cloud Registration

    Authors: Weijie Wang, Guofeng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Luc Van Gool, Nicu Sebe, Bruno Lepri

    Abstract: Learning-based point cloud registration approaches have significantly outperformed their traditional counterparts. However, they typically require extensive training on specific datasets. In this paper, we propose , the first zero-shot point cloud registration approach that eliminates the need for training on point cloud datasets. The cornerstone of ZeroReg is the novel transfer of image features… ▽ More

    Submitted 8 December, 2023; v1 submitted 5 December, 2023; originally announced December 2023.

  15. arXiv:2311.17129  [pdf, other

    cs.CV cs.LG

    Feedback RoI Features Improve Aerial Object Detection

    Authors: Botao Ren, Botian Xu, Tengyu Liu, **gyi Wang, Zhidong Deng

    Abstract: Neuroscience studies have shown that the human visual system utilizes high-level feedback information to guide lower-level perception, enabling adaptation to signals of different characteristics. In light of this, we propose Feedback multi-Level feature Extractor (Flex) to incorporate a similar mechanism for object detection. Flex refines feature selection based on image-wise and instance-level fe… ▽ More

    Submitted 28 November, 2023; originally announced November 2023.

  16. arXiv:2308.02339  [pdf, other

    cs.CV

    Improving Scene Graph Generation with Superpixel-Based Interaction Learning

    Authors: **gyi Wang, Can Zhang, **fa Huang, Botao Ren, Zhidong Deng

    Abstract: Recent advances in Scene Graph Generation (SGG) typically model the relationships among entities utilizing box-level features from pre-defined detectors. We argue that an overlooked problem in SGG is the coarse-grained interactions between boxes, which inadequately capture contextual semantics for relationship modeling, practically limiting the development of the field. In this paper, we take the… ▽ More

    Submitted 4 August, 2023; originally announced August 2023.

  17. arXiv:2307.03917  [pdf, other

    eess.AS cs.CL cs.SD

    On decoder-only architecture for speech-to-text and large language model integration

    Authors: Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, **yu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu

    Abstract: Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has not been explored well. The "decoder-only" architecture has also not been well studied for speech processing tasks. In this research, we introduce Speech-LLaMA,… ▽ More

    Submitted 2 October, 2023; v1 submitted 8 July, 2023; originally announced July 2023.

  18. arXiv:2306.16061  [pdf, other

    cs.RO cs.AI

    MRHER: Model-based Relay Hindsight Experience Replay for Sequential Object Manipulation Tasks with Sparse Rewards

    Authors: Yuming Huang, Bin Ren, Ziming Xu, Lianghong Wu

    Abstract: Sparse rewards pose a significant challenge to achieving high sample efficiency in goal-conditioned reinforcement learning (RL). Specifically, in sequential manipulation tasks, the agent receives failure rewards until it successfully completes the entire manipulation task, which leads to low sample efficiency. To tackle this issue and improve sample efficiency, we propose a novel model-based RL fr… ▽ More

    Submitted 21 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

  19. arXiv:2305.07498  [pdf, other

    cs.CV

    Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

    Authors: Jianfeng Kuang, Wei Hua, Dingkang Liang, Mingkun Yang, Deqiang Jiang, Bo Ren, Xiang Bai

    Abstract: Visual information extraction (VIE), which aims to simultaneously perform OCR and information extraction in a unified framework, has drawn increasing attention due to its essential role in various applications like understanding receipts, goods, and traffic signs. However, as existing benchmark datasets for VIE mainly consist of document images without the adequate diversity of layout structures,… ▽ More

    Submitted 14 June, 2023; v1 submitted 12 May, 2023; originally announced May 2023.

    Comments: 15 pages, 6 figures, ICDAR2023

  20. arXiv:2305.04268  [pdf, other

    cs.CV

    Multi-Space Neural Radiance Fields

    Authors: Ze-Xin Yin, Jiaxiong Qiu, Ming-Ming Cheng, Bo Ren

    Abstract: Existing Neural Radiance Fields (NeRF) methods suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multi-space neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward… ▽ More

    Submitted 7 May, 2023; originally announced May 2023.

    Comments: CVPR 2023, 10 pages, 12 figures

  21. arXiv:2304.08706  [pdf, other

    cs.CV

    Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections

    Authors: Jiaxiong Qiu, Peng-Tao Jiang, Yifan Zhu, Ze-Xin Yin, Ming-Ming Cheng, Bo Ren

    Abstract: Neural implicit methods have achieved high-quality 3D object surfaces under slight specular highlights. However, high specular reflections (HSR) often appear in front of target objects when we capture them through glasses. The complex ambiguity in these scenes violates the multi-view consistency, then makes it challenging for recent methods to reconstruct target objects correctly. To remedy this i… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: 17 pages, 20 figures

  22. arXiv:2304.08304  [pdf, other

    cs.CV cs.AI

    SDVRF: Sparse-to-Dense Voxel Region Fusion for Multi-modal 3D Object Detection

    Authors: Binglu Ren, Jianqin Yin

    Abstract: In the perception task of autonomous driving, multi-modal methods have become a trend due to the complementary characteristics of LiDAR point clouds and image data. However, the performance of multi-modal methods is usually limited by the sparsity of the point cloud or the noise problem caused by the misalignment between LiDAR and the camera. To solve these two problems, we present a new concept,… ▽ More

    Submitted 17 September, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

  23. arXiv:2303.14768  [pdf, other

    cs.CV

    Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies

    Authors: Bei Gan, Xiujun Shu, Ruizhi Qiao, Haoqian Wu, Keyu Chen, Hanjun Li, Bo Ren

    Abstract: Movie highlights stand out of the screenplay for efficient browsing and play a crucial role on social media platforms. Based on existing efforts, this work has two observations: (1) For different annotators, labeling highlight has uncertainty, which leads to inaccurate and time-consuming annotations. (2) Besides previous supervised or unsupervised settings, some existing video corpora can be usefu… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: Accepted to CVPR2023

  24. arXiv:2303.08331  [pdf, other

    cs.CV cs.LG cs.NE eess.IV

    Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

    Authors: Gen Li, Jie Ji, Minghai Qin, Wei Niu, Bin Ren, Fatemeh Afghah, Linke Guo, Xiaolong Ma

    Abstract: As deep convolutional neural networks (DNNs) are widely used in various fields of computer vision, leveraging the overfitting ability of the DNN to achieve video resolution upscaling has become a new trend in the modern video delivery system. By dividing videos into chunks and overfitting each chunk with a super-resolution model, the server encodes videos before transmitting them to the clients, t… ▽ More

    Submitted 18 June, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

    Comments: CVPR 2023 Highlight Paper

  25. arXiv:2302.14338  [pdf, other

    cs.CV

    Turning a CLIP Model into a Scene Text Detector

    Authors: Wenwen Yu, Yuliang Liu, Wei Hua, Deqiang Jiang, Bo Ren, Xiang Bai

    Abstract: The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses… ▽ More

    Submitted 26 March, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

    Comments: CVPR2023

  26. arXiv:2301.03949  [pdf, other

    cs.CV

    Modiff: Action-Conditioned 3D Motion Generation with Denoising Diffusion Probabilistic Models

    Authors: Mengyi Zhao, Mengyuan Liu, Bin Ren, Shuling Dai, Nicu Sebe

    Abstract: Diffusion-based generative models have recently emerged as powerful solutions for high-quality synthesis in multiple domains. Leveraging the bidirectional Markov chains, diffusion probabilistic models generate samples by inferring the reversed Markov chain based on the learned distribution map** at the forward diffusion process. In this work, we propose Modiff, a conditional paradigm that benefi… ▽ More

    Submitted 28 March, 2023; v1 submitted 10 January, 2023; originally announced January 2023.

  27. arXiv:2212.00465  [pdf, other

    cs.CV eess.IV

    FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning

    Authors: Yulei Qin, Xingyu Chen, Chao Chen, Yunhang Shen, Bo Ren, Yun Gu, Jie Yang, Chunhua Shen

    Abstract: Recently, webly supervised learning (WSL) has been studied to leverage numerous and accessible data from the Internet. Most existing methods focus on learning noise-robust models from web images while neglecting the performance drop caused by the differences between web domain and real-world domain. However, only by tackling the performance gap above can we fully exploit the practical value of web… ▽ More

    Submitted 1 December, 2022; originally announced December 2022.

    Comments: 7 pages, 5 figures, 5 tables. Accepted in AAAI 2023

  28. arXiv:2211.16208  [pdf, other

    cs.CV

    SLAN: Self-Locator Aided Network for Cross-Modal Understanding

    Authors: Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Bo Ren, Ming-Ming Cheng

    Abstract: Learning fine-grained interplay between vision and language allows to a more accurate understanding for VisionLanguage tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by textagnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to its heavy relianc… ▽ More

    Submitted 8 December, 2022; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: 12 pages

  29. arXiv:2211.15743  [pdf, other

    cs.IR math.ST

    Towards Reliable Item Sampling for Recommendation Evaluation

    Authors: Dong Li, Ruoming **, Zhenming Liu, Bin Ren, **g Gao, Zhi Liu

    Abstract: Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are "inconsistent" with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either map** the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to… ▽ More

    Submitted 11 October, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

    Comments: aaai2023

  30. arXiv:2211.07210  [pdf, other

    cs.CV cs.AI

    Grafting Pre-trained Models for Multimodal Headline Generation

    Authors: Lingfeng Qiao, Chen Wu, Ye Liu, Haoyuan Peng, Di Yin, Bo Ren

    Abstract: Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.… ▽ More

    Submitted 14 November, 2022; originally announced November 2022.

    Comments: Accepted by EMNLP 2022

  31. arXiv:2211.06742  [pdf, other

    cs.CV cs.AI

    Deep Unsupervised Key Frame Extraction for Efficient Video Classification

    Authors: Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, Paolo Rota

    Abstract: Video processing and analysis have become an urgent task since a huge amount of videos (e.g., Youtube, Hulu) are uploaded online every day. The extraction of representative key frames from videos is very important in video processing and analysis since it greatly reduces computing resources and time. Although great progress has been made recently, large-scale video classification remains an open p… ▽ More

    Submitted 12 November, 2022; originally announced November 2022.

    Comments: Accepted to TOMM

  32. arXiv:2210.04473  [pdf, other

    cs.CL

    Leveraging Key Information Modeling to Improve Less-Data Constrained News Headline Generation via Duality Fine-Tuning

    Authors: Zhuoxuan Jiang, Lingfeng Qiao, Di Yin, Shanshan Feng, Bo Ren

    Abstract: Recent language generative models are mostly trained on large-scale datasets, while in some real scenarios, the training datasets are often expensive to obtain and would be small-scale. In this paper we investigate the challenging task of less-data constrained generation, especially when the generated news headlines are short yet expected by readers to keep readable and informative simultaneously.… ▽ More

    Submitted 10 October, 2022; originally announced October 2022.

    Comments: Accepted by AACL-IJCNLP 2022 main conference

  33. arXiv:2209.09476  [pdf, other

    cs.LG cs.AI cs.CV

    SparCL: Sparse Continual Learning on the Edge

    Authors: Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, Jennifer Dy

    Abstract: Existing work in continual learning (CL) focuses on mitigating catastrophic forgetting, i.e., model performance deterioration on past tasks when learning a new task. However, the training efficiency of a CL system is under-investigated, which limits the real-world application of CL systems under resource-limited scenarios. In this work, we propose a novel framework called Sparse Continual Learning… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: Published at NeurIPS 2022 as a conference paper

  34. arXiv:2208.13363  [pdf, other

    cs.LG

    Survey: Exploiting Data Redundancy for Optimization of Deep Learning

    Authors: Jou-An Chen, Wei Niu, Bin Ren, Yanzhi Wang, Xipeng Shen

    Abstract: Data redundancy is ubiquitous in the inputs and intermediate results of Deep Neural Networks (DNN). It offers many significant opportunities for improving DNN performance and efficiency and has been explored in a large body of work. These studies have scattered in many venues across several years. The targets they focus on range from images to videos and texts, and the techniques they use to detec… ▽ More

    Submitted 29 August, 2022; originally announced August 2022.

  35. arXiv:2208.10180  [pdf, other

    cs.CV

    TaCo: Textual Attribute Recognition via Contrastive Learning

    Authors: Chang Nie, Yiqing Hu, Yanqiu Qu, Hao Liu, Deqiang Jiang, Bo Ren

    Abstract: As textual attributes like font are core design elements of document format and page style, automatic attributes recognition favor comprehensive practical applications. Existing approaches already yield satisfactory performance in differentiating disparate attributes, but they still suffer in distinguishing similar attributes with only subtle difference. Moreover, their performance drop severely i… ▽ More

    Submitted 22 August, 2022; originally announced August 2022.

  36. arXiv:2208.09374  [pdf, other

    cs.CV

    VLMAE: Vision-Language Masked Autoencoder

    Authors: Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Chen Wu, Xiujun Shu, Bo Ren

    Abstract: Image and language modeling is of crucial importance for vision-language pre-training (VLP), which aims to learn multi-modal representations from large-scale paired image-text data. However, we observe that most existing VLP methods focus on modeling the interactions between image and text features while neglecting the information disparity between image and text, thus suffering from focal bias. T… ▽ More

    Submitted 19 August, 2022; originally announced August 2022.

    Comments: 12 pages, 7 figures

  37. arXiv:2208.08608  [pdf, other

    cs.CV

    See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

    Authors: Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang

    Abstract: Text-based person retrieval aims to find the query person based on a textual description. The key is to learn a common latent space map** between visual-textual modalities. To achieve this goal, existing works employ segmentation to obtain explicitly cross-modal alignments or utilize attention to explore salient alignments. These methods have two shortcomings: 1) Labeling cross-modal alignments… ▽ More

    Submitted 25 August, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

    Comments: Accepted at ECCV Workshop on Real-World Surveillance (RWS 2022)

  38. arXiv:2207.12577  [pdf, other

    cs.CV cs.AR cs.LG eess.IV

    Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

    Authors: Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, Yanzhi Wang

    Abstract: Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this,… ▽ More

    Submitted 25 July, 2022; originally announced July 2022.

  39. arXiv:2207.04713  [pdf, other

    cs.CL

    GMN: Generative Multi-modal Network for Practical Document Information Extraction

    Authors: Haoyu Cao, Jiefeng Ma, Antai Guo, Yiqing Hu, Hao Liu, Deqiang Jiang, Yinsong Liu, Bo Ren

    Abstract: Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world. Although recent literature has already achieved competitive results, these approaches usually fail when dealing with complex documents with noisy OCR results or mutative layouts. This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to add… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: Accepted to NAACL 2022 main conference

  40. arXiv:2207.04242  [pdf, other

    cs.CV

    PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

    Authors: Bin Ren, Hao Tang, Yiming Wang, Xia Li, Wei Wang, Nicu Sebe

    Abstract: For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images. Hence, one not only needs to encode the long-range dependencies among pixels in both the source view im… ▽ More

    Submitted 6 March, 2023; v1 submitted 9 July, 2022; originally announced July 2022.

    Comments: 5 pages, 5 figures

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing

  41. arXiv:2207.01887  [pdf, other

    cs.CV

    Open-Vocabulary Multi-Label Classification via Multi-Modal Knowledge Transfer

    Authors: Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Bo Ren, Shu-Tao Xia

    Abstract: Real-world recognition system often encounters the challenge of unseen labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit single-modal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Ins… ▽ More

    Submitted 1 February, 2023; v1 submitted 5 July, 2022; originally announced July 2022.

    Comments: AAAI 2023 (Oral presentation paper). Updated version

  42. arXiv:2207.01241  [pdf, other

    cs.CV

    OS-MSL: One Stage Multimodal Sequential Link Framework for Scene Segmentation and Classification

    Authors: Ye Liu, Lingfeng Qiao, Di Yin, Zhuoxuan Jiang, Xinghua Jiang, Deqiang Jiang, Bo Ren

    Abstract: Scene segmentation and classification (SSC) serve as a critical step towards the field of video structuring analysis. Intuitively, jointly learning of these two tasks can promote each other by sharing common information. However, scene segmentation concerns more on the local difference between adjacent shots while classification needs the global representation of scene segments, which probably lea… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted by ACM MM 2022

  43. arXiv:2206.11134  [pdf, other

    cs.CV

    Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

    Authors: Peixian Chen, Kekai Sheng, Mengdan Zhang, Mingbao Lin, Yunhang Shen, Shaohui Lin, Bo Ren, Ke Li

    Abstract: Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perfo… ▽ More

    Submitted 24 November, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  44. arXiv:2206.10620  [pdf, other

    cs.LG cs.AI cs.CV cs.PL

    CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework

    Authors: Xiaofeng Li, Bin Ren, Xipeng Shen, Yanzhi Wang

    Abstract: There is a growing demand for shifting the delivery of AI capability from data centers on the cloud to edge or end devices, exemplified by the fast emerging real-time AI-based apps running on smartphones, AR/VR devices, autonomous vehicles, and various IoT devices. The shift has however been seriously hampered by the large growing gap between DNN computing demands and the computing power on edge o… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

  45. arXiv:2206.09195  [pdf, other

    cs.LG cs.AI

    EEML: Ensemble Embedded Meta-learning

    Authors: Geng Li, Boyuan Ren, Hongzhi Wang

    Abstract: To accelerate learning process with few samples, meta-learning resorts to prior knowledge from previous tasks. However, the inconsistent task distribution and heterogeneity is hard to be handled through a global sharing model initialization. In this paper, based on gradient-based meta-learning, we propose an ensemble embedded meta-learning algorithm (EEML) that explicitly utilizes multi-model-ense… ▽ More

    Submitted 18 June, 2022; originally announced June 2022.

  46. arXiv:2206.03377  [pdf, other

    cs.CL

    RAAT: Relation-Augmented Attention Transformer for Relation Modeling in Document-Level Event Extraction

    Authors: Yuan Liang, Zhuoxuan Jiang, Di Yin, Bo Ren

    Abstract: In document-level event extraction (DEE) task, event arguments always scatter across sentences (across-sentence issue) and multiple events may lie in one document (multi-event issue). In this paper, we argue that the relation information of event arguments is of great significance for addressing the above two issues, and propose a new DEE framework which can model the relation dependencies, called… ▽ More

    Submitted 7 June, 2022; originally announced June 2022.

    Comments: Accepted by NAACL 2022

  47. arXiv:2206.02343  [pdf, other

    cs.CV

    Contrastive Graph Multimodal Model for Text Classification in Videos

    Authors: Ye Liu, Changchong Lu, Chen Lin, Di Yin, Bo Ren

    Abstract: The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort to large numbers of text recognition methods based on OCR technology. However, to our knowledge, there is no existing work focused on the second step of video t… ▽ More

    Submitted 6 June, 2022; originally announced June 2022.

  48. arXiv:2206.01244  [pdf, other

    cs.CV eess.IV

    Real-Time Portrait Stylization on the Edge

    Authors: Yanyu Li, Xuan Shen, Geng Yuan, Jiexiong Guan, Wei Niu, Hao Tang, Bin Ren, Yanzhi Wang

    Abstract: In this work we demonstrate real-time portrait stylization, specifically, translating self-portrait into cartoon or anime style on mobile devices. We propose a latency-driven differentiable architecture search method, maintaining realistic generative quality. With our framework, we obtain $10\times$ computation reduction on the generative model and achieve real-time video stylization on off-the-sh… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

  49. arXiv:2205.12551  [pdf, other

    cs.CV cs.CR

    Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

    Authors: Bin Ren, Yahui Liu, Yue Song, Wei Bi, Rita Cucchiara, Nicu Sebe, Wei Wang

    Abstract: Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on the accuracy, priva… ▽ More

    Submitted 26 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: Accepted to CVPR2023

  50. arXiv:2205.10884  [pdf, other

    cs.CL

    Sequence-to-Action: Grammatical Error Correction with Action Guided Sequence Generation

    Authors: Jiquan Li, Junliang Guo, Yongxin Zhu, Xin Sheng, Deqiang Jiang, Bo Ren, Linli Xu

    Abstract: The task of Grammatical Error Correction (GEC) has received remarkable attention with wide applications in Natural Language Processing (NLP) in recent years. While one of the key principles of GEC is to keep the correct parts unchanged and avoid over-correction, previous sequence-to-sequence (seq2seq) models generate results from scratch, which are not guaranteed to follow the original sentence st… ▽ More

    Submitted 22 May, 2022; originally announced May 2022.

    Comments: accepted in AAAI 2022