Search | arXiv e-print repository

Neural Graphics Texture Compression Supporting Random Acces

Authors: Farzad Farhadzadeh, Qiqi Hou, Hoang Le, Amir Said, Randall Rauwendaal, Alex Bourd, Fatih Porikli

Abstract: Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression.… ▽ More Advances in rendering have led to tremendous growth in texture assets, including resolution, complexity, and novel textures components, but this growth in data volume has not been matched by advances in its compression. Meanwhile Neural Image Compression (NIC) has advanced significantly and shown promising results, but the proposed methods cannot be directly adapted to neural texture compression. First, texture compression requires on-demand and real-time decoding with random access during parallel rendering (e.g. block texture decompression on GPUs). Additionally, NIC does not support multi-resolution reconstruction (mip-levels), nor does it have the ability to efficiently jointly compress different sets of texture channels. In this work, we introduce a novel approach to texture set compression that integrates traditional GPU texture representation and NIC techniques, designed to enable random access and support many-channel texture sets. To achieve this goal, we propose an asymmetric auto-encoder framework that employs a convolutional encoder to capture detailed information in a bottleneck-latent space, and at decoder side we utilize a fully connected network, whose inputs are sampled latent features plus positional information, for a given texture coordinate and mip level. This latent data is defined to enable simplified access to multi-resolution data by simply changing the scanning strides. Experimental results demonstrate that this approach provides much better results than conventional texture compression, and significant improvement over the latest method using neural networks. △ Less

Submitted 6 May, 2024; originally announced July 2024.

Comments: ECCV submission

arXiv:2406.15819 [pdf, other]

Automatic AI Model Selection for Wireless Systems: Online Learning via Digital Twinning

Authors: Qiushuo Hou, Matteo Zecchin, Sangwoo Park, Yunlong Cai, Guanding Yu, Kaushik Chowdhury, Osvaldo Simeone

Abstract: In modern wireless network architectures, such as O-RAN, artificial intelligence (AI)-based applications are deployed at intelligent controllers to carry out functionalities like scheduling or power control. The AI "apps" are selected on the basis of contextual information such as network conditions, topology, traffic statistics, and design goals. The map** between context and AI model parameter… ▽ More In modern wireless network architectures, such as O-RAN, artificial intelligence (AI)-based applications are deployed at intelligent controllers to carry out functionalities like scheduling or power control. The AI "apps" are selected on the basis of contextual information such as network conditions, topology, traffic statistics, and design goals. The map** between context and AI model parameters is ideally done in a zero-shot fashion via an automatic model selection (AMS) map** that leverages only contextual information without requiring any current data. This paper introduces a general methodology for the online optimization of AMS map**s. Optimizing an AMS map** is challenging, as it requires exposure to data collected from many different contexts. Therefore, if carried out online, this initial optimization phase would be extremely time consuming. A possible solution is to leverage a digital twin of the physical system to generate synthetic data from multiple simulated contexts. However, given that the simulator at the digital twin is imperfect, a direct use of simulated data for the optimization of the AMS map** would yield poor performance when tested in the real system. This paper proposes a novel method for the online optimization of AMS map** that corrects for the bias of the simulator by means of limited real data collected from the physical system. Experimental results for a graph neural network-based power control app demonstrate the significant advantages of the proposed approach. △ Less

Submitted 22 June, 2024; originally announced June 2024.

Comments: submitted for a journal publication

arXiv:2406.06858 [pdf, other]

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Authors: Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin **, Xin Liu

Abstract: Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation… ▽ More Large deep learning models have demonstrated strong ability to solve many tasks across a wide range of applications. Those large models typically require training and inference to be distributed. Tensor parallelism is a common technique partitioning computation of an operation or layer across devices to overcome the memory capacity limitation of a single processor, and/or to accelerate computation to meet a certain latency requirement. However, this kind of parallelism introduces additional communication that might contribute a significant portion of overall runtime. Thus limits scalability of this technique within a group of devices with high speed interconnects, such as GPUs with NVLinks in a node. This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs. Flux over-decomposes communication and computation operations into much finer-grained operations and further fuses them into a larger kernel to effectively hide communication without compromising kernel efficiency. Flux can potentially overlap up to 96% of communication given a fused kernel. Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPUs with various GPU generations and interconnects, and up to 1.66x and 1.30x speedups for prefill and decoding inference over vLLM on a cluster with 8 GPUs with various GPU generations and interconnects. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

arXiv:2406.00670 [pdf, other]

Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation

Authors: Yunheng Li, ZhongYu Li, Quansheng Zeng, Qibin Hou, Ming-Ming Cheng

Abstract: Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual fea… ▽ More Pre-trained vision-language models, e.g., CLIP, have been successfully applied to zero-shot semantic segmentation. Existing CLIP-based approaches primarily utilize visual features from the last layer to align with text embeddings, while they neglect the crucial information in intermediate layers that contain rich object details. However, we find that directly aggregating the multi-level visual features weakens the zero-shot ability for novel classes. The large differences between the visual features from different layers make these features hard to align well with the text embeddings. We resolve this problem by introducing a series of independent decoders to align the multi-level visual features with the text embeddings in a cascaded way, forming a novel but simple framework named Cascade-CLIP. Our Cascade-CLIP is flexible and can be easily applied to existing zero-shot semantic segmentation methods. Experimental results show that our simple Cascade-CLIP achieves superior zero-shot performance on segmentation benchmarks, like COCO-Stuff, Pascal-VOC, and Pascal-Context. Our code is available at: https://github.com/HVision-NKU/Cascade-CLIP △ Less

Submitted 6 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

Comments: Accepted by ICML 2024

arXiv:2405.08021 [pdf, other]

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

Authors: Zhao Ren, Kevin Scheck, Qinhan Hou, Stefano van Gogh, Michael Wand, Tanja Schultz

Abstract: Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available dat… ▽ More Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: Accepted by EMBC 2024

arXiv:2405.07469 [pdf, other]

Phase coding semi-quantum key distribution system based on the Single-state protocol

Authors: Qincheng Hou, Siying Huang, Naida Mo, **dong Wang, Zhengjun Wei, Yafei Yu, Tianming Zhao, Zhiming Zhang

Abstract: Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of ot… ▽ More Semi-quantum key distribution (SQKD) allows sharing random keys between a quantum user and a classical user. However, implementing classical user operations is challenging, posing a hurdle to achieving the Single-state protocol. By using the "selective modulation" method, the feasibility of SQKD is verified in principle. The proposal of the selective modulation method enables the realization of other protocols for SQKD. To advance experimental progress in SQKD, we propose and implement a phase-encoded semi-quantum key distribution system based on the Single-state protocol and the "selective modulation" method. The system operates at a frequency of 100MHz and an average photon number of 0.1. The interference contrast achieved 96.52%, the average quantum bit error rate was 1.19%, and the raw key rate reached 88Kbps. Our experimental results demonstrate the feasibility and stability of the proposed phase-encoded semi-quantum key distribution system. Furthermore, by leveraging the "selective modulation" scheme proposed in this paper, we develop a comprehensive theoretical description of selective modulation. Through an analysis of quantum state evolution, we assess the security of our system, ultimately demonstrating its resilience against attacks targeting quantum states. The classical user of our system requires only two optical devices, significantly reducing the equipment requirements and enhancing its application potential. This work validates the feasibility of semi-quantum key distribution experiments and provides ideas for future research on semi-quantum key distribution experiments and security studies. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2405.01434 [pdf, other]

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Authors: Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, Qibin Hou

Abstract: For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent… ▽ More For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at https://github.com/HVision-NKU/StoryDiffusion. △ Less

Submitted 2 May, 2024; originally announced May 2024.

arXiv:2404.18454 [pdf, other]

doi 10.1145/3641519.3657456

3D Gaussian Splatting with Deferred Reflection

Authors: Keyang Ye, Qiming Hou, Kun Zhou

Abstract: The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes f… ▽ More The advent of neural and Gaussian-based radiance field methods have achieved great success in the field of novel view synthesis. However, specular reflection remains non-trivial, as the high frequency radiance field is notoriously difficult to fit stably and accurately. We present a deferred shading method to effectively render specular reflection with Gaussian splatting. The key challenge comes from the environment map reflection model, which requires accurate surface normal while simultaneously bottlenecks normal estimation with discontinuous gradients. We leverage the per-pixel reflection gradients generated by deferred shading to bridge the optimization process of neighboring Gaussians, allowing nearly correct normal estimations to gradually propagate and eventually spread over all reflective objects. Our method significantly outperforms state-of-the-art techniques and concurrent work in synthesizing high-quality specular reflection effects, demonstrating a consistent improvement of peak signal-to-noise ratio (PSNR) for both synthetic and real-world scenes, while running at a frame rate almost identical to vanilla Gaussian splatting. △ Less

Submitted 4 June, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.11860 [pdf, ps, other]

Active robustness against the detuning-error for Rydberg quantum gates

Authors: Qing-Ling Hou, Han Wang, **g Qian

Abstract: Error suppression to the experimental imperfections is a central challenge for useful quantum computing. Recent studies have shown the advantages of using single-modulated pulses based on optimal control which can realize high-fidelity two-qubit gates in neutral-atom arrays. However, typical optimization only minimizes the ideal gate error in the absence of any decay, which allows the gate to be p… ▽ More Error suppression to the experimental imperfections is a central challenge for useful quantum computing. Recent studies have shown the advantages of using single-modulated pulses based on optimal control which can realize high-fidelity two-qubit gates in neutral-atom arrays. However, typical optimization only minimizes the ideal gate error in the absence of any decay, which allows the gate to be passively influenced by all error sources leading to an exponential increase of sensitivity when the error becomes larger. In the present work, we propose the realization of two-qubit CZ gates with active robustness against two-photon detuning errors. Our method depends on a modified cost function in numerical optimization for sha** gate pulses, which can minimize, not only the ideal gate error but also the fluctuations of gate infidelity over a wide error range. We introduce a family of Rydberg blockade gates with active robustness towards the impacts of versatile noise sources such as Doppler dephasing and ac Stark shifts. The resulting gates with robust pulses can significantly increase the insensitivity to any type of errors acting on the two-photon detuning, benefiting from a relaxed requirement of colder atomic temperatures or more stable lasers for current experimental technology. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 13 pages, 7 figures

arXiv:2404.11100 [pdf, other]

Synthesizing Realistic Data for Table Recognition

Authors: Qiyu Hou, Jun Wang, Meixuan Qiao, Lujun Tian

Abstract: To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic… ▽ More To overcome the limitations and challenges of current automatic table data annotation methods and random table data synthesis approaches, we propose a novel method for synthesizing annotation data specifically designed for table recognition. This method utilizes the structure and content of existing complex tables, facilitating the efficient creation of tables that closely replicate the authentic styles found in the target domain. By leveraging the actual structure and content of tables from Chinese financial announcements, we have developed the first extensive table annotation dataset in this domain. We used this dataset to train several recent deep learning-based end-to-end table recognition models. Additionally, we have established the inaugural benchmark for real-world complex tables in the Chinese financial announcement domain, using it to assess the performance of models trained on our synthetic data, thereby effectively validating our method's practicality and effectiveness. Furthermore, we applied our synthesis method to augment the FinTabNet dataset, extracted from English financial announcements, by increasing the proportion of tables with multiple spanning cells to introduce greater complexity. Our experiments show that models trained on this augmented dataset achieve comprehensive improvements in performance, especially in the recognition of tables with multiple spanning cells. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: ICDAR 2024

arXiv:2404.04887 [pdf, other]

A Clinical-oriented Multi-level Contrastive Learning Method for Disease Diagnosis in Low-quality Medical Images

Authors: Qingshan Hou, Shuai Cheng, Peng Cao, **zhu Yang, Xiaoli Liu, Osmar R. Zaiane, Yih Chung Tham

Abstract: Representation learning offers a conduit to elucidate distinctive features within the latent space and interpret the deep models. However, the randomness of lesion distribution and the complexity of low-quality factors in medical images pose great challenges for models to extract key lesion features. Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in… ▽ More Representation learning offers a conduit to elucidate distinctive features within the latent space and interpret the deep models. However, the randomness of lesion distribution and the complexity of low-quality factors in medical images pose great challenges for models to extract key lesion features. Disease diagnosis methods guided by contrastive learning (CL) have shown significant advantages in lesion feature representation. Nevertheless, the effectiveness of CL is highly dependent on the quality of the positive and negative sample pairs. In this work, we propose a clinical-oriented multi-level CL framework that aims to enhance the model's capacity to extract lesion features and discriminate between lesion and low-quality factors, thereby enabling more accurate disease diagnosis from low-quality medical images. Specifically, we first construct multi-level positive and negative pairs to enhance the model's comprehensive recognition capability of lesion features by integrating information from different levels and qualities of medical images. Moreover, to improve the quality of the learned lesion embeddings, we introduce a dynamic hard sample mining method based on self-paced learning. The proposed CL framework is validated on two public medical image datasets, EyeQ and Chest X-ray, demonstrating superior performance compared to other state-of-the-art disease diagnostic methods. △ Less

Submitted 7 April, 2024; originally announced April 2024.

arXiv:2403.17879 [pdf, other]

Low-Latency Neural Stereo Streaming

Authors: Qiqi Hou, Farzad Farhadzadeh, Amir Said, Guillaume Sautiere, Hoang Le

Abstract: The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallel… ▽ More The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods, both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance, they compress left and right views sequentially, leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods, LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design, LLSS processes left and right views in parallel, minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs. △ Less

Submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted by CVPR2024

arXiv:2403.17749 [pdf, other]

Multi-Task Dense Prediction via Mixture of Low-Rank Experts

Authors: Yuqi Yang, Peng-Tao Jiang, Qibin Hou, Hao Zhang, **wei Chen, Bo Li

Abstract: Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a gener… ▽ More Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper, we present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing. Furthermore, to control the parameters and computational cost brought by the increase in the number of experts, we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution, the parameters and computational cost do not change much with the increase of experts. Benefiting from this design, we increase the number of experts and its reception field to enlarge the representation capacity, facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE. △ Less

Submitted 27 May, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

Comments: Accepted at CVPR 2024

arXiv:2403.11735 [pdf, other]

LSKNet: A Foundation Lightweight Backbone for Remote Sensing

Authors: Yuxuan Li, Xiang Li, Yimian Dai, Qibin Hou, Li Liu, Yongxiang Liu, Ming-Ming Cheng, Jian Yang

Abstract: Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote se… ▽ More Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet. △ Less

Submitted 23 June, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2303.09030

arXiv:2403.06534 [pdf, other]

SARDet-100K: Towards Open-Source Benchmark and ToolKit for Large-Scale SAR Object Detection

Authors: Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming-Ming Cheng, Jian Yang

Abstract: Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source met… ▽ More Synthetic Aperture Radar (SAR) object detection has gained significant attention recently due to its irreplaceable all-weather imaging capabilities. However, this research field suffers from both limited public datasets (mostly comprising <2K images with only mono-category objects) and inaccessible source code. To tackle these challenges, we establish a new benchmark dataset and an open-source method for large-scale SAR object detection. Our dataset, SARDet-100K, is a result of intense surveying, collecting, and standardizing 10 existing SAR detection datasets, providing a large-scale and diverse dataset for research purposes. To the best of our knowledge, SARDet-100K is the first COCO-level large-scale multi-class SAR object detection dataset ever created. With this high-quality dataset, we conducted comprehensive experiments and uncovered a crucial challenge in SAR object detection: the substantial disparities between the pretraining on RGB datasets and finetuning on SAR datasets in terms of both data domain and model structure. To bridge these gaps, we propose a novel Multi-Stage with Filter Augmentation (MSFA) pretraining framework that tackles the problems from the perspective of data input, domain transition, and model migration. The proposed MSFA method significantly enhances the performance of SAR object detection models while demonstrating exceptional generalizability and flexibility across diverse models. This work aims to pave the way for further advancements in SAR object detection. The dataset and code is available at https://github.com/zcablii/SARDet_100K. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: 22 Pages, 10 Figures, 9 Tables

arXiv:2402.17403 [pdf, other]

Sora Generates Videos with Stunning Geometrical Consistency

Authors: Xuanyi Li, Daquan Zhou, Chenxu Zhang, Shaodong Wei, Qibin Hou, Ming-Ming Cheng

Abstract: The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generat… ▽ More The recently developed Sora model [1] has exhibited remarkable capabilities in video generation, sparking intense discussions regarding its ability to simulate real-world phenomena. Despite its growing popularity, there is a lack of established metrics to evaluate its fidelity to real-world physics quantitatively. In this paper, we introduce a new benchmark that assesses the quality of the generated videos based on their adherence to real-world physics principles. We employ a method that transforms the generated videos into 3D models, leveraging the premise that the accuracy of 3D reconstruction is heavily contingent on the video quality. From the perspective of 3D reconstruction, we use the fidelity of the geometric constraints satisfied by the constructed 3D models as a proxy to gauge the extent to which the generated videos conform to real-world physics rules. Project page: https://sora-geometrical-consistency.github.io/ △ Less

Submitted 27 February, 2024; originally announced February 2024.

Comments: 5 pages, 3 figures

arXiv:2402.15627 [pdf, other]

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Authors: Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao , et al. (7 additional authors not shown)

Abstract: We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model bl… ▽ More We present the design, implementation and engineering experience in building and deploying MegaScale, a production system for training large language models (LLMs) at the scale of more than 10,000 GPUs. Training LLMs at this scale brings unprecedented challenges to training efficiency and stability. We take a full-stack approach that co-designs the algorithmic and system components across model block and optimizer design, computation and communication overlap**, operator optimization, data pipeline, and network performance tuning. Maintaining high efficiency throughout the training process (i.e., stability) is an important consideration in production given the long extent of LLM training jobs. Many hard stability issues only emerge at large scale, and in-depth observability is the key to address them. We develop a set of diagnosis tools to monitor system components and events deep in the stack, identify root causes, and derive effective techniques to achieve fault tolerance and mitigate stragglers. MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs, improving the MFU by 1.34x compared to Megatron-LM. We share our operational experience in identifying and fixing failures and stragglers. We hope by articulating the problems and sharing our experience from a systems perspective, this work can inspire future LLM systems research. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.09270 [pdf, other]

Fast Window-Based Event Denoising with Spatiotemporal Correlation Enhancement

Authors: Huachen Fang, **jian Wu, Qibin Hou, Weisheng Dong, Guangming Shi

Abstract: Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based… ▽ More Previous deep learning-based event denoising methods mostly suffer from poor interpretability and difficulty in real-time processing due to their complex architecture designs. In this paper, we propose window-based event denoising, which simultaneously deals with a stack of events while existing element-based denoising focuses on one event each time. Besides, we give the theoretical analysis based on probability distributions in both temporal and spatial domains to improve interpretability. In temporal domain, we use timestamp deviations between processing events and central event to judge the temporal correlation and filter out temporal-irrelevant events. In spatial domain, we choose maximum a posteriori (MAP) to discriminate real-world event and noise, and use the learned convolutional sparse coding to optimize the objective function. Based on the theoretical analysis, we build Temporal Window (TW) module and Soft Spatial Feature Embedding (SSFE) module to process temporal and spatial information separately, and construct a novel multi-scale window-based event denoising network, named MSDNet. The high denoising accuracy and fast running speed of our MSDNet enables us to achieve real-time denoising in complex scenes. Extensive experimental results verify the effectiveness and robustness of our MSDNet. Our algorithm can remove event noise effectively and efficiently and improve the performance of downstream tasks. △ Less

Submitted 14 February, 2024; originally announced February 2024.

arXiv:2402.05375 [pdf, other]

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang

Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to man… ▽ More The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to manipulate the text embeddings and remove unwanted content from them. We introduce two contributions, which we refer to as $\textit{soft-weighted regularization}$ and $\textit{inference-time text embedding optimization}$. The first regularizes the text embedding matrix and effectively suppresses the undesired content. The second method aims to further suppress the unwanted content generation of the prompt, and encourages the generation of desired content. We evaluate our method quantitatively and qualitatively on extensive experiments, validating its effectiveness. Furthermore, our method is generalizability to both the pixel-space diffusion models (i.e. DeepFloyd-IF) and the latent-space diffusion models (i.e. Stable Diffusion). △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: ICLR 2024. Our code is available in https://github.com/sen-mao/SuppressEOT

arXiv:2401.11387 [pdf, ps, other]

Rational Solutions to the First Order Difference Equations in the Bivariate Difference Field

Authors: Qing-Hu Hou, Yarong Wei

Abstract: Inspired by Karr's algorithm, we consider the summations involving a sequence satisfying a recurrence of order two. The structure of such summations provides an algebraic framework for solving the difference equations of form $aσ(g)+bg=f$ in the bivariate difference field $(\mathbb{F}(α, β), σ)$, where $a, b,f\in\mathbb{F}(α,β)\setminus\{0\}$ are known binary functions of $α$, $β$, and $α$, $β$ ar… ▽ More Inspired by Karr's algorithm, we consider the summations involving a sequence satisfying a recurrence of order two. The structure of such summations provides an algebraic framework for solving the difference equations of form $aσ(g)+bg=f$ in the bivariate difference field $(\mathbb{F}(α, β), σ)$, where $a, b,f\in\mathbb{F}(α,β)\setminus\{0\}$ are known binary functions of $α$, $β$, and $α$, $β$ are two algebraically independent transcendental elements, $σ$ is a transformation that satisfies $σ(α)=β$, $σ(β)=uα+vβ$, where $u,v\neq 0\in\mathbb{F}$. Based on it, we then describe algorithms for finding the universal denominator for those equations in the bivariate difference field under certain assumptions. This reduces the general problem of finding the rational solutions of such equations to the problem of finding the polynomial solutions of such equations. △ Less

Submitted 20 January, 2024; originally announced January 2024.

arXiv:2312.13613 [pdf, ps, other]

Reduction on the congruences of partial sums of P-recursive sequences

Authors: Qing-Hu Hou, Na Li

Abstract: Hou and Liu developed a telesco** method to prove the congruence of partial sums of P-recursive sequences. We release the requirement on the telescoper and utilize the congruence of the sequence. With this approach, we are able to confirm a conjecture of Sun and find a new congruence on the central trinomial coefficient. Hou and Liu developed a telesco** method to prove the congruence of partial sums of P-recursive sequences. We release the requirement on the telescoper and utilize the congruence of the sequence. With this approach, we are able to confirm a conjecture of Sun and find a new congruence on the central trinomial coefficient. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: 11 pages

MSC Class: 33F10; 11A07; 05A19; 11B65

arXiv:2312.08866 [pdf, other]

MCANet: Medical Image Segmentation with Multi-Scale Cross-Axis Attention

Authors: Hao Shao, Quansheng Zeng, Qibin Hou, Jufeng Yang

Abstract: Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attentio… ▽ More Efficiently capturing multi-scale information and building long-range dependencies among pixels are essential for medical image segmentation because of the various sizes and shapes of the lesion regions or organs. In this paper, we present Multi-scale Cross-axis Attention (MCA) to solve the above challenging issues based on the efficient axial attention. Instead of simply connecting axial attention along the horizontal and vertical directions sequentially, we propose to calculate dual cross attentions between two parallel axial attentions to capture global information better. To process the significant variations of lesion regions or organs in individual sizes and shapes, we also use multiple convolutions of strip-shape kernels with different kernel sizes in each axial attention path to improve the efficiency of the proposed MCA in encoding spatial information. We build the proposed MCA upon the MSCAN backbone, yielding our network, termed MCANet. Our MCANet with only 4M+ parameters performs even better than most previous works with heavy backbones (e.g., Swin Transformer) on four challenging tasks, including skin lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation, and polyp segmentation. Code is available at https://github.com/haoshao-nku/medical_seg. △ Less

Submitted 19 December, 2023; v1 submitted 14 December, 2023; originally announced December 2023.

arXiv:2312.08735 [pdf, other]

Polyper: Boundary Sensitive Polyp Segmentation

Authors: Hao Shao, Yang Zhang, Qibin Hou

Abstract: We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing… ▽ More We present a new boundary sensitive framework for polyp segmentation, called Polyper. Our method is motivated by a clinical approach that seasoned medical practitioners often leverage the inherent features of interior polyp regions to tackle blurred boundaries.Inspired by this, we propose explicitly leveraging polyp regions to bolster the model's boundary discrimination capability while minimizing computation. Our approach first extracts boundary and polyp regions from the initial segmentation map through morphological operators. Then, we design the boundary sensitive attention that concentrates on augmenting the features near the boundary regions using the interior polyp regions's characteristics to generate good segmentation results. Our proposed method can be seamlessly integrated with classical encoder networks, like ResNet-50, MiT-B1, and Swin Transformer. To evaluate the effectiveness of Polyper, we conduct experiments on five publicly available challenging datasets, and receive state-of-the-art performance on all of them. Code is available at https://github.com/haoshao-nku/medical_seg.git. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024

arXiv:2312.05830 [pdf, other]

A Decoupled Spatio-Temporal Framework for Skeleton-based Action Segmentation

Authors: Yunheng Li, Zhongyu Li, Shanghua Gao, Qilong Wang, Qibin Hou, Ming-Ming Cheng

Abstract: Effectively modeling discriminative spatio-temporal information is essential for segmenting activities in long action sequences. However, we observe that existing methods are limited in weak spatio-temporal modeling capability due to two forms of decoupled modeling: (i) cascaded interaction couples spatial and temporal modeling, which over-smooths motion modeling over the long sequence, and (ii) j… ▽ More Effectively modeling discriminative spatio-temporal information is essential for segmenting activities in long action sequences. However, we observe that existing methods are limited in weak spatio-temporal modeling capability due to two forms of decoupled modeling: (i) cascaded interaction couples spatial and temporal modeling, which over-smooths motion modeling over the long sequence, and (ii) joint-shared temporal modeling adopts shared weights to model each joint, ignoring the distinct motion patterns of different joints. We propose a Decoupled Spatio-Temporal Framework (DeST) to address the above issues. Firstly, we decouple the cascaded spatio-temporal interaction to avoid stacking multiple spatio-temporal blocks, while achieving sufficient spatio-temporal interaction. Specifically, DeST performs once unified spatial modeling and divides the spatial features into different groups of subfeatures, which then adaptively interact with temporal features from different layers. Since the different sub-features contain distinct spatial semantics, the model could learn the optimal interaction pattern at each layer. Meanwhile, inspired by the fact that different joints move at different speeds, we propose joint-decoupled temporal modeling, which employs independent trainable weights to capture distinctive temporal features of each joint. On four large-scale benchmarks of different scenes, DeST significantly outperforms current state-of-the-art methods with less computational complexity. △ Less

Submitted 10 December, 2023; originally announced December 2023.

arXiv:2312.04248 [pdf, other]

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes

Authors: Xuying Zhang, Bo-Wen Yin, Yuming Chen, Zheng Lin, Yunheng Li, Qibin Hou, Ming-Ming Cheng

Abstract: Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner prima… ▽ More Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However, the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile, the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges, we present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly, a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then, we develop a Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. Our code and results will be made publicly available △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:2311.06772 [pdf, other]

ChatAnything: Facetime Chat with LLM-Enhanced Personas

Authors: Yilin Zhao, Xinbin Yuan, Shanghua Gao, Zhijie Lin, Qibin Hou, Jiashi Feng, Daquan Zhou

Abstract: In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixtur… ▽ More In this technical report, we target generating anthropomorphized personas for LLM-based characters in an online manner, including visual appearance, personality and tones, with only text descriptions. To achieve this, we first leverage the in-context learning capability of LLMs for personality generation by carefully designing a set of system prompts. We then propose two novel concepts: the mixture of voices (MoV) and the mixture of diffusers (MoD) for diverse voice and appearance generation. For MoV, we utilize the text-to-speech (TTS) algorithms with a variety of pre-defined tones and select the most matching one based on the user-provided text description automatically. For MoD, we combine the recent popular text-to-image generation techniques and talking head algorithms to streamline the process of generating talking objects. We termed the whole framework as ChatAnything. With it, users could be able to animate anything with any personas that are anthropomorphic using just a few text inputs. However, we have observed that the anthropomorphic objects produced by current generative models are often undetectable by pre-trained face landmark detectors, leading to failure of the face motion generation, even if these faces possess human-like appearances because those images are nearly seen during the training (e.g., OOD samples). To address this issue, we incorporate pixel-level guidance to infuse human face landmarks during the image generation phase. To benchmark these metrics, we have built an evaluation dataset. Based on it, we verify that the detection rate of the face landmark is significantly increased from 57.0% to 92.5% thus allowing automatic face animation based on generated speech content. The code and more results can be found at https://chatanything.github.io/. △ Less

Submitted 12 November, 2023; originally announced November 2023.

arXiv:2310.19234 [pdf, ps, other]

Log-behavior of the root sequences of P-recursive sequences

Authors: Qing-hu Hou, Zhongjie Li

Abstract: In recent years, Sun has proposed numerous conjectures regarding the log-concavity of root sequences $\{\sqrt[n]{a_n}}_{n\geqslant 1}$. We establish criteria for the asymptotic log-concavity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ and the asymptotic ratio log-convexity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ for $P$-recursive sequences $\{\sqrt[n]{a_n}}_{n\geqslant{0}}$. Additionally, by the aid of symb… ▽ More In recent years, Sun has proposed numerous conjectures regarding the log-concavity of root sequences $\{\sqrt[n]{a_n}}_{n\geqslant 1}$. We establish criteria for the asymptotic log-concavity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ and the asymptotic ratio log-convexity of $\{\sqrt[n]{a_n}}_{n\geqslant 1}$ for $P$-recursive sequences $\{\sqrt[n]{a_n}}_{n\geqslant{0}}$. Additionally, by the aid of symbolic computation, we present a systematic approach to determine the explicit integer $N$ such that the sequence $\{\sqrt[n]{a_n}}_{n\geqslant{N}}$ is log-concave and the sequence $\{\sqrt[n]{a_n}}_{n\geqslant N}$ is ratio log-convex. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: 15 pages

ACM Class: G.2.1

arXiv:2310.13235 [pdf, other]

Auxiliary Features-Guided Super Resolution for Monte Carlo Rendering

Authors: Qiqi Hou, Feng Liu

Abstract: This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of l… ▽ More This paper investigates super resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super resolution technologies, it is essentially an ill-posed problem and cannot recover high-frequency details in renderings. To address this problem, we exploit high-resolution auxiliary features to guide super resolution of low-resolution renderings. These high-resolution auxiliary features can be quickly rendered by a rendering engine and at the same time provide valuable high-frequency details to assist super resolution. To this end, we develop a cross-modality Transformer network that consists of an auxiliary feature branch and a low-resolution rendering branch. These two branches are designed to fuse high-resolution auxiliary features with the corresponding low-resolution rendering. Furthermore, we design residual densely-connected Swin Transformer groups to learn to extract representative features to enable high-quality super-resolution. Our experiments show that our auxiliary features-guided super-resolution method outperforms both super-resolution methods and Monte Carlo denoising methods in producing high-quality renderings. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted by CGF

Journal ref: Computer Graphics Forum 2023

arXiv:2310.13215 [pdf, other]

Zone Evaluation: Revealing Spatial Bias in Object Detection

Authors: Zhaohui Zheng, Yuming Chen, Qibin Hou, Xiang Li, ** Wang, Ming-Ming Cheng

Abstract: A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, exten… ▽ More A fundamental limitation of object detectors is that they suffer from "spatial bias", and in particular perform less satisfactorily when detecting objects near image borders. For a long time, there has been a lack of effective ways to measure and identify spatial bias, and little is known about where it comes from and what degree it is. To this end, we present a new zone evaluation protocol, extending from the traditional evaluation to a more generalized one, which measures the detection performance over zones, yielding a series of Zone Precisions (ZPs). For the first time, we provide numerical results, showing that the object detectors perform quite unevenly across the zones. Surprisingly, the detector's performance in the 96% border zone of the image does not reach the AP value (Average Precision, commonly regarded as the average detection performance in the entire image zone). To better understand spatial bias, a series of heuristic experiments are conducted. Our investigation excludes two intuitive conjectures about spatial bias that the object scale and the absolute positions of objects barely influence the spatial bias. We find that the key lies in the human-imperceptible divergence in data patterns between objects in different zones, thus eventually forming a visible performance gap between the zones. With these findings, we finally discuss a future direction for object detection, namely, spatial disequilibrium problem, aiming at pursuing a balanced detection ability over the entire image zone. By broadly evaluating 10 popular object detectors and 5 detection datasets, we shed light on the spatial bias of object detectors. We hope this work could raise a focus on detection robustness. The source codes, evaluation protocols, and tutorials are publicly available at https://github.com/Zzh-tju/ZoneEval. △ Less

Submitted 1 June, 2024; v1 submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted by IEEE TPAMI

arXiv:2310.03699 [pdf, other]

Taylor coefficients and series involving harmonic numbers

Authors: Qing-Hu Hou, Zhi-Wei Sun

Abstract: During 2022--2023 Z.-W. Sun posed many conjectures on infinite series with summands involving generalized harmonic numbers. Motivated by this, we deduce $31$ series identities involving harmonic numbers, three of which were previously conjectured by the second author. For example, we obtain that \[ \sum_{k=1}^{\infty} \frac{(-1)^k}{k^2{2k \choose k}{3k \choose k}} \big( \frac{7 k-2}{2 k-1} H_{k-1}… ▽ More During 2022--2023 Z.-W. Sun posed many conjectures on infinite series with summands involving generalized harmonic numbers. Motivated by this, we deduce $31$ series identities involving harmonic numbers, three of which were previously conjectured by the second author. For example, we obtain that \[ \sum_{k=1}^{\infty} \frac{(-1)^k}{k^2{2k \choose k}{3k \choose k}} \big( \frac{7 k-2}{2 k-1} H_{k-1}^{(2)}-\frac{3}{4 k^2} \big)=\frac{π^4}{720}. \] and \[ \sum_{k=1}^\infty \frac{1}{k^2 {2k \choose k}^2} \left( \frac{30k-11}{k(2k-1)} (H_{2k-1}^{(3)} + 2 H_{k-1}^{(3)}) + \frac{27}{8k^4} \right) = 4 ζ(3)^2, \] where $H_n^{(m)}$ denotes $\sum_{0<j \le n}j^{-m}$. △ Less

Submitted 26 October, 2023; v1 submitted 5 October, 2023; originally announced October 2023.

Comments: add some new series

MSC Class: Primary 05A19; 11B65; Secondary 33B15

arXiv:2309.09668 [pdf, other]

DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

Authors: Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, Qibin Hou

Abstract: We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2)… ▽ More We present DFormer, a novel RGB-D pretraining framework to learn transferable representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) Unlike previous works that encode RGB-D information with RGB pretrained backbone, we pretrain the backbone using image-depth pairs from ImageNet-1K, and hence the DFormer is endowed with the capacity to encode RGB-D representations; 2) DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel building block design. DFormer avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pretrained backbones, which widely lies in existing methods but has not been resolved. We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. Our code is available at: https://github.com/VCIP-RGBD/DFormer. △ Less

Submitted 7 February, 2024; v1 submitted 18 September, 2023; originally announced September 2023.

Comments: Accepted by ICLR 2024

arXiv:2309.04399 [pdf, other]

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Authors: Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng

Abstract: Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and… ▽ More Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning between the prompt and the output image. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in semantic information embedding from the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can significantly improve the text-to-image consistency with negligible computation overhead compared to the original diffusion models. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2308.05778 [pdf]

doi 10.1016/j.matt.2023.11.014

Current percolation model for the special resistivity behavior observed in Cu-doped Apatite

Authors: Qiang Hou, Wei Wei, Xin Zhou, Xinyue Wang, Yue Sun, ZhiXiang Shi

Abstract: Since the initial report of the potential occurrence of room-temperature superconductivity under normal pressure [arXiv: 2307.12008], there has been significant interest in the field of condensed matter physics regarding Cu-doped Apatite (Pb10-xCux(PO4)6O). In this study, we performed temperature-dependent resistivity measurements on the synthesized Pb10-xCux(PO4)6O samples. The structure of the s… ▽ More Since the initial report of the potential occurrence of room-temperature superconductivity under normal pressure [arXiv: 2307.12008], there has been significant interest in the field of condensed matter physics regarding Cu-doped Apatite (Pb10-xCux(PO4)6O). In this study, we performed temperature-dependent resistivity measurements on the synthesized Pb10-xCux(PO4)6O samples. The structure of the sample was confirmed to match the reference literature through X-ray diffraction analysis. Remarkably, we observed four distinct types of resistivity behaviors within samples from the same pellet: (1) A semiconductor-like behavior characterized by a decrease in resistivity as the temperature is lowered. (2) A gradual reduction in resistivity, reaching an exceptionally small value that falls below the resolution limits of our measurement equipment. (3) An abrupt drop in resistivity to a low value at ~ 250 K. (4) An almost linear reduction in resistivity exhibiting a transition at approximately 7 K (possibly associated with Pb). Following a thorough compositional analysis, we proposed a current percolation model, based on the formation of a Cu/Pb current channel, to elucidate the observed special resistivity behaviors. It is important to note that the Meissner effect was not observed in our magnetization measurements. Consequently, we reached the conclusion that the presence of superconductivity in Cu-doped Apatite has yet to be substantiated. △ Less

Submitted 10 August, 2023; originally announced August 2023.

Comments: This paper represents a continuation of our previous study [arXiv:2308.01192], now offering a more comprehensive analysis of the collected data

Journal ref: Matter 6, 4408-4418 (2023)

arXiv:2308.05480 [pdf, other]

YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection

Authors: Yuming Chen, Xinbin Yuan, Ruiqi Wu, Jiabao Wang, Qibin Hou, Ming-Ming Cheng

Abstract: We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object det… ▽ More We aim at providing the object detection community with an efficient and performant object detector, termed YOLO-MS. The core design is based on a series of investigations on how convolutions with different kernel sizes affect the detection performance of objects at different scales. The outcome is a new strategy that can strongly enhance multi-scale feature representations of real-time object detectors. To verify the effectiveness of our strategy, we build a network architecture, termed YOLO-MS. We train our YOLO-MS on the MS COCO dataset from scratch without relying on any other large-scale datasets, like ImageNet, or pre-trained weights. Without bells and whistles, our YOLO-MS outperforms the recent state-of-the-art real-time object detectors, including YOLO-v7 and RTMDet, when using a comparable number of parameters and FLOPs. Taking the XS version of YOLO-MS as an example, with only 4.5M learnable parameters and 8.7G FLOPs, it can achieve an AP score of 43%+ on MS COCO, which is about 2%+ higher than RTMDet with the same model size. Moreover, our work can also be used as a plug-and-play module for other YOLO models. Typically, our method significantly improves the AP of YOLOv8 from 37%+ to 40%+ with even fewer parameters and FLOPs. Code is available at https://github.com/FishAndWasabi/YOLO-MS. △ Less

Submitted 10 August, 2023; originally announced August 2023.

arXiv:2308.01192 [pdf]

doi 10.1016/j.matt.2023.11.014

Observation of zero resistance above 100$^\circ$ K in Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O

Authors: Qiang Hou, Wei Wei, Xin Zhou, Yue Sun, Zhixiang Shi

Abstract: Room-temperature superconductivity has always been regarded as the ultimate goal in the fields of solid-state physics and materials science, with its realization holding revolutionary significance, capable of triggering significant changes in energy transmission and storage. However, achieving it poses various challenges. Recent research revealed that material Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O displa… ▽ More Room-temperature superconductivity has always been regarded as the ultimate goal in the fields of solid-state physics and materials science, with its realization holding revolutionary significance, capable of triggering significant changes in energy transmission and storage. However, achieving it poses various challenges. Recent research revealed that material Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O displays room-temperature superconductivity under atmospheric pressure, sparking global interest in further exploration. Here, we utilized solid-phase synthesis to obtain a polycrystalline sample of Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O. X-ray diffraction confirmed its structural consistency with referenced literature. Zero resistance, which is important evidence for superconductivity, was observed above 100$^\circ$ K under ambient pressure in our experiment. Our finding indicates that Pb$_{10-x}$Cu$_x$(PO$_4$)$_6$O is a possible candidate for searching high-temperature superconductors. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 7 pages, 3 figures

Journal ref: Matter 6, 4408-4418 (2023)

arXiv:2307.02176 [pdf, other]

Molecular Dynamics

Authors: Halima Mouhib, Juami H. M. van Gils, Jose Gavaldá-Garciá, Qingzhen Hou, Ali May, Arriën Symon Rauh, Jocelyne Vreede, Sanne Abeln, K. Anton Feenstra

Abstract: While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computati… ▽ More While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. We know that many proteins have functional motions, and in Chapter "Structure Determination" we already introduced the famous example of the allosteric cooperative binding of oxygen to the haem group in hemoglobin. However, experimentally, such motions are hard to observe. Here, we will introduce MD simulations to investigate the dynamic behaviour of proteins. In a simulation the forces and interactions between particles are used to numerically derive the resulting three-dimensional movement of these particles over a certain time-scale. We will also highlight some applications, and will see how simulation results may be interpreted. △ Less

Submitted 6 July, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: editorial responsability: Halima Mouhib, Sanne Abeln, K. Anton Feenstra. This chapter is part of the book "Introduction to Protein Structural Bioinformatics". The Preface arXiv:1801.09442 contains links to all the (published) chapters. The update adds available arxiv hyperlinks for the chapters

arXiv:2307.02173 [pdf, other]

Function Prediction

Authors: Bas Stringer, Annika Jacobsen, Qingzhen Hou, Hans de Ferrante, Olga Ivanova, Katharina Waury, Jose Gavaldá-Garciá, Sanne Abeln, K. Anton Feenstra

Abstract: While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computati… ▽ More While many good textbooks are available on Protein Structure, Molecular Simulations, Thermodynamics and Bioinformatics methods in general, there is no good introductory level book for the field of Structural Bioinformatics. This book aims to give an introduction into Structural Bioinformatics, which is where the previous topics meet to explore three dimensional protein structures through computational analysis. We provide an overview of existing computational techniques, to validate, simulate, predict and analyse protein structures. More importantly, it will aim to provide practical knowledge about how and when to use such techniques. We will consider proteins from three major vantage points: Protein structure quantification, Protein structure prediction, and Protein simulation & dynamics. There are still huge gaps in understanding the molecular function of proteins. This raises the question on how we may predict protein function, when little to no knowledge from direct experiments is available. Protein function is a broad concept which spans different scales: from quantum scale effects for catalyzing enzymatic reactions, to phenotypes that manifest at the organism level. In fact, many of these functional scales are entirely different research areas. Here, we will consider prediction of a smaller range of functions, roughly spanning the protein residue-level up to the pathway level. We will give a conceptual overview of which functional aspects of proteins we can predict, which methods are currently available, and how well they work in practice. △ Less

Submitted 6 July, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: editorial responsability: K. Anton Feenstra, Sanne Abeln. This chapter is part of the book "Introduction to Protein Structural Bioinformatics". The Preface arXiv:1801.09442 contains links to all the (published) chapters. The update adds available arxiv hyperlinks for the chapters

arXiv:2306.13277 [pdf, ps, other]

Meta-Gating Framework for Fast and Continuous Resource Optimization in Dynamic Wireless Environments

Authors: Qiushuo Hou, Mengyuan Lee, Guanding Yu, Yunlong Cai

Abstract: With the great success of deep learning (DL) in image classification, speech recognition, and other fields, more and more studies have applied various neural networks (NNs) to wireless resource allocation. Generally speaking, these artificial intelligent (AI) models are trained under some special learning hypotheses, especially that the statistics of the training data are static during the trainin… ▽ More With the great success of deep learning (DL) in image classification, speech recognition, and other fields, more and more studies have applied various neural networks (NNs) to wireless resource allocation. Generally speaking, these artificial intelligent (AI) models are trained under some special learning hypotheses, especially that the statistics of the training data are static during the training stage. However, the distribution of channel state information (CSI) is constantly changing in the real-world wireless communication environment. Therefore, it is essential to study effective dynamic DL technologies to solve wireless resource allocation problems. In this paper, we propose a novel framework, named meta-gating, for solving resource allocation problems in an episodically dynamic wireless environment, where the CSI distribution changes over periods and remains constant within each period. The proposed framework, consisting of an inner network and an outer network, aims to adapt to the dynamic wireless environment by achieving three important goals, i.e., seamlessness, quickness and continuity. Specifically, for the former two goals, we propose a training method by combining a model-agnostic meta-learning (MAML) algorithm with an unsupervised learning mechanism. With this training method, the inner network is able to fast adapt to different channel distributions because of the good initialization. As for the goal of continuity, the outer network can learn to evaluate the importance of inner network's parameters under different CSI distributions, and then decide which subset of the inner network should be activated through the gating operation. Additionally, we theoretically analyze the performance of the proposed meta-gating framework. △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: accepted by IEEE TCOM

arXiv:2306.11369 [pdf, other]

CrossKD: Cross-Head Knowledge Distillation for Object Detection

Authors: Jiabao Wang, Yuming Chen, Zhaohui Zheng, Xiang Li, Ming-Ming Cheng, Qibin Hou

Abstract: Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detecti… ▽ More Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper, we present a general and effective prediction mimicking distillation scheme, called CrossKD, which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions, greatly improving the student's detection performance. Moreover, as mimicking the teacher's predictions is the target of KD, CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO, with only prediction mimicking losses applied, our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7, outperforming all existing KD methods. In addition, our method also works well when distilling detectors with heterogeneous backbones. Code is available at https://github.com/jbwang1997/CrossKD. △ Less

Submitted 15 April, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2306.07532 [pdf, other]

Referring Camouflaged Object Detection

Authors: Xuying Zhang, Bowen Yin, Zheng Lin, Qibin Hou, Deng-** Fan, Ming-Ming Cheng

Abstract: We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios. Then, we develop a simple but strong dual-branch fram… ▽ More We consider the problem of referring camouflaged object detection (Ref-COD), a new task that aims to segment specified camouflaged objects based on a small set of referring images with salient target objects. We first assemble a large-scale dataset, called R2C7K, which consists of 7K images covering 64 object categories in real-world scenarios. Then, we develop a simple but strong dual-branch framework, dubbed R2CNet, with a reference branch embedding the common representations of target objects from referring images and a segmentation branch identifying and segmenting camouflaged objects under the guidance of the common representations. In particular, we design a Referring Mask Generation module to generate pixel-level prior mask and a Referring Feature Enrichment module to enhance the capability of identifying specified camouflaged objects. Extensive experiments show the superiority of our Ref-COD methods over their COD counterparts in segmenting specified camouflaged objects and identifying the main body of target objects. Our code and dataset are publicly available at https://github.com/zhangxuying1004/RefCOD. △ Less

Submitted 11 July, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2306.04300 [pdf, other]

CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation

Authors: Boyuan Sun, Yuqi Yang, Le Zhang, Ming-Ming Cheng, Qibin Hou

Abstract: This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easil… ▽ More This paper presents a simple but performant semi-supervised semantic segmentation approach, called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information, which previous works have omitted. Motivated by these, we aim to improve the use efficiency of unlabeled data by designing two novel label propagation strategies. First, we propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more. Then, we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps. CorrMatch achieves great performance on popular segmentation benchmarks. Taking the DeepLabV3+ with ResNet-101 backbone as our segmentation model, we receive a 76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images. Code is available at https://github.com/BBBBchan/CorrMatch. △ Less

Submitted 10 December, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

arXiv:2305.15248 [pdf, other]

Delving Deeper into Data Scaling in Masked Image Modeling

Authors: Cheng-Ze Lu, Xiaojie **, Qibin Hou, Jun Hao Liew, Ming-Ming Cheng, Jiashi Feng

Abstract: Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we t… ▽ More Understanding whether self-supervised learning methods can scale with unlimited data is crucial for training large-scale models. In this work, we conduct an empirical study on the scaling capability of masked image modeling (MIM) methods (e.g., MAE) for visual recognition. Unlike most previous works that depend on the widely-used ImageNet dataset, which is manually curated and object-centric, we take a step further and propose to investigate this problem in a more practical setting. Specifically, we utilize the web-collected Coyo-700M dataset. We randomly sample varying numbers of training images from the Coyo dataset and construct a series of sub-datasets, containing 0.5M, 1M, 5M, 10M, and 100M images, for pre-training. Our goal is to investigate how the performance changes on downstream tasks when scaling with different sizes of data and models. The study reveals that: 1) MIM can be viewed as an effective method to improve the model capacity when the scale of the training data is relatively small; 2) Strong reconstruction targets can endow the models with increased capacities on downstream tasks; 3) MIM pre-training is data-agnostic under most scenarios, which means that the strategy of sampling pre-training data is non-critical. We hope these observations could provide valuable insights for future research on MIM. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.00498 [pdf, ps, other]

Ramanujan-inspired series for $1/π$ involving harmonic numbers

Authors: Qinghu Hou, Haihong He, Xiaoxia Wang

Abstract: By applying the derivative operator to the known identities from hypergeometric series or WZ pairs, we obtain seven series associated with harmonic numbers. Specifically, six of them are Ramanujan-like formulas for $1/π$ and the remaining onecontains harmonic numbers of order $2$. As conclusions, Sun's five conjectural series are proved. By applying the derivative operator to the known identities from hypergeometric series or WZ pairs, we obtain seven series associated with harmonic numbers. Specifically, six of them are Ramanujan-like formulas for $1/π$ and the remaining onecontains harmonic numbers of order $2$. As conclusions, Sun's five conjectural series are proved. △ Less

Submitted 8 July, 2023; v1 submitted 30 April, 2023; originally announced May 2023.

Comments: 11 pages

arXiv:2305.00371 [pdf, other]

doi 10.3847/1538-4357/accf9c

New 26P(p,γ)27S thermonuclear reaction rate and its astrophysical implication in rp-process

Authors: S. Q. Hou, J. B. Liu, T. C. L. Trueman, J. G. Li, M. Pignatari, C. Bertulani, X. X. Xu

Abstract: Accurate nuclear reaction rates for 26P(p,γ)27S are pivotal for a comprehensive understanding of rp-process nucleosynthesis path in the region of proton-rich sulfur and phosphorus isotopes. However, large uncertainties still exist in the current rate of 26P(p,γ)27S because of the lack of the nuclear mass and the energy level structure information of 27S. We reevaluate this reaction rate using the… ▽ More Accurate nuclear reaction rates for 26P(p,γ)27S are pivotal for a comprehensive understanding of rp-process nucleosynthesis path in the region of proton-rich sulfur and phosphorus isotopes. However, large uncertainties still exist in the current rate of 26P(p,γ)27S because of the lack of the nuclear mass and the energy level structure information of 27S. We reevaluate this reaction rate using the experimentally constrained 27S mass, together with the shell-model predicted level structure. It is found that the 26P(p,γ)27S reaction rate is dominated by a direct-capture (DC) reaction mechanism despite the presence of three resonances at E = 1.104, 1.597, 1.777 MeV above the proton threshold in 27S. The new rate is overall smaller than the other previous rates from Hauser-Feshbach statistical model by at least one order of magnitude in the temperature range of X-ray burst interest. In addition, we consistently update the photodisintegration rate using the new 27S mass. The influence of new rates of forward and reverse reaction in the abundances of isotopes produced in rp-process is explored by post-processing nucleosynthesis calculations. The final abundance ratio of 27S/26P obtained using the new rates is only 10% of that from the old rate. The abundance flow calculations show the reaction path 26P(p,γ)27S(\b{eta}+,ν)27P is not as important as thought previously for producing 27P. The adoption of the new reaction rates for 26P(p,γ)27S only reduces the final production of aluminum by 7.1%, and has no discernible impact on the yield of other elements. △ Less

Submitted 29 April, 2023; originally announced May 2023.

arXiv:2304.13240 [pdf, other]

Structure Diagram Recognition in Financial Announcements

Authors: Meixuan Qiao, Jun Wang, Junfu Xiang, Qiyu Hou, Ruixuan Li

Abstract: Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines… ▽ More Accurately extracting structured data from structure diagrams in financial announcements is of great practical importance for building financial knowledge graphs and further improving the efficiency of various financial applications. First, we proposed a new method for recognizing structure diagrams in financial announcements, which can better detect and extract different types of connecting lines, including straight lines, curves, and polylines of different orientations and angles. Second, we developed a two-stage method to efficiently generate the industry's first benchmark of structure diagrams from Chinese financial announcements, where a large number of diagrams were synthesized and annotated using an automated tool to train a preliminary recognition model with fairly good performance, and then a high-quality benchmark can be obtained by automatically annotating the real-world structure diagrams using the preliminary model and then making few manual corrections. Finally, we experimentally verified the significant performance advantage of our structure diagram recognition method over previous methods. △ Less

Submitted 1 May, 2023; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: ICDAR2023

arXiv:2304.09790 [pdf, other]

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

Authors: Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, Ming-Ming Cheng

Abstract: We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flo… ▽ More We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward war** on the input frames separately. Combining these two designs enables us to generate promising task-oriented flows and reduce the difficulties in modeling large motions and handling occluded areas during frame interpolation. These qualities promote our model to achieve state-of-the-art performance on various benchmarks with high efficiency. Moreover, our convolution-based model competes favorably compared to Transformer-based models in terms of accuracy and efficiency. Our code is available at https://github.com/MCG-NKU/AMT. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: Accepted to CVPR2023

arXiv:2304.03582 [pdf, other]

doi 10.3847/1538-4357/acca81

Reaction kinetics of CN + toluene and its implication on the productions of aromatic nitriles in the Taurus molecular cloud and Titan's atmosphere

Authors: Mengqi Wu, Xiaoqing Wu, Qifeng Hou, Jiangbin Huang, Dongfeng Zhao, Feng Zhang

Abstract: Reactions between cyano radical and aromatic hydrocarbons are believed to be important pathways for the formation of aromatic nitriles in the interstellar medium (ISM) including those identified in the Taurus molecular cloud (TMC-1). Aromatic nitriles might participate in the formation of polycyclic aromatic nitrogen containing hydrocarbons (PANHs) in Titan's atmosphere. Here, ab initio kinetics s… ▽ More Reactions between cyano radical and aromatic hydrocarbons are believed to be important pathways for the formation of aromatic nitriles in the interstellar medium (ISM) including those identified in the Taurus molecular cloud (TMC-1). Aromatic nitriles might participate in the formation of polycyclic aromatic nitrogen containing hydrocarbons (PANHs) in Titan's atmosphere. Here, ab initio kinetics simulations reveal a high efficiency of $\rm \sim10^{-10}~cm^{3}~s^{-1}$ and the competition of the different products of 30-1800 K and $10^{-7}$-100 atm of the CN + toluene reaction. In the star-forming region of TMC-1 environment, the product yields of benzonitrile and tolunitriles for CN reacting with toluene may be approximately 17$\%$ and 83$\%$, respectively. The detection of main products, tolunitriles, can serve as proxies for the undetected toluene in the ISM due to their much larger dipole moments. The competition between bimolecular and unimolecular products is extremely intense under the warmer and denser PANH forming region of Titan's stratosphere. The computational results show that the fractions of tolunitriles, adducts, and benzonitrile are 19$\%$-68$\%$, 15$\%$-64$\%$ and 17$\%$, respectively, at 150-200 K and 0.0001-0.001 atm (Titan's stratosphere). Then, benzonitrile and tolunitriles may contribute to the formation of PANHs by consecutive $\rm C_{2}H$ additions. Kinetic information of aromatic nitriles for the CN + toluene reaction calculated here helps to explain the formation mechanism of polycyclic aromatic hydrocarbons (PAHs) or PANHs under different interstellar environments and constrains corresponding astrochemical models. △ Less

Submitted 7 April, 2023; originally announced April 2023.

arXiv:2303.15649 [pdf, other]

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang

Abstract: A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (1) Unsatisfying results for selected regions, and unexpected changes in nonselected regions. (2) They require careful text pro… ▽ More A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (1) Unsatisfying results for selected regions, and unexpected changes in nonselected regions. (2) They require careful text prompt editing where the prompt should include all visual objects in the input image. To address this, we propose two improvements: (1) Only optimizing the input of the value linear network in the cross-attention layers, is sufficiently powerful to reconstruct a real image. (2) We propose attention regularization to preserve the object-like attention maps after editing, enabling us to obtain accurate style editing without invoking significant structural changes. We further improve the editing technique which is used for the unconditional branch of classifier-free guidance, as well as the conditional one as used by P2P. Extensive experimental prompt-editing results on a variety of images, demonstrate qualitatively and quantitatively that our method has superior editing capabilities than existing and concurrent works. △ Less

Submitted 20 August, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.09735 [pdf, other]

SRFormer: Permuted Self-Attention for Single Image Super-Resolution

Authors: Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, Qibin Hou

Abstract: Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core… ▽ More Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention (PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Our PSA is simple and can be easily applied to existing super-resolution networks based on window self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. We hope our simple and effective approach can serve as a useful tool for future research in super-resolution model design. △ Less

Submitted 16 March, 2023; originally announced March 2023.

arXiv:2303.09030 [pdf, other]

Large Selective Kernel Network for Remote Sensing Object Detection

Authors: Yuxuan Li, Qibin Hou, Zhaohui Zheng, Ming-Ming Cheng, Jian Yang, Xiang Li

Abstract: Recent research on remote sensing object detection has largely focused on improving the representation of oriented bounding boxes but has overlooked the unique prior knowledge presented in remote sensing scenarios. Such prior knowledge can be useful because tiny remote sensing objects may be mistakenly detected without referencing a sufficiently long-range context, and the long-range context requi… ▽ More Recent research on remote sensing object detection has largely focused on improving the representation of oriented bounding boxes but has overlooked the unique prior knowledge presented in remote sensing scenarios. Such prior knowledge can be useful because tiny remote sensing objects may be mistakenly detected without referencing a sufficiently long-range context, and the long-range context required by different types of objects can vary. In this paper, we take these priors into account and propose the Large Selective Kernel Network (LSKNet). LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To the best of our knowledge, this is the first time that large and selective kernel mechanisms have been explored in the field of remote sensing object detection. Without bells and whistles, LSKNet sets new state-of-the-art scores on standard benchmarks, i.e., HRSC2016 (98.46\% mAP), DOTA-v1.0 (81.85\% mAP) and FAIR1M-v1.0 (47.87\% mAP). Based on a similar technique, we rank 2nd place in 2022 the Greater Bay Area International Algorithm Competition. Code is available at https://github.com/zcablii/Large-Selective-Kernel-Network. △ Less

Submitted 19 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: Preprint, under review

Showing 1–50 of 180 results for author: Hou, Q