Search | arXiv e-print repository

Improving Global Weather and Ocean Wave Forecast with Large Artificial Intelligence Models

Authors: Fenghua Ling, Lin Ouyang, Boufeniza Redouane Larbi, **g-Jia Luo, Tao Han, Xiaohui Zhong, Lei Bai

Abstract: The rapid advancement of artificial intelligence technologies, particularly in recent years, has led to the emergence of several large parameter artificial intelligence weather forecast models. These models represent a significant breakthrough, overcoming the limitations of traditional numerical weather prediction models and indicating the emergence of profound potential tools for atmosphere-ocean… ▽ More The rapid advancement of artificial intelligence technologies, particularly in recent years, has led to the emergence of several large parameter artificial intelligence weather forecast models. These models represent a significant breakthrough, overcoming the limitations of traditional numerical weather prediction models and indicating the emergence of profound potential tools for atmosphere-ocean forecasts. This study explores the evolution of these advanced artificial intelligence forecast models, and based on the identified commonalities, proposes the "Three Large Rules" to measure their development. We discuss the potential of artificial intelligence in revolutionizing numerical weather prediction, and briefly outlining the underlying reasons for its great potential. While acknowledging the high accuracy, computational efficiency, and ease of deployment of large artificial intelligence forecast models, we also emphasize the irreplaceable values of traditional numerical forecasts and explore the challenges in the future development of large-scale artificial intelligence atmosphere-ocean forecast models. We believe that the optimal future of atmosphere-ocean weather forecast lies in achieving a seamless integration of artificial intelligence and traditional numerical models. Such a synthesis is anticipated to offer a more advanced and reliable approach for improved atmosphere-ocean forecasts. Additionally, we illustrate how forecasters can adapt and leverage the advanced artificial intelligence model through an example by building a large artificial intelligence model for global ocean wave forecast. △ Less

Submitted 18 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.16416 [pdf, other]

Endo-4DGS: Endoscopic Monocular Scene Reconstruction with 4D Gaussian Splatting

Authors: Yiming Huang, Beilei Cui, Long Bai, Ziqi Guo, Mengya Xu, Mobarakol Islam, Hongliang Ren

Abstract: In the realm of robot-assisted minimally invasive surgery, dynamic scene reconstruction can significantly enhance downstream tasks and improve surgical outcomes. Neural Radiance Fields (NeRF)-based methods have recently risen to prominence for their exceptional ability to reconstruct scenes but are hampered by slow inference speed, prolonged training, and inconsistent depth estimation. Some previo… ▽ More In the realm of robot-assisted minimally invasive surgery, dynamic scene reconstruction can significantly enhance downstream tasks and improve surgical outcomes. Neural Radiance Fields (NeRF)-based methods have recently risen to prominence for their exceptional ability to reconstruct scenes but are hampered by slow inference speed, prolonged training, and inconsistent depth estimation. Some previous work utilizes ground truth depth for optimization but is hard to acquire in the surgical domain. To overcome these obstacles, we present Endo-4DGS, a real-time endoscopic dynamic reconstruction approach that utilizes 3D Gaussian Splatting (GS) for 3D representation. Specifically, we propose lightweight MLPs to capture temporal dynamics with Gaussian deformation fields. To obtain a satisfactory Gaussian Initialization, we exploit a powerful depth estimation foundation model, Depth-Anything, to generate pseudo-depth maps as a geometry prior. We additionally propose confidence-guided learning to tackle the ill-pose problems in monocular depth estimation and enhance the depth-guided reconstruction with surface normal constraints and depth regularization. Our approach has been validated on two surgical datasets, where it can effectively render in real-time, compute efficiently, and reconstruct with remarkable accuracy. △ Less

Submitted 2 April, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

arXiv:2401.12681 [pdf, other]

Non-Neighbors Also Matter to Kriging: A New Contrastive-Prototypical Learning

Authors: Zhishuai Li, Yunhao Nie, Ziyue Li, Lei Bai, Yisheng Lv, Rui Zhao

Abstract: Kriging aims at estimating the attributes of unsampled geo-locations from observations in the spatial vicinity or physical connections, which helps mitigate skewed monitoring caused by under-deployed sensors. Existing works assume that neighbors' information offers the basis for estimating the attributes of the unobserved target while ignoring non-neighbors. However, non-neighbors could also offer… ▽ More Kriging aims at estimating the attributes of unsampled geo-locations from observations in the spatial vicinity or physical connections, which helps mitigate skewed monitoring caused by under-deployed sensors. Existing works assume that neighbors' information offers the basis for estimating the attributes of the unobserved target while ignoring non-neighbors. However, non-neighbors could also offer constructive information, and neighbors could also be misleading. To this end, we propose ``Contrastive-Prototypical'' self-supervised learning for Kriging (KCP) to refine valuable information from neighbors and recycle the one from non-neighbors. As a pre-trained paradigm, we conduct the Kriging task from a new perspective of representation: we aim to first learn robust and general representations and then recover attributes from representations. A neighboring contrastive module is designed that coarsely learns the representations by narrowing the representation distance between the target and its neighbors while pushing away the non-neighbors. In parallel, a prototypical module is introduced to identify similar representations via exchanged prediction, thus refining the misleading neighbors and recycling the useful non-neighbors from the neighboring contrast component. As a result, not all the neighbors and some of the non-neighbors will be used to infer the target. To encourage the two modules above to learn general and robust representations, we design an adaptive augmentation module that incorporates data-driven attribute augmentation and centrality-based topology augmentation over the spatiotemporal Kriging graph data. Extensive experiments on real-world datasets demonstrate the superior performance of KCP compared to its peers with 6% improvements and exceptional transferability and robustness. The code is available at https://github.com/bonaldli/KCP △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted in AISTATS 2024

arXiv:2401.12505 [pdf, other]

doi 10.1088/1402-4896/ad4c16

Topological magnons in a non-coplanar magnetic order on the triangular lattice

Authors: Linli Bai, Ken Chen

Abstract: The bond-dependent Kitaev interaction $K$ is familiar in the effective spin model of transition metal compounds with octahedral ligands. In this work, we find a peculiar non-coplanar magnetic order can be formed with the help of $K$ and next-nearest neighbor Heisenberg coupling $J_2$ on the triangular lattice. It can be seen as a miniature version of skyrmion crystal, since it has nine spins and a… ▽ More The bond-dependent Kitaev interaction $K$ is familiar in the effective spin model of transition metal compounds with octahedral ligands. In this work, we find a peculiar non-coplanar magnetic order can be formed with the help of $K$ and next-nearest neighbor Heisenberg coupling $J_2$ on the triangular lattice. It can be seen as a miniature version of skyrmion crystal, since it has nine spins and an integer topological number in a magnetic unit cell. The magnon excitations in such an order are studied by the linear spin-wave theory. Of note is that the change in the relative size of $J_2$ and $K$ produces topological magnon phase transitions although the topological number remains unchanged. We also calculated the experimentally observable thermal Hall conductivity, and found that the signs of thermal Hall conductivity will change with topological phase transitions or temperature changes in certain regions. △ Less

Submitted 27 April, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.11960 [pdf, other]

Observation-Guided Meteorological Field Downscaling at Station Scale: A Benchmark and a New Method

Authors: Zili Liu, Hao Chen, Lei Bai, Wenyuan Li, Keyan Chen, Zhengyi Wang, Wanli Ouyang, Zhengxia Zou, Zhenwei Shi

Abstract: Downscaling (DS) of meteorological variables involves obtaining high-resolution states from low-resolution meteorological fields and is an important task in weather forecasting. Previous methods based on deep learning treat downscaling as a super-resolution task in computer vision and utilize high-resolution gridded meteorological fields as supervision to improve resolution at specific grid scales… ▽ More Downscaling (DS) of meteorological variables involves obtaining high-resolution states from low-resolution meteorological fields and is an important task in weather forecasting. Previous methods based on deep learning treat downscaling as a super-resolution task in computer vision and utilize high-resolution gridded meteorological fields as supervision to improve resolution at specific grid scales. However, this approach has struggled to align with the continuous distribution characteristics of meteorological fields, leading to an inherent systematic bias between the downscaled results and the actual observations at meteorological stations. In this paper, we extend meteorological downscaling to arbitrary scattered station scales, establish a brand new benchmark and dataset, and retrieve meteorological states at any given station location from a coarse-resolution meteorological field. Inspired by data assimilation techniques, we integrate observational data into the downscaling process, providing multi-scale observational priors. Building on this foundation, we propose a new downscaling model based on hypernetwork architecture, namely HyperDS, which efficiently integrates different observational information into the model training, achieving continuous scale modeling of the meteorological field. Through extensive experiments, our proposed method outperforms other specially designed baseline models on multiple surface variables. Notably, the mean squared error (MSE) for wind speed and surface pressure improved by 67% and 19.5% compared to other methods. We will release the dataset and code subsequently. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.09274 [pdf, ps, other]

Avoiding strict saddle points of nonconvex regularized problems

Authors: Luwei Bai

Abstract: We introduce a strict saddle property for $\ell_p$ regularized functions, and propose an iterative reweighted $\ell_1$ algorithm to solve the $\ell_p$ regularized problems. The algorithm is guaranteed to converge only to local minimizers when randomly initialized. The strict saddle property is shown generic on these sparse optimization problems. Those analyses as well as the proposed algorithm can… ▽ More We introduce a strict saddle property for $\ell_p$ regularized functions, and propose an iterative reweighted $\ell_1$ algorithm to solve the $\ell_p$ regularized problems. The algorithm is guaranteed to converge only to local minimizers when randomly initialized. The strict saddle property is shown generic on these sparse optimization problems. Those analyses as well as the proposed algorithm can be easily extended to general nonconvex regularized problems. △ Less

Submitted 9 June, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: 22 pages

arXiv:2401.06013 [pdf, other]

Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

Authors: Beilei Cui, Mobarakol Islam, Long Bai, Hongliang Ren

Abstract: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoR… ▽ More Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO. △ Less

Submitted 12 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: Accepted by IPCAI 2024 (IJCAR Special Issue)

arXiv:2401.04148 [pdf, other]

Online Test-Time Adaptation of Spatial-Temporal Traffic Flow Forecasting

Authors: Pengxin Guo, Pengrong **, Ziyue Li, Lei Bai, Yu Zhang

Abstract: Accurate spatial-temporal traffic flow forecasting is crucial in aiding traffic managers in implementing control measures and assisting drivers in selecting optimal travel routes. Traditional deep-learning based methods for traffic flow forecasting typically rely on historical data to train their models, which are then used to make predictions on future data. However, the performance of the traine… ▽ More Accurate spatial-temporal traffic flow forecasting is crucial in aiding traffic managers in implementing control measures and assisting drivers in selecting optimal travel routes. Traditional deep-learning based methods for traffic flow forecasting typically rely on historical data to train their models, which are then used to make predictions on future data. However, the performance of the trained model usually degrades due to the temporal drift between the historical and future data. To make the model trained on historical data better adapt to future data in a fully online manner, this paper conducts the first study of the online test-time adaptation techniques for spatial-temporal traffic flow forecasting problems. To this end, we propose an Adaptive Double Correction by Series Decomposition (ADCSD) method, which first decomposes the output of the trained model into seasonal and trend-cyclical parts and then corrects them by two separate modules during the testing phase using the latest observed data entry by entry. In the proposed ADCSD method, instead of fine-tuning the whole trained model during the testing phase, a lite network is attached after the trained model, and only the lite network is fine-tuned in the testing process each time a data entry is observed. Moreover, to satisfy that different time series variables may have different levels of temporal drift, two adaptive vectors are adopted to provide different weights for different time series variables. Extensive experiments on four real-world traffic flow forecasting datasets demonstrate the effectiveness of the proposed ADCSD method. The code is available at https://github.com/Pengxin-Guo/ADCSD. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2401.01759 [pdf, other]

VGA: Vision and Graph Fused Attention Network for Rumor Detection

Authors: Lin Bai, Caiyan Jia, Ziying Song, Chaoqun Cui

Abstract: With the development of social media, rumors have been spread broadly on social media platforms, causing great harm to society. Beside textual information, many rumors also use manipulated images or conceal textual information within images to deceive people and avoid being detected, making multimodal rumor detection be a critical problem. The majority of multimodal rumor detection methods mainly… ▽ More With the development of social media, rumors have been spread broadly on social media platforms, causing great harm to society. Beside textual information, many rumors also use manipulated images or conceal textual information within images to deceive people and avoid being detected, making multimodal rumor detection be a critical problem. The majority of multimodal rumor detection methods mainly concentrate on extracting features of source claims and their corresponding images, while ignoring the comments of rumors and their propagation structures. These comments and structures imply the wisdom of crowds and are proved to be crucial to debunk rumors. Moreover, these methods usually only extract visual features in a basic manner, seldom consider tampering or textual information in images. Therefore, in this study, we propose a novel Vision and Graph Fused Attention Network (VGA) for rumor detection to utilize propagation structures among posts so as to obtain the crowd opinions and further explore visual tampering features, as well as the textual information hidden in images. We conduct extensive experiments on three datasets, demonstrating that VGA can effectively detect multimodal rumors and outperform state-of-the-art methods significantly. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2401.01117 [pdf, other]

Q-Refine: A Perceptual Quality Refiner for AI-Generated Image

Authors: Chunyi Li, Haoning Wu, Zicheng Zhang, Hongkun Hao, Kaiwei Zhang, Lei Bai, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Abstract: With the rapid evolution of the Text-to-Image (T2I) model in recent years, their unsatisfactory generation result has become a challenge. However, uniformly refining AI-Generated Images (AIGIs) of different qualities not only limited optimization capabilities for low-quality AIGIs but also brought negative optimization to high-quality AIGIs. To address this issue, a quality-award refiner named Q-R… ▽ More With the rapid evolution of the Text-to-Image (T2I) model in recent years, their unsatisfactory generation result has become a challenge. However, uniformly refining AI-Generated Images (AIGIs) of different qualities not only limited optimization capabilities for low-quality AIGIs but also brought negative optimization to high-quality AIGIs. To address this issue, a quality-award refiner named Q-Refine is proposed. Based on the preference of the Human Visual System (HVS), Q-Refine uses the Image Quality Assessment (IQA) metric to guide the refining process for the first time, and modify images of different qualities through three adaptive pipelines. Experimental shows that for mainstream T2I models, Q-Refine can perform effective optimization to AIGIs of different qualities. It can be a general refiner to optimize AIGIs from both fidelity and aesthetic quality levels, thus expanding the application of the T2I generation models. △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: 6 pages, 5 figures

arXiv:2401.00496 [pdf, other]

SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge

Authors: Dimitrios Psychogyios, Emanuele Colleoni, Beatrice Van Amsterdam, Chih-Yang Li, Shu-Yu Huang, Yuchong Li, Fucang Jia, Baosheng Zou, Guotai Wang, Yang Liu, Maxence Boels, Jiayu Huo, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin, Mengya Xu, An Wang, Yanan Wu, Long Bai, Hongliang Ren, Atsushi Yamada, Yuriko Harai, Yuto Ishikawa, Kazuyuki Hayashi , et al. (25 additional authors not shown)

Abstract: Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segme… ▽ More Surgical tool segmentation and action recognition are fundamental building blocks in many computer-assisted intervention applications, ranging from surgical skills assessment to decision support systems. Nowadays, learning-based action recognition and segmentation approaches outperform classical methods, relying, however, on large, annotated datasets. Furthermore, action recognition and tool segmentation algorithms are often trained and make predictions in isolation from each other, without exploiting potential cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we release the first multimodal, publicly available, in-vivo, dataset for surgical action recognition and semantic instrumentation segmentation, containing 50 suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The aim of the challenge is twofold. First, to enable researchers to leverage the scale of the provided dataset and develop robust and highly accurate single-task action recognition and tool segmentation approaches in the surgical domain. Second, to further explore the potential of multitask-based learning approaches and determine their comparative advantage against their single-task counterparts. A total of 12 teams participated in the challenge, contributing 7 action recognition methods, 9 instrument segmentation techniques, and 4 multitask approaches that integrated both action recognition and instrument segmentation. The complete SAR-RARP50 dataset is available at: https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/191091 △ Less

Submitted 23 January, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

arXiv:2312.12462 [pdf, other]

Towards an end-to-end artificial intelligence driven global weather forecasting system

Authors: Kun Chen, Lei Bai, Fenghua Ling, Peng Ye, Tao Chen, **g-Jia Luo, Hao Chen, Yi Xiao, Kang Chen, Tao Han, Wanli Ouyang

Abstract: The weather forecasting system is important for science and society, and significant achievements have been made in applying artificial intelligence (AI) to medium-range weather forecasting. However, existing AI-based weather forecasting models rely on analysis or reanalysis products from traditional numerical weather prediction (NWP) systems as initial conditions for making predictions. Initial s… ▽ More The weather forecasting system is important for science and society, and significant achievements have been made in applying artificial intelligence (AI) to medium-range weather forecasting. However, existing AI-based weather forecasting models rely on analysis or reanalysis products from traditional numerical weather prediction (NWP) systems as initial conditions for making predictions. Initial states are typically generated by traditional data assimilation components, which are computational expensive and time-consuming. Here we present an AI-based data assimilation model, i.e., Adas, for global weather variables. By introducing the confidence matrix, Adas employs gated convolution to handle sparse observations and gated cross-attention for capturing the interactions between the background and observations. Further, we combine Adas with the advanced AI-based forecasting model (i.e., FengWu) to construct the first end-to-end AI-based global weather forecasting system: FengWu-Adas. We demonstrate that Adas can assimilate global observations to produce high-quality analysis, enabling the system operate stably for long term. Moreover, we are the first to apply the methods to real-world scenarios, which is more challenging and has considerable practical application potential. We have also achieved the forecasts based on the analyses generated by AI with a skillful forecast lead time exceeding that of the IFS for the first time. △ Less

Submitted 8 April, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.12455 [pdf, other]

FengWu-4DVar: Coupling the Data-driven Weather Forecasting Model with 4D Variational Assimilation

Authors: Yi Xiao, Lei Bai, Wei Xue, Kang Chen, Tao Han, Wanli Ouyang

Abstract: Weather forecasting is a crucial yet highly challenging task. With the maturity of Artificial Intelligence (AI), the emergence of data-driven weather forecasting models has opened up a new paradigm for the development of weather forecasting systems. Despite the significant successes that have been achieved (e.g., surpassing advanced traditional physical models for global medium-range forecasting),… ▽ More Weather forecasting is a crucial yet highly challenging task. With the maturity of Artificial Intelligence (AI), the emergence of data-driven weather forecasting models has opened up a new paradigm for the development of weather forecasting systems. Despite the significant successes that have been achieved (e.g., surpassing advanced traditional physical models for global medium-range forecasting), existing data-driven weather forecasting models still rely on the analysis fields generated by the traditional assimilation and forecasting system, which hampers the significance of data-driven weather forecasting models regarding both computational cost and forecasting accuracy. In this work, we explore the possibility of coupling the data-driven weather forecasting model with data assimilation by integrating the global AI weather forecasting model, FengWu, with one of the most popular assimilation algorithms, Four-Dimensional Variational (4DVar) assimilation, and develop an AI-based cyclic weather forecasting system, FengWu-4DVar. FengWu-4DVar can incorporate observational data into the data-driven weather forecasting model and consider the temporal evolution of atmospheric dynamics to obtain accurate analysis fields for making predictions in a cycling manner without the help of physical models. Owning to the auto-differentiation ability of deep learning models, FengWu-4DVar eliminates the need of develo** the cumbersome adjoint model, which is usually required in the traditional implementation of the 4DVar algorithm. Experiments on the simulated observational dataset demonstrate that FengWu-4DVar is capable of generating reasonable analysis fields for making accurate and efficient iterative predictions. △ Less

Submitted 19 May, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

Comments: 15 pages, 8 figures

arXiv:2312.11805 [pdf, other]

Gemini: A Family of Highly Capable Multimodal Models

Authors: Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee , et al. (1325 additional authors not shown)

Abstract: This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr… ▽ More This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI. △ Less

Submitted 17 June, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10429 [pdf, other]

ResoNet: Robust and Explainable ENSO Forecasts with Hybrid Convolution and Transformer Networks

Authors: Pumeng Lyu, Tao Tang, Fenghua Ling, **g-Jia Luo, Niklas Boers, Wanli Ouyang, Lei Bai

Abstract: Recent studies have shown that deep learning (DL) models can skillfully predict the El Niño-Southern Oscillation (ENSO) forecasts over 1.5 years ahead. However, concerns regarding the reliability of predictions made by DL methods persist, including potential overfitting issues and lack of interpretability. Here, we propose ResoNet, a DL model that combines convolutional neural network (CNN) and Tr… ▽ More Recent studies have shown that deep learning (DL) models can skillfully predict the El Niño-Southern Oscillation (ENSO) forecasts over 1.5 years ahead. However, concerns regarding the reliability of predictions made by DL methods persist, including potential overfitting issues and lack of interpretability. Here, we propose ResoNet, a DL model that combines convolutional neural network (CNN) and Transformer architectures. This hybrid architecture design enables our model to adequately capture local SSTA as well as long-range inter-basin interactions across oceans. We show that ResoNet can robustly predict ESNO at lead times between 19 and 26 months, thus outperforming existing approaches in terms of the forecast horizon. According to an explainability method applied to ResoNet predictions of El Niño and La Niña events from 1- to 18-month lead, we find that it predicts the Niño3.4 index based on multiple physically reasonable mechanisms, such as the Recharge Oscillator concept, Seasonal Footprint Mechanism, and Indian Ocean capacitor effect. Moreover, we demonstrate that for the first time, the asymmetry between El Niño and La Niña development can be captured by ResoNet. Our results could help alleviate skepticism about applying DL models for ENSO prediction and encourage more attempts to discover and predict climate phenomena using AI methods. △ Less

Submitted 16 December, 2023; originally announced December 2023.

Comments: 32 pages, 5 main figures and 12 supplementary figures

arXiv:2312.09576 [pdf, other]

SegRap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma

Authors: Xiangde Luo, Jia Fu, Yunxin Zhong, Shuolin Liu, Bing Han, Mehdi Astaraki, Simone Bendazzoli, Iuliana Toma-Dasu, Yiwen Ye, Ziyang Chen, Yong Xia, Yanzhou Su, ** Ye, Junjun He, Zhaohu Xing, Hongqiu Wang, Lei Zhu, Kaixiang Yang, Xin Fang, Zhiwei Wang, Chan Woong Lee, Sang Joon Park, Jaehee Chun, Constantin Ulrich, Klaus H. Maier-Hein , et al. (17 additional authors not shown)

Abstract: Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results… ▽ More Radiation therapy is a primary and effective NasoPharyngeal Carcinoma (NPC) treatment strategy. The precise delineation of Gross Tumor Volumes (GTVs) and Organs-At-Risk (OARs) is crucial in radiation treatment, directly impacting patient prognosis. Previously, the delineation of GTVs and OARs was performed by experienced radiation oncologists. Recently, deep learning has achieved promising results in many medical image segmentation tasks. However, for NPC OARs and GTVs segmentation, few public datasets are available for model development and evaluation. To alleviate this problem, the SegRap2023 challenge was organized in conjunction with MICCAI2023 and presented a large-scale benchmark for OAR and GTV segmentation with 400 Computed Tomography (CT) scans from 200 NPC patients, each with a pair of pre-aligned non-contrast and contrast-enhanced CT scans. The challenge's goal was to segment 45 OARs and 2 GTVs from the paired CT scans. In this paper, we detail the challenge and analyze the solutions of all participants. The average Dice similarity coefficient scores for all submissions ranged from 76.68\% to 86.70\%, and 70.42\% to 73.44\% for OARs and GTVs, respectively. We conclude that the segmentation of large-size OARs is well-addressed, and more efforts are needed for GTVs and small-size or thin-structure OARs. The benchmark will remain publicly available here: https://segrap2023.grand-challenge.org △ Less

Submitted 15 December, 2023; originally announced December 2023.

Comments: A challenge report of SegRap2023 (organized in conjunction with MICCAI2023)

arXiv:2312.06428 [pdf, other]

VisionTraj: A Noise-Robust Trajectory Recovery Framework based on Large-scale Camera Network

Authors: Zhishuai Li, Ziyue Li, Xiaoru Hu, Guoqing Du, Yunhao Nie, Feng Zhu, Lei Bai, Rui Zhao

Abstract: Trajectory recovery based on the snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization. The state-of-the-art solutions devoted to such a vision-based scheme typically incorporate predefined rules or unsupervised iterative feedback, struggling with multi-fold challenges such as lack of open-source datasets for training the whole pipeline, and… ▽ More Trajectory recovery based on the snapshots from the city-wide multi-camera network facilitates urban mobility sensing and driveway optimization. The state-of-the-art solutions devoted to such a vision-based scheme typically incorporate predefined rules or unsupervised iterative feedback, struggling with multi-fold challenges such as lack of open-source datasets for training the whole pipeline, and the vulnerability to the noises from visual inputs. In response to the dilemma, this paper proposes VisionTraj, the first learning-based model that reconstructs vehicle trajectories from snapshots recorded by road network cameras. Coupled with it, we elaborate on two rational vision-trajectory datasets, which produce extensive trajectory data along with corresponding visual snapshots, enabling supervised vision-trajectory interplay extraction. Following the data creation, based on the results from the off-the-shelf multi-modal vehicle clustering, we first re-formulate the trajectory recovery problem as a generative task and introduce the canonical Transformer as the autoregressive backbone. Then, to identify clustering noises (e.g., false positives) with the bound on the snapshots' spatiotemporal dependencies, a GCN-based soft-denoising module is conducted based on the fine- and coarse-grained Re-ID clusters. Additionally, we harness strong semantic information extracted from the tracklet to provide detailed insights into the vehicle's entry and exit actions during trajectory recovery. The denoising and tracklet components can also act as plug-and-play modules to boost baselines. Experimental results on the two hand-crafted datasets show that the proposed VisionTraj achieves a maximum +11.5% improvement against the sub-best model. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2312.01697 [pdf, other]

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang

Abstract: Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did no… ▽ More Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk. △ Less

Submitted 21 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: 24 pages, 5 figures

arXiv:2311.02962 [pdf, other]

Retrieval-Augmented Code Generation for Universal Information Extraction

Authors: Yucan Guo, Zixuan Li, Xiaolong **, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, Xueqi Cheng

Abstract: Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Langua… ▽ More Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Language Models (LLMs) trained on both codes and texts have demonstrated powerful capabilities of transforming texts into codes, which provides a feasible solution to IE tasks. Therefore, in this paper, we propose a universal retrieval-augmented code generation framework based on LLMs, called Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way. By so doing, extracting knowledge under these schemas can be transformed into generating codes that instantiate the predefined Python classes with the information in texts. To generate these codes more precisely, Code4UIE adopts the in-context learning mechanism to instruct LLMs with examples. In order to obtain appropriate examples for different tasks, Code4UIE explores several example retrieval strategies, which can retrieve examples semantically similar to the given texts. Extensive experiments on five representative IE tasks across nine datasets demonstrate the effectiveness of the Code4UIE framework. △ Less

Submitted 6 November, 2023; originally announced November 2023.

arXiv:2311.02631 [pdf, other]

A Critical Perceptual Pre-trained Model for Complex Trajectory Recovery

Authors: Dedong Li, Ziyue Li, Zhishuai Li, Lei Bai, Qingyuan Gong, Lijun Sun, Wolfgang Ketter, Rui Zhao

Abstract: The trajectory on the road traffic is commonly collected at a low sampling rate, and trajectory recovery aims to recover a complete and continuous trajectory from the sparse and discrete inputs. Recently, sequential language models have been innovatively adopted for trajectory recovery in a pre-trained manner: it learns road segment representation vectors, which will be used in the downstream task… ▽ More The trajectory on the road traffic is commonly collected at a low sampling rate, and trajectory recovery aims to recover a complete and continuous trajectory from the sparse and discrete inputs. Recently, sequential language models have been innovatively adopted for trajectory recovery in a pre-trained manner: it learns road segment representation vectors, which will be used in the downstream tasks. However, existing methods are incapable of handling complex trajectories: when the trajectory crosses remote road segments or makes several turns, which we call critical nodes, the quality of learned representations deteriorates, and the recovered trajectories skip the critical nodes. This work is dedicated to offering a more robust trajectory recovery for complex trajectories. Firstly, we define the trajectory complexity based on the detour score and entropy score and construct the complexity-aware semantic graphs correspondingly. Then, we propose a Multi-view Graph and Complexity Aware Transformer (MGCAT) model to encode these semantics in trajectory pre-training from two aspects: 1) adaptively aggregate the multi-view graph features considering trajectory pattern, and 2) higher attention to critical nodes in a complex trajectory. Such that, our MGCAT is perceptual when handling the critical scenario of complex trajectories. Extensive experiments are conducted on large-scale datasets. The results prove that our method learns better representations for trajectory recovery, with 5.22% higher F1-score overall and 8.16% higher F1-score for complex trajectories particularly. The code is available at https://github.com/bonaldli/ComplexTraj. △ Less

Submitted 5 November, 2023; originally announced November 2023.

Comments: Accepted in ACM SIGSPATIAL 2023

arXiv:2311.00291 [pdf, other]

Graph Representation Learning for Infrared and Visible Image Fusion

Authors: **g Li, Lu Bai, Bin Yang, Chang Li, Lingfei Ma, Edwin R. Hancock

Abstract: Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably… ▽ More Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably leads to information loss. In addition, the transformer structure extracts long-range dependence by considering the correlativity among all image patches, leading to information redundancy of such transformer-based methods. However, graph representation is more flexible than grid (CNN) or sequence (transformer structure) representation to address irregular objects, and graph can also construct the relationships among the spatially repeatable details or texture with far-space distance. Therefore, to address the above issues, it is significant to convert images into the graph space and thus adopt graph convolutional networks (GCNs) to extract NLss. This is because the graph can provide a fine structure to aggregate features and propagate information across the nearest vertices without introducing redundant information. Concretely, we implement a cascaded NLss extraction pattern to extract NLss of intra- and inter-modal by exploring interactions of different image pixels in intra- and inter-image positional distance. We commence by preforming GCNs on each intra-modal to aggregate features and propagate information to extract independent intra-modal NLss. Then, GCNs are performed on the concatenate intra-modal NLss features of infrared and visible images, which can explore the cross-domain NLss of inter-modal to reconstruct the fused image. Ablation studies and extensive experiments illustrates the effectiveness and superiority of the proposed method on three datasets. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.14174 [pdf, other]

An In-Context Schema Understanding Method for Knowledge Base Question Answering

Authors: Yantao Liu, Zixuan Li, Xiaolong **, Yucan Guo, Long Bai, Sai** Guan, Jiafeng Guo, Xueqi Cheng

Abstract: The Knowledge Base Question Answering (KBQA) task aims to answer natural language questions based on a given knowledge base. Recently, Large Language Models (LLMs) have shown strong capabilities in language understanding and can be used to solve this task. In doing so, a major challenge for LLMs is to overcome the immensity and heterogeneity of knowledge base schemas.Existing methods bypass this c… ▽ More The Knowledge Base Question Answering (KBQA) task aims to answer natural language questions based on a given knowledge base. Recently, Large Language Models (LLMs) have shown strong capabilities in language understanding and can be used to solve this task. In doing so, a major challenge for LLMs is to overcome the immensity and heterogeneity of knowledge base schemas.Existing methods bypass this challenge by initially employing LLMs to generate drafts of logic forms without schema-specific details.Then, an extra module is used to inject schema information to these drafts.In contrast, in this paper, we propose a simple In-Context Schema Understanding (ICSU) method that enables LLMs to directly understand schemas by leveraging in-context learning. Specifically, ICSU provides schema information to LLMs using schema-related annotated examples. We investigate three example retrieval strategies based on raw questions, anonymized questions, and generated SPARQL queries. Experimental results show that ICSU demonstrates competitive performance compared to baseline methods on both the KQA Pro and WebQSP datasets. △ Less

Submitted 10 February, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

arXiv:2310.13447 [pdf, other]

Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation

Authors: Siyu Zhang, Yeming Chen, Sirui Cheng, Yaoru Sun, Jun Yang, Lizhi Bai

Abstract: Within the multimodal field, the key to integrating vision and language lies in establishing a good alignment strategy. Recently, benefiting from the success of self-supervised learning, significant progress has been made in multimodal semantic representation based on pre-trained models for vision and language. However, there is still room for improvement in visual semantic representation. The lac… ▽ More Within the multimodal field, the key to integrating vision and language lies in establishing a good alignment strategy. Recently, benefiting from the success of self-supervised learning, significant progress has been made in multimodal semantic representation based on pre-trained models for vision and language. However, there is still room for improvement in visual semantic representation. The lack of spatial semantic coherence and vulnerability to noise makes it challenging for current pixel or patch-based methods to accurately extract complex scene boundaries. To this end, this paper develops superpixel as a comprehensive compact representation of learnable image data, which effectively reduces the number of visual primitives for subsequent processing by clustering perceptually similar pixels. To mine more precise topological relations, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical structure of constituent visual patterns, and captures multiscale features by progressively merging adjacent superpixels as graph nodes. Moreover, we predict the differences between adjacent nodes through the graph structure, facilitating key information aggregation of graph nodes to reason actual semantic relations. Afterward, we design a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by learning complementary spatial information at different regional scales. Our proposed method can be well applied to multiple downstream task learning. Extensive experiments demonstrate that our method is competitive with other state-of-the-art methods in visual reasoning. Our code will be released upon publication. △ Less

Submitted 25 October, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.09937 [pdf, other]

Joint Sparse Representations and Coupled Dictionary Learning in Multi-Source Heterogeneous Image Pseudo-color Fusion

Authors: Long Bai, Shilong Yao, Kun Gao, Yanjun Huang, Ruijie Tang, Hong Yan, Max Q. -H. Meng, Hongliang Ren

Abstract: Considering that Coupled Dictionary Learning (CDL) method can obtain a reasonable linear mathematical relationship between resource images, we propose a novel CDL-based Synthetic Aperture Radar (SAR) and multispectral pseudo-color fusion method. Firstly, the traditional Brovey transform is employed as a pre-processing method on the paired SAR and multispectral images. Then, CDL is used to capture… ▽ More Considering that Coupled Dictionary Learning (CDL) method can obtain a reasonable linear mathematical relationship between resource images, we propose a novel CDL-based Synthetic Aperture Radar (SAR) and multispectral pseudo-color fusion method. Firstly, the traditional Brovey transform is employed as a pre-processing method on the paired SAR and multispectral images. Then, CDL is used to capture the correlation between the pre-processed image pairs based on the dictionaries generated from the source images via enforced joint sparse coding. Afterward, the joint sparse representation in the pair of dictionaries is utilized to construct an image mask via calculating the reconstruction errors, and therefore generate the final fusion image. The experimental verification results of the SAR images from the Sentinel-1 satellite and the multispectral images from the Landsat-8 satellite show that the proposed method can achieve superior visual effects, and excellent quantitative performance in terms of spectral distortion, correlation coefficient, MSE, NIQE, BRISQUE, and PIQE. △ Less

Submitted 15 October, 2023; originally announced October 2023.

Comments: To appear in IEEE Sensors Journal

arXiv:2310.08261 [pdf, other]

GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection

Authors: Ziying Song, Haiyue Wei, Lin Bai, Lei Yang, Caiyan Jia

Abstract: LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordi… ▽ More LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign. △ Less

Submitted 12 October, 2023; originally announced October 2023.

arXiv:2310.01994 [pdf, other]

Understanding Masked Autoencoders From a Local Contrastive Perspective

Authors: Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Lu** Zhou, Wanli Ouyang

Abstract: Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we fir… ▽ More Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies. However, despite achieving state-of-the-art performance across various downstream vision tasks, the underlying mechanisms that drive MAE's efficacy are less well-explored compared to the canonical contrastive learning paradigm. In this paper, we first propose a local perspective to explicitly extract a local contrastive form from MAE's reconstructive objective at the patch level. And then we introduce a new empirical framework, called Local Contrastive MAE (LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE reveals that MAE learns invariance to random masking and ensures distribution consistency between the learned token embeddings and the original images. Furthermore, we dissect the contribution of the decoder and random masking to MAE's success, revealing both the decoder's learning mechanism and the dual role of random masking as data augmentation and effective receptive field restriction. Our experimental analysis sheds light on the intricacies of MAE and summarizes some useful design methodologies, which can inspire more powerful visual self-supervised methods. △ Less

Submitted 8 December, 2023; v1 submitted 3 October, 2023; originally announced October 2023.

arXiv:2309.15718 [pdf, other]

doi 10.1038/s42256-024-00818-6

Geometry-enhanced Pre-training on Interatomic Potentials

Authors: Taoyong Cui, Chenyu Tang, Mao Su, Shufei Zhang, Yuqiang Li, Lei Bai, Yuhan Dong, Xingao Gong, Wanli Ouyang

Abstract: Machine learning interatomic potentials (MLIPs) enables molecular dynamics (MD) simulations with ab initio accuracy and has been applied to various fields of physical science. However, the performance and transferability of MLIPs are limited by insufficient labeled training data, which require expensive ab initio calculations to obtain the labels, especially for complex molecular systems. To addre… ▽ More Machine learning interatomic potentials (MLIPs) enables molecular dynamics (MD) simulations with ab initio accuracy and has been applied to various fields of physical science. However, the performance and transferability of MLIPs are limited by insufficient labeled training data, which require expensive ab initio calculations to obtain the labels, especially for complex molecular systems. To address this challenge, we design a novel geometric structure learning paradigm that consists of two stages. We first generate a large quantity of 3D configurations of target molecular system with classical molecular dynamics simulations. Then, we propose geometry-enhanced self-supervised learning consisting of masking, denoising, and contrastive learning to better capture the topology and 3D geometric information from the unlabeled 3D configurations. We evaluate our method on various benchmarks ranging from small molecule datasets to complex periodic molecular systems with more types of elements. The experimental results show that the proposed pre-training method can greatly enhance the accuracy of MLIPs with few extra computational costs and works well with different invariant or equivariant graph neural network architectures. Our method improves the generalization capability of MLIPs and helps to realize accurate MD simulations for complex molecular systems. △ Less

Submitted 12 April, 2024; v1 submitted 27 September, 2023; originally announced September 2023.

Journal ref: Published in Nature Machine Intelligence 2024

arXiv:2309.12960 [pdf, other]

Nested Event Extraction upon Pivot Element Recogniton

Authors: Weicheng Ren, Zixuan Li, Xiaolong **, Long Bai, Miao Su, Yantao Liu, Sai** Guan, Jiafeng Guo, Xueqi Cheng

Abstract: Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer-nest events and as triggers of inner-nest events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NE… ▽ More Nested Event Extraction (NEE) aims to extract complex event structures where an event contains other events as its arguments recursively. Nested events involve a kind of Pivot Elements (PEs) that simultaneously act as arguments of outer-nest events and as triggers of inner-nest events, and thus connect them into nested structures. This special characteristic of PEs brings challenges to existing NEE methods, as they cannot well cope with the dual identities of PEs. Therefore, this paper proposes a new model, called PerNee, which extracts nested events mainly based on recognizing PEs. Specifically, PerNee first recognizes the triggers of both inner-nest and outer-nest events and further recognizes the PEs via classifying the relation type between trigger pairs. The model uses prompt learning to incorporate information from both event types and argument roles for better trigger and argument representations to improve NEE performance. Since existing NEE datasets (e.g., Genia11) are limited to specific domains and contain a narrow range of event types with nested structures, we systematically categorize nested events in the generic domain and construct a new NEE dataset, called ACE2005-Nest. Experimental results demonstrate that PerNee consistently achieves state-of-the-art performance on ACE2005-Nest, Genia11, and Genia13. The ACE2005-Nest dataset and the code of the PerNee model are available at https://github.com/waysonren/PerNee. △ Less

Submitted 7 April, 2024; v1 submitted 22 September, 2023; originally announced September 2023.

Comments: Accepted at LREC-COLING 2024

arXiv:2309.12892 [pdf, other]

ProtoEM: A Prototype-Enhanced Matching Framework for Event Relation Extraction

Authors: Zhilei Hu, Zixuan Li, Daozhu Xu, Long Bai, Cheng **, Xiaolong **, Jiafeng Guo, Xueqi Cheng

Abstract: Event Relation Extraction (ERE) aims to extract multiple kinds of relations among events in texts. However, existing methods singly categorize event relations as different classes, which are inadequately capturing the intrinsic semantics of these relations. To comprehensively understand their intrinsic semantics, in this paper, we obtain prototype representations for each type of event relation an… ▽ More Event Relation Extraction (ERE) aims to extract multiple kinds of relations among events in texts. However, existing methods singly categorize event relations as different classes, which are inadequately capturing the intrinsic semantics of these relations. To comprehensively understand their intrinsic semantics, in this paper, we obtain prototype representations for each type of event relation and propose a Prototype-Enhanced Matching (ProtoEM) framework for the joint extraction of multiple kinds of event relations. Specifically, ProtoEM extracts event relations in a two-step manner, i.e., prototype representing and prototype matching. In the first step, to capture the connotations of different event relations, ProtoEM utilizes examples to represent the prototypes corresponding to these relations. Subsequently, to capture the interdependence among event relations, it constructs a dependency graph for the prototypes corresponding to these relations and utilized a Graph Neural Network (GNN)-based module for modeling. In the second step, it obtains the representations of new event pairs and calculates their similarity with those prototypes obtained in the first step to evaluate which types of event relations they belong to. Experimental results on the MAVEN-ERE dataset demonstrate that the proposed ProtoEM framework can effectively represent the prototypes of event relations and further obtain a significant improvement over baseline models. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: Work in progress

arXiv:2309.10431 [pdf, other]

Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions

Authors: Jie Wang, Lihe Ding, Tingfa Xu, Shaocong Dong, Xinli Xu, Long Bai, Jianan Li

Abstract: Robust 3D perception under corruption has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the stru… ▽ More Robust 3D perception under corruption has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks, including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Accepted by ICCV2023; code: https://github.com/Roywangj/AdaptPoint

arXiv:2309.10242 [pdf, other]

Reinforcement Learning for optimal dividend problem under diffusion model

Authors: Lihua Bai, Thejani Gamage, ** Ma, Pengxu Xie

Abstract: In this paper, we study the optimal dividend problem under the continuous time diffusion model with the dividend rate being restricted in a given finite interval. Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. We th… ▽ More In this paper, we study the optimal dividend problem under the continuous time diffusion model with the dividend rate being restricted in a given finite interval. Unlike the standard literature, we shall particularly be interested in the case when the parameters (e.g. drift and diffusion coefficients) of the model are not specified so that the optimal control cannot be explicitly determined. We therefore follow the recently developed method via the Reinforcement Learning (RL) to find the optimal strategy. Specifically, we shall design a corresponding RL-type entropy-regularized exploratory control problem, which randomize the control actions, and balance the exploitation and exploration. We shall first carry out a theoretical analysis of the new relaxed control problem and prove that the value function is the unique bounded classical solution to the corresponding HJB equation. We will then use a policy improvement argument, along with policy evaluation devices (e.g., Temporal Difference (TD)-based algorithm and Martingale Loss (ML)-algorithms) to construct approximating sequences of the optimal strategy. We present some numerical results using different parametrization families for the cost functional to illustrate the effectiveness of the approximation schemes. △ Less

Submitted 18 September, 2023; originally announced September 2023.

arXiv:2309.05536 [pdf, other]

doi 10.1103/PhysRevD.108.126015

Transport properties of a holographic model with novel gauge-axion coupling

Authors: Lin-Yue Bai, Jian-Pin Wu, Zhen-Hua Zhou

Abstract: We investigate the transport properties within a holographic model characterized by a novel gauge-axion coupling. A key innovation is the introduction of the direct coupling between axion fields, the antisymmetric tensor, and the gauge field in our bulk theory. This novel coupling term leads to the emergence of nondiagonal components in the conductivity tensor. An important characteristic is that… ▽ More We investigate the transport properties within a holographic model characterized by a novel gauge-axion coupling. A key innovation is the introduction of the direct coupling between axion fields, the antisymmetric tensor, and the gauge field in our bulk theory. This novel coupling term leads to the emergence of nondiagonal components in the conductivity tensor. An important characteristic is that the off-diagonal elements manifest antisymmetry. Remarkably, the conductivity behavior in this model akin to that of Hall conductivity. Additionally, this model can also achieve metal-insulator transition. △ Less

Submitted 26 December, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: 29 pages, 11 figures, Published version

Journal ref: Phys.Rev.D 108 (2023) 12, 126015

arXiv:2309.03467 [pdf, other]

Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation

Authors: Zhuqiang Lu, Kun Hu, Chaoyue Wang, Lei Bai, Zhiyong Wang

Abstract: A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short i… ▽ More A 360-degree (omni-directional) image provides an all-encompassing spherical view of a scene. Recently, there has been an increasing interest in synthesising 360-degree images from conventional narrow field of view (NFoV) images captured by digital cameras and smartphones, for providing immersive experiences in various scenarios such as virtual reality. Yet, existing methods typically fall short in synthesizing intricate visual details or ensure the generated images align consistently with user-provided prompts. In this study, autoregressive omni-aware generative network (AOG-Net) is proposed for 360-degree image generation by out-painting an incomplete 360-degree image progressively with NFoV and text guidances joinly or individually. This autoregressive scheme not only allows for deriving finer-grained and text-consistent patterns by dynamically generating and adjusting the process but also offers users greater flexibility to edit their conditions throughout the generation process. A global-local conditioning mechanism is devised to comprehensively formulate the outpainting guidance in each autoregressive step. Text guidances, omni-visual cues, NFoV inputs and omni-geometry are encoded and further formulated with cross-attention based transformers into a global stream and a local stream into a conditioned generative backbone model. As AOG-Net is compatible to leverage large-scale models for the conditional encoder and the generative prior, it enables the generation to use extensive open-vocabulary text guidances. Comprehensive experiments on two commonly used 360-degree image datasets for both indoor and outdoor settings demonstrate the state-of-the-art performance of our proposed method. Our code will be made publicly available. △ Less

Submitted 8 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: Accepted by AAAI 24

ACM Class: I.4.0

arXiv:2308.16376 [pdf, other]

Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

Authors: Lei Bai, Dongang Wang, Michael Barnett, Mariano Cabezas, Weidong Cai, Fernando Calamante, Kain Kyle, Dongnan Liu, Linda Ly, Aria Nguyen, Chun-Chien Shieh, Ryan Sullivan, Hengrui Wang, Geng Zhan, Wanli Ouyang, Chenyu Wang

Abstract: Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clin… ▽ More Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations. △ Less

Submitted 30 August, 2023; originally announced August 2023.

Comments: 11 pages, 4 figures, journal submission

arXiv:2308.15717 [pdf]

Risk-aware Flexible Resource Utilization in an Unbalanced Three-Phase Distribution Network using SDP-based Distributionally Robust Optimal Power Flow

Authors: Zelong Lu, Jianxue Wang, Mohammad Shahidehpour, Linquan Bai, Zuyi Li, Lei Yan, Xianlong Chen

Abstract: The variability caused by the proliferation of distributed energy resources (DERs) and the significant growth in unbalanced three-phase loads pose unprecedented challenges to distribution network operations. This paper focuses on how a distribution system operator (DSO), taking over the distribution grid and market operations, would develop a risk-aware flexibility market to mitigate uncertainties… ▽ More The variability caused by the proliferation of distributed energy resources (DERs) and the significant growth in unbalanced three-phase loads pose unprecedented challenges to distribution network operations. This paper focuses on how a distribution system operator (DSO), taking over the distribution grid and market operations, would develop a risk-aware flexibility market to mitigate uncertainties in an unbalanced three-phase power distribution network. First, a distributionally robust chance constraint (DRCC) method is devised to solve the unbalanced three-phase optimal power flow using a semidefinite programming (SDP) model. The DSO can apply the proposed solution to jointly clear energy and flexibility markets. Then, the DRCC model accuracy is improved by an information-sharing mechanism characterized by spatially-correlated uncertainties in the distribution grid. Further, a novel system-wide response function is derived to make the DRCC model tractable. Using the duality theory, the paper further investigates the physical composition of the DSO's cleared flexibility prices to guide the unbalanced distribution network operation. Finally, the effectiveness of the risk-aware flexibility market is verified in a modified three-phase IEEE 34-node system. Results demonstrate that the flexibility market can quantify the impact of spatially correlated uncertainties and facilitate the utilization of flexible resources to mitigate uncertainties across the network. △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.14100 [pdf, other]

Rethinking Exemplars for Continual Semantic Segmentation in Endoscopy Scenes: Entropy-based Mini-Batch Pseudo-Replay

Authors: Guankun Wang, Long Bai, Yanan Wu, Tong Chen, Hongliang Ren

Abstract: Endoscopy is a widely used technique for the early detection of diseases or robotic-assisted minimally invasive surgery (RMIS). Numerous deep learning (DL)-based research works have been developed for automated diagnosis or processing of endoscopic view. However, existing DL models may suffer from catastrophic forgetting. When new target classes are introduced over time or cross institutions, the… ▽ More Endoscopy is a widely used technique for the early detection of diseases or robotic-assisted minimally invasive surgery (RMIS). Numerous deep learning (DL)-based research works have been developed for automated diagnosis or processing of endoscopic view. However, existing DL models may suffer from catastrophic forgetting. When new target classes are introduced over time or cross institutions, the performance of old classes may suffer severe degradation. More seriously, data privacy and storage issues may lead to the unavailability of old data when updating the model. Therefore, it is necessary to develop a continual learning (CL) methodology to solve the problem of catastrophic forgetting in endoscopic image segmentation. To tackle this, we propose a Endoscopy Continual Semantic Segmentation (EndoCSS) framework that does not involve the storage and privacy issues of exemplar data. The framework includes a mini-batch pseudo-replay (MB-PR) mechanism and a self-adaptive noisy cross-entropy (SAN-CE) loss. The MB-PR strategy circumvents privacy and storage issues by generating pseudo-replay images through a generative model. Meanwhile, the MB-PR strategy can also correct the model deviation to the replay data and current training data, which is aroused by the significant difference in the amount of current and replay images. Therefore, the model can perform effective representation learning on both new and old tasks. SAN-CE loss can help model fitting by adjusting the model's output logits, and also improve the robustness of training. Extensive continual semantic segmentation (CSS) experiments on public datasets demonstrate that our method can robustly and effectively address the catastrophic forgetting brought by class increment in endoscopy scenes. The results show that our framework holds excellent potential for real-world deployment in a streaming learning manner. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: Accepted by Computers in Biology and Medicine

arXiv:2308.10468 [pdf, other]

STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

Authors: Tao Han, Lei Bai, Lingbo Liu, Wanli Ouyang

Abstract: Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (\textbf{S}elec\tex… ▽ More Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (\textbf{S}elec\textbf{T}iv\textbf{E} inh\textbf{ER}itance l\textbf{E}a\textbf{R}ning) that addresses the issue of scale variations in object counting. STEERER selects the most suitable scale for patch objects to boost feature extraction and only inherits discriminative features from lower to higher resolution progressively. The main insights of STEERER are a dedicated Feature Selection and Inheritance Adaptor (FSIA), which selectively forwards scale-customized features at each scale, and a Masked Selection and Inheritance Loss (MSIL) that helps to achieve high-quality density maps across all scales. Our experimental results on nine datasets with counting and localization tasks demonstrate the unprecedented scale generalization ability of STEERER. Code is available at \url{https://github.com/taohan10200/STEERER}. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV2023, 9 pages

arXiv:2308.07203 [pdf, other]

Successive Refinement of Shannon Cipher System Under Maximal Leakage

Authors: Zhuangfei Wu, Lin Bai, Lin Zhou

Abstract: We study the successive refinement setting of Shannon cipher system (SCS) under the maximal leakage secrecy metric for discrete memoryless sources under bounded distortion measures. Specifically, we generalize the threat model for the point-to-point rate-distortion setting of Issa, Wagner and Kamath (T-IT 2020) to the multiterminal successive refinement setting. Under mild conditions that correspo… ▽ More We study the successive refinement setting of Shannon cipher system (SCS) under the maximal leakage secrecy metric for discrete memoryless sources under bounded distortion measures. Specifically, we generalize the threat model for the point-to-point rate-distortion setting of Issa, Wagner and Kamath (T-IT 2020) to the multiterminal successive refinement setting. Under mild conditions that correspond to partial secrecy, we characterize the asymptotically optimal normalized maximal leakage region for both the joint excess-distortion probability (JEP) and the expected distortion reliability constraints. Under JEP, in the achievability part, we propose a type-based coding scheme, analyze the reliability guarantee for JEP and bound the leakage of the information source through compressed messages. In the converse part, by analyzing a guessing scheme of the eavesdropper, we prove the optimality of our achievability result. Under expected distortion, the achievability part is established similarly to the JEP counterpart. The converse proof proceeds by generalizing the corresponding results for the rate-distortion setting of SCS by Schieler and Cuff (T-IT 2014) to the successive refinement setting. Somewhat surprisingly, the normalized maximal leakage regions under both JEP and expected distortion constraints are identical under certain conditions, although JEP appears to be a stronger reliability constraint. △ Less

Submitted 18 April, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

arXiv:2308.02869 [pdf, other]

Semi-supervised Learning for Segmentation of Bleeding Regions in Video Capsule Endoscopy

Authors: Hechen Li, Yanan Wu, Long Bai, An Wang, Tong Chen, Hongliang Ren

Abstract: In the realm of modern diagnostic technology, video capsule endoscopy (VCE) is a standout for its high efficacy and non-invasive nature in diagnosing various gastrointestinal (GI) conditions, including obscure bleeding. Importantly, for the successful diagnosis and treatment of these conditions, accurate recognition of bleeding regions in VCE images is crucial. While deep learning-based methods ha… ▽ More In the realm of modern diagnostic technology, video capsule endoscopy (VCE) is a standout for its high efficacy and non-invasive nature in diagnosing various gastrointestinal (GI) conditions, including obscure bleeding. Importantly, for the successful diagnosis and treatment of these conditions, accurate recognition of bleeding regions in VCE images is crucial. While deep learning-based methods have emerged as powerful tools for the automated analysis of VCE images, they often demand large training datasets with comprehensive annotations. Acquiring these labeled datasets tends to be time-consuming, costly, and requires significant domain expertise. To mitigate this issue, we have embraced a semi-supervised learning (SSL) approach for the bleeding regions segmentation within VCE. By adopting the `Mean Teacher' method, we construct a student U-Net equipped with an scSE attention block, alongside a teacher model of the same architecture. These models' parameters are alternately updated throughout the training process. We use the Kvasir-Capsule dataset for our experiments, which encompasses various GI bleeding conditions. Notably, we develop the segmentation annotations for this dataset ourselves. The findings from our experiments endorse the efficacy of the SSL-based segmentation strategy, demonstrating its capacity to reduce reliance on large volumes of annotations for model training, without compromising on the accuracy of identification. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: ICBIR 2023

arXiv:2308.02845 [pdf, other]

Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation

Authors: Tianhang Liu, Hechen Li, Long Bai, Yanan Wu, An Wang, Mobarakol Islam, Hongliang Ren

Abstract: Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long… ▽ More Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long-range dependence. However, current DeTR requires long iterations to converge, and does not perform well in detecting small objects. This paper proposes a transformer-based landmark detection solution with deformable DeTR and the semantic-aligned-matching module for detecting landmarks in robot-assisted intubation. The semantics aligner can effectively align the semantics of object queries and image features in the same embedding space using the most discriminative features. To evaluate the performance of our solution, we utilize a publicly accessible glottis dataset and automatically annotate a nostril detection dataset. The experimental results demonstrate our competitive performance in detection accuracy. Our code is publicly accessible. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: ICBIR 2023 (Best Student Paper Award). Code availability: https://github.com/ConorLTH/airway_intubation_landmarks_detection

arXiv:2308.00588 [pdf, other]

Relation-Aware Distribution Representation Network for Person Clustering with Multiple Modalities

Authors: Kaijian Liu, Shixiang Tang, Ziyue Li, Zhishuai Li, Lei Bai, Feng Zhu, Rui Zhao

Abstract: Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific u… ▽ More Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features into a joint feature space. However, multi-modal clue features are usually rather weakly correlated due to the semantic gap from the modality-specific uniqueness. As a result, these methods are not suitable for person clustering. In this paper, we propose a Relation-Aware Distribution representation Network (RAD-Net) to generate a distribution representation for multi-modal clues. The distribution representation of a clue is a vector consisting of the relation between this clue and all other clues from all modalities, thus being modality agnostic and good for person clustering. Accordingly, we introduce a graph-based method to construct distribution representation and employ a cyclic update policy to refine distribution representation progressively. Our method achieves substantial improvements of +6% and +8.2% in F-score on the Video Person-Clustering Dataset (VPCD) and VoxCeleb2 multi-view clustering dataset, respectively. Codes will be released publicly upon acceptance. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: Accepted in IEEE Transactions on Multimedia

arXiv:2307.12045 [pdf, other]

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

Authors: Long Bai, Mobarakol Islam, Hongliang Ren

Abstract: The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental cla… ▽ More The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental classes or tasks, their performance on old tasks drops dramatically. Furthermore, due to medical data privacy and licensing issues, it is often difficult to access old data when updating continual learning (CL) models. Therefore, we develop a non-exemplar continual surgical VQLA framework, to explore and balance the rigidity-plasticity trade-off of DNNs in a sequential learning paradigm. We revisit the distillation loss in CL tasks, and propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight aligning (WA) technique is also integrated to adjust the weight bias between old and new tasks. We further establish a CL framework on three public surgical datasets in the context of surgical settings that consist of overlap** classes between old and new surgical VQLA tasks. With extensive experiments, we demonstrate that our proposed method excellently reconciles learning and forgetting on the continual surgical VQLA over conventional CL methods. Our code is publicly accessible. △ Less

Submitted 22 July, 2023; originally announced July 2023.

Comments: To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/CS-VQLA

arXiv:2307.09211 [pdf, other]

Intrinsic ferroelectric switching in two-dimension $α$-In$_2$Se$_3$

Authors: Liyi Bai, Changming Ke, Zhongshen Luo, Tianyuan Zhu, Lu You, Shi Liu

Abstract: Two-dimensional (2D) ferroelectric semiconductors present opportunities for integrating ferroelectrics into high-density ultrathin nanoelectronics. Among the few synthesized 2D ferroelectrics, $α$-In$_2$Se$_3$, known for its electrically addressable vertical polarization has attracted significant interest. However, the understanding of many fundamental characteristics of this material, such as the… ▽ More Two-dimensional (2D) ferroelectric semiconductors present opportunities for integrating ferroelectrics into high-density ultrathin nanoelectronics. Among the few synthesized 2D ferroelectrics, $α$-In$_2$Se$_3$, known for its electrically addressable vertical polarization has attracted significant interest. However, the understanding of many fundamental characteristics of this material, such as the existence of spontaneous in-plane polarization and switching mechanisms, remains controversial, marked by conflicting experimental and theoretical results. Here, our combined experimental characterizations with piezoresponse force microscope and symmetry analysis conclusively dismiss previous claims of in-plane ferroelectricity in $α$-In$_2$Se$_3$. The processes of vertical polarization switching in monolayer $α$-In$_2$Se$_3$ are explored with deep-learning-assisted large-scale molecular dynamics simulations, revealing atomistic mechanisms fundamentally different from those of bulk ferroelectrics. Despite lacking in-plane effective polarization, 1D domain walls can be moved by both out-of-plane and in-plane fields, exhibiting unusual avalanche dynamics characterized by abrupt, intermittent moving patterns. The propagating velocity at various temperatures, field orientations, and strengths can be statistically described with a universal creep equation, featuring a dynamical exponent of 2 that is distinct from all known values for elastic interfaces moving in disordered media. This work rectifies a long-held misunderstanding regarding the in-plane ferroelectricity of $α$-In$_2$Se$_3$, and the quantitative characterizations of domain wall velocity will hold broad implications for both the fundamental understanding and technological applications of 2D ferroelectrics. △ Less

Submitted 28 February, 2024; v1 submitted 18 July, 2023; originally announced July 2023.

Comments: 30 pages, 6 figures

arXiv:2307.05182 [pdf, other]

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Authors: Long Bai, Mobarakol Islam, Hongliang Ren

Abstract: Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. I… ▽ More Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available. △ Less

Submitted 19 August, 2023; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/CAT-ViL

arXiv:2307.02452 [pdf, other]

LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion

Authors: Long Bai, Tong Chen, Yanan Wu, An Wang, Mobarakol Islam, Hongliang Ren

Abstract: Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gr… ▽ More Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gradually attracts researchers. Given the exuberant development of the denoising diffusion probabilistic model (DDPM) in computer vision, we introduce a WCE LLIE framework based on the multi-scale convolutional neural network (CNN) and reverse diffusion process. The multi-scale design allows models to preserve high-resolution representation and context information from low-resolution, while the curved wavelet attention (CWA) block is proposed for high-frequency and local feature learning. Furthermore, we combine the reverse diffusion procedure to further optimize the shallow output and generate the most realistic image. The proposed method is compared with ten state-of-the-art (SOTA) LLIE methods and significantly outperforms quantitatively and qualitatively. The superior performance on GI disease segmentation further demonstrates the clinical potential of our proposed model. Our code is publicly accessible. △ Less

Submitted 22 July, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/LLCaps

arXiv:2306.14177 [pdf, other]

Enhancing Mapless Trajectory Prediction through Knowledge Distillation

Authors: Yuning Wang, Pu Zhang, Lei Bai, Jianru Xue

Abstract: Scene information plays a crucial role in trajectory forecasting systems for autonomous driving by providing semantic clues and constraints on potential future paths of traffic agents. Prevalent trajectory prediction techniques often take high-definition maps (HD maps) as part of the inputs to provide scene knowledge. Although HD maps offer accurate road information, they may suffer from the high… ▽ More Scene information plays a crucial role in trajectory forecasting systems for autonomous driving by providing semantic clues and constraints on potential future paths of traffic agents. Prevalent trajectory prediction techniques often take high-definition maps (HD maps) as part of the inputs to provide scene knowledge. Although HD maps offer accurate road information, they may suffer from the high cost of annotation or restrictions of law that limits their widespread use. Therefore, those methods are still expected to generate reliable prediction results in mapless scenarios. In this paper, we tackle the problem of improving the consistency of multi-modal prediction trajectories and the real road topology when map information is unavailable during the test phase. Specifically, we achieve this by training a map-based prediction teacher network on the annotated samples and transferring the knowledge to a student mapless prediction network using a two-fold knowledge distillation framework. Our solution is generalizable for common trajectory prediction networks and does not bring extra computation burden. Experimental results show that our method stably improves prediction performance in mapless mode on many widely used state-of-the-art trajectory prediction baselines, compensating for the gaps caused by the absence of HD maps. Qualitative visualization results demonstrate that our approach helps infer unseen map information. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: submitted to NeurIPS 2023

arXiv:2306.14143 [pdf, other]

Intelligent Multi-Modal Sensing-Communication Integration: Synesthesia of Machines

Authors: Xiang Cheng, Haotian Zhang, Jianan Zhang, Shijian Gao, Sijiang Li, Ziwei Huang, Lu Bai, Zonghui Yang, Xinhu Zheng, Liuqing Yang

Abstract: In the era of sixth-generation (6G) wireless communications, integrated sensing and communications (ISAC) is recognized as a promising solution to upgrade the physical system by endowing wireless communications with sensing capability. Existing ISAC is mainly oriented to static scenarios with radio-frequency (RF) sensors being the primary participants, thus lacking a comprehensive environment feat… ▽ More In the era of sixth-generation (6G) wireless communications, integrated sensing and communications (ISAC) is recognized as a promising solution to upgrade the physical system by endowing wireless communications with sensing capability. Existing ISAC is mainly oriented to static scenarios with radio-frequency (RF) sensors being the primary participants, thus lacking a comprehensive environment feature characterization and facing a severe performance bottleneck in dynamic environments. To date, extensive surveys on ISAC have been conducted but are limited to summarizing RF-based radar sensing. Currently, some research efforts have been devoted to exploring multi-modal sensing-communication integration but still lack a comprehensive review. Therefore, we generalize the concept of ISAC inspired by human synesthesia to establish a unified framework of intelligent multi-modal sensing-communication integration and provide a comprehensive review under such a framework in this paper. The so-termed Synesthesia of Machines (SoM) gives the clearest cognition of such intelligent integration and details its paradigm for the first time. We commence by justifying the necessity of the new paradigm. Subsequently, we offer a definition of SoM and zoom into the detailed paradigm, which is summarized as three operation modes. To facilitate SoM research, we overview the prerequisite of SoM research, i.e., mixed multi-modal (MMM) datasets. Then, we introduce the map** relationships between multi-modal sensing and communications. Afterward, we cover the technological review on SoM-enhance-based and SoM-concert-based applications. To corroborate the superiority of SoM, we also present simulation results related to dual-function waveform and predictive beamforming design. Finally, we propose some potential directions to inspire future research efforts. △ Less

Submitted 20 November, 2023; v1 submitted 25 June, 2023; originally announced June 2023.

Comments: This paper has been accepted by IEEE Communications Surveys & Tutorials

arXiv:2306.14125 [pdf, other]

M$^3$SC: A Generic Dataset for Mixed Multi-Modal (MMM) Sensing and Communication Integration

Authors: Xiang Cheng, Ziwei Huang, Lu Bai, Haotian Zhang, Mingran Sun, Boxun Liu, Sijiang Li, Jianan Zhang, Minson Lee

Abstract: The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC data… ▽ More The sixth generation (6G) of mobile communication system is witnessing a new paradigm shift, i.e., integrated sensing-communication system. A comprehensive dataset is a prerequisite for 6G integrated sensing-communication research. This paper develops a novel simulation dataset, named M3SC, for mixed multi-modal (MMM) sensing-communication integration, and the generation framework of the M3SC dataset is further given. To obtain multi-modal sensory data in physical space and communication data in electromagnetic space, we utilize AirSim and WaveFarer to collect multi-modal sensory data and exploit Wireless InSite to collect communication data. Furthermore, the in-depth integration and precise alignment of AirSim, WaveFarer, and Wireless InSite are achieved. The M3SC dataset covers various weather conditions, various frequency bands, and different times of the day. Currently, the M3SC dataset contains 1500 snapshots, including 80 RGB images, 160 depth maps, 80 LiDAR point clouds, 256 sets of mmWave waveforms with 8 radar point clouds, and 72 channel impulse response (CIR) matrices per snapshot, thus totaling 120,000 RGB images, 240,000 depth maps, 120,000 LiDAR point clouds, 384,000 sets of mmWave waveforms with 12,000 radar point clouds, and 108,000 CIR matrices. The data processing result presents the multi-modal sensory information and communication channel statistical properties. Finally, the MMM sensing-communication application, which can be supported by the M3SC dataset, is discussed. △ Less

Submitted 25 June, 2023; originally announced June 2023.

Comments: 12 pages, 12 figures

arXiv:2306.10900 [pdf, other]

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

Authors: Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, Wanli Ouyang

Abstract: Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human i… ▽ More Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/. △ Less

Submitted 18 March, 2024; v1 submitted 19 June, 2023; originally announced June 2023.

Comments: 18 pages, 8 figures, accepted by AAAI 2024

arXiv:2306.08259 [pdf, other]

LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting

Authors: Xu Liu, Yutong Xia, Yuxuan Liang, Junfeng Hu, Yiwei Wang, Lei Bai, Chao Huang, Zhenguang Liu, Bryan Hooi, Roger Zimmermann

Abstract: Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning in capturing non-linear patterns of traffic data. However, the promising results achieved on current public datasets may not be applicable to practical scenarios due to limitations within these datasets. First, the limited sizes of them may not… ▽ More Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning in capturing non-linear patterns of traffic data. However, the promising results achieved on current public datasets may not be applicable to practical scenarios due to limitations within these datasets. First, the limited sizes of them may not reflect the real-world scale of traffic networks. Second, the temporal coverage of these datasets is typically short, posing hurdles in studying long-term patterns and acquiring sufficient samples for training deep models. Third, these datasets often lack adequate metadata for sensors, which compromises the reliability and interpretability of the data. To mitigate these limitations, we introduce the LargeST benchmark dataset. It encompasses a total number of 8,600 sensors in California with a 5-year time coverage and includes comprehensive metadata. Using LargeST, we perform in-depth data analysis to extract data insights, benchmark well-known baselines in terms of their performance and efficiency, and identify challenges as well as opportunities for future research. We release the datasets and baseline implementations at: https://github.com/liuxu77/LargeST. △ Less

Submitted 28 October, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

Showing 51–100 of 351 results for author: Bai, L