-
Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
Authors:
Mengshuo Jia,
Zeyu Cui,
Gabriela Hug
Abstract:
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and th…
▽ More
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
△ Less
Submitted 26 June, 2024; v1 submitted 24 June, 2024;
originally announced June 2024.
-
Cephalometric Landmark Detection across Ages with Prototypical Network
Authors:
Han Wu,
Chong Wang,
Lanzhuju Mei,
Tong Yang,
Min Zhu,
Dingggang Shen,
Zhiming Cui
Abstract:
Automated cephalometric landmark detection is crucial in real-world orthodontic diagnosis. Current studies mainly focus on only adult subjects, neglecting the clinically crucial scenario presented by adolescents whose landmarks often exhibit significantly different appearances compared to adults. Hence, an open question arises about how to develop a unified and effective detection algorithm across…
▽ More
Automated cephalometric landmark detection is crucial in real-world orthodontic diagnosis. Current studies mainly focus on only adult subjects, neglecting the clinically crucial scenario presented by adolescents whose landmarks often exhibit significantly different appearances compared to adults. Hence, an open question arises about how to develop a unified and effective detection algorithm across various age groups, including adolescents and adults. In this paper, we propose CeLDA, the first work for Cephalometric Landmark Detection across Ages. Our method leverages a prototypical network for landmark detection by comparing image features with landmark prototypes. To tackle the appearance discrepancy of landmarks between age groups, we design new strategies for CeLDA to improve prototype alignment and obtain a holistic estimation of landmark prototypes from a large set of training images. Moreover, a novel prototype relation mining paradigm is introduced to exploit the anatomical relations between the landmark prototypes. Extensive experiments validate the superiority of CeLDA in detecting cephalometric landmarks on both adult and adolescent subjects. To our knowledge, this is the first effort toward develo** a unified solution and dataset for cephalometric landmark detection across age groups. Our code and dataset will be made public on https://github.com/ShanghaiTech-IMPACT/Cephalometric-Landmark-Detection-across-Ages-with-Prototypical-Network
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models
Authors:
Ziyun Cui,
Chang Lei,
Wen Wu,
Yinan Duan,
Diyang Qu,
Ji Wu,
Runsen Chen,
Chao Zhang
Abstract:
The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse…
▽ More
The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse acoustic and linguistic features embedded in spontaneous speech, both the Whisper speech model and textual large language models (LLMs) are used for suicide risk detection. Both all-parameter finetuning and parameter-efficient finetuning approaches are used to adapt the pre-trained models for suicide risk detection, and multiple audio-text fusion approaches are evaluated to combine the representations of Whisper and the LLM. The proposed system achieves a detection accuracy of 0.807 and an F1-score of 0.846 on the test set with 119 subjects, indicating promising potential for real suicide risk detection applications.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Decay Pruning Method: Smooth Pruning With a Self-Rectifying Procedure
Authors:
Minghao Yang,
Linlin Gao,
Pengyuan Li,
Wenbo Li,
Yihong Dong,
Zhiying Cui
Abstract:
Current structured pruning methods often result in considerable accuracy drops due to abrupt network changes and loss of information from pruned structures. To address these issues, we introduce the Decay Pruning Method (DPM), a novel smooth pruning approach with a self-rectifying mechanism. DPM consists of two key components: (i) Smooth Pruning: It converts conventional single-step pruning into m…
▽ More
Current structured pruning methods often result in considerable accuracy drops due to abrupt network changes and loss of information from pruned structures. To address these issues, we introduce the Decay Pruning Method (DPM), a novel smooth pruning approach with a self-rectifying mechanism. DPM consists of two key components: (i) Smooth Pruning: It converts conventional single-step pruning into multi-step smooth pruning, gradually reducing redundant structures to zero over N steps with ongoing optimization. (ii) Self-Rectifying: This procedure further enhances the aforementioned process by rectifying sub-optimal pruning based on gradient information. Our approach demonstrates strong generalizability and can be easily integrated with various existing pruning methods. We validate the effectiveness of DPM by integrating it with three popular pruning methods: OTOv2, Depgraph, and Gate Decorator. Experimental results show consistent improvements in performance compared to the original pruning methods, along with further reductions of FLOPs in most scenarios.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Bayesian WeakS-to-Strong from Text Classification to Generation
Authors:
Ziyun Cui,
Ziyang Zhang,
Wen Wu,
Guangzhi Sun,
Chao Zhang
Abstract:
Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of…
▽ More
Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of weak models which simulate the variability in human opinions. Confidence scores are estimated using a Bayesian approach to guide the WeakS-to-Strong generalization. Furthermore, we extend the application of WeakS-to-Strong from text classification tasks to text generation tasks where more advanced strategies are investigated for supervision. Moreover, direct preference optimization is applied to advance the student model's preference learning, beyond the basic learning framework of teacher forcing. Results demonstrate the effectiveness of the proposed approach for the reliability of a strong student model, showing potential for superalignment.
△ Less
Submitted 24 May, 2024;
originally announced June 2024.
-
OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
Authors:
Lening Wang,
Wenzhao Zheng,
Yilong Ren,
Han Jiang,
Zhiyong Cui,
Haiyang Yu,
Jiwen Lu
Abstract:
Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in mo…
▽ More
Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods mode scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency in modeling long-term temporal evolutions. To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s-videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving. Code is available at: https://github.com/wzzheng/OccSora.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis
Authors:
Boming Zhao,
Yuan Li,
Ziyu Sun,
Lin Zeng,
Yujun Shen,
Rui Ma,
Yinda Zhang,
Hujun Bao,
Zhaopeng Cui
Abstract:
Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a n…
▽ More
Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Rankability-enhanced Revenue Uplift Modeling Framework for Online Marketing
Authors:
Bowei He,
Yunpeng Weng,
Xing Tang,
Ziqiang Cui,
Zexu Sun,
Liang Chen,
Xiuqiang He,
Chen Ma
Abstract:
Uplift modeling has been widely employed in online marketing by predicting the response difference between the treatment and control groups, so as to identify the sensitive individuals toward interventions like coupons or discounts. Compared with traditional \textit{conversion uplift modeling}, \textit{revenue uplift modeling} exhibits higher potential due to its direct connection with the corpora…
▽ More
Uplift modeling has been widely employed in online marketing by predicting the response difference between the treatment and control groups, so as to identify the sensitive individuals toward interventions like coupons or discounts. Compared with traditional \textit{conversion uplift modeling}, \textit{revenue uplift modeling} exhibits higher potential due to its direct connection with the corporate income. However, previous works can hardly handle the continuous long-tail response distribution in revenue uplift modeling. Moreover, they have neglected to optimize the uplift ranking among different individuals, which is actually the core of uplift modeling. To address such issues, in this paper, we first utilize the zero-inflated lognormal (ZILN) loss to regress the responses and customize the corresponding modeling network, which can be adapted to different existing uplift models. Then, we study the ranking-related uplift modeling error from the theoretical perspective and propose two tighter error bounds as the additional loss terms to the conventional response regression loss. Finally, we directly model the uplift ranking error for the entire population with a listwise uplift ranking loss. The experiment results on offline public and industrial datasets validate the effectiveness of our method for revenue uplift modeling. Furthermore, we conduct large-scale experiments on a prominent online fintech marketing platform, Tencent FiT, which further demonstrates the superiority of our method in real-world applications.
△ Less
Submitted 12 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
Authors:
Ming** Zhang,
Jiannong Cao,
Xiaoming Shen,
Zeyang Cui
Abstract:
Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to…
▽ More
Large language models (LLMs) have shown great potential in natural language processing and content generation. However, current LLMs heavily rely on cloud computing, leading to prolonged latency, high bandwidth cost, and privacy concerns. Edge computing is promising to address such concerns by deploying LLMs on edge devices, closer to data sources. Some works try to leverage model quantization to reduce the model size to fit the resource-constraint edge devices, but they lead to accuracy loss. Other works use cloud-edge collaboration, suffering from unstable network connections. In this work, we leverage collaborative edge computing to facilitate the collaboration among edge devices and cloud servers for jointly performing efficient LLM inference. We propose a general framework to partition the LLM model into shards and deploy on distributed devices. To achieve efficient LLM inference, we formulate an adaptive joint device selection and model partition problem and design an efficient dynamic programming algorithm to optimize the inference latency and throughput, respectively. Experiments of Llama2 serial models on a heterogeneous physical prototype demonstrate that EdgeShard achieves up to 50% latency reduction and 2x throughput improvement over baseline methods.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
PS6D: Point Cloud Based Symmetry-Aware 6D Object Pose Estimation in Robot Bin-Picking
Authors:
Yifan Yang,
Zhihao Cui,
Qianyi Zhang,
**gtai Liu
Abstract:
6D object pose estimation holds essential roles in various fields, particularly in the gras** of industrial workpieces. Given challenges like rust, high reflectivity, and absent textures, this paper introduces a point cloud based pose estimation framework (PS6D). PS6D centers on slender and multi-symmetric objects. It extracts multi-scale features through an attention-guided feature extraction m…
▽ More
6D object pose estimation holds essential roles in various fields, particularly in the gras** of industrial workpieces. Given challenges like rust, high reflectivity, and absent textures, this paper introduces a point cloud based pose estimation framework (PS6D). PS6D centers on slender and multi-symmetric objects. It extracts multi-scale features through an attention-guided feature extraction module, designs a symmetry-aware rotation loss and a center distance sensitive translation loss to regress the pose of each point to the centroid of the instance, and then uses a two-stage clustering method to complete instance segmentation and pose estimation. Objects from the Siléane and IPA datasets and typical workpieces from industrial practice are used to generate data and evaluate the algorithm. In comparison to the state-of-the-art approach, PS6D demonstrates an 11.5\% improvement in F$_{1_{inst}}$ and a 14.8\% improvement in Recall. The main part of PS6D has been deployed to the software of Mech-Mind, and achieves a 91.7\% success rate in bin-picking experiments, marking its application in industrial pose estimation tasks.
△ Less
Submitted 18 May, 2024;
originally announced May 2024.
-
3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning
Authors:
Zhentao Liu,
Huangxuan Zhao,
Wenhui Qin,
Zhenghong Zhou,
Xinggang Wang,
Wen** Wang,
Xiaochun Lai,
Chuansheng Zheng,
Dinggang Shen,
Zhiming Cui
Abstract:
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substanti…
▽ More
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. However, sparse-view DSA reconstruction, aimed at reducing radiation dosage, is still underexplored in the research community. The dynamic blood flow and insufficient input of sparse-view DSA images present significant challenges to the 3D vessel reconstruction task. In this study, we propose to use a time-agnostic vessel probability field to solve this problem effectively. Our approach, termed as vessel probability guided attenuation learning, represents the DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the vessel probability field. Functioning as a dynamic mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism facilitates a self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves the reconstruction quality. Our model is trained by minimizing the disparity between synthesized projections and real captured DSA images. We further employ two training strategies to improve our reconstruction quality: (1) coarse-to-fine progressive training to achieve better geometry and (2) temporal perturbed rendering loss to enforce temporal consistency. Experimental results have demonstrated superior quality on both 3D vessel reconstruction and 2D view synthesis.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Diffusion-based Contrastive Learning for Sequential Recommendation
Authors:
Ziqiang Cui,
Haolun Wu,
Bowei He,
Ji Cheng,
Chen Ma
Abstract:
Self-supervised contrastive learning, which directly extracts inherent data correlations from unlabeled data, has been widely utilized to mitigate the data sparsity issue in sequential recommendation. The majority of existing methods create different augmented views of the same user sequence via random augmentation, and subsequently minimize their distance in the embedding space to enhance the qua…
▽ More
Self-supervised contrastive learning, which directly extracts inherent data correlations from unlabeled data, has been widely utilized to mitigate the data sparsity issue in sequential recommendation. The majority of existing methods create different augmented views of the same user sequence via random augmentation, and subsequently minimize their distance in the embedding space to enhance the quality of user representations. However, random augmentation often disrupts the semantic information and interest evolution pattern inherent in the user sequence, leading to the generation of semantically distinct augmented views. Promoting similarity of these semantically diverse augmented sequences can render the learned user representations insensitive to variations in user preferences and interest evolution, contradicting the core learning objectives of sequential recommendation. To address this issue, we leverage the inherent characteristics of sequential recommendation and propose the use of context information to generate more reasonable augmented positive samples. Specifically, we introduce a context-aware diffusion-based contrastive learning method for sequential recommendation. Given a user sequence, our method selects certain positions and employs a context-aware diffusion model to generate alternative items for these positions with the guidance of context information. These generated items then replace the corresponding original items, creating a semantically consistent augmented view of the original sequence. Additionally, to maintain representation cohesion, item embeddings are shared between the diffusion model and the recommendation model, and the entire framework is trained in an end-to-end manner. Extensive experiments on five benchmark datasets demonstrate the superiority of our proposed method.
△ Less
Submitted 7 June, 2024; v1 submitted 15 May, 2024;
originally announced May 2024.
-
Variable Substitution and Bilinear Programming for Aligning Partially Overlap** Point Sets
Authors:
Wei Lian,
Zhesen Cui,
Fei Ma,
Hang Pan,
Wangmeng Zuo
Abstract:
In many applications, the demand arises for algorithms capable of aligning partially overlap** point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, t…
▽ More
In many applications, the demand arises for algorithms capable of aligning partially overlap** point sets while remaining invariant to the corresponding transformations. This research presents a method designed to meet such requirements through minimization of the objective function of the robust point matching (RPM) algorithm. First, we show that the RPM objective is a cubic polynomial. Then, through variable substitution, we transform the RPM objective to a quadratic function. Leveraging the convex envelope of bilinear monomials, we proceed to relax the resulting objective function, thus obtaining a lower bound problem that can be conveniently decomposed into distinct linear assignment and low-dimensional convex quadratic program components, both amenable to efficient optimization. Furthermore, a branch-and-bound (BnB) algorithm is devised, which solely branches over the transformation parameters, thereby boosting convergence rate. Empirical evaluations demonstrate better robustness of the proposed methodology against non-rigid deformation, positional noise, and outliers, particularly in scenarios where outliers remain distinct from inliers, when compared with prevailing state-of-the-art approaches.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning
Authors:
Wenqi Dong,
Bangbang Yang,
Lin Ma,
Xiao Liu,
Liyuan Cui,
Hujun Bao,
Yuewen Ma,
Zhaopeng Cui
Abstract:
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tas…
▽ More
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
-
A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds
Authors:
Christopher Z. Cui,
Xiangyu Peng,
Mark O. Riedl
Abstract:
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a pr…
▽ More
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
Traj-LLM: A New Exploration for Empowering Trajectory Prediction with Pre-trained Large Language Models
Authors:
Zhengxing Lan,
Hongbo Li,
Lingshan Liu,
Bo Fan,
Yisheng Lv,
Yilong Ren,
Zhiyong Cui
Abstract:
Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explici…
▽ More
Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explicit prompt engineering to generate future motion from agents' past/observed trajectories and scene semantics. Traj-LLM starts with sparse context joint coding to dissect the agent and scene features into a form that LLMs understand. On this basis, we innovatively explore LLMs' powerful comprehension abilities to capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function and enhancing Traj-LLM's scene comprehension, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module. Finally, a multi-modal Laplace decoder is designed to achieve scene-compliant multi-modal predictions. Extensive experiments manifest that Traj-LLM, fortified by LLMs' strong prior knowledge and understanding prowess, together with lane-aware probability learning, outstrips state-of-the-art methods across evaluation metrics. Moreover, the few-shot analysis further substantiates Traj-LLM's performance, wherein with just 50% of the dataset, it outperforms the majority of benchmarks relying on complete data utilization. This study explores equip** the trajectory prediction task with advanced capabilities inherent in LLMs, furnishing a more universal and adaptable solution for forecasting agent motion in a new way.
△ Less
Submitted 8 May, 2024;
originally announced May 2024.
-
A Cognitive-Driven Trajectory Prediction Model for Autonomous Driving in Mixed Autonomy Environment
Authors:
Haicheng Liao,
Zhenning Li,
Chengyue Wang,
Bonan Wang,
Hanlin Kong,
Yanchen Guan,
Guofa Li,
Zhiyong Cui,
Chengzhong Xu
Abstract:
As autonomous driving technology progresses, the need for precise trajectory prediction models becomes paramount. This paper introduces an innovative model that infuses cognitive insights into trajectory prediction, focusing on perceived safety and dynamic decision-making. Distinct from traditional approaches, our model excels in analyzing interactions and behavior patterns in mixed autonomy traff…
▽ More
As autonomous driving technology progresses, the need for precise trajectory prediction models becomes paramount. This paper introduces an innovative model that infuses cognitive insights into trajectory prediction, focusing on perceived safety and dynamic decision-making. Distinct from traditional approaches, our model excels in analyzing interactions and behavior patterns in mixed autonomy traffic scenarios. It represents a significant leap forward, achieving marked performance improvements on several key datasets. Specifically, it surpasses existing benchmarks with gains of 16.2% on the Next Generation Simulation (NGSIM), 27.4% on the Highway Drone (HighD), and 19.8% on the Macao Connected Autonomous Driving (MoCAD) dataset. Our proposed model shows exceptional proficiency in handling corner cases, essential for real-world applications. Moreover, its robustness is evident in scenarios with missing or limited data, outperforming most of the state-of-the-art baselines. This adaptability and resilience position our model as a viable tool for real-world autonomous driving systems, heralding a new standard in vehicle trajectory prediction for enhanced safety and efficiency.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
MLS-Track: Multilevel Semantic Interaction in RMOT
Authors:
Zeliang Ma,
Song Yang,
Zhe Cui,
Zhicheng Zhao,
Fei Su,
Delong Liu,
**gyu Wang
Abstract:
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersect…
▽ More
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling
Authors:
Shengze Dong,
Zhuorui Cui,
Ding Liu,
**zhi Lei
Abstract:
Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual d…
▽ More
Motivation: Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Results: Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIM). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples. Availability and implementation: https://github.com/DongShengze/scRDiT
△ Less
Submitted 9 April, 2024;
originally announced April 2024.
-
PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos
Authors:
Yufei Zhang,
Jeffrey O. Kephart,
Zijun Cui,
Qiang Ji
Abstract:
While current methods have shown promising progress on estimating 3D human motion from monocular videos, their motion estimates are often physically unrealistic because they mainly consider kinematics. In this paper, we introduce Physics-aware Pretrained Transformer (PhysPT), which improves kinematics-based motion estimates and infers motion forces. PhysPT exploits a Transformer encoder-decoder ba…
▽ More
While current methods have shown promising progress on estimating 3D human motion from monocular videos, their motion estimates are often physically unrealistic because they mainly consider kinematics. In this paper, we introduce Physics-aware Pretrained Transformer (PhysPT), which improves kinematics-based motion estimates and infers motion forces. PhysPT exploits a Transformer encoder-decoder backbone to effectively learn human dynamics in a self-supervised manner. Moreover, it incorporates physics principles governing human motion. Specifically, we build a physics-based body representation and contact force model. We leverage them to impose novel physics-inspired training losses (i.e., force loss, contact loss, and Euler-Lagrange loss), enabling PhysPT to capture physical properties of the human body and the forces it experiences. Experiments demonstrate that, once trained, PhysPT can be directly applied to kinematics-based estimates to significantly enhance their physical plausibility and generate favourable motion forces. Furthermore, we show that these physically meaningful quantities translate into improved accuracy of an important downstream task: human action recognition.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image
Authors:
Chong Bao,
Yinda Zhang,
Yuan Li,
Xiyu Zhang,
Bangbang Yang,
Hujun Bao,
Marc Pollefeys,
Guofeng Zhang,
Zhaopeng Cui
Abstract:
Recently, we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However, due to the diversity of frameworks, there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper, we propose a generic avatar editing approach that can be universally applied to various 3D…
▽ More
Recently, we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However, due to the diversity of frameworks, there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper, we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal, we design a novel expression-aware modification generative model, which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process, we develop several techniques, including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools, implicit latent space guidance to enhance model convergence, and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
IME: Integrating Multi-curvature Shared and Specific Embedding for Temporal Knowledge Graph Completion
Authors:
Jiapu Wang,
Zheng Cui,
Boyue Wang,
Shirui Pan,
Junbin Gao,
Baocai Yin,
Wen Gao
Abstract:
Temporal Knowledge Graphs (TKGs) incorporate a temporal dimension, allowing for a precise capture of the evolution of knowledge and reflecting the dynamic nature of the real world. Typically, TKGs contain complex geometric structures, with various geometric structures interwoven. However, existing Temporal Knowledge Graph Completion (TKGC) methods either model TKGs in a single space or neglect the…
▽ More
Temporal Knowledge Graphs (TKGs) incorporate a temporal dimension, allowing for a precise capture of the evolution of knowledge and reflecting the dynamic nature of the real world. Typically, TKGs contain complex geometric structures, with various geometric structures interwoven. However, existing Temporal Knowledge Graph Completion (TKGC) methods either model TKGs in a single space or neglect the heterogeneity of different curvature spaces, thus constraining their capacity to capture these intricate geometric structures. In this paper, we propose a novel Integrating Multi-curvature shared and specific Embedding (IME) model for TKGC tasks. Concretely, IME models TKGs into multi-curvature spaces, including hyperspherical, hyperbolic, and Euclidean spaces. Subsequently, IME incorporates two key properties, namely space-shared property and space-specific property. The space-shared property facilitates the learning of commonalities across different curvature spaces and alleviates the spatial gap caused by the heterogeneous nature of multi-curvature spaces, while the space-specific property captures characteristic features. Meanwhile, IME proposes an Adjustable Multi-curvature Pooling (AMP) approach to effectively retain important information. Furthermore, IME innovatively designs similarity, difference, and structure loss functions to attain the stated objective. Experimental results clearly demonstrate the superior performance of IME over existing state-of-the-art TKGC models.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
SGDM: Static-Guided Dynamic Module Make Stronger Visual Models
Authors:
Wenjie Xing,
Zhenchao Cui,
**g Qi
Abstract:
The spatial attention mechanism has been widely used to improve object detection performance. However, its operation is currently limited to static convolutions lacking content-adaptive features. This paper innovatively approaches from the perspective of dynamic convolution. We propose Razor Dynamic Convolution (RDConv) to address thetwo flaws in dynamic weight convolution, making it hard to imple…
▽ More
The spatial attention mechanism has been widely used to improve object detection performance. However, its operation is currently limited to static convolutions lacking content-adaptive features. This paper innovatively approaches from the perspective of dynamic convolution. We propose Razor Dynamic Convolution (RDConv) to address thetwo flaws in dynamic weight convolution, making it hard to implement in spatial mechanism: 1) it is computation-heavy; 2) when generating weights, spatial information is disregarded. Firstly, by using Razor Operation to generate certain features, we vastly reduce the parameters of the entire dynamic convolution operation. Secondly, we added a spatial branch inside RDConv to generate convolutional kernel parameters with richer spatial information. Embedding dynamic convolution will also bring the problem of sensitivity to high-frequency noise. We propose the Static-Guided Dynamic Module (SGDM) to address this limitation. By using SGDM, we utilize a set of asymmetric static convolution kernel parameters to guide the construction of dynamic convolution. We introduce the mechanism of shared weights in static convolution to solve the problem of dynamic convolution being sensitive to high-frequency noise. Extensive experiments illustrate that multiple different object detection backbones equipped with SGDM achieve a highly competitive boost in performance(e.g., +4% mAP with YOLOv5n on VOC and +1.7% mAP with YOLOv8n on COCO) with negligible parameter increase(i.e., +0.33M on YOLOv5n and +0.19M on YOLOv8n).
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Development of a Chinese Human-Automation Trust Scale
Authors:
Zixin Cui,
Xiangling Zhuang,
Seul Chan Lee,
Jieun Lee,
Xintong Li,
Makoto Itoh
Abstract:
The development of a reliable and valid assessment tool of human-automation trust is an important topic. This study aimed to develop a Chinese version of human-automation trust scale (C-HATS) with reasonable reliability and validity based on Lee and See (2004)'s trust model. After three phases of assessments including exploratory factor analysis, item analysis, and confirmatory factor analysis, di…
▽ More
The development of a reliable and valid assessment tool of human-automation trust is an important topic. This study aimed to develop a Chinese version of human-automation trust scale (C-HATS) with reasonable reliability and validity based on Lee and See (2004)'s trust model. After three phases of assessments including exploratory factor analysis, item analysis, and confirmatory factor analysis, different dimensions and items were considered for initial and posttask human-automation trust. For post-task trust, the scale had three dimensions and 11 items and reflected Lee and See (2004)'s model, whereas different from Lee and See (2004)'s model, the final scale had 14 items but only two dimensions for initial trust. Nevertheless, for both initial and post-task trust, reasonable reliability and validity of the scale were verified with various consumer automation products. Although further verification is still necessary, the developed C-HATS could be used to effectively assess human-automation trust in the Chinese context.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field
Authors:
Jiarui Hu,
Xianhao Chen,
Boyin Feng,
Guanglin Li,
Liang**g Yang,
Hujun Bao,
Guofeng Zhang,
Zhaopeng Cui
Abstract:
Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and map** (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system…
▽ More
Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and map** (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and map**. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and map** performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: https://zju3dv.github.io/cg-slam.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
Emergence of Social Norms in Generative Agent Societies: Principles and Architecture
Authors:
Siyue Ren,
Zhiyao Cui,
Ruiqi Song,
Zhen Wang,
Shuyue Hu
Abstract:
Social norms play a crucial role in guiding agents towards understanding and adhering to standards of behavior, thus reducing social conflicts within multi-agent systems (MASs). However, current LLM-based (or generative) MASs lack the capability to be normative. In this paper, we propose a novel architecture, named CRSEC, to empower the emergence of social norms within generative MASs. Our archite…
▽ More
Social norms play a crucial role in guiding agents towards understanding and adhering to standards of behavior, thus reducing social conflicts within multi-agent systems (MASs). However, current LLM-based (or generative) MASs lack the capability to be normative. In this paper, we propose a novel architecture, named CRSEC, to empower the emergence of social norms within generative MASs. Our architecture consists of four modules: Creation & Representation, Spreading, Evaluation, and Compliance. This addresses several important aspects of the emergent processes all in one: (i) where social norms come from, (ii) how they are formally represented, (iii) how they spread through agents' communications and observations, (iv) how they are examined with a sanity check and synthesized in the long term, and (v) how they are incorporated into agents' planning and actions. Our experiments deployed in the Smallville sandbox game environment demonstrate the capability of our architecture to establish social norms and reduce social conflicts within generative MASs. The positive outcomes of our human evaluation, conducted with 30 evaluators, further affirm the effectiveness of our approach. Our project can be accessed via the following link: https://github.com/sxswz213/CRSEC.
△ Less
Submitted 20 May, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
TPLLM: A Traffic Prediction Framework Based on Pretrained Large Language Models
Authors:
Yilong Ren,
Yue Chen,
Shuai Liu,
Boyue Wang,
Haiyang Yu,
Zhiyong Cui
Abstract:
Traffic prediction constitutes a pivotal facet within the purview of Intelligent Transportation Systems (ITS), and the attainment of highly precise predictions holds profound significance for efficacious traffic management. The precision of prevailing deep learning-driven traffic prediction models typically sees an upward trend with a rise in the volume of training data. However, the procurement o…
▽ More
Traffic prediction constitutes a pivotal facet within the purview of Intelligent Transportation Systems (ITS), and the attainment of highly precise predictions holds profound significance for efficacious traffic management. The precision of prevailing deep learning-driven traffic prediction models typically sees an upward trend with a rise in the volume of training data. However, the procurement of comprehensive spatiotemporal datasets for traffic is often fraught with challenges, primarily stemming from the substantial costs associated with data collection and retention. Consequently, develo** a model that can achieve accurate predictions and good generalization ability in areas with limited historical traffic data is a challenging problem. It is noteworthy that the rapidly advancing pretrained Large Language Models (LLMs) of recent years have demonstrated exceptional proficiency in cross-modality knowledge transfer and few-shot learning. Recognizing the sequential nature of traffic data, similar to language, we introduce TPLLM, a novel traffic prediction framework leveraging LLMs. In this framework, we construct a sequence embedding layer based on Convolutional Neural Networks (CNNs) and a graph embedding layer based on Graph Convolutional Networks (GCNs) to extract sequence features and spatial features, respectively. These are subsequently integrated to form inputs that are suitable for LLMs. A Low-Rank Adaptation (LoRA) fine-tuning approach is applied to TPLLM, thereby facilitating efficient learning and minimizing computational demands. Experiments on two real-world datasets demonstrate that TPLLM exhibits commendable performance in both full-sample and few-shot prediction scenarios, effectively supporting the development of ITS in regions with scarce historical traffic data.
△ Less
Submitted 18 March, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
A Cognitive-Based Trajectory Prediction Approach for Autonomous Driving
Authors:
Haicheng Liao,
Yongkang Li,
Zhenning Li,
Chengyue Wang,
Zhiyong Cui,
Shengbo Eben Li,
Chengzhong Xu
Abstract:
In autonomous vehicle (AV) technology, the ability to accurately predict the movements of surrounding vehicles is paramount for ensuring safety and operational efficiency. Incorporating human decision-making insights enables AVs to more effectively anticipate the potential actions of other vehicles, significantly improving prediction accuracy and responsiveness in dynamic environments. This paper…
▽ More
In autonomous vehicle (AV) technology, the ability to accurately predict the movements of surrounding vehicles is paramount for ensuring safety and operational efficiency. Incorporating human decision-making insights enables AVs to more effectively anticipate the potential actions of other vehicles, significantly improving prediction accuracy and responsiveness in dynamic environments. This paper introduces the Human-Like Trajectory Prediction (HLTP) model, which adopts a teacher-student knowledge distillation framework inspired by human cognitive processes. The HLTP model incorporates a sophisticated teacher-student knowledge distillation framework. The "teacher" model, equipped with an adaptive visual sector, mimics the visual processing of the human brain, particularly the functions of the occipital and temporal lobes. The "student" model focuses on real-time interaction and decision-making, drawing parallels to prefrontal and parietal cortex functions. This approach allows for dynamic adaptation to changing driving scenarios, capturing essential perceptual cues for accurate prediction. Evaluated using the Macao Connected and Autonomous Driving (MoCAD) dataset, along with the NGSIM and HighD benchmarks, HLTP demonstrates superior performance compared to existing models, particularly in challenging environments with incomplete data. The project page is available at Github.
△ Less
Submitted 29 February, 2024;
originally announced February 2024.
-
Bit Rate Matching Algorithm Optimization in JPEG-AI Verification Model
Authors:
Panqi Jia,
A. Burakhan Koyuncu,
Jue Mao,
Ze Cui,
Yi Ma,
Tiansheng Guo,
Timofey Solovyev,
Alexander Karabutov,
Yin Zhao,
**g Wang,
Elena Alshina,
Andre Kaup
Abstract:
The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties…
▽ More
The research on neural network (NN) based image compression has shown superior performance compared to classical compression frameworks. Unlike the hand-engineered transforms in the classical frameworks, NN-based models learn the non-linear transforms providing more compact bit representations, and achieve faster coding speed on parallel devices over their classical counterparts. Those properties evoked the attention of both scientific and industrial communities, resulting in the standardization activity JPEG-AI. The verification model for the standardization process of JPEG-AI is already in development and has surpassed the advanced VVC intra codec. To generate reconstructed images with the desired bits per pixel and assess the BD-rate performance of both the JPEG-AI verification model and VVC intra, bit rate matching is employed. However, the current state of the JPEG-AI verification model experiences significant slowdowns during bit rate matching, resulting in suboptimal performance due to an unsuitable model. The proposed methodology offers a gradual algorithmic optimization for matching bit rates, resulting in a fourfold acceleration and over 1% improvement in BD-rate at the base operation point. At the high operation point, the acceleration increases up to sixfold.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
An Overview of the Development of Stereotactic Body Radiation Therapy
Authors:
Yanqi Zong,
Zhengrong Cui,
Luqi Lin,
Sihao Wang,
Yizhi Chen
Abstract:
Stereotactic body radiation therapy (SBRT) refers to focusing high-energy rays in three-dimensional space on the tumor lesion area, reducing the dose received by surrounding normal tissues, which can effectively improve the local control rate of the tumor and reduce the probability of complications. With the comprehensive development of medical imaging, radiation biology and other disciplines, thi…
▽ More
Stereotactic body radiation therapy (SBRT) refers to focusing high-energy rays in three-dimensional space on the tumor lesion area, reducing the dose received by surrounding normal tissues, which can effectively improve the local control rate of the tumor and reduce the probability of complications. With the comprehensive development of medical imaging, radiation biology and other disciplines, this less-fractional, high-dose radiotherapy method has been increasingly developed and applied in clinical practice. The background, radio-biological basis, key technologies and main equipment of SBRT are discussed, and its future development direction is prospected.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
Plugin Speech Enhancement: A Universal Speech Enhancement Framework Inspired by Dynamic Neural Network
Authors:
Yanan Chen,
Zihao Cui,
Yingying Gao,
Junlan Feng,
Chao Deng,
Shilei Zhang
Abstract:
The expectation to deploy a universal neural network for speech enhancement, with the aim of improving noise robustness across diverse speech processing tasks, faces challenges due to the existing lack of awareness within static speech enhancement frameworks regarding the expected speech in downstream modules. These limitations impede the effectiveness of static speech enhancement approaches in ac…
▽ More
The expectation to deploy a universal neural network for speech enhancement, with the aim of improving noise robustness across diverse speech processing tasks, faces challenges due to the existing lack of awareness within static speech enhancement frameworks regarding the expected speech in downstream modules. These limitations impede the effectiveness of static speech enhancement approaches in achieving optimal performance for a range of speech processing tasks, thereby challenging the notion of universal applicability. The fundamental issue in achieving universal speech enhancement lies in effectively informing the speech enhancement module about the features of downstream modules. In this study, we present a novel weighting prediction approach, which explicitly learns the task relationships from downstream training information to address the core challenge of universal speech enhancement. We found the role of deciding whether to employ data augmentation techniques as crucial downstream training information. This decision significantly impacts the expected speech and the performance of the speech enhancement module. Moreover, we introduce a novel speech enhancement network, the Plugin Speech Enhancement (Plugin-SE). The Plugin-SE is a dynamic neural network that includes the speech enhancement module, gate module, and weight prediction module. Experimental results demonstrate that the proposed Plugin-SE approach is competitive or superior to other joint training methods across various downstream tasks.
△ Less
Submitted 20 February, 2024;
originally announced February 2024.
-
Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion
Authors:
Zuoyue Li,
Zhenqiang Li,
Zhaopeng Cui,
Marc Pollefeys,
Martin R. Oswald
Abstract:
Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either ope…
▽ More
Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
△ Less
Submitted 1 April, 2024; v1 submitted 19 January, 2024;
originally announced January 2024.
-
Treatment-Aware Hyperbolic Representation Learning for Causal Effect Estimation with Social Networks
Authors:
Ziqiang Cui,
Xing Tang,
Yang Qiao,
Bowei He,
Liang Chen,
Xiuqiang He,
Chen Ma
Abstract:
Estimating the individual treatment effect (ITE) from observational data is a crucial research topic that holds significant value across multiple domains. How to identify hidden confounders poses a key challenge in ITE estimation. Recent studies have incorporated the structural information of social networks to tackle this challenge, achieving notable advancements. However, these methods utilize g…
▽ More
Estimating the individual treatment effect (ITE) from observational data is a crucial research topic that holds significant value across multiple domains. How to identify hidden confounders poses a key challenge in ITE estimation. Recent studies have incorporated the structural information of social networks to tackle this challenge, achieving notable advancements. However, these methods utilize graph neural networks to learn the representation of hidden confounders in Euclidean space, disregarding two critical issues: (1) the social networks often exhibit a scalefree structure, while Euclidean embeddings suffer from high distortion when used to embed such graphs, and (2) each ego-centric network within a social network manifests a treatment-related characteristic, implying significant patterns of hidden confounders. To address these issues, we propose a novel method called Treatment-Aware Hyperbolic Representation Learning (TAHyper). Firstly, TAHyper employs the hyperbolic space to encode the social networks, thereby effectively reducing the distortion of confounder representation caused by Euclidean embeddings. Secondly, we design a treatment-aware relationship identification module that enhances the representation of hidden confounders by identifying whether an individual and her neighbors receive the same treatment. Extensive experiments on two benchmark datasets are conducted to demonstrate the superiority of our method.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
A Hybrid Neural Network Model For Predicting The Nitrate Concentration In The Recirculating Aquaculture System
Authors:
Xiangyu Fan,
Jiaxin Lia,
Yingzhe Wang,
Yingsha Qu,
Hao Li,
Keming Qu,
Zhengguo Cui
Abstract:
This study was groundbreaking in its application of neural network models for nitrate management in the Recirculating Aquaculture System (RAS). A hybrid neural network model was proposed, which accurately predicted daily nitrate concentration and its trends using six water quality parameters. We conducted a 105-day aquaculture experiment, during which we collected 450 samples from five sets of RAS…
▽ More
This study was groundbreaking in its application of neural network models for nitrate management in the Recirculating Aquaculture System (RAS). A hybrid neural network model was proposed, which accurately predicted daily nitrate concentration and its trends using six water quality parameters. We conducted a 105-day aquaculture experiment, during which we collected 450 samples from five sets of RAS to train our model (C-L-A model) which incorporates Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and self-Attention. Furthermore, we obtained 90 samples from a standalone RAS as the testing data to evaluate the performance of the model in practical applications. The experimental results proved that the C-L-A model accurately predicted nitrate concentration in RAS and maintained good performance even with a reduced proportion of training data. We recommend using water quality parameters from the past 7 days to forecast future nitrate concentration, as this timeframe allows the model to achieve maximum generalization capability. Additionally, we compared the performance of the C-L-A model with three basic neural network models (CNN, LSTM, self-Attention) as well as three hybrid neural network models (CNN-LSTM, CNN-Attention, LSTM-Attention). The results demonstrated that the C-L-A model (R2=0.956) significantly outperformed the other neural network models (R2=0.901-0.927). Our study suggests that the utilization of neural network models, specifically the C-L-A model, could potentially assist the RAS industry in conserving resources for daily nitrate monitoring.
△ Less
Submitted 15 January, 2024; v1 submitted 2 January, 2024;
originally announced January 2024.
-
Inferring Heterogeneous Treatment Effects of Crashes on Highway Traffic: A Doubly Robust Causal Machine Learning Approach
Authors:
Shuang Li,
Ziyuan Pu,
Zhiyong Cui,
Seunghyeon Lee,
Xiucheng Guo,
Dong Ngoduy
Abstract:
Highway traffic crashes exert a considerable impact on both transportation systems and the economy. In this context, accurate and dependable emergency responses are crucial for effective traffic management. However, the influence of crashes on traffic status varies across diverse factors and may be biased due to selection bias. Therefore, there arises a necessity to accurately estimate the heterog…
▽ More
Highway traffic crashes exert a considerable impact on both transportation systems and the economy. In this context, accurate and dependable emergency responses are crucial for effective traffic management. However, the influence of crashes on traffic status varies across diverse factors and may be biased due to selection bias. Therefore, there arises a necessity to accurately estimate the heterogeneous causal effects of crashes, thereby providing essential insights to facilitate individual-level emergency decision-making. This paper proposes a novel causal machine learning framework to estimate the causal effect of different types of crashes on highway speed. The Neyman-Rubin Causal Model (RCM) is employed to formulate this problem from a causal perspective. The Conditional Shapley Value Index (CSVI) is proposed based on causal graph theory to filter adverse variables, and the Structural Causal Model (SCM) is then adopted to define the statistical estimand for causal effects. The treatment effects are estimated by Doubly Robust Learning (DRL) methods, which combine doubly robust causal inference with classification and regression machine learning models. Experimental results from 4815 crashes on Highway Interstate 5 in Washington State reveal the heterogeneous treatment effects of crashes at varying distances and durations. The rear-end crashes cause more severe congestion and longer durations than other types of crashes, and the sideswipe crashes have the longest delayed impact. Additionally, the findings show that rear-end crashes affect traffic greater at night, while crash to objects has the most significant influence during peak hours. Statistical hypothesis tests, error metrics based on matched "counterfactual outcomes", and sensitive analyses are employed for assessment, and the results validate the accuracy and effectiveness of our method.
△ Less
Submitted 1 January, 2024;
originally announced January 2024.
-
KeDuSR: Real-World Dual-Lens Super-Resolution via Kernel-Free Matching
Authors:
Huan**g Yue,
Zifan Cui,
Kun Li,
**gyu Yang
Abstract:
Dual-lens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and…
▽ More
Dual-lens super-resolution (SR) is a practical scenario for reference (Ref) based SR by utilizing the telephoto image (Ref) to assist the super-resolution of the low-resolution wide-angle image (LR input). Different from general RefSR, the Ref in dual-lens SR only covers the overlapped field of view (FoV) area. However, current dual-lens SR methods rarely utilize these specific characteristics and directly perform dense matching between the LR input and Ref. Due to the resolution gap between LR and Ref, the matching may miss the best-matched candidate and destroy the consistent structures in the overlapped FoV area. Different from them, we propose to first align the Ref with the center region (namely the overlapped FoV area) of the LR input by combining global war** and local war** to make the aligned Ref be sharp and consistent. Then, we formulate the aligned Ref and LR center as value-key pairs, and the corner region of the LR is formulated as queries. In this way, we propose a kernel-free matching strategy by matching between the LR-corner (query) and LR-center (key) regions, and the corresponding aligned Ref (value) can be warped to the corner region of the target. Our kernel-free matching strategy avoids the resolution gap between LR and Ref, which makes our network have better generalization ability. In addition, we construct a DuSR-Real dataset with (LR, Ref, HR) triples, where the LR and HR are well aligned. Experiments on three datasets demonstrate that our method outperforms the second-best method by a large margin. Our code and dataset are available at https://github.com/ZifanCui/KeDuSR.
△ Less
Submitted 2 January, 2024; v1 submitted 28 December, 2023;
originally announced December 2023.
-
AccidentGPT: Accident Analysis and Prevention from V2X Environmental Perception with Multi-modal Large Model
Authors:
Lening Wang,
Yilong Ren,
Han Jiang,
Pinlong Cai,
Daocheng Fu,
Tianqi Wang,
Zhiyong Cui,
Haiyang Yu,
Xuesong Wang,
Hanchu Zhou,
Helai Huang,
Yinhai Wang
Abstract:
Traffic accidents, being a significant contributor to both human casualties and property damage, have long been a focal point of research for many scholars in the field of traffic safety. However, previous studies, whether focusing on static environmental assessments or dynamic driving analyses, as well as pre-accident predictions or post-accident rule analyses, have typically been conducted in is…
▽ More
Traffic accidents, being a significant contributor to both human casualties and property damage, have long been a focal point of research for many scholars in the field of traffic safety. However, previous studies, whether focusing on static environmental assessments or dynamic driving analyses, as well as pre-accident predictions or post-accident rule analyses, have typically been conducted in isolation. There has been a lack of an effective framework for develo** a comprehensive understanding and application of traffic safety. To address this gap, this paper introduces AccidentGPT, a comprehensive accident analysis and prevention multi-modal large model. AccidentGPT establishes a multi-modal information interaction framework grounded in multi-sensor perception, thereby enabling a holistic approach to accident analysis and prevention in the field of traffic safety. Specifically, our capabilities can be categorized as follows: for autonomous driving vehicles, we provide comprehensive environmental perception and understanding to control the vehicle and avoid collisions. For human-driven vehicles, we offer proactive long-range safety warnings and blind-spot alerts while also providing safety driving recommendations and behavioral norms through human-machine dialogue and interaction. Additionally, for traffic police and management agencies, our framework supports intelligent and real-time analysis of traffic safety, encompassing pedestrian, vehicles, roads, and the environment through collaborative perception from multiple vehicles and road testing devices. The system is also capable of providing a thorough analysis of accident causes and liability after vehicle collisions. Our framework stands as the first large model to integrate comprehensive scene understanding into traffic safety studies. Project page: https://accidentgpt.github.io
△ Less
Submitted 28 December, 2023; v1 submitted 20 December, 2023;
originally announced December 2023.
-
PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields
Authors:
Boming Zhao,
Luwei Yang,
Mao Mao,
Hujun Bao,
Zhaopeng Cui
Abstract:
Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) have been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRFs for data augmentation to improve the regression model training, and the performance on novel viewpoints and appearances is still limited due to the lack of geometric constraint…
▽ More
Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) have been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRFs for data augmentation to improve the regression model training, and the performance on novel viewpoints and appearances is still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, \ie, PNeRFLoc, based on a unified point-based representation. On the one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also develop an efficient rendering-based framework with a war** loss function. Furthermore, several robustness techniques are developed to handle illumination changes and dynamic objects for outdoor scenarios. Experiments demonstrate that PNeRFLoc performs the best on synthetic data when the NeRF model can be well learned and performs on par with the SOTA method on the visual localization benchmark datasets.
△ Less
Submitted 17 December, 2023;
originally announced December 2023.
-
Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption
Authors:
Ziteng Cui,
Lin Gu,
Xiao Sun,
Xianzheng Ma,
Yu Qiao,
Tatsuya Harada
Abstract:
The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission the…
▽ More
The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points. This simplified rendering approach presents challenges in accurately modeling images captured under adverse lighting conditions, such as low light or over-exposure. Motivated by the ancient Greek emission theory that posits visual perception as a result of rays emanating from the eyes, we slightly refine the conventional NeRF framework to train NeRF under challenging light conditions and generate normal-light condition novel views unsupervised. We introduce the concept of a "Concealing Field," which assigns transmittance values to the surrounding air to account for illumination effects. In dark scenarios, we assume that object emissions maintain a standard lighting level but are attenuated as they traverse the air during the rendering process. Concealing Field thus compel NeRF to learn reasonable density and colour estimations for objects even in dimly lit situations. Similarly, the Concealing Field can mitigate over-exposed emissions during the rendering stage. Furthermore, we present a comprehensive multi-view dataset captured under challenging illumination conditions for evaluation. Our code and dataset available at https://github.com/cuiziteng/Aleth-NeRF
△ Less
Submitted 24 January, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
CLIP in Medical Imaging: A Comprehensive Survey
Authors:
Zihao Zhao,
Yuxiao Liu,
Han Wu,
Yonghao Li,
Sheng Wang,
Lin Teng,
Disheng Liu,
Zhiming Cui,
Qian Wang,
Dinggang Shen
Abstract:
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training paradigm, successfully introduces text supervision to vision models. It has shown promising results across various tasks, attributable to its generalizability and interpretability. The use of CLIP has recently gained increasing interest in the medical imaging domain, serving both as a pre-training paradigm for alig…
▽ More
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training paradigm, successfully introduces text supervision to vision models. It has shown promising results across various tasks, attributable to its generalizability and interpretability. The use of CLIP has recently gained increasing interest in the medical imaging domain, serving both as a pre-training paradigm for aligning medical vision and language, and as a critical component in diverse clinical tasks. With the aim of facilitating a deeper understanding of this promising direction, this survey offers an in-depth exploration of the CLIP paradigm within the domain of medical imaging, regarding both refined CLIP pre-training and CLIP-driven applications. In this study, We (1) start with a brief introduction to the fundamentals of CLIP methodology. (2) Then, we investigate the adaptation of CLIP pre-training in the medical domain, focusing on how to optimize CLIP given characteristics of medical images and reports. (3) Furthermore, we explore the practical utilization of CLIP pre-trained models in various tasks, including classification, dense prediction, and cross-modal tasks. (4) Finally, we discuss existing limitations of CLIP in the context of medical imaging and propose forward-looking directions to address the demands of medical imaging domain. We expect that this comprehensive survey will provide researchers in the field of medical image analysis with a holistic understanding of the CLIP paradigm and its potential implications. The project page can be found on https://github.com/zhaozh10/Awesome-CLIP-in-Medical-Imaging.
△ Less
Submitted 21 May, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
On the Ground and in the Sky: A Tutorial on Radio Localization in Ground-Air-Space Networks
Authors:
Hazem Sallouha,
Sharief Saleh,
Sibren De Bast,
Zhuangzhuang Cui,
Sofie Pollin,
Henk Wymeersch
Abstract:
The inherent limitations in scaling up ground infrastructure for future wireless networks, combined with decreasing operational costs of aerial and space networks, are driving considerable research interest in multisegment ground-air-space (GAS) networks. In GAS networks, where ground and aerial users share network resources, ubiquitous and accurate user localization becomes indispensable, not onl…
▽ More
The inherent limitations in scaling up ground infrastructure for future wireless networks, combined with decreasing operational costs of aerial and space networks, are driving considerable research interest in multisegment ground-air-space (GAS) networks. In GAS networks, where ground and aerial users share network resources, ubiquitous and accurate user localization becomes indispensable, not only as an end-user service but also as an enabler for location-aware communications. This breaks the convention of having localization as a byproduct in networks primarily designed for communications. To address these imperative localization needs, the design and utilization of ground, aerial, and space anchors require thorough investigation. In this tutorial, we provide an in-depth systemic analysis of the radio localization problem in GAS networks, considering ground and aerial users as targets to be localized. Starting from a survey of the most relevant works, we then define the key characteristics of anchors and targets in GAS networks. Subsequently, we detail localization fundamentals in GAS networks, considering 3D positions, orientations, and velocities. Afterward, we thoroughly analyze radio localization systems in GAS networks, detailing the system model, design aspects, and considerations for each of the three GAS anchors. Preliminary results are presented to provide a quantifiable perspective on key design aspects in GAS-based localization scenarios. We then identify the vital roles 6G enablers are expected to play in radio localization in GAS networks.
△ Less
Submitted 17 June, 2024; v1 submitted 9 December, 2023;
originally announced December 2023.
-
Autonomous and Adaptive Role Selection for Multi-robot Collaborative Area Search Based on Deep Reinforcement Learning
Authors:
Lina Zhu,
Jiyu Cheng,
Hao Zhang,
Zhichao Cui,
Wei Zhang,
Yuehu Liu
Abstract:
In the tasks of multi-robot collaborative area search, we propose the unified approach for simultaneous map** for sensing more targets (exploration) while searching and locating the targets (coverage). Specifically, we implement a hierarchical multi-agent reinforcement learning algorithm to decouple task planning from task execution. The role concept is integrated into the upper-level task plann…
▽ More
In the tasks of multi-robot collaborative area search, we propose the unified approach for simultaneous map** for sensing more targets (exploration) while searching and locating the targets (coverage). Specifically, we implement a hierarchical multi-agent reinforcement learning algorithm to decouple task planning from task execution. The role concept is integrated into the upper-level task planning for role selection, which enables robots to learn the role based on the state status from the upper-view. Besides, an intelligent role switching mechanism enables the role selection module to function between two timesteps, promoting both exploration and coverage interchangeably. Then the primitive policy learns how to plan based on their assigned roles and local observation for sub-task execution. The well-designed experiments show the scalability and generalization of our method compared with state-of-the-art approaches in the scenes with varying complexity and number of robots.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Impact of Indoor Mobility Behavior on the Respiratory Infectious Diseases Transmission Trends
Authors:
Ziwei Cui,
Ming Cai,
Zheng Zhu,
Gongbo Chen,
Yao Xiao
Abstract:
The importance of indoor human mobility in the transmission dynamics of respiratory infectious diseases has been acknowledged. Previous studies have predominantly addressed a single type of mobility behavior such as queueing and a series of behaviors under specific scenarios. However, these studies ignore the abstraction of mobility behavior in various scenes and the critical examination of how th…
▽ More
The importance of indoor human mobility in the transmission dynamics of respiratory infectious diseases has been acknowledged. Previous studies have predominantly addressed a single type of mobility behavior such as queueing and a series of behaviors under specific scenarios. However, these studies ignore the abstraction of mobility behavior in various scenes and the critical examination of how these abstracted behaviors impact disease propagation. To address these problems, this study considers people's mobility behaviors in a general scenario, abstracting them into two main categories: crowding behavior, related to the spatial aspect, and stop** behavior, related to the temporal aspect. Accordingly, this study investigates their impacts on disease spreading and the impact of individual spatio-temporal distribution resulting from these mobility behaviors on epidemic transmission. First, a point of interest (POI) method is introduced to quantify the crowding-related spatial POI factors (i.e., the number of crowdings and the distance between crowdings) and stop**-related temporal POI factors (i.e., the number of stop**s and the duration of each stop**). Besides, a personal space determined with Voronoi diagrams is used to construct the individual spatio-temporal distribution factor. Second, two indicators (i.e., the daily number of new cases and the average exposure risk of people) are applied to quantify epidemic transmission. These indicators are derived from a fundamental model which accurately predicts disease transmission between moving individuals. Third, a set of 200 indoor scenarios is constructed and simulated to help determine variable values. Concurrently, the influences and underlying mechanisms of these behavioral factors on disease transmission are examined using structural equation modeling and causal inference modeling......
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Joint Diffusion: Mutual Consistency-Driven Diffusion Model for PET-MRI Co-Reconstruction
Authors:
Taofeng Xie,
Zhuo-Xu Cui,
Chen Luo,
Huayu Wang,
Congcong Liu,
Yuanzhi Zhang,
Xuemei Wang,
Yanjie Zhu,
Qiyu **,
Guoqing Chen,
Yihang Zhou,
Dong Liang,
Haifeng Wang
Abstract:
Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI sy…
▽ More
Positron Emission Tomography and Magnetic Resonance Imaging (PET-MRI) systems can obtain functional and anatomical scans. PET suffers from a low signal-to-noise ratio. Meanwhile, the k-space data acquisition process in MRI is time-consuming. The study aims to accelerate MRI and enhance PET image quality. Conventional approaches involve the separate reconstruction of each modality within PET-MRI systems. However, there exists complementary information among multi-modal images. The complementary information can contribute to image reconstruction. In this study, we propose a novel PET-MRI joint reconstruction model employing a mutual consistency-driven diffusion mode, namely MC-Diffusion. MC-Diffusion learns the joint probability distribution of PET and MRI for utilizing complementary information. We conducted a series of contrast experiments about LPLS, Joint ISAT-net and MC-Diffusion by the ADNI dataset. The results underscore the qualitative and quantitative improvements achieved by MC-Diffusion, surpassing the state-of-the-art method.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Realization of a programmable multi-purpose photonic quantum memory with over-thousand qubit manipulations
Authors:
Sheng Zhang,
Jixuan Shi,
Zhaibin Cui,
Ye Wang,
Yukai Wu,
Luming Duan,
Yunfei Pu
Abstract:
Quantum networks can enable various applications such as distributed quantum computing, long-distance quantum communication, and network-based quantum sensing with unprecedented performances. One of the most important building blocks for a quantum network is a photonic quantum memory which serves as the interface between the communication channel and the local functional unit. A programmable quant…
▽ More
Quantum networks can enable various applications such as distributed quantum computing, long-distance quantum communication, and network-based quantum sensing with unprecedented performances. One of the most important building blocks for a quantum network is a photonic quantum memory which serves as the interface between the communication channel and the local functional unit. A programmable quantum memory which can process a large stream of flying qubits and fulfill the requirements of multiple core functions in a quantum network is still to-be-realized. Here we report a high-performance quantum memory which can simultaneously store 72 optical qubits carried by 144 spatially separated atomic ensembles and support up to a thousand consecutive write or read operations in a random access way, two orders of magnitude larger than the previous record. Due to the built-in programmability, this quantum memory can be adapted on-demand for several functions. As example applications, we realize quantum queue, stack, and buffer which closely resemble the counterpart devices for classical information processing. We further demonstrate the synchronization and reshuffle of 4 entangled pairs of photonic pulses with probabilistic arrival time and arbitrary release order via the memory, which is an essential requirement for the realization of quantum repeaters and efficient routing in quantum networks. Realization of this multi-purpose programmable quantum memory thus constitutes a key enabling building block for future large-scale fully-functional quantum networks.
△ Less
Submitted 29 April, 2024; v1 submitted 16 November, 2023;
originally announced November 2023.
-
CP-SLAM: Collaborative Neural Point-based SLAM System
Authors:
Jiarui Hu,
Mao Mao,
Hujun Bao,
Guofeng Zhang,
Zhaopeng Cui
Abstract:
This paper presents a collaborative implicit neural simultaneous localization and map** (SLAM) system with RGB-D image sequences, which consists of complete front-end and back-end modules including odometry, loop detection, sub-map fusion, and global refinement. In order to enable all these modules in a unified framework, we propose a novel neural point based 3D scene representation in which eac…
▽ More
This paper presents a collaborative implicit neural simultaneous localization and map** (SLAM) system with RGB-D image sequences, which consists of complete front-end and back-end modules including odometry, loop detection, sub-map fusion, and global refinement. In order to enable all these modules in a unified framework, we propose a novel neural point based 3D scene representation in which each point maintains a learnable neural feature for scene encoding and is associated with a certain keyframe. Moreover, a distributed-to-centralized learning strategy is proposed for the collaborative implicit SLAM to improve consistency and cooperation. A novel global optimization framework is also proposed to improve the system accuracy like traditional bundle adjustment. Experiments on various datasets demonstrate the superiority of the proposed method in both camera tracking and map**.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection
Authors:
Wenxin Wang,
Zhuo-Xu Cui,
Guanxun Cheng,
Chentao Cao,
Xi Xu,
Ziwei Liu,
Haifeng Wang,
Yulong Qi,
Dong Liang,
Yanjie Zhu
Abstract:
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cyc…
▽ More
Accurate detection and segmentation of brain tumors is critical for medical diagnosis. However, current supervised learning methods require extensively annotated images and the state-of-the-art generative models used in unsupervised methods often have limitations in covering the whole data distribution. In this paper, we propose a novel framework Two-Stage Generative Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and Variance Exploding stochastic differential equation using joint probability (VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is trained on unpaired data to generate abnormal images from healthy images as data prior. Then VE-JP is implemented to reconstruct healthy images using synthetic paired abnormal images as a guide, which alters only pathological regions but not regions of healthy. Notably, our method directly learned the joint probability distribution for conditional generation. The residual between input and reconstructed images suggests the abnormalities and a thresholding method is subsequently applied to obtain segmentation results. Furthermore, the multimodal results are weighted with different weights to improve the segmentation accuracy further. We validated our method on three datasets, and compared with other unsupervised methods for anomaly detection and segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS dataset and 0.7403 in In-house dataset show that our method achieves better segmentation performance and has better generalization.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Cascaded Multi-task Adaptive Learning Based on Neural Architecture Search
Authors:
Yingying Gao,
Shilei Zhang,
Zihao Cui,
Chao Deng,
Junlan Feng
Abstract:
Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end casc…
▽ More
Cascading multiple pre-trained models is an effective way to compose an end-to-end system. However, fine-tuning the full cascaded model is parameter and memory inefficient and our observations reveal that only applying adapter modules on cascaded model can not achieve considerable performance as fine-tuning. We propose an automatic and effective adaptive learning method to optimize end-to-end cascaded multi-task models based on Neural Architecture Search (NAS) framework. The candidate adaptive operations on each specific module consist of frozen, inserting an adapter and fine-tuning. We further add a penalty item on the loss to limit the learned structure which takes the amount of trainable parameters into account. The penalty item successfully restrict the searched architecture and the proposed approach is able to search similar tuning scheme with hand-craft, compressing the optimizing parameters to 8.7% corresponding to full fine-tuning on SLURP with an even better performance.
△ Less
Submitted 23 October, 2023;
originally announced October 2023.
-
DreamSpace: Dreaming Your Room Space with Text-Driven Panoramic Texture Propagation
Authors:
Bangbang Yang,
Wenqi Dong,
Lin Ma,
Wenbo Hu,
Xiao Liu,
Zhaopeng Cui,
Yuewen Ma
Abstract:
Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, whi…
▽ More
Diffusion-based methods have achieved prominent success in generating 2D media. However, accomplishing similar proficiencies for scene-level mesh texturing in 3D spatial applications, e.g., XR/VR, remains constrained, primarily due to the intricate nature of 3D geometry and the necessity for immersive free-viewpoint rendering. In this paper, we propose a novel indoor scene texturing framework, which delivers text-driven texture generation with enchanting details and authentic spatial coherence. The key insight is to first imagine a stylized 360° panoramic texture from the central viewpoint of the scene, and then propagate it to the rest areas with inpainting and imitating techniques. To ensure meaningful and aligned textures to the scene, we develop a novel coarse-to-fine panoramic texture generation approach with dual texture alignment, which both considers the geometry and texture cues of the captured scenes. To survive from cluttered geometries during texture propagation, we design a separated strategy, which conducts texture inpainting in confidential regions and then learns an implicit imitating network to synthesize textures in occluded and tiny structural areas. Extensive experiments and the immersive VR application on real-world indoor scenes demonstrate the high quality of the generated textures and the engaging experience on VR headsets. Project webpage: https://ybbbbt.com/publication/dreamspace
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
3D Structure-guided Network for Tooth Alignment in 2D Photograph
Authors:
Yulong Dou,
Lanzhuju Mei,
Dinggang Shen,
Zhiming Cui
Abstract:
Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions), affecting both masticatory function and aesthetics. However, orthodontic treatment often involves complex, lengthy procedures. As such, generating a 2D photograph depicting aligned teeth prior to orthodontic treatment is crucial for effective dentist-patient communication and, more importantly, for encouraging patients to a…
▽ More
Orthodontics focuses on rectifying misaligned teeth (i.e., malocclusions), affecting both masticatory function and aesthetics. However, orthodontic treatment often involves complex, lengthy procedures. As such, generating a 2D photograph depicting aligned teeth prior to orthodontic treatment is crucial for effective dentist-patient communication and, more importantly, for encouraging patients to accept orthodontic intervention. In this paper, we propose a 3D structure-guided tooth alignment network that takes 2D photographs as input (e.g., photos captured by smartphones) and aligns the teeth within the 2D image space to generate an orthodontic comparison photograph featuring aesthetically pleasing, aligned teeth. Notably, while the process operates within a 2D image space, our method employs 3D intra-oral scanning models collected in clinics to learn about orthodontic treatment, i.e., projecting the pre- and post-orthodontic 3D tooth structures onto 2D tooth contours, followed by a diffusion model to learn the map** relationship. Ultimately, the aligned tooth contours are leveraged to guide the generation of a 2D photograph with aesthetically pleasing, aligned teeth and realistic textures. We evaluate our network on various facial photographs, demonstrating its exceptional performance and strong applicability within the orthodontic industry.
△ Less
Submitted 17 October, 2023;
originally announced October 2023.